Last modified: 2013-03-25 15:21:34 UTC
When uploading large video files via the upload api it would be very helpful to support uploading the file in chunks. That way if your POST request gets reset, dies in the middle, or your process crashes you don't need to start uploading from scratch. Additionally this will avoid long-time running large POSTs on the server side. Innitialy we should support the firefogg chunk upload method: http://www.firefogg.org/dev/chunk_post.html Basically we just give the "next chunk url" back as a response until we get a done=1 in the post vars, then concatenate the pieces.
Is that something we can reasonably have our PHP-based code understand, or would this take patching PHP's own HTTP upload-handling code?
We should collect different parts, concatenate them and them pass them on to LocalFile::recordUpload2. I think we can do this but is it wise to invent our own standard for chunked upload?
in response to comment #1 yea its really simple we just collect a set of post files and concatenate them.. no special HTTP upload handling code. maybe concatenate by running shell command to avoid sending all that data through php but not a big deal either way. in response to comment #2 is there some standard that you would prefer we use? The only thing I am aware of is the google proposed resume upload code http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal Notice even Google (with their custom clients) has taken the 1 meg chunk at a time approach because it avoids modifications to HTTP protocol. They have proposed modifications to HTTP protocol and we can support that once it "gets out there".
(In reply to comment #3) > in response to comment #1 > yea its really simple we just collect a set of post files and concatenate > them.. no special HTTP upload handling code. maybe concatenate by running shell > command to avoid sending all that data through php but not a big deal either > way. > Assigning this to Bryan, he's the one working on the upload API at the moment. > in response to comment #2 > is there some standard that you would prefer we use? The only thing I am aware > of is the google proposed resume upload code > http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal > > Notice even Google (with their custom clients) has taken the 1 meg chunk at a > time approach because it avoids modifications to HTTP protocol. They have > proposed modifications to HTTP protocol and we can support that once it "gets > out there". > Modifications to the HTTP protocol shouldn't be supported by us, but by PHP.
IIRC the new API upload module supports Firefogg chunck uploading. Marking as FIXED.
(In reply to comment #5) > IIRC the new API upload module supports Firefogg chunck uploading. Marking as > FIXED. Based on comments at bug 25676, I don't think this bug is properly fixed currently. Re-opening for now.
(Copied from bug 25676 comment 12 by Neil) (In reply to bug 25676 comment 11) > (In reply to bug 25676 comment 9) > > Reopening since Tim's arguments for WONTFIX pertained mostly to the Firefogg > > add-on (client-side) rather than the FirefoggChunkedUpload extension > > (server-side support). > > Actually I think the second paragraph in comment 5, where I explained why I > don't think the server-side extension should be enabled, was longer than the > first paragraph, which dealt with my objections to the client-side. I've had a look at Google's Resumable Upload Protocol. They do things in a reasonable manner, also very RESTy. We have never used HTTP Headers or Status fields for application state signalling, but we can emulate most of this in ordinary POST parameters and returned data. http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#ResumableUpload Okay, so if the following things were added to a chunked upload protocol, would this be satisfactory? * Before starting an upload, the client tells the server the length of the file to be uploaded, in bytes. * With each upload chunk, the client also tells the server which range of bytes this corresponds to. * With each response to an upload chunk, the server indicates the largest range of contiguous bytes starting from zero that it thinks it has. (The client should use this information to set its filepointer for subsequent chunks). n.b. this means it's possible for the client to send overlapping ranges; the server needs to be smart about this * The server is the party that decides when the upload is done. (By signaling that the full range of packets are received, saying "ok, all done", and then returning the other usual information about how to refer to the reassembled file). We could also add optional checksums here, at least for each individual chunk. (A complete-file checksum would be nice, but it's not clear to me whether it is practical for Javascript FileAPI clients to obtain them). And each upload could return some error status, particularly if checksums or expected length doesn't match.
(Copied from bug 25676 comment 13 by Neil) Okay, from some comments on IRC, it is apparently unclear why I just posted some suggestions for changing the Firefogg protocol. I am trying to answer Tim's objections that the way of uploading chunks is not robust enough. In Firefogg, the client is basically POSTing chunks to append until it says that it's done. The server has no idea when this process is going to end, and has no idea if it missed any chunks. I believe this is what bothered Tim about this. There was some confusion about whether PLupload was any better. As far as I can tell, it isn't, so I looked to the Google Resumable Upload Protocol for something more explicit.
(Copied from bug 25676 comment 14 by Tim) (In reply to bug 25676 comment 12) > Okay, so if the following things were added to a chunked upload protocol, would > this be satisfactory? Yes, that is the sort of thing we need. (In reply to bug 25676 comment 13) > In Firefogg, the client is basically POSTing chunks to append until it says > that it's done. The server has no idea when this process is going to end, and > has no idea if it missed any chunks. I believe this is what bothered Tim about > this. Yes. For example, if the PHP process for a chunk upload takes a long time, the Squid server may time out and return an error message, but the PHP process will continue and the chunk may still be appended to the file eventually. In this situation, Firefogg would retry the upload of the same chunk, resulting in it being appended twice. Because the original request and the retry will operate concurrently, it's possible to hit NFS concurrency issues, with the duplicate chunks partially overwriting each other. A robust protocol, which assumes that chunks may be uploaded concurrently, duplicated or omitted, will be immune to these kinds of operational details. Dealing with concurrency might be as simple as returning an error message if another process is operating on the same file. I'm not saying there is a need for something complex there.
So... the current consensus based on bug 25676 is a transport level chunk support. This points to the support being written into core and have it handled somewhat above the 'upload' api entry points. The flow would look like the following: based on http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#ResumableUploadInitiate and http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal Your initial post sets all the upload parameters filename comment text token stash etc. In addition to Content-Length ( for the parameters ). We set the "X-Upload-Content-Type" and "X-Upload-Content-Length" headers that give the target file type and upload size but we /DO NOT/ include any portion of the file in this initial request. These special X-Upload-Content-Length headers indicate to the server that this is a resumable / chunk upload request. ( Ideally we don't want to explicitly tag with a mediaWiki specific api parameter ) We may need a way to initially communicate to the client that the server supports resumable uploads. ( ie The server then checks the requested size, validates all the initial upload parameters ( token, valid file name etc ) then responds with a unique url that only the current session can upload to. HTTP/1.1 200 OK Location: <upload_uri> NOTE: We are slightly abusing the resume protocol, since normally you send a request to upload the entire file ( but because small chunks are more friendly on wikimedia's back-end system we want clients to send things in smaller parts ) Then the client then starts to send the file in 1 meg chunks. The chunks are specified via the Content-Range header ie something like: Content-Length: 10 Content-Range: bytes 0-10/100 The server revives the content-range POSTs and checks that the chunk is authenticated via the session and unique url, the chunks byte ranges are checked and only valid unseen sequential byte ranges are appended to the file. If there are no errors the server responds with a header specify the next chunk HTTP/1.1 308 Resume Incomplete Content-Length: 0 Range: 11-20 The client then responds to the Resume Incomplete and sends the next chunk to the server, if the POST breaks or is incomplete the client can query the server for where it left off with: PUT <upload_uri> HTTP/1.1 Host: docs.google.com Content-Length: 0 Content-Range: bytes */100 The client should only do this every 30 seconds for 5 min and then give up. The server should also "give up" after 30 min and invalidate any chunks that attempt to be appended to an old file. Likewise partially uploaded files should be purged every so often, possibly with the same purge system used for stashed files? Finally if all is well, the when the final chunk is sent and the normal api repose code is run where the file is validated and stashed or added to the system. If this sounds reasonable all that left to do is implementation ;)
(In reply to comment #10) > If this sounds reasonable all that left to do is implementation ;) The parts about X-Foo HTTP headers and using PUT sound like they would make a JS client implementation difficult. Same for using HTTP headers to return data (Range header in the 308 response) and probably to a lesser degree for using status codes to convey information (308). Supporting the Google protocol verbatim is probably a good idea for interoperability, but I think we should also implement a slightly modified version that uses only POST data (query params) to send data and only the response body to receive data, just like api.php
Modern XHR browser implementations do support reading and sending custom headers and reading custom response codes etc. ie xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders() etc. We won't be supporting non-modern XHR systems since the browser needs to support the blob slice api... I don't think the issue would be on the client, my concern would be if there are any foreseen issues on the back end. I suppose we could support both... by also supporting these headers in the api request and response.
(In reply to comment #12) > Modern XHR browser implementations do support reading and sending custom > headers and reading custom response codes etc. ie > xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders() > etc. > > We won't be supporting non-modern XHR systems since the browser needs to > support the blob slice api... I don't think the issue would be on the client, > my concern would be if there are any foreseen issues on the back end. > > I suppose we could support both... by also supporting these headers in the api > request and response. Does jQuery AJAX support this?
(In reply to comment #13) > (In reply to comment #12) > > Modern XHR browser implementations do support reading and sending custom > > headers and reading custom response codes etc. ie > > xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders() > > etc. > > > > We won't be supporting non-modern XHR systems since the browser needs to > > support the blob slice api... I don't think the issue would be on the client, > > my concern would be if there are any foreseen issues on the back end. > > > > I suppose we could support both... by also supporting these headers in the api > > request and response. > Does jQuery AJAX support this? Sure, jQuery returns the raw XHR object. But it would need to be a "plugin" that extended the ajax support. Ie the jQuery plugin would take an input[type=file] or parent form and "POST" it to an api target and give you status updates via callbacks or polling. If done at the protocol level, the jQuery plugin could be a general purpose plugin for any php based Google resumable upload implementation.
(In reply to comment #14) > Sure, jQuery returns the raw XHR object. But it would need to be a "plugin" > that extended the ajax support. Ie the jQuery plugin would take an > input[type=file] or parent form and "POST" it to an api target and give you > status updates via callbacks or polling. > > If done at the protocol level, the jQuery plugin could be a general purpose > plugin for any php based Google resumable upload implementation. OK, as long as we're not making protocol design choices that make a jQuery implementation disproportionately harder to do, I'm fine with it.