Last modified: 2013-03-25 15:21:34 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 17255 - Support Upload Resume via Server Side File Concatenation
Support Upload Resume via Server Side File Concatenation
Status: REOPENED
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
unspecified
All All
: Normal enhancement with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
http://www.firefogg.org/dev/chunk_pos...
:
Depends on: 15227
Blocks: 16927
  Show dependency treegraph
 
Reported: 2009-01-30 18:01 UTC by Michael Dale
Modified: 2013-03-25 15:21 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Michael Dale 2009-01-30 18:01:31 UTC
When uploading large video files via the upload api it would be very helpful to support uploading the file in chunks. That way if your POST request gets reset, dies in the middle, or your process crashes you don't need to start uploading from scratch. Additionally this will avoid long-time running large POSTs on the server side. 

Innitialy we should support the firefogg chunk upload method:
http://www.firefogg.org/dev/chunk_post.html

Basically we just give the "next chunk url" back as a response until we get a done=1 in the post vars, then concatenate the pieces.
Comment 1 Brion Vibber 2009-01-30 18:06:37 UTC
Is that something we can reasonably have our PHP-based code understand, or would this take patching PHP's own HTTP upload-handling code?
Comment 2 Bryan Tong Minh 2009-01-30 18:23:00 UTC
We should collect different parts, concatenate them and them pass them on to LocalFile::recordUpload2. I think we can do this but is it wise to invent our own standard for chunked upload?
Comment 3 Michael Dale 2009-01-30 18:39:42 UTC
in response to comment #1
yea its really simple we just collect a set of post files and concatenate them.. no special HTTP upload handling code. maybe concatenate by running shell command to avoid sending all that data through php but not a big deal either way. 

in response to comment #2
is there some standard that you would prefer we use? The only thing I am aware of is the google proposed resume upload code
http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal

Notice even Google (with their custom clients) has taken the 1 meg chunk at a time approach because it avoids modifications to HTTP protocol. They have proposed modifications to HTTP protocol and we can support that once it "gets out there".
Comment 4 Roan Kattouw 2009-01-31 13:44:46 UTC
(In reply to comment #3)
> in response to comment #1
> yea its really simple we just collect a set of post files and concatenate
> them.. no special HTTP upload handling code. maybe concatenate by running shell
> command to avoid sending all that data through php but not a big deal either
> way. 
> 
Assigning this to Bryan, he's the one working on the upload API at the moment.

> in response to comment #2
> is there some standard that you would prefer we use? The only thing I am aware
> of is the google proposed resume upload code
> http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal
> 
> Notice even Google (with their custom clients) has taken the 1 meg chunk at a
> time approach because it avoids modifications to HTTP protocol. They have
> proposed modifications to HTTP protocol and we can support that once it "gets
> out there".
> 
Modifications to the HTTP protocol shouldn't be supported by us, but by PHP.
Comment 5 Roan Kattouw 2009-07-25 18:08:10 UTC
IIRC the new API upload module supports Firefogg chunck uploading. Marking as FIXED.
Comment 6 MZMcBride 2011-03-28 03:46:46 UTC
(In reply to comment #5)
> IIRC the new API upload module supports Firefogg chunck uploading. Marking as
> FIXED.

Based on comments at bug 25676, I don't think this bug is properly fixed currently. Re-opening for now.
Comment 7 MZMcBride 2011-03-28 03:54:54 UTC
(Copied from bug 25676 comment 12 by Neil)

(In reply to bug 25676 comment 11)
> (In reply to bug 25676 comment 9)
> > Reopening since Tim's arguments for WONTFIX pertained mostly to the Firefogg
> > add-on (client-side) rather than the FirefoggChunkedUpload extension
> > (server-side support).
> 
> Actually I think the second paragraph in comment 5, where I explained why I
> don't think the server-side extension should be enabled, was longer than the
> first paragraph, which dealt with my objections to the client-side.

I've had a look at Google's Resumable Upload Protocol. They do things in a
reasonable manner, also very RESTy. We have never used HTTP Headers or Status
fields for application state signalling, but we can emulate most of this in
ordinary POST parameters and returned data.

http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#ResumableUpload

Okay, so if the following things were added to a chunked upload protocol, would
this be satisfactory?

* Before starting an upload, the client tells the server the length of the file
to be uploaded, in bytes.

* With each upload chunk, the client also tells the server which range of bytes
this corresponds to.

* With each response to an upload chunk, the server indicates the largest range
of contiguous bytes starting from zero that it thinks it has. (The client
should use this information to set its filepointer for subsequent chunks). n.b.
this means it's possible for the client to send overlapping ranges; the server
needs to be smart about this

* The server is the party that decides when the upload is done. (By signaling
that the full range of packets are received, saying "ok, all done", and then
returning the other usual information about how to refer to the reassembled
file).

We could also add optional checksums here, at least for each individual chunk.
(A complete-file checksum would be nice, but it's not clear to me whether it is
practical for Javascript FileAPI clients to obtain them).

And each upload could return some error status, particularly if checksums or
expected length doesn't match.
Comment 8 MZMcBride 2011-03-28 03:55:37 UTC
(Copied from bug 25676 comment 13 by Neil)

Okay, from some comments on IRC, it is apparently unclear why I just posted
some suggestions for changing the Firefogg protocol. I am trying to answer
Tim's objections that the way of uploading chunks is not robust enough.

In Firefogg, the client is basically POSTing chunks to append until it says
that it's done. The server has no idea when this process is going to end, and
has no idea if it missed any chunks. I believe this is what bothered Tim about
this. 

There was some confusion about whether PLupload was any better. As far as I can
tell, it isn't, so I looked to the Google Resumable Upload Protocol for
something more explicit.
Comment 9 MZMcBride 2011-03-28 03:57:20 UTC
(Copied from bug 25676 comment 14 by Tim)

(In reply to bug 25676 comment 12)
> Okay, so if the following things were added to a chunked upload protocol, would
> this be satisfactory?

Yes, that is the sort of thing we need.

(In reply to bug 25676 comment 13)
> In Firefogg, the client is basically POSTing chunks to append until it says
> that it's done. The server has no idea when this process is going to end, and
> has no idea if it missed any chunks. I believe this is what bothered Tim about
> this. 

Yes. For example, if the PHP process for a chunk upload takes a long time, the
Squid server may time out and return an error message, but the PHP process will
continue and the chunk may still be appended to the file eventually. In this
situation, Firefogg would retry the upload of the same chunk, resulting in it
being appended twice. Because the original request and the retry will operate
concurrently, it's possible to hit NFS concurrency issues, with the duplicate
chunks partially overwriting each other.

A robust protocol, which assumes that chunks may be uploaded concurrently,
duplicated or omitted, will be immune to these kinds of operational details.

Dealing with concurrency might be as simple as returning an error message if
another process is operating on the same file. I'm not saying there is a need
for something complex there.
Comment 10 Michael Dale 2011-03-29 17:16:58 UTC
So... the current consensus based on bug 25676 is a transport level chunk support. This points to the support being written into core and have it handled somewhat above the 'upload' api entry points. 

The flow would look like the following: based on http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#ResumableUploadInitiate and  http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal

Your initial post sets all the upload parameters 
filename
comment
text
token
stash 
etc.

In addition to Content-Length ( for the parameters ).  We set the "X-Upload-Content-Type" and "X-Upload-Content-Length" headers that give the target file type and upload size but we /DO NOT/ include any portion of the file in this initial request. These special X-Upload-Content-Length headers indicate to the server that this is a resumable / chunk upload request. ( Ideally we don't want to explicitly tag with a mediaWiki specific api parameter ) We may need a way to initially communicate to the client that the server supports resumable uploads. ( ie 

The server then checks the requested size, validates all the initial upload parameters ( token, valid file name etc ) then responds with a unique url that only the current session can upload to. 

HTTP/1.1 200 OK
Location: <upload_uri>

NOTE: We are slightly abusing the resume protocol, since normally you send a request to upload the entire file ( but because small chunks are more friendly on wikimedia's back-end system we want clients to send things in smaller parts )

Then the client then starts to send the file in 1 meg chunks. The chunks are specified via the Content-Range header ie something like: 

Content-Length: 10
Content-Range: bytes 0-10/100

The server revives the content-range POSTs and checks that the chunk is authenticated via the session and unique url, the chunks byte ranges are checked and only valid unseen sequential byte ranges are appended to the file. 

If there are no errors the server responds with a header specify the next chunk
HTTP/1.1 308 Resume Incomplete
Content-Length: 0
Range: 11-20

The client then responds to the Resume Incomplete and sends the next chunk to the server, if the POST breaks or is incomplete the client can query the server for where it left off with: 

PUT <upload_uri> HTTP/1.1
Host: docs.google.com
Content-Length: 0
Content-Range: bytes */100

The client should only do this every 30 seconds for 5 min and then give up. The server should also "give up" after 30 min and invalidate any chunks that attempt to be appended to an old file. Likewise partially uploaded files should be purged every so often, possibly with the same purge system used for stashed files?

Finally if all is well, the when the final chunk is sent and the normal api repose code is run where the file is validated and stashed or added to the system. 


If this sounds reasonable all that left to do is implementation ;)
Comment 11 Roan Kattouw 2011-03-29 17:49:17 UTC
(In reply to comment #10)
> If this sounds reasonable all that left to do is implementation ;)
The parts about X-Foo HTTP headers and using PUT sound like they would make a JS client implementation difficult. Same for using HTTP headers to return data (Range header in the 308 response) and probably to a lesser degree for using status codes to convey information (308).

Supporting the Google protocol verbatim is probably a good idea for interoperability, but I think we should also implement a slightly modified version that uses only POST data (query params) to send data and only the response body to receive data, just like api.php
Comment 12 Michael Dale 2011-03-29 18:16:23 UTC
Modern XHR browser implementations do support reading and sending custom headers and reading custom response codes etc. ie xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders() etc. 

We won't be supporting non-modern XHR systems since the browser needs to support the blob slice api... I don't think the issue would be on the client, my concern would be if there are any foreseen issues on the back end.

I suppose we could support both... by also supporting these headers in the api request and response.
Comment 13 Roan Kattouw 2011-03-29 18:17:28 UTC
(In reply to comment #12)
> Modern XHR browser implementations do support reading and sending custom
> headers and reading custom response codes etc. ie
> xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders()
> etc. 
> 
> We won't be supporting non-modern XHR systems since the browser needs to
> support the blob slice api... I don't think the issue would be on the client,
> my concern would be if there are any foreseen issues on the back end.
> 
> I suppose we could support both... by also supporting these headers in the api
> request and response.
Does jQuery AJAX support this?
Comment 14 Michael Dale 2011-03-29 18:24:29 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Modern XHR browser implementations do support reading and sending custom
> > headers and reading custom response codes etc. ie
> > xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders()
> > etc. 
> > 
> > We won't be supporting non-modern XHR systems since the browser needs to
> > support the blob slice api... I don't think the issue would be on the client,
> > my concern would be if there are any foreseen issues on the back end.
> > 
> > I suppose we could support both... by also supporting these headers in the api
> > request and response.
> Does jQuery AJAX support this?

Sure, jQuery returns the raw XHR object. But it would need to be a "plugin" that extended the ajax support. Ie the jQuery plugin would take an input[type=file] or parent form and "POST" it to an api target and give you status updates via callbacks or polling. 

If done at the protocol level, the jQuery plugin could be a general purpose plugin for any php based Google resumable upload implementation.
Comment 15 Roan Kattouw 2011-03-29 18:26:18 UTC
(In reply to comment #14)
> Sure, jQuery returns the raw XHR object. But it would need to be a "plugin"
> that extended the ajax support. Ie the jQuery plugin would take an
> input[type=file] or parent form and "POST" it to an api target and give you
> status updates via callbacks or polling. 
> 
> If done at the protocol level, the jQuery plugin could be a general purpose
> plugin for any php based Google resumable upload implementation.
OK, as long as we're not making protocol design choices that make a jQuery implementation disproportionately harder to do, I'm fine with it.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links