Last modified: 2011-07-13 01:04:39 UTC
When deploying this to live Commons, Roan Kattouw and I noticed that multiple file uploads usually had one mysterious failure. Roan determined that the records of some uploaded files were disappearing from the $_SESSION record. Changing UploadWizard's configuration such that it only uploaded one file after another fixed the problem. (maxSimultaneousUploads = 1) Speculation: there seems to be some race condition with updating, or possibly creating, the $_SESSION data structures that the UploadWizard depends upon. So forcing them to go one-at-a-time fixed the problem. It may be that we should do an initial call to some API to create the data structures for all the uploading files first in one shot, then actually kick off uploads. Simultaneous uploads are a very useful feature especially for large files, so we really ought to fix this. However it's not a blocker per se.
(In reply to comment #0) > Speculation: there seems to be some race condition with updating, or possibly > creating, the $_SESSION data structures that the UploadWizard depends upon. This code in UploadStash::__construct() looks suspicious: if ( !isset( $_SESSION[UploadBase::SESSION_KEYNAME] ) ) { $_SESSION[UploadBase::SESSION_KEYNAME] = array(); }
Created attachment 7891 [details] Proposed patch (untested) The attached patch fixes this by moving the upload stash metadata out of $_SESSION altogether and into memcached (or APC or DB cache, whatever's available; it uses CACHE_ANYTHING) keyed by session ID and stash key, so there should be no interference between processes provided you're not trying to upload the same file twice at the same time. Patch is untested, and I'm not sure you like the idea of moving this out of the session even on normal installs where PHP's session locking prevents these concurrency issues from happening.
Okay, there's a number of things here... Somewhat irrelevant to the main thrust here -- making the "local" cache of the file in UploadStash->files upon storage, rather than a memoized retrieval -- this seems wrong to me as you only get a chance to do this in the same process as the one that created the record. Any subsequent access is slow. If the argument is that the caches are fast anyway then the whole optimization of having UploadStash->files should be removed. Okay, the main thing you did is to: - make entries directly in cache. I verified that there isn't any case in MediaWiki where we have absolutely no cache available, so that seems okay. - but now we're relying on an unspecified number of forms of cache, that will change in the future, to do intraserver communication. Unlike $_SESSION which has to do that even for a website to work, other forms of caching may not. This will happen to work in the MediaWiki cluster since we'll get memcache, and on a typical standalone MediaWiki you'll get a *DBM, so maybe it's okay, but it bothers me a little. - Since you give each upload their own key, they are truly isolated so that probably will fix the concurrency issue. This makes all the uploads truly isolated, which is nice, but also breaks with the convention of storing all of them under a particular key (this is what everyone else does, such as the UploadByUrl, Firefogg uploads) and it also makes it difficult to find all the uploads later if we wanted to. There is no case in the application where we have to do this, but I anticipate we will want to do this later. You might have a really simple solution here but I want to think about it a bit more. Some other ideas to fix it: - It might be as simple as eliminating the line that creates an array in $_SESSION[UploadBase::SESSION_NAME]. This is PHP, multi-dimensional hashes spring into existence on assignment. Depending on how this is implemented maybe that solves it. - Figuring out a locking system to obtain control over $_SESSION[UploadBase::SESSION_KEYNAME]. Of course that means we are starting to implement a database. ;( - Actually use the database -- least desirable from a standpoint of doing a quick fix, but probably the best long term solution
(In reply to comment #3) > Okay, there's a number of things here... > > Somewhat irrelevant to the main thrust here -- making the "local" cache of the > file in UploadStash->files upon storage, rather than a memoized retrieval -- > this seems wrong to me as you only get a chance to do this in the same process > as the one that created the record. Any subsequent access is slow. If the > argument is that the caches are fast anyway then the whole optimization of > having UploadStash->files should be removed. > I don't see the issue here. UploadStash->$files is not shared between processes or anything. Memoization only works within the same process. AFAICT this revision changed nothing in this regard: it writes to $files upon storage AND upon retrieval (before, we only wrote upon retrieval but did a retrieval at the end of the storage function; hitting $_SESSION twice like that is fine, hitting the cache unnecessarily is not, that's why I refactored this a little bit) > Okay, the main thing you did is to: > > - make entries directly in cache. I verified that there isn't any case in > MediaWiki where we have absolutely no cache available, so that seems okay. > > - but now we're relying on an unspecified number of forms of cache, that will > change in the future, to do intraserver communication. Unlike $_SESSION which > has to do that even for a website to work, other forms of caching may not. This > will happen to work in the MediaWiki cluster since we'll get memcache, and on a > typical standalone MediaWiki you'll get a *DBM, so maybe it's okay, but it > bothers me a little. > This is a good point. I guess we could do what I was considering initially, which is to choose between the old and the "new" behavior based on a $wg var which, if unset, would default to equal $wgSessionsInMemcached. The rationale here is that if sessions are in memcached, the new way is definitely safe (just as safe as $_SESSION would be, anyway) and even required (because there's no session locking in this case). Cases where $wgSessionsInMemcached is disabled but session locking is somehow disabled too are probably very rare. > - Since you give each upload their own key, they are truly isolated so that > probably will fix the concurrency issue. This makes all the uploads truly > isolated, which is nice, but also breaks with the convention of storing all of > them under a particular key (this is what everyone else does, such as the > UploadByUrl, Firefogg uploads) and it also makes it difficult to find all the > uploads later if we wanted to. There is no case in the application where we > have to do this, but I anticipate we will want to do this later. > Yeah you can't do array_keys( $_SESSION['wsUploadData'] ) to list your currently stashed uploads anymore, that's right. This is by design, however: having all the keys in one place like that is what caused these concurrency issues in the first place. > - It might be as simple as eliminating the line that creates an array in > $_SESSION[UploadBase::SESSION_NAME]. This is PHP, multi-dimensional hashes > spring into existence on assignment. Depending on how this is implemented maybe > that solves it. > I'm pessimistic about this: I think the session handler writes the data to memcached when the request ends, in which case it doesn't matter how that data was put in $_SESSION exactly. > - Actually use the database -- least desirable from a standpoint of doing a > quick fix, but probably the best long term solution This is probably a good idea, as it gives us excellent concurrency control; expiry wouldn't be handled, though, so we'd have to do that ourselves the way Block.php and SqlBagOStuff (best class name ever) already do.
(leaving aside the memoization issue until I analyze it better) Yeah, I forgot to mention the requirement for then cleaning that db table periodically. Doesn't seem like a showstopper to me. Sorry, I can't really type for long periods yet. I'd really like to experiment with a local memcached just to see if we can solve this problem in simpler ways. We pretty much have to do that sooner or later, or experiment with the live cluster, which is lame.
Progress report -- have simulated the problem locally, also uncovered an apparent issue where all uploads can successfully complete (from the wizard's point of view) but there is no record of it happening in the UploadStash. Probably a concurrency issue again.
Okay, so, with some modifications and bugfixes, the proposed scheme does work. However... - this also breaks compatibility with the standard session-based stash. The final step of UploadWizard uses the standard API to perform an UploadFromStash. So then it doesn't work. - this eliminates our ability to list all stashed files. Which is not that big a deal now, but will be in the future. I have made a new branch (uploadstash-backend) for making a clean break with the old way of stashing files. We may use this temporarily, or we may go to a solution based on a database table, which is highly preferable (and was always anticipated anyway).
In the meantime, the race condition could probably be mitigated by simply ensuring that uploads don't get created around the same time. Waiting for a response from the first creation before creating the second, and so on. I am reluctant to add this to the code since then the upload processes on the frontend have to know about each other, which previously existed in splendid isolation. So let's defer this until the larger issue of how to store tempoary upload data is solved.
deferred to sprint F
Well, every client-side technique I've tried to space out or otherwise 'quantize' API calls doesn't work well enough, or is itself subject to local race conditions. The API caller-thing would have to be in its own thread for this to be bulletproof. I can space things out so every API call is 2,3,4, even 5 seconds apart, which seems to do the trick. But that just means that 5 files are unbearably long to upload. Almost worse than doing them serially. Seems that the simplest way to solve this is to really solve the race condition on the database end. Was not all that hard to get something halfway working although I have to consider whether I want to db backend to look like the stash format, or like the image tables. Adding a new table is more controversial, though I know some core committers are on board with the idea.
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
I've updated the UploadStash backend to use the database for storage instead of the session, see r92009
I do recommend against using SMALLINT unnecessarily in the database schemas; it saves very little space and tends to introduce errors when you discover that something needs more than 16 bits after all... For width/height in particular, formats like PNG allow up to 2^31-1 for width or height, and huuuuuuuuge map or panoramic images *do* sometimes have sides bigger than 32k.
(In reply to comment #13) > For width/height in particular, formats like PNG allow up to 2^31-1 for width > or height, and huuuuuuuuge map or panoramic images *do* sometimes have sides > bigger than 32k. Good point. Fixed.
Neil and I tested with maxSimultaneousConnections=3, and it seems to work correctly now.