Last modified: 2013-10-04 15:18:07 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56736, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54736 - Inconsistency bug in file re-upload (possible race condition) potentially causing data loss
Inconsistency bug in file re-upload (possible race condition) potentially cau...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
File management (Other open bugs)
1.22.0
All All
: High critical with 1 vote (vote)
: 1.22.0 release
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: commons
  Show dependency treegraph
 
Reported: 2013-09-28 15:35 UTC by Bawolff (Brian Wolff)
Modified: 2013-10-04 15:18 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bawolff (Brian Wolff) 2013-09-28 15:35:14 UTC
Denniss pointed me to https://commons.wikimedia.org/wiki/?curid=28611474

Basically, the original version of the file is missing (oi_archive_name empty string). On top of that, the two old versions have the exact same upload timestamp down to the second, which seems unlikely to be correct (unless both uploaded by uploadwizard at same time, using chunked upload or upload stash maybe?).

The img_metadata looks similar, except orientation field is different, the files have different sha1's and different sizes, which would support the hypothesis that user saw file looked wrong, rotated in an image manipulation program, and re-uploaded (however that doesn't follow with the timestamp).

So weird things happening here:
*Original file is gone
*The second version has the exact same img_timestamp as the first version
**Furthermore log_timestamps are the same too. The log_ids are such that it appears the timestamps are correct
*page_id in the log entry for the second version, is incorrect (It is 0, which could indicate something wrong, perhaps with slave lag. Normally the logging code does a db query by itself to determine what log_page to use, and I believe it hits a slave for that)
*There is no dummy edit to the file description page for the first re-upload
*Hard to be sure, but it appears the original file was visible for a long time (At least 23 hours after the second version was uploaded, as this was when it was tagged as needing rotation). Then rotate bot comes along, rotates things, but it rotates the wrong 2nd version, which is already rotated
**Possibly caused by a race condition with varnishes - the re-upload sends cache purges but they happen before the first image is rendered, and the thumbnail gets stuck in varnish cache. (I think the issue where the original file is gone, is more important than the thumbnails stuck in cache, assuming that is what is happening here)

Possible theory: User uses upload wizard to upload all the files in some folder. This person accidentally selects the non-rotated version, and then later selects the rotated version, in the same batch, and puts both as they should have the same filename. Upload wizard uploads both, and tries to publish both to the same file name. Since the long part of the upload already has taken place (since its stashed), only the quick publish step has to happen, which can happen at the same time. Race conditions in filebackend causes data loss of original file.
Comment 1 Bawolff (Brian Wolff) 2013-09-28 15:40:52 UTC
p.s. Denniss's comment at https://commons.wikimedia.org/w/index.php?title=User_talk:Bawolff&oldid=105637956#Storage.2FBackend_issues.3F suggests that this is not just a one-off issue. I'll repeat his original comment here for the record:

 "While monitoring the Rotatebot log I often come around images with storage issues; the first image revision is invalid but not completely broken as the second working image seems to gather wrong rotation info from it, so does Rotatebot and in the end the images gets rotated wrong. Deleting the file and restoring the first working image (usually the first but sometimes the second revision is shown as invalid) shows in most cases an image without rotation problem. Do you have an idea why this data corruption happens? Almost all of these image have the same storage time recorded, maximum one minute difference between the two. Example here. --Denniss (talk) 09:45, 28 September 2013 (UTC)"
Comment 2 Platonides 2013-09-28 15:57:28 UTC
I think the storage problem for two revision of the same file uploaded at the same time is an old, known one. (but should be fixed anyway)

The "old-version shown 23 hours later" looks like a cache problem (not strange given that the two versions had identical timestamps, to the second)

The UW is a good theory.
Comment 3 Denniss 2013-09-28 23:34:14 UTC
New file to check: https://commons.wikimedia.org/wiki/File:Cuenca,_casas_colgantes.JPG

First image version invalid, second working with thumb upright but fullres rotated. Uploader requested rotation ~100 mins after upload and rotated image is off by 90 degrees.
Comment 4 Bawolff (Brian Wolff) 2013-09-29 00:04:45 UTC
Confirmed, if you upload two separate files with the uploadwizard that you say have the same title, bad things happen ( https://test.wikipedia.org/wiki/File:BDavis-test-del.jpg ).

Suggested work around for the near term: Have uploadwizard bail out in this situation. Its very unlikely if the user is doing this, that they really want the file. Filed as bug 54750.

Additionally we should actually fix the race condition.
Comment 5 Aaron Schulz 2013-10-01 19:12:17 UTC
(In reply to comment #4)
> Confirmed, if you upload two separate files with the uploadwizard that you
> say
> have the same title, bad things happen (
> https://test.wikipedia.org/wiki/File:BDavis-test-del.jpg ).
> 

At the same time or far apart too?
Comment 6 Bawolff (Brian Wolff) 2013-10-01 19:56:43 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Confirmed, if you upload two separate files with the uploadwizard that you
> > say
> > have the same title, bad things happen (
> > https://test.wikipedia.org/wiki/File:BDavis-test-del.jpg ).
> > 
> 
> At the same time or far apart too?

At the same time.
Comment 7 Gerrit Notification Bot 2013-10-01 20:54:39 UTC
Change 87010 had a related patch set uploaded by Aaron Schulz:
Make sure LocalFile::lock() works for brand new uploads

https://gerrit.wikimedia.org/r/87010
Comment 8 Gerrit Notification Bot 2013-10-01 23:57:57 UTC
Change 87010 merged by jenkins-bot:
Make sure LocalFile::lock() works for brand new uploads

https://gerrit.wikimedia.org/r/87010
Comment 9 Andre Klapper 2013-10-03 16:21:49 UTC
Patch got merged, can this be closed as RESOLVED FIXED?
Comment 10 Bawolff (Brian Wolff) 2013-10-04 11:19:01 UTC
Patch is live on test. Someone should do the steps to reproduce to make sure bug is actually fixed. (I can't at the moment)
Comment 11 Bawolff (Brian Wolff) 2013-10-04 15:12:00 UTC
(In reply to comment #10)
> Patch is live on test. Someone should do the steps to reproduce to make sure
> bug is actually fixed. (I can't at the moment)

Tested, it works fine now, both files succesfully uploaded


---

However, the API publish request (the action=upload&filekey=foo&... request) for one of the files does time out, eventually returning a squid error "Request: POST http://test.wikipedia.org/w/api.php, from 208.80.154.9 via cp1010.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.123 (10.64.0.123)<br/>
Error: ERR_READ_TIMEOUT, errno [No Error] at Fri, 04 Oct 2013 14:52:03 GMT"

Interestingly enough, the timestamp from that error is precisely when in the log the second version of the file is uploaded (14:52:03), which is kind of odd coincidence if it was just uploading both in sequence took longer than the squid timeout. I wonder if some lock isn't being released for the second one until after the user disconnects.

Anyhow, calling this bug fixed. The second part might warrant further investigation, but is a much lower priority issue - splitting to bug 54978

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links