Last modified: 2014-06-09 02:40:08 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34551, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 32551 - Descriptionless files (Missing page_latest referential integrity issue)
Descriptionless files (Missing page_latest referential integrity issue)
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.23.0
All All
: High critical with 3 votes (vote)
: ---
Assigned To: Gilles Dubuc
http://commons.wikimedia.org/w/index....
: testme
: 40178 60205 61898 (view as bug list)
Depends on:
Blocks: 39094
  Show dependency treegraph
 
Reported: 2011-11-21 18:40 UTC by Platonides
Modified: 2014-06-09 02:40 UTC (History)
31 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Platonides 2011-11-21 18:40:41 UTC
A number of files appeared on commons on 15th with no description (may or may not have page entry)
http://commons.wikimedia.org/w/index.php?title=Commons:Administrators%27_noticeboard&oldid=62820408#System_problems

select img_name, img_timestamp, rev_timestamp, page_latest, rev_timestamp - img_timestamp from image left join page on (page_title=img_name and page_namespace=6) left join revision on (reV_page = page_id and rev_parent_id = 0) where img_timestamp >= '20111114233531' limit 7;

+------------------------------------------------+----------------+----------------+-------------+-------------------------------+
| img_name                                       | img_timestamp  | rev_timestamp  | page_latest | rev_timestamp - img_timestamp |
+------------------------------------------------+----------------+----------------+-------------+-------------------------------+
| La_playAaAa.png                                | 20111114233642 | 20111120142847 |    62765918 |                       5909205 | (description created later)
| Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg | 20111114233723 | NULL           |           0 |                          NULL |
| Raonic_and_Youzhny.jpg                         | 20111114233732 | NULL           |        NULL |                          NULL |
| Darcy's_gravesite_in_SE_Dijon_(France).jpg     | 20111114233911 | 20111120142313 |    62765628 |                       5908402 | (description created later)
| Defeater-lost-ground-album-cover.jpg           | 20111114233944 | NULL           |           0 |                          NULL |
| Jaguar_Sentado.jpg                             | 20111114234008 | NULL           |        NULL |                          NULL |
| Phil_Bryant.jpg                                | 20111114234008 | NULL           |        NULL |                          NULL |
+------------------------------------------------+----------------+----------------+-------------+-------------------------------+


The timing matches with the update of external storage servers https://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/

I think this is due to the image upload nested transactions, which break the outer transactions. See http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/49500

When the servers were switched, the second half transaction for those files would have failed and half-rollbacked.

We already suffered this kind of failures in the past, see bug 15430, bug 20744, bug 24978. Seems that each ~12 months we suffer one of those. It should finally be fixed.
Comment 1 Ariel T. Glenn 2011-11-21 19:05:52 UTC
http://wikitech.wikimedia.org/view/Server_admin_log#November_14  and you can see the times match exactly.

I guess we could disable uploads during such maintenance.
Comment 2 Brion Vibber 2011-11-21 20:43:05 UTC
Was that the External Storage switchover with the read-only period?

Best is to make sure that doesn't happen by preparing the maintenance so there's still a read/write cluster available to accept saves.


In the case that we get stuck with it anyway, sounds like there's some bad error handling somewhere -- ideally the upload should end up getting rejected and rolled back if part of the file: description page edit fails. Oooh yeah nested transactions, not sure that works.
Comment 3 Platonides 2011-11-21 20:54:38 UTC
Look at the wikitech thread. A second begin() does an implicit commit (with some backends, including mysql) so it's quite untransactional. We start a transaction, the file classes use the db for lock, then doEditPage() has a begin() commit() too...
Comment 5 Paul Kaganer 2011-12-26 18:08:10 UTC
Any pgogress for resolve this bug?
Comment 6 Mark A. Hershberger 2011-12-27 15:33:32 UTC
(In reply to comment #3)
> Look at the wikitech thread. A second begin() does an implicit commit (with
> some backends, including mysql) so it's quite untransactional. We start a
> transaction, the file classes use the db for lock, then doEditPage() has a
> begin() commit() too...

See bug #27283 for this one
Comment 7 Yann Forget 2012-02-14 16:28:45 UTC
And also http://commons.wikimedia.org/wiki/File%3APOSTERMENDOZA.JPG
Comment 8 Paul Kaganer 2012-02-28 08:45:36 UTC
Any pgogress for resolve this bug? Please fid & fix all from descriptionless files (with publishing list of these files) by creating temporary file info pages - for further manual filling.
Comment 9 Paul Kaganer 2012-03-03 21:47:00 UTC
See also other side of these same problem: file with first version is non exist. https://bugzilla.wikimedia.org/show_bug.cgi?id=34934
Comment 10 Paul Kaganer 2012-06-28 12:31:35 UTC
Any pgogress for resolve this bug?
Comment 12 Foroa 2012-08-07 09:59:14 UTC
And the last months, new files have been added to the list. Is there a structural way to identify and to repair them ? Thumbnails and external references seem to remain intact.
Comment 13 Paul Kaganer 2012-10-25 15:03:51 UTC
Soon expire a year from the opening of the bug. Maybe even able to determine the cause of the problem?
Comment 14 Andre Klapper 2012-10-25 18:30:37 UTC
The cause is likely explained in comment 2.
Comment 15 Foroa 2012-10-26 09:14:26 UTC
It sems impossible to edit/categorise the "repaired files"
 including http://commons.wikimedia.org/wiki/File:Two-stroke_engine_moving_parts_(Montagu,_Cars_and_Motor-Cycles,_1928).svg

Some of the remaining older ones: http://commons.wikimedia.org/wiki/File:POSTERMENDOZA.JPG
http://commons.wikimedia.org/wiki/File:Ogeret.JPG

What to do with the faulty files we detect further on ?
Comment 16 Platonides 2012-10-28 16:58:02 UTC
Interesting, trying to create the page results in an edit conflict. This is new.

These three pages have a page entry, but no associated revision (and page_latest = 0).

There are 16 images like this:
mysql> select page_namespace,page_title from page where page_latest = 0;
+----------------+---------------------------------------------------------------------------+
| page_namespace | page_title                                                                |
+----------------+---------------------------------------------------------------------------+
|              6 | Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg                            |
|              6 | POSTERMENDOZA.JPG                                                         |
|              6 | Ogeret.JPG                                                                |
|              6 | SANTA_MARIA_DE_PUIG-AGUILAR_-_7.JPG                                       |
|              6 | Luftaufnahmen_Nordseekueste_2012_05_D50_by-RaBoe_066.jpg                  |
|              6 | Kit_body_brugge1314.png                                                   |
|              6 | Kit_left_arm_union1314.png                                                |
|              6 | Kit_body_union1314.png                                                    |
|              6 | Kit_right_arm_union1314.png                                               |
|              6 | Luftaufnahmen_Nordseekueste_2012_05_D50_by-RaBoe_067.jpg                  |
|              6 | Two-stroke_engine_moving_parts_(Montagu,_Cars_and_Motor-Cycles,_1928).svg |
|              6 | Pastel_Raymond_Martin_2.jpg                                               |
|              6 | Coxeter_diagram_finite_rank4_correspondence.png                           |
|              6 | USS_Tempest_PC-2_Crest.png                                                |
|              6 | RUS-CZE_2012-06-08_pl.svg                                                 |
|              6 | Mérite_national_chevalier_FRANCE.jpg                                      |
+----------------+---------------------------------------------------------------------------+
Comment 17 Foroa 2013-04-15 13:12:31 UTC
Obviously a low priority item: 17 months now.
Comment 18 Jarek Tuszynski 2013-05-24 11:55:50 UTC
I deleted and reuploded Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg  while trying to fix the impossible to edit file with no license. Should we do the same with the rest?
Comment 19 Foroa 2013-05-24 14:33:00 UTC
Strange, as the picture is in the system and can be used on wikipedias, such as http://commons.wikimedia.org/wiki/File:POSTERMENDOZA.JPG . 

There are several more in http://commons.wikimedia.org/wiki/Special:UnusedFiles and http://commons.wikimedia.org/wiki/Special:UncategorizedFiles.
Comment 20 Bawolff (Brian Wolff) 2014-02-19 01:55:18 UTC
There are a significant number (~700 a month) of files where during upload an entry in page table is created, but page_latest = 0 (and log_page also seems to be 0). If we just count upload entries with log_page = 0 (Which seems to be symtomatic of the issue, but I haven't 100% verified it can't happen in other situations)

MariaDB [commonswiki_p]> select substr( log_timestamp, 1, 6) as YYYYMM, count(*) as 'uploads missing log_page' from logging_logindex  where log_type='upload' and log_timestamp > '20121000000000' and log_page = 0 group by substr( log_timestamp, 1, 6);
+--------+--------------------------+
| YYYYMM | uploads missing log_page |
+--------+--------------------------+
| 201210 |                     1323 |
| 201211 |                      572 |
| 201212 |                      624 |
| 201301 |                     1179 |
| 201302 |                      666 |
| 201303 |                      762 |
| 201304 |                      955 |
| 201305 |                      637 |
| 201306 |                      598 |
| 201307 |                      555 |
| 201308 |                      818 |
| 201309 |                     3268 |
| 201310 |                      808 |
| 201311 |                      852 |
| 201312 |                      806 |
| 201401 |                     1116 |
| 201402 |                      435 |
+--------+--------------------------+
17 rows in set (7.08 sec)


This is not a good thing.
Comment 21 Marius Hoch 2014-02-23 13:12:58 UTC
*** Bug 60205 has been marked as a duplicate of this bug. ***
Comment 22 Marius Hoch 2014-02-23 13:44:34 UTC
Increasing priority: This starts to cause noticeable problems on commons
Comment 23 Rayson Ho 2014-02-23 19:27:36 UTC
This bug also affected my file (CASCON2013-2.JPG on Commons) and I was trying to edit the corrupted page but it didn't save.

This bug only affected 1 of my uploads thus far (I've uploaded 1000+ files onto commons), but I voted the importance up as I think the bug can cause major issues.

Rayson

(P.S. This is Rayson from the Open Grid Scheduler / Grid Engine project: http://gridscheduler.sourceforge.net/ - besides free software, we also contribute to free media repos like Wiki Commons.)
Comment 24 Andre Klapper 2014-02-23 21:13:32 UTC
(In reply to Marius Hoch from comment #22)
> Increasing priority: This starts to cause noticeable problems on commons

Aaron / Sean: Could somebody take a look at this, please?
Comment 25 magog.the.ogre 2014-02-24 00:11:55 UTC
Is the problem related to this? I see no history on the following files on Commons:

File:Admission ticket to Benjamin Rush lecture.jpg
File:Admission ticket to John Morgan lecture.jpg
File:Admission ticket to Caspar Wistar lecture.jpg
File:Papilio anchisiades, mating.jpg
Comment 26 Marius Hoch 2014-02-24 00:29:24 UTC
(In reply to magog.the.ogre from comment #25)
> Is the problem related to this? I see no history on the following files on
> Commons:
> 
> File:Admission ticket to Benjamin Rush lecture.jpg
> File:Admission ticket to John Morgan lecture.jpg
> File:Admission ticket to Caspar Wistar lecture.jpg
> File:Papilio anchisiades, mating.jpg

All these files have (as far as I see) a valid history and a valid value in page_latest. They are just messed redirects... So no, that's not related.
Comment 27 Marius Hoch 2014-02-28 19:54:10 UTC
*** Bug 61898 has been marked as a duplicate of this bug. ***
Comment 28 Marius Hoch 2014-02-28 19:58:03 UTC
Raising priority again (per duplicate bug and ongoing issues)
Comment 29 Andre Klapper 2014-02-28 22:05:56 UTC
Note: Dup bug 61898 comment 4 has some investigations by Sean.
Comment 30 Bawolff (Brian Wolff) 2014-03-01 00:50:08 UTC
(In reply to Andre Klapper from comment #29)
> Note: Dup bug 61898 comment 4 has some investigations by Sean.

Could you change its component to not be security so we can see? (assuming nothing sensitive is there. This hardly seems like a security issue, but obviously its hard to make that judgement without knowing whats on the bug)
Comment 31 Greg Grossmeier 2014-03-03 15:52:32 UTC
(In reply to Bawolff (Brian Wolff) from comment #30)
> (In reply to Andre Klapper from comment #29)
> > Note: Dup bug 61898 comment 4 has some investigations by Sean.
> 
> Could you change its component to not be security so we can see? (assuming
> nothing sensitive is there. This hardly seems like a security issue, but
> obviously its hard to make that judgement without knowing whats on the bug)

(Using Chatham House Rules ;) )

The reasoning for filing it under Security was given in bug 61898 comment 0:

"I'm filing this under Security to keep it on a low radar. They're innocent to keep around, but in my experience filing these publicly may cause them to disappear at some point when too many people spread the link without proper context or try to poke at it."

No comment on the validity of the reasoning, but there are URLs in that bug that are not on this one.
Comment 32 Rob Lanphier 2014-03-20 21:58:55 UTC
Hi Gilles, could you insert this into the planning process for the next sprint?
Comment 33 Gerrit Notification Bot 2014-03-21 06:02:49 UTC
Change 119932 had a related patch set uploaded by Brian Wolff:
Rollback transactions if job throws an exception.

https://gerrit.wikimedia.org/r/119932
Comment 34 Bawolff (Brian Wolff) 2014-03-21 06:37:48 UTC
I was finally able to reproduce this.

Steps to reproduce:
*Make sure $wgEnableAsyncUploads = true;
*Artificially make it so that Revision::insertOn throws some sort of exception. My current theory is that maybe very intermitent issue with external store dbs is triggering the "Unable to store text to external storage" exception in ExternalStore::insertWithFallback. That's just a random guess based on the proximity of one approximately the db inserts fail. Could be totally wrong. Anyway, its probably somewhere around that block of code. Someone with access to job queue log could confirm if there are any common exceptions in the log, and look at the results of the upload job for the various example files in question.
*Upload an image using the api and stashed upload, making sure that the async option is specified.
*Do runJobs.php --type PublishStashedFile to publish the file. This should return an error from the exception triggered via step 2

Actual behaviour:
*Locally (I don't know about commons), the upload api does correctly report a stash failure error to the user. Which is as it should be
*Image does not show up until page is action=purge or ~24hours, as the negative entry in file memcache is not cleared
*No RC entry (Although I did notice a gap in the rc_id field which is kind of weird, maybe coincidence)
*No dummy edit in page history
*Log entry is missing log_page
*Page table entry is inserted in DB, however page_latest is set to 0, which is a referential integrity violation and should never happen [The big bad of this bug]
*Page contents is missing, and replaced with missing revision error. Page cannot be edited (Get edit conflict warning). Only way to fix page is to re-upload a new image over it or delete and undelete the page.



Expected behaviour:
*Well you know, upload just works properly ;)
*Referential integrity should never ever be violated. There should never be an uneditable page.
*Obviously a better behaviour would be for the page to be blank instead of "broken", however that's probably still a "bad" behaviour. Ideally we would want the users text to always be on the page. Its unclear to me if we would prefer that the image totally go away if we can't edit the page properly (ie the operation totally be atomic), or if the current behaviour of the file being saved is preferable (probably. Less data loss the better).
*Cache should be cleared so that images don't randomly appear 24 hours after the fact.

(In reply to Gerrit Notification Bot from comment #33)
> Change 119932 had a related patch set uploaded by Brian Wolff:
> Rollback transactions if job throws an exception.
> 
> https://gerrit.wikimedia.org/r/119932

If I'm correct about the cause (Which as it stands is just a guess), and the patch actually works, what it will cause is that pages experiencing this bug won't be in a "broken" state (ref integrity failure), but just be blank, so users can create them later as they please. This is not a full fix to the issue, we still need to figure out what situation is causing the issue, and reduce its occurrence (Could someone look at job queue logs for one of the example files like File:ContentLanguage.svg if logs go back that far, and see what the last error was?). We also may want to try to catch whatever exception is happening, and retry the operation, if we're assuming its a very intermittent issue.

> 
> The reasoning for filing it under Security was given in bug 61898 comment 0:
> 
> "I'm filing this under Security to keep it on a low radar. They're innocent
> to keep around, but in my experience filing these publicly may cause them to
> disappear at some point when too many people spread the link without proper
> context or try to poke at it."
> 
> No comment on the validity of the reasoning, but there are URLs in that bug
> that are not on this one.

That seems unnecessary given how often we're getting new examples - Most recently [[commons:File:ContentLanguage.svg]] which was uploaded a week ago (Earlier we seemed to be getting new examples every day). Commons admins also have db access via tool labs and are more than capable of getting a list of all affected files themselves if they were so inclined.
Comment 35 Sean Pringle 2014-03-25 12:22:27 UTC
Possible that bug 63058 is related. Unproven, but that one was certainly blocking some commits on commonswiki today.
Comment 36 Bawolff (Brian Wolff) 2014-03-25 15:02:17 UTC
(In reply to Sean Pringle from comment #35)
> Possible that bug 63058 is related. Unproven, but that one was certainly
> blocking some commits on commonswiki today.

Blocking the commit wouldnt cause this since (from what i understand) that would cause a rollback once php timed out. This bug needs to have a COMMIT issued in the middle of a transaction to cause an inconsistent state.
Comment 37 Bawolff (Brian Wolff) 2014-04-05 21:12:03 UTC
*** Bug 40178 has been marked as a duplicate of this bug. ***
Comment 38 Gerrit Notification Bot 2014-04-06 01:18:21 UTC
Change 124135 had a related patch set uploaded by Brian Wolff:
Make chunked upload jobs robust in face of exceptions.

https://gerrit.wikimedia.org/r/124135
Comment 39 Gerrit Notification Bot 2014-04-06 01:53:00 UTC
Change 124136 had a related patch set uploaded by Brian Wolff:
When uploading a new file, save to memcached directly after commit

https://gerrit.wikimedia.org/r/124136
Comment 40 Gerrit Notification Bot 2014-04-06 02:22:53 UTC
Change 124137 had a related patch set uploaded by Brian Wolff:
Make doEditContent call $dbw->rollback() if exception happens

https://gerrit.wikimedia.org/r/124137
Comment 41 Bawolff (Brian Wolff) 2014-04-06 03:50:50 UTC
So the patch in comment 38 or the patch in comment 40 will stop the "Missing revision #0 error" on new files. Instead when that situation happens, it will cause the page to just not be created, which is an improvement over the page becoming corrupt. (Either patch will fix the issue. Both together make things even more robust)

The patch in comment 39 would make it so that when this issue happens, the image appears immediately instead of 24 hours later due to stale negative memcache entry.

So what remains to be done for this bug (Other then review of those patches):
*Figure out root cause of edits not going through
**Most recent example (for a file) is [[commons:File:Stephens_college_at_the_corner_of_Waugh_and_Broadway_.jpg]] uploaded at approximately 2014-03-30 00:24:12. It would be helpful if someone who had access to job queue logs could grep through for a PublishStashedFile job at about that time, for that title, and see what the jobs error output is.

*Figure out if there is something that should be done about root cause (which really depends on what it is)

*Figure out if it makes sense in the face of an edit failure to simply try and doEditContent() a second time(?) Or if there is some other behaviour that makes sense to do in the case where we try to edit a page and it doesn't work.
Comment 42 Gerrit Notification Bot 2014-04-06 09:12:05 UTC
Change 124137 merged by jenkins-bot:
Make doEditContent call $dbw->rollback() if exception happens

https://gerrit.wikimedia.org/r/124137
Comment 43 Gerrit Notification Bot 2014-04-06 09:26:06 UTC
Change 124136 merged by jenkins-bot:
When uploading a new file, save to memcached directly after commit

https://gerrit.wikimedia.org/r/124136
Comment 44 Gerrit Notification Bot 2014-04-06 09:26:13 UTC
Change 124135 merged by jenkins-bot:
Make chunked upload jobs robust in face of exceptions.

https://gerrit.wikimedia.org/r/124135
Comment 45 Bawolff (Brian Wolff) 2014-04-09 01:46:26 UTC
Resetting for new for now, until we confirm whether this is fixed or not.

Aaron tried to look up [[commons:File:Rancho_Camulos_National_Historic_Landmark_Plaque.jpg]] in the jobqueue log for me, but it wasn't in the log. Which first of all confused me, and then caused me to look at UploadWizard code, and realize it only uploads via job queue if file is > 10 MB. So I guess a lot of the errors weren't coming from exceptions thrown by the upload job like I previously thought, but perhaps just directly through the api (Although perhaps a small number came through job classes).

Which means that anomie's patch dc7d342d93b12 (which was deployed to commons today, april 8 at 21:12), may actually fix this issue. The patch in comment 42 (not yet deployed) may also equally fix the issue. Anyways, guess we should watch to see if these files stop occurring. The most recent example I can find is from today at 17:20 ( [[commons:file:ABS-3401.0-OverseasArrivalsDeparturesAustralia-ShorttermMovementResidentDepartures_SelectedDestinations-Trend-NumberMovements-Cambodia-A83808853V.svg]] ), so, so far so good. Not that that means much yet as in the past there have been gaps of a week between such files sometimes.

----

There's also the separate issue of what's causing the underlying error. There is still going to be files without a description page, they just wont have the "revision 0 is missing" error any more. We should figure out what's up with that. Perhaps that should be split to a separate bug (?)
Comment 46 Andre Klapper 2014-05-05 16:23:47 UTC
Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a problem in the last three weeks?
Comment 47 Steinsplitter 2014-05-06 16:08:25 UTC
(In reply to Andre Klapper from comment #46)
> Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a
> problem in the last three weeks?

Unfortunately yes.
Comment 48 Bawolff (Brian Wolff) 2014-05-14 22:58:23 UTC
(In reply to Steinsplitter from comment #47)
> (In reply to Andre Klapper from comment #46)
> > Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a
> > problem in the last three weeks?
> 
> Unfortunately yes.

I haven't been able to find any recent examples in the db. Has there been any new files with this issue in say the last month (or even in the later part of april). If so please include examples.
Comment 49 Andre Klapper 2014-05-25 15:14:40 UTC
(In reply to Bawolff (Brian Wolff) from comment #48)
> I haven't been able to find any recent examples in the db. Has there been
> any new files with this issue in say the last month (or even in the later
> part of april). If so please include examples.

Replies welcome.
Comment 50 Andre Klapper 2014-06-05 14:45:19 UTC
Reducing priority until somebody can provide recent examples. 
See comment 48, comment 49.
Comment 51 Bawolff (Brian Wolff) 2014-06-05 14:56:57 UTC
Calling fixed. I think comment 44 + Anomie's api patch fixed the issue. If anyone encounters more examples please speak up.
Comment 52 Bawolff (Brian Wolff) 2014-06-09 02:40:08 UTC
btw, splitting off larger issue of sometimes no description gets posted, to bug 66355.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links