Last modified: 2014-06-09 02:40:08 UTC
A number of files appeared on commons on 15th with no description (may or may not have page entry) http://commons.wikimedia.org/w/index.php?title=Commons:Administrators%27_noticeboard&oldid=62820408#System_problems select img_name, img_timestamp, rev_timestamp, page_latest, rev_timestamp - img_timestamp from image left join page on (page_title=img_name and page_namespace=6) left join revision on (reV_page = page_id and rev_parent_id = 0) where img_timestamp >= '20111114233531' limit 7; +------------------------------------------------+----------------+----------------+-------------+-------------------------------+ | img_name | img_timestamp | rev_timestamp | page_latest | rev_timestamp - img_timestamp | +------------------------------------------------+----------------+----------------+-------------+-------------------------------+ | La_playAaAa.png | 20111114233642 | 20111120142847 | 62765918 | 5909205 | (description created later) | Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg | 20111114233723 | NULL | 0 | NULL | | Raonic_and_Youzhny.jpg | 20111114233732 | NULL | NULL | NULL | | Darcy's_gravesite_in_SE_Dijon_(France).jpg | 20111114233911 | 20111120142313 | 62765628 | 5908402 | (description created later) | Defeater-lost-ground-album-cover.jpg | 20111114233944 | NULL | 0 | NULL | | Jaguar_Sentado.jpg | 20111114234008 | NULL | NULL | NULL | | Phil_Bryant.jpg | 20111114234008 | NULL | NULL | NULL | +------------------------------------------------+----------------+----------------+-------------+-------------------------------+ The timing matches with the update of external storage servers https://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/ I think this is due to the image upload nested transactions, which break the outer transactions. See http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/49500 When the servers were switched, the second half transaction for those files would have failed and half-rollbacked. We already suffered this kind of failures in the past, see bug 15430, bug 20744, bug 24978. Seems that each ~12 months we suffer one of those. It should finally be fixed.
http://wikitech.wikimedia.org/view/Server_admin_log#November_14 and you can see the times match exactly. I guess we could disable uploads during such maintenance.
Was that the External Storage switchover with the read-only period? Best is to make sure that doesn't happen by preparing the maintenance so there's still a read/write cluster available to accept saves. In the case that we get stuck with it anyway, sounds like there's some bad error handling somewhere -- ideally the upload should end up getting rejected and rolled back if part of the file: description page edit fails. Oooh yeah nested transactions, not sure that works.
Look at the wikitech thread. A second begin() does an implicit commit (with some backends, including mysql) so it's quite untransactional. We start a transaction, the file classes use the db for lock, then doEditPage() has a begin() commit() too...
Here is another similar case from one of the old upgrade/corruption the database: Bad fileinfo: http://commons.wikimedia.org/wiki/File:Nevsky_prospekt_SPb_000000162_1_m.jpg File is exist: http://upload.wikimedia.org/wikipedia/commons/5/57/Nevsky_prospekt_SPb_000000004_1_m.jpg See upload log: http://commons.wikimedia.org/w/index.php?title=Special:Log&type=&user=&page=File:Nevsky+prospekt+SPb+000000162+1+m.jpg&year=&month=-1&tagfilter=&hide_patrol_log=1 Please fix it as well!
Any pgogress for resolve this bug?
(In reply to comment #3) > Look at the wikitech thread. A second begin() does an implicit commit (with > some backends, including mysql) so it's quite untransactional. We start a > transaction, the file classes use the db for lock, then doEditPage() has a > begin() commit() too... See bug #27283 for this one
And also http://commons.wikimedia.org/wiki/File%3APOSTERMENDOZA.JPG
Any pgogress for resolve this bug? Please fid & fix all from descriptionless files (with publishing list of these files) by creating temporary file info pages - for further manual filling.
See also other side of these same problem: file with first version is non exist. https://bugzilla.wikimedia.org/show_bug.cgi?id=34934
I found these three files, all related to this bug: > http://commons.wikimedia.org/wiki/File:Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg > http://commons.wikimedia.org/wiki/File:Coxeter_diagram_finite_rank4_correspondence.png > http://commons.wikimedia.org/wiki/File:USS_Tempest_PC-2_Crest.png
And the last months, new files have been added to the list. Is there a structural way to identify and to repair them ? Thumbnails and external references seem to remain intact.
Soon expire a year from the opening of the bug. Maybe even able to determine the cause of the problem?
The cause is likely explained in comment 2.
It sems impossible to edit/categorise the "repaired files" including http://commons.wikimedia.org/wiki/File:Two-stroke_engine_moving_parts_(Montagu,_Cars_and_Motor-Cycles,_1928).svg Some of the remaining older ones: http://commons.wikimedia.org/wiki/File:POSTERMENDOZA.JPG http://commons.wikimedia.org/wiki/File:Ogeret.JPG What to do with the faulty files we detect further on ?
Interesting, trying to create the page results in an edit conflict. This is new. These three pages have a page entry, but no associated revision (and page_latest = 0). There are 16 images like this: mysql> select page_namespace,page_title from page where page_latest = 0; +----------------+---------------------------------------------------------------------------+ | page_namespace | page_title | +----------------+---------------------------------------------------------------------------+ | 6 | Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg | | 6 | POSTERMENDOZA.JPG | | 6 | Ogeret.JPG | | 6 | SANTA_MARIA_DE_PUIG-AGUILAR_-_7.JPG | | 6 | Luftaufnahmen_Nordseekueste_2012_05_D50_by-RaBoe_066.jpg | | 6 | Kit_body_brugge1314.png | | 6 | Kit_left_arm_union1314.png | | 6 | Kit_body_union1314.png | | 6 | Kit_right_arm_union1314.png | | 6 | Luftaufnahmen_Nordseekueste_2012_05_D50_by-RaBoe_067.jpg | | 6 | Two-stroke_engine_moving_parts_(Montagu,_Cars_and_Motor-Cycles,_1928).svg | | 6 | Pastel_Raymond_Martin_2.jpg | | 6 | Coxeter_diagram_finite_rank4_correspondence.png | | 6 | USS_Tempest_PC-2_Crest.png | | 6 | RUS-CZE_2012-06-08_pl.svg | | 6 | Mérite_national_chevalier_FRANCE.jpg | +----------------+---------------------------------------------------------------------------+
Obviously a low priority item: 17 months now.
I deleted and reuploded Bruxelles_Java_Masque_Wayang_02_10_2011_06.jpg while trying to fix the impossible to edit file with no license. Should we do the same with the rest?
Strange, as the picture is in the system and can be used on wikipedias, such as http://commons.wikimedia.org/wiki/File:POSTERMENDOZA.JPG . There are several more in http://commons.wikimedia.org/wiki/Special:UnusedFiles and http://commons.wikimedia.org/wiki/Special:UncategorizedFiles.
There are a significant number (~700 a month) of files where during upload an entry in page table is created, but page_latest = 0 (and log_page also seems to be 0). If we just count upload entries with log_page = 0 (Which seems to be symtomatic of the issue, but I haven't 100% verified it can't happen in other situations) MariaDB [commonswiki_p]> select substr( log_timestamp, 1, 6) as YYYYMM, count(*) as 'uploads missing log_page' from logging_logindex where log_type='upload' and log_timestamp > '20121000000000' and log_page = 0 group by substr( log_timestamp, 1, 6); +--------+--------------------------+ | YYYYMM | uploads missing log_page | +--------+--------------------------+ | 201210 | 1323 | | 201211 | 572 | | 201212 | 624 | | 201301 | 1179 | | 201302 | 666 | | 201303 | 762 | | 201304 | 955 | | 201305 | 637 | | 201306 | 598 | | 201307 | 555 | | 201308 | 818 | | 201309 | 3268 | | 201310 | 808 | | 201311 | 852 | | 201312 | 806 | | 201401 | 1116 | | 201402 | 435 | +--------+--------------------------+ 17 rows in set (7.08 sec) This is not a good thing.
*** Bug 60205 has been marked as a duplicate of this bug. ***
Increasing priority: This starts to cause noticeable problems on commons
This bug also affected my file (CASCON2013-2.JPG on Commons) and I was trying to edit the corrupted page but it didn't save. This bug only affected 1 of my uploads thus far (I've uploaded 1000+ files onto commons), but I voted the importance up as I think the bug can cause major issues. Rayson (P.S. This is Rayson from the Open Grid Scheduler / Grid Engine project: http://gridscheduler.sourceforge.net/ - besides free software, we also contribute to free media repos like Wiki Commons.)
(In reply to Marius Hoch from comment #22) > Increasing priority: This starts to cause noticeable problems on commons Aaron / Sean: Could somebody take a look at this, please?
Is the problem related to this? I see no history on the following files on Commons: File:Admission ticket to Benjamin Rush lecture.jpg File:Admission ticket to John Morgan lecture.jpg File:Admission ticket to Caspar Wistar lecture.jpg File:Papilio anchisiades, mating.jpg
(In reply to magog.the.ogre from comment #25) > Is the problem related to this? I see no history on the following files on > Commons: > > File:Admission ticket to Benjamin Rush lecture.jpg > File:Admission ticket to John Morgan lecture.jpg > File:Admission ticket to Caspar Wistar lecture.jpg > File:Papilio anchisiades, mating.jpg All these files have (as far as I see) a valid history and a valid value in page_latest. They are just messed redirects... So no, that's not related.
*** Bug 61898 has been marked as a duplicate of this bug. ***
Raising priority again (per duplicate bug and ongoing issues)
Note: Dup bug 61898 comment 4 has some investigations by Sean.
(In reply to Andre Klapper from comment #29) > Note: Dup bug 61898 comment 4 has some investigations by Sean. Could you change its component to not be security so we can see? (assuming nothing sensitive is there. This hardly seems like a security issue, but obviously its hard to make that judgement without knowing whats on the bug)
(In reply to Bawolff (Brian Wolff) from comment #30) > (In reply to Andre Klapper from comment #29) > > Note: Dup bug 61898 comment 4 has some investigations by Sean. > > Could you change its component to not be security so we can see? (assuming > nothing sensitive is there. This hardly seems like a security issue, but > obviously its hard to make that judgement without knowing whats on the bug) (Using Chatham House Rules ;) ) The reasoning for filing it under Security was given in bug 61898 comment 0: "I'm filing this under Security to keep it on a low radar. They're innocent to keep around, but in my experience filing these publicly may cause them to disappear at some point when too many people spread the link without proper context or try to poke at it." No comment on the validity of the reasoning, but there are URLs in that bug that are not on this one.
Hi Gilles, could you insert this into the planning process for the next sprint?
Change 119932 had a related patch set uploaded by Brian Wolff: Rollback transactions if job throws an exception. https://gerrit.wikimedia.org/r/119932
I was finally able to reproduce this. Steps to reproduce: *Make sure $wgEnableAsyncUploads = true; *Artificially make it so that Revision::insertOn throws some sort of exception. My current theory is that maybe very intermitent issue with external store dbs is triggering the "Unable to store text to external storage" exception in ExternalStore::insertWithFallback. That's just a random guess based on the proximity of one approximately the db inserts fail. Could be totally wrong. Anyway, its probably somewhere around that block of code. Someone with access to job queue log could confirm if there are any common exceptions in the log, and look at the results of the upload job for the various example files in question. *Upload an image using the api and stashed upload, making sure that the async option is specified. *Do runJobs.php --type PublishStashedFile to publish the file. This should return an error from the exception triggered via step 2 Actual behaviour: *Locally (I don't know about commons), the upload api does correctly report a stash failure error to the user. Which is as it should be *Image does not show up until page is action=purge or ~24hours, as the negative entry in file memcache is not cleared *No RC entry (Although I did notice a gap in the rc_id field which is kind of weird, maybe coincidence) *No dummy edit in page history *Log entry is missing log_page *Page table entry is inserted in DB, however page_latest is set to 0, which is a referential integrity violation and should never happen [The big bad of this bug] *Page contents is missing, and replaced with missing revision error. Page cannot be edited (Get edit conflict warning). Only way to fix page is to re-upload a new image over it or delete and undelete the page. Expected behaviour: *Well you know, upload just works properly ;) *Referential integrity should never ever be violated. There should never be an uneditable page. *Obviously a better behaviour would be for the page to be blank instead of "broken", however that's probably still a "bad" behaviour. Ideally we would want the users text to always be on the page. Its unclear to me if we would prefer that the image totally go away if we can't edit the page properly (ie the operation totally be atomic), or if the current behaviour of the file being saved is preferable (probably. Less data loss the better). *Cache should be cleared so that images don't randomly appear 24 hours after the fact. (In reply to Gerrit Notification Bot from comment #33) > Change 119932 had a related patch set uploaded by Brian Wolff: > Rollback transactions if job throws an exception. > > https://gerrit.wikimedia.org/r/119932 If I'm correct about the cause (Which as it stands is just a guess), and the patch actually works, what it will cause is that pages experiencing this bug won't be in a "broken" state (ref integrity failure), but just be blank, so users can create them later as they please. This is not a full fix to the issue, we still need to figure out what situation is causing the issue, and reduce its occurrence (Could someone look at job queue logs for one of the example files like File:ContentLanguage.svg if logs go back that far, and see what the last error was?). We also may want to try to catch whatever exception is happening, and retry the operation, if we're assuming its a very intermittent issue. > > The reasoning for filing it under Security was given in bug 61898 comment 0: > > "I'm filing this under Security to keep it on a low radar. They're innocent > to keep around, but in my experience filing these publicly may cause them to > disappear at some point when too many people spread the link without proper > context or try to poke at it." > > No comment on the validity of the reasoning, but there are URLs in that bug > that are not on this one. That seems unnecessary given how often we're getting new examples - Most recently [[commons:File:ContentLanguage.svg]] which was uploaded a week ago (Earlier we seemed to be getting new examples every day). Commons admins also have db access via tool labs and are more than capable of getting a list of all affected files themselves if they were so inclined.
Possible that bug 63058 is related. Unproven, but that one was certainly blocking some commits on commonswiki today.
(In reply to Sean Pringle from comment #35) > Possible that bug 63058 is related. Unproven, but that one was certainly > blocking some commits on commonswiki today. Blocking the commit wouldnt cause this since (from what i understand) that would cause a rollback once php timed out. This bug needs to have a COMMIT issued in the middle of a transaction to cause an inconsistent state.
*** Bug 40178 has been marked as a duplicate of this bug. ***
Change 124135 had a related patch set uploaded by Brian Wolff: Make chunked upload jobs robust in face of exceptions. https://gerrit.wikimedia.org/r/124135
Change 124136 had a related patch set uploaded by Brian Wolff: When uploading a new file, save to memcached directly after commit https://gerrit.wikimedia.org/r/124136
Change 124137 had a related patch set uploaded by Brian Wolff: Make doEditContent call $dbw->rollback() if exception happens https://gerrit.wikimedia.org/r/124137
So the patch in comment 38 or the patch in comment 40 will stop the "Missing revision #0 error" on new files. Instead when that situation happens, it will cause the page to just not be created, which is an improvement over the page becoming corrupt. (Either patch will fix the issue. Both together make things even more robust) The patch in comment 39 would make it so that when this issue happens, the image appears immediately instead of 24 hours later due to stale negative memcache entry. So what remains to be done for this bug (Other then review of those patches): *Figure out root cause of edits not going through **Most recent example (for a file) is [[commons:File:Stephens_college_at_the_corner_of_Waugh_and_Broadway_.jpg]] uploaded at approximately 2014-03-30 00:24:12. It would be helpful if someone who had access to job queue logs could grep through for a PublishStashedFile job at about that time, for that title, and see what the jobs error output is. *Figure out if there is something that should be done about root cause (which really depends on what it is) *Figure out if it makes sense in the face of an edit failure to simply try and doEditContent() a second time(?) Or if there is some other behaviour that makes sense to do in the case where we try to edit a page and it doesn't work.
Change 124137 merged by jenkins-bot: Make doEditContent call $dbw->rollback() if exception happens https://gerrit.wikimedia.org/r/124137
Change 124136 merged by jenkins-bot: When uploading a new file, save to memcached directly after commit https://gerrit.wikimedia.org/r/124136
Change 124135 merged by jenkins-bot: Make chunked upload jobs robust in face of exceptions. https://gerrit.wikimedia.org/r/124135
Resetting for new for now, until we confirm whether this is fixed or not. Aaron tried to look up [[commons:File:Rancho_Camulos_National_Historic_Landmark_Plaque.jpg]] in the jobqueue log for me, but it wasn't in the log. Which first of all confused me, and then caused me to look at UploadWizard code, and realize it only uploads via job queue if file is > 10 MB. So I guess a lot of the errors weren't coming from exceptions thrown by the upload job like I previously thought, but perhaps just directly through the api (Although perhaps a small number came through job classes). Which means that anomie's patch dc7d342d93b12 (which was deployed to commons today, april 8 at 21:12), may actually fix this issue. The patch in comment 42 (not yet deployed) may also equally fix the issue. Anyways, guess we should watch to see if these files stop occurring. The most recent example I can find is from today at 17:20 ( [[commons:file:ABS-3401.0-OverseasArrivalsDeparturesAustralia-ShorttermMovementResidentDepartures_SelectedDestinations-Trend-NumberMovements-Cambodia-A83808853V.svg]] ), so, so far so good. Not that that means much yet as in the past there have been gaps of a week between such files sometimes. ---- There's also the separate issue of what's causing the underlying error. There is still going to be files without a description page, they just wont have the "revision 0 is missing" error any more. We should figure out what's up with that. Perhaps that should be split to a separate bug (?)
Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a problem in the last three weeks?
(In reply to Andre Klapper from comment #46) > Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a > problem in the last three weeks? Unfortunately yes.
(In reply to Steinsplitter from comment #47) > (In reply to Andre Klapper from comment #46) > > Platonides, Paul, Yann, Trijnstel: Do you know if this has been still a > > problem in the last three weeks? > > Unfortunately yes. I haven't been able to find any recent examples in the db. Has there been any new files with this issue in say the last month (or even in the later part of april). If so please include examples.
(In reply to Bawolff (Brian Wolff) from comment #48) > I haven't been able to find any recent examples in the db. Has there been > any new files with this issue in say the last month (or even in the later > part of april). If so please include examples. Replies welcome.
Reducing priority until somebody can provide recent examples. See comment 48, comment 49.
Calling fixed. I think comment 44 + Anomie's api patch fixed the issue. If anyone encounters more examples please speak up.
btw, splitting off larger issue of sometimes no description gets posted, to bug 66355.