Last modified: 2011-11-29 03:20:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T25264, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 23264 - Dumps twisted in several languages
Dumps twisted in several languages
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal critical with 7 votes (vote)
: ---
Assigned To: Ariel T. Glenn
upload.wikimedia.org/wikipedia/common...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-04-20 21:01 UTC by LinguistManiac
Modified: 2011-11-29 03:20 UTC (History)
18 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description LinguistManiac 2010-04-20 21:01:05 UTC
Many dumps are twisted corrupted in many languages.

While syntactically correct, titles do not correspond to content.

e.g. "A mír na Zemi!" in the czech wiki, has the text of "singapore" in the dump. I've discovered this all across the languages - seems not to affect
all articles though. (cswiki dump as of 20100411)
If you need more examples, I can provide them



  <page>
    <title>A mír na Zemi!</title>
    <id>70749</id>
    <revision>
      <id>5178497</id>
      <timestamp>2010-04-03T22:56:32Z</timestamp>
      <contributor>
        <username>Chalupa</username>
        <id>3656</id>
      </contributor>
      <comment>obrázek z commons</comment>
      <text xml:space="preserve">{{Infobox stát|
    genitiv = Singapuru
  | úřední název = Republic of Singapore&lt;br /&gt;新加坡共和国&lt;br /&gt;Republik Singapura&lt;br /&gt;சிங்கப்பூர் குடியரசு
  | vlajka = Flag of Singapore.svg
  | článek o vlajce = Singapurská vlajka
  | znak =
  | mapa umístění = LocationSingapore.png
...
Comment 1 Ariel T. Glenn 2010-04-20 21:09:33 UTC
The recent dumps for cswiki, ltwiktionary, thwiki and elwiki had to be interrupted, as they were hung.  They were restarted by forcefully shooting threads. Please use files from the previous dumps.  You'll see on the index page (http://dumps.wikimedia.org/backup-index.html) messages like "Dump complete, 2 items failed".  

If you are seeing this in some other dump than the above, please note it here. Thanks.
Comment 2 LinguistManiac 2010-04-20 22:00:06 UTC
e.g. glwiki
Comment 3 ABX 2010-04-27 11:54:37 UTC
I see it in pl.wiktionary dump and reported a few days ago in bug #18651
Comment 4 Tomasz Finc 2010-05-03 18:54:46 UTC
I'm seeing this show up consistently for any stalled run


Warning: XMLReader::read(): compress.bzip2:///mnt/dumps/public/frwiktionary/20100422/frwiktionary-20100422-pages-articles.xml.bz2:6208817: parser error : Extra content at the end of the document in /usr/local/apache/common-local/wmf-deployment/maintenance/backupPrefetch.inc on line 151
Warning: XMLReader::read(): [[zh:extravasa in /usr/local/apache/common-local/wmf-deployment/maintenance/backupPrefetch.inc on line 151
Warning: XMLReader::read():               ^ in /usr/local/apache/common-local/wmf-deployment/maintenance/backupPrefetch.inc on line 151
Warning: XMLReader::read(): An Error Occured while reading in /usr/local/apache/common-local/wmf-deployment/maintenance/backupPrefetch.inc on line 151
Comment 5 Tomasz Finc 2010-05-04 21:39:36 UTC
Were going along two paths for this right now

For the first we are 

1) Turning off pre fetch since some previous snapshots are bad
2) Turning off spawning child procs since we seeing inter process messaging break down

This is now being tested on snapshot2 and if we see no issues will be propagated to production. 

For the second we are testing out a potential bugfix to the core issue and making sure it has no unexpected consequences.

This is being tested on snapshot3
Comment 6 Ascánder Suárez 2010-05-07 16:54:21 UTC
It could be a new instance of the same problem reported last June for the Spanish Wikipedia (see Bug 18694 [https://bugzilla.wikimedia.org/show_bug.cgi?id=18694]).

Our problem was: 

Must articles last modified between mid January 2009 and mid April 2009 appeared with a wrong content in dumps. The reason was never discovered, but as this problem affected several kinds of users, we dealt with it by updating all articles last modified in that period (spell checking, cosmetic changes, and finally, useless changes).
Comment 7 Ariel T. Glenn 2010-05-12 02:38:18 UTC
Just an update: I'm taking the opportunity to refactor dumpTextPass, fetchText, backuos.inc and dumpBackup.php so the Maintenance class is used appropriately and so that we can add timeouts to reads and writes properly.  Should be testing the new code on one of the snapshot hosts tomorrow afternoon.  This should address the "revisions out of sync" as well as "backup processes hang indefinitely on write" issues.
Comment 8 Malafaya 2010-05-19 16:18:49 UTC
Any news on this? Fresher dumps would be welcome :).
Comment 9 Ariel T. Glenn 2010-05-19 21:54:07 UTC
tests look good, going to try to run some production dumps this afternoon.
Comment 10 andreasmeier80 2010-05-21 08:02:36 UTC
Are there any news about this?
Comment 11 Ariel T. Glenn 2010-05-21 18:07:34 UTC
I am running cswiki now; when it's done I'll make it available via the downloads page. It should be inspected closely by a regular user of the dumps to see if it's correct.  If someone else watching this bug is on a smaller project and would be interested in getting dumps now and checking them for accuracy, I'd be happy to run a set.  Once I have a few dumps verified as ok, I'll do a full run through all the projects.
Comment 12 Rich Farmbrough 2010-05-21 19:04:45 UTC
If you run on Rowiki pages-articles, I can do some ad hock checking against previous versions.  Not sure how much help that would be. And I have to say pages-articles dumps would still be useful even if they are somewhat broken, as long as we know!
Comment 13 Rich Farmbrough 2010-05-21 19:06:01 UTC
(In reply to comment #12)
> If you run on Rowiki pages-articles, I can do some ad hock checking against
: Ad hock?  Means I will be drinking I suppose.
Comment 14 Egmontaz 2010-05-21 22:04:20 UTC
I can check el.wikipedia dump against the last erroneous and the previous 2 good ones, I usually work with pages-meta-current, but will do it with pages-articles too.
Comment 15 Malafaya 2010-05-22 02:00:20 UTC
(In reply to comment #11)
> I am running cswiki now; when it's done I'll make it available via the
> downloads page. It should be inspected closely by a regular user of the dumps
> to see if it's correct.  If someone else watching this bug is on a smaller
> project and would be interested in getting dumps now and checking them for
> accuracy, I'd be happy to run a set.  Once I have a few dumps verified as ok,
> I'll do a full run through all the projects.

If you make ptwikt (wikt, not wiki) available, I'll be happy to analyze it too.
Comment 16 Ariel T. Glenn 2010-05-22 02:03:06 UTC
Currently running: cs, ro, el.  I'll start up ptwikt once one of those completes. We won't be updating the central index page but I'll add a note here as they become available.
Comment 17 Ariel T. Glenn 2010-05-22 04:59:21 UTC
rowiki is complete and can be retrieved from 
http://dumps.wikimedia.org/rowiki/20100521/
Comment 18 Ariel T. Glenn 2010-05-22 05:34:22 UTC
cswiki has completed: http://dumps.wikimedia.org/cswiki/20100521/
Comment 19 Ariel T. Glenn 2010-05-22 07:17:17 UTC
elwiki is temporarily on hold.  I ran across a revision with unretrievable text: see http://el.wikipedia.org/w/index.php?title=%CE%A3%CF%85%CE%B6%CE%AE%CF%84%CE%B7%CF%83%CE%B7_%CF%87%CF%81%CE%AE%CF%83%CF%84%CE%B7:Geraki/%CE%91%CF%81%CF%87%CE%B5%CE%AF%CE%BF_9&oldid=1422393 for the particular revision.  I'll probably restart the job tomorrow and ignore this one revision's text.  At some point we should decide how to mark up pages for which there are errors.  We already put 'deleted' in some fields, perhaps we should have error indications as well.

If folks have other thoughts, please chime in during the next 6-7 hours or so (while I sleep).
Comment 20 Malafaya 2010-05-24 00:55:00 UTC
You may have forgotten about pt.wikt's dump... ;)
Comment 21 Ariel T. Glenn 2010-05-24 03:51:19 UTC
No, I didn't forget. I was hoping for the folks with ro and cs to look at those before I continued on. However, since you asked again (and they haven't commented yet), I'm running it now.
Comment 22 Ariel T. Glenn 2010-05-24 04:57:00 UTC
Please see (and check closely) http://dumps.wikimedia.org/ptwiktionary/20100524/ and let me know if there are any issues.  Thanks.
Comment 23 Malafaya 2010-05-24 23:20:24 UTC
Ariel,
I didn't find any inconsistencies so far, but I never found any in previous pt.wikt dumps either. I will let you know if I find anything meanwhile.
Thanks.
Comment 24 Ariel T. Glenn 2010-05-25 00:12:01 UTC
glwiki run completed, see http://dumps.wikimedia.org/glwiki/20100524/
Comment 25 Ariel T. Glenn 2010-05-25 15:25:34 UTC
elwiki run completed, see http://dumps.wikimedia.org/elwiki/20100525/
Comment 26 Rich Farmbrough 2010-05-25 20:15:48 UTC
OK the notable difference is that the filespace seem to have changed name from  April - it is now Fisier. However I developed signatures for both dumps and compared them. The correlation is very high, internal consistency also seems good. Spot checks of differences were supported by the history pages of the wiki.
Comment 27 Malafaya 2010-05-26 21:45:36 UTC
Attending to Tomasz request and confirming what I said above (#23), no inconsistencies were found in the latest ptwiktionary dump. Obviously, I didn't check everything but did a few random checks, so it's possible inconsistences may exist and were not noticed.
Comment 28 Platonides 2010-05-26 21:47:07 UTC
> glwiki run completed, see http://dumps.wikimedia.org/glwiki/20100524/

Content of glwiki-20100524-pages-meta-current seems good. Scanned fully the main namespace and all pages seemed related to the purpoted title.
Comment 29 Egmontaz 2010-05-27 08:14:33 UTC
elwiki seems good. I did random checks, and the usual queries I do and all seem nice and consistent, none of the problems I had with the previous dump showed up.
Comment 30 Ariel T. Glenn 2010-05-29 03:54:59 UTC
Running one worker (one queue of dumps) now.  They should be showing up on http://dumps.wikimedia.org/backup-index.html already.
Comment 31 andreasmeier80 2010-06-02 12:36:05 UTC
There is a problem with testwiki, see http://dumps.wikimedia.org/testwiki/20100531/
Comment 32 Ariel T. Glenn 2010-06-02 15:00:51 UTC
Yes, testwiki won't run correctly until my fixes are deployed in the production branch (special case).
Comment 33 Ariel T. Glenn 2010-06-04 19:43:22 UTC
I have moved all of the previous bad dumps (April 11 through May 2 2010) to a separate location so they will no longer show up on the download page.  Dumps will continue running, doing projects with the oldest good dump first. 

Some fixes which should help to prevent an occurrence of this bug have been committed to trunk.
Comment 34 Rich Farmbrough 2010-06-21 10:09:07 UTC
2010-06-21 08:04:07 enwiki (new): missing status record

Not sure what's happening here, the date seems to be today's date - i.e. when I looked on the 16th it said

2010-06-15 08:04:07 enwiki (new): missing status record
Comment 35 Ariel T. Glenn 2010-06-21 14:08:54 UTC
enwiki dumps are not running right now (any job that might have started can be ignored).  We expect to start a job for it later in the week once migration to the new storage server has been completed.
Comment 36 Rich Farmbrough 2010-06-29 21:31:54 UTC
Yahoo extracts failed

# 2010-06-29 20:54:09 failed Extracted page abstracts for Yahoo
Database returned error "0: "

    * abstract.xml
Comment 37 Malafaya 2010-07-16 10:46:34 UTC
As of today, several dumps (simplewiki, elwiki, cswiki) seem stuck since 5th July.
Comment 38 Frozen Wind 2010-07-22 06:33:32 UTC
simplewiki is stuck on rev 1832000, which can be fetched for me.
Comment 39 Ariel T. Glenn 2011-09-18 06:30:29 UTC
Closing this since there have been no new reports of text content drift after putting the text length check in place (and fixing the underlying bug in mid 2010 that caused the issue).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links