Last modified: 2013-06-18 15:39:36 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20694, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 18694 - Spanish wikipedia XML dump problems
Spanish wikipedia XML dump problems
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal with 6 votes (vote)
: ---
Assigned To: Ariel T. Glenn
:
: 19420 19598 20114 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-06 09:15 UTC by elephantus_l
Modified: 2013-06-18 15:39 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description elephantus_l 2009-05-06 09:15:58 UTC
I downloaded two sequential Spanish wikipedia XML dump files
(eswiki-20090504-pages-articles.xml.bz2 and before that eswiki-20090421-pages-articles.xml.bz2). When I imported the file into wikitaxi it showed a strange error on a large number of pages: the titles and the content of the pages were mixed-up, that is, the title would be something and the text itself would obviously be from a different page (or it would be a combination of two pages). So I looked into the original XML file itself and this is what I found, for example:

  <page>
    <title>Gómez Plata</title>
    <id>454035</id>
    <revision>
      <id>25156038</id>
      <timestamp>2009-03-28T06:38:04Z</timestamp>
      <contributor>
        <username>SajoR</username>
        <id>130444</id>
      </contributor>
      <minor />
      <comment>leve mejora</comment>
      <text xml:space="preserve">'''Montserrat Domínguez''' ([[Madrid]], [[1963]]) es una [[periodismo|periodista]] [[España|española]].

Considera que la primera obligación de un periodista es ser crítico con el poder y es optimista respecto a la situación actual del periodismo. Su trabajo le ofrece, en su opinión, &quot;un motor de vida&quot;.

Es aficionada a la [[lectura]] y a los viajes.

== Biografía ==

Estudió [[Ciencias de la Información]] por la [[Universidad Complutense de Madrid]]. Posteriormente cursó un Master en Periodismo por la [[Universidad de Columbia]].


So the title of the page is Gómez Plata (a municipality in Colombia), but the page is about a Spanish journalist.

This didn't happen when I downloaded other wikipedia dumps (en, de, nl, sv). Could someone please look into this problem? Thank you.
Comment 1 Tomasz Finc 2009-05-06 23:58:15 UTC
I can indeed find this in eswiki-20090504-pages-articles.xml.bz2 but not in eswiki-20090504-pages-meta-current.xml which is bizarre. A quick look at things didn't showcase any big errors in the code. This is going to to take a bit more time to find. Thank you for testing the other dumps to see if this was happening as well.
Comment 2 elephantus_l 2009-05-07 06:46:39 UTC
Apparently the messed-up pages are those with a timestamp from approximately mid-January 2009 to mid- or late-April 2009. The pages older or younger than that aren't affected.
Comment 3 Tomasz Finc 2009-05-26 08:41:52 UTC
It doesn't seem to affect all articles within that time range though making it a bit hard to find other examples. Can you give me a list of 10 
or so more that are still affected in the latest dump 20090519. Gómez Plata got updated and is now dumping correctly.
Comment 4 Tomasz Finc 2009-06-29 07:42:52 UTC
*** Bug 19420 has been marked as a duplicate of this bug. ***
Comment 5 Platonides 2009-07-10 21:33:20 UTC
eswiki-20090702-pages-articles is still affected.

For instance,
[[MediaWiki:anonnotice]] have the content of [[Carlos Iglesias]].

[[Wikipedia:Portada]] has
  <page>
    <title>Wikipedia:Portada</title>
    <id>2271189</id>
    <revision>
      <id>25284089</id>
      <timestamp>2009-04-02T13:56:29Z</timestamp>
      <contributor>
        <username>Muro de Aguas</username>
        <id>214907</id>
      </contributor>
      <minor />
      <comment>wapedia no es propiedad de wikipedia</comment>
      <text xml:space="preserve">#REDIRECT [[Plantilla:Ficha de militar]]</text>
    </revision>
  </page>

whereas that revision is http://es.wikipedia.org/w/index.php?title=Wikipedia:Portada&diff=25284089&oldid=24586619

Has a clean dump* been done since the problem was detected?

*A dump not based on the previous one.
Comment 6 Platonides 2009-07-11 00:13:10 UTC
(In reply to comment #2)
> Apparently the messed-up pages are those with a timestamp from approximately
> mid-January 2009 to mid- or late-April 2009. The pages older or younger than
> that aren't affected.

Could it be a slave whose autoincrement column desynchonized?


Comment 7 Enrique 2009-07-13 13:26:11 UTC
eswiki-20090710-pages-articles.xml resolves this error???
Comment 8 Platonides 2009-07-13 13:31:46 UTC
No. But you can use the pages-meta-current to skip it.
Comment 9 Enrique 2009-07-13 13:55:34 UTC
them, this error will not  have solution?
tomorrow i will try with pages-meta-current to skip it error, but I prefer wait for an solution.
Comment 10 Alejandro Tejada Capellan 2009-07-17 19:51:41 UTC
(In reply to comment #8)
> No. But you can use the pages-meta-current to skip it.
Ok, is not too much problem to download 1GB instead of
758MB, but Could someone find any hint in the PHP code that creates this backup, that actually explains why this error is ocurring only in the Spanish Wikipedia and not in the English version?
I had verified that English version database backup does not show these errors.
Comment 11 Enrique 2009-07-17 20:37:19 UTC
Hi, i downloaded pages-meta-current to skip it error, but i see many red links, in the official wikipedia these links are blue
many of templates are empty or uncategorised.
Comment 12 Tomasz Finc 2009-07-21 01:55:29 UTC
(In reply to comment #10)
> (In reply to comment #8)
> > No. But you can use the pages-meta-current to skip it.
> Ok, is not too much problem to download 1GB instead of
> 758MB, but Could someone find any hint in the PHP code that creates this
> backup, that actually explains why this error is ocurring only in the Spanish
> Wikipedia and not in the English version?

Sadly we've suffered a loss of a good chunk of our previously run snapshots so comparing to
those will be a bit hard. If I can catch the problem happening actively then it will be much
easier.

Comment 13 Brion Vibber 2009-08-07 19:50:44 UTC
*** Bug 20114 has been marked as a duplicate of this bug. ***
Comment 14 Platonides 2010-01-10 22:34:41 UTC
*** Bug 19598 has been marked as a duplicate of this bug. ***
Comment 15 Ascánder Suárez 2010-02-25 07:25:52 UTC
Hi, this problem is still present in the backups of the Spanish Wikipedia. It is hard to calculate the number of articles affected, but they are among those modified between January and April 2009. No affected articles have been observed so far out of this range of time.

Articles affected (i.e. articles showing a wrong content in the periodic backups including the last published one: eswiki-20100221-pages-articles.xml.bz2) seem to be the same in different backups.

Here are some examples of wrong content:

Article [[:es:Escudo de la Polinesia Francesa]] shows contents belonging to [[:es:Anexo:Gobernadores de Corrientes]]

Article [[:es:Pleurodema]] shows contents belonging to [[:es:Aviación virtual]]

Redirect [[:es:Candelilla]] points to [[:es:Eugène Scribe]] instead of [[:es:Euphorbia antisyphilitica]].

Redirect [[:es:Knut Schreiner]] points to [[:es:Euphorbia antisyphilitica]] instead of [[:es:Euroboy]]

Here is an upper bound to the number of articles affected (i.e. articles updated between January and April 2009): 44193 articles/annexes, 5634 files on other space names and 99067 redirects. Contrary to articles, redirects are easy to check and I can say that almost all of them show a wrong content in the last backup.
For instance, these are the redirects updated on the first hour of March first, 2009 and they are all wrong:

'Aeropuerto de Ontario' --> 'Agustín de Pedrayes'
'Corno' --> 'El aprendiz de brujo (Dukas)'
'Kenneth Burrell' --> 'Aeropuerto Internacional LA/Ontario'
'Oro amarillo' --> 'Emo'
'Claro de Luna (Beethoven)' --> 'oro'
'Claro de Luna (Maupassant)' --> 'Sonata para piano n.º 14 (Beethoven)'
'Claro de luna (Debussy)' --> 'Sonata para oboe y piano (Poulenc)'
'Idioma retorromance' --> 'Adrenalynn'
'Boubacar traoré' --> 'Suite bergamasque'
'Rodrigo Sepúlveda Lara' --> 'Claro de luna (astronomía)'
'Oxnard' --> 'Rodrigo Sepúlveda'
Comment 16 Ascánder Suárez 2010-03-24 10:44:32 UTC
This problem is vanishing.

As mentioned before, pages affected are among those edited for the last time between January 2009 and April 2009, so with the help of [[:es:Usuario:Boticario]] and his bot CEM-bot, the pages with these characteristics were reviewed first orthographically and then the remaining for cosmetic changes. There are still several pages and redirects that show a wrong contents in the las dump, with and upper bound of 38 redirects and 16260 pages.

All of the 38 remaining redirects contain character "_" in their title and thus, are not accessible through the site.

In order to finish with this problem and unless someone proposes a better idea, I'll suggest Boticario to edit them introducing a useless space at the end of the first line or something equally useless and invisible.

For the record, here is the list of redirects that I don't know how to access (notice the underscores in their title):

* [[:es:A _c]]
* [[:es:A _d]]
* [[:es:A _e _c]]
* [[:es:Siglo_II_d._C.]]
* [[:es:La_440]]
* [[:es:S._I.]]
* [[:es:580_a._C.]]
* [[:es:589_a._C.]]
* [[:es:588_a._C.]]
* [[:es:585_a._C.]]
* [[:es:584_a._C.]]
* [[:es:582_a._C.]]
* [[:es:594_a._C.]]
* [[:es:600_a._C.]]
* [[:es:559_a._C.]]
* [[:es:556_a._C.]]
* [[:es:550_a._C.]]
* [[:es:558_a._C.]]
* [[:es:555_a._C.]]
* [[:es:551_a._C.]]
* [[:es:546_a._C.]]
* [[:es:529_a._C.]]
* [[:es:528_a._C.]]
* [[:es:526_a._C.]]
* [[:es:525_a._C.]]
* [[:es:522_a._C.]]
* [[:es:521_a._C.]]
* [[:es:520_a._C.]]
* [[:es:510_a._C.]]
* [[:es:515_a._C.]]
* [[:es:K._O.]]
* [[:es:Brasilia,_D._F.]]
* [[:es:Brasilia, D._F.]]
* [[:es:1200_a._C.]]
* [[:es:500_a._C.]]
* [[:es:Marina de EE._UU.]]
* [[:es:Francis S._Collins]]
Comment 17 Platonides 2010-03-24 22:00:15 UTC
Look at http://es.wikipedia.org/w/index.php?title=Francis_S.%C2%A0Collins&diff=prev&oldid=25984099 It is not an space, it is a non-breaking space. This redirect was created by a indef blocked vandal which used a no breaking space (0xc2 0xa0) instead of the normal one (0x20).

For some reason action=view is treating the 160 space as a 32 one and thus doesn't find it.
We have 73 articles like that. Most were made by Rosarino, but also by Gunderson, Muro Bot, Wiki Winner and Jtspotau.
We should probably delete them.
Comment 18 Platonides 2010-03-25 15:11:02 UTC
I opened bug 22939 to handle the nbsp titles.
Comment 19 Steef 2010-05-03 11:33:37 UTC
It seems that not only eswiki is affected by this.

There is a report on dewiki Village Pump about this issue on dewikisource ([http://de.wikipedia.org/w/index.php?title=Wikipedia:Fragen_zur_Wikipedia&oldid=73896610#Dump]).
Comment 20 Diederik van Liere 2011-02-06 08:57:16 UTC
It seems that bug 18651 is related as well (https://bugzilla.wikimedia.org/show_bug.cgi?id=18651).
Comment 21 Mark A. Hershberger 2011-05-03 18:56:54 UTC
Givng dump bugs to Ariel.
Comment 22 Ariel T. Glenn 2011-08-29 16:27:38 UTC
We check length of revision content from the db against what we have in previous dumps (or what we think we are retrieving from the db), as of June 2010 (http://svn.wikimedia.org/viewvc/mediawiki?view=revision&revision=67324); are people still seeing this issue?
Comment 23 Ariel T. Glenn 2011-09-18 06:39:59 UTC
Closing, since no further reports were submitted after the text length check was put in place and the underlying bug causing text content mismatch was fixed in mid 2010.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links