Last modified: 2013-06-18 13:30:34 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20651, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 18651 - hewiktionary pages-articles.xml dump corrupted
hewiktionary pages-articles.xml dump corrupted
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal major with 3 votes (vote)
: ---
Assigned To: Ariel T. Glenn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-02 10:43 UTC by Maxim Iorsh
Modified: 2013-06-18 13:30 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Script which detects errors in database dumps (4.81 KB, text/plain)
2009-07-16 21:23 UTC, Maxim Iorsh
Details

Description Maxim Iorsh 2009-05-02 10:43:40 UTC
The file hewiktionary-20090423-pages-articles.xml (available as compressed download from http://download.wikimedia.org/hewiktionary/20090423/hewiktionary-20090423-pages-articles.xml.bz2) has corrupted data inside.

For example, some entries lack text altogether, e.g. "MP3" has empty <text> tag. The article is available at http://he.wiktionary.org/wiki/MP3, its content is obviously non-empty.

 <page>
  <title>MP3</title>
  <id>11639</id>
 −<revision>
   <id>88117</id>
   <timestamp>2009-02-16T18:12:13Z</timestamp>
  −<contributor>
    <username>Interwicket</username>
    <id>2170</id>
   </contributor>
   <minor/>
   <comment>iwiki +[[:ko:MP3]]</comment>
   <text xml:space="preserve"/>
  </revision>
 </page>

Other entries have text belonging to *other* entries, e.g. the entry "דת" (http://he.wiktionary.org/wiki/דת) has text from the entry "ויסקי" (http://he.wiktionary.org/wiki/ויסקי).

 <page>
  <title>דת</title>
  <id>11808</id>
 −<revision>
   <id>92454</id>
   <timestamp>2009-03-22T03:06:03Z</timestamp>
  −<contributor>
    <username>Interwicket</username>
    <id>2170</id>
   </contributor>
   <minor/>
   <comment>iwiki +[[:th:דת]]</comment>
  −<text xml:space="preserve">
==וִיסְקִי==
{{ניתוח דקדוקי|
|כתיב מלא=ויסקי
|הגייה='''vis'''ki
|חלק דיבר=שם־עצם
|מין=זכר
|שורש=
|דרך תצורה=
|נטיות=
}}
[[תמונה:Scotch Whisky (aka).jpg|שמאל|ממוזער|184px|ויסקי]]
# משקה [[אלכוהולי]], המופק על ידי זיקוק סוגים שונים של [[דגנים]] אשר עברו תהליך [[הלתתה]]. לאחר הזיקוק, מיושן הנוזל בחביות עץ אלון לפרק זמן משתנה. אחוז האלכוהול ברוב מותגי הוויסקי עומד על 40.
#:* ב[[יום הולדת|יום הולדתי]] שתיתי '''ויסקי''' לשוכרה.
#:* "איזה בן קיבוץ בא לבקר בעיר; נבוך עם התרמיל ובלוריתו המתנפנפת; הוא '''ויסקי''' טוב מוזג, תשתה בחור צעיר; ומחייך: "ספר, אז מה נשמע ברפת"" ("בלדה לעוזב קיבוץ", מילים: יענקל'ה רוטבליט)

===מקור===
# מקור שמו של הוויסקי מגיע מהשפות הקלטיות. במקור נקרא המשקה "ויסקי באה" (uisge beatha באיות אירי או uisge baugh באיות סקוטי) שמשמעותו המילולית היא: "מי החיים".

===תרגום===
* אנגלית: {{ת|אנגלית|whiskey}}

===ראו גם===
* [[וודקה]]
* [[טקילה]]

===קישורים חיצוניים===
{{מיזמים|ויקיפדיה=ויסקי|ויקישיתוף=Category:Whisky|שם ויקישיתוף=ויסקי}}

{{תבנית:משקאות חריפים}}

[[קטגוריה:משקאות]]

[[el:ויסקי]]
   </text>
  </revision>
 </page>
Comment 1 Chad H. 2009-05-02 16:00:09 UTC
Update components.
Comment 2 Brion Vibber 2009-07-13 19:09:56 UTC
This current?
Comment 3 Maxim Iorsh 2009-07-14 10:50:12 UTC
It seems that these specific XML entries are ok in later dumps, but other entries are corrupt instead. I have some scripts which process pages-articles.xml and detect such corruptions, but they need some manual assistance. Please let me know if you want examples of corrupted entries for some newer dump.
Comment 4 Tomasz Finc 2009-07-14 20:07:41 UTC
If you could share the scripts that you use to detect this that would be great.
Comment 5 Maxim Iorsh 2009-07-16 21:23:02 UTC
Created attachment 6348 [details]
Script which detects errors in database dumps
Comment 6 Maxim Iorsh 2009-07-16 21:39:51 UTC
This script should be ran as

 ./HeWiktionary_2_CulmusDic.pl hewiktionary-pages-articles.xml > hewiktionary-culmus.xml

where hewiktionary-pages-articles.xml is the dump in question. It will produce a bunch of reports of form

 ...
 Bad word in heading: כלכלן
 Bad word in heading: זבד
 Bad word in heading: חג
 ...

I made an effort to ensure that these reports refer to actual dump errors with high probability. Try a few if you don't encounter an error for the first time. The example is from 20090713 dump (http://download.wikimedia.org/hewiktionary/20090713/, file http://download.wikimedia.org/hewiktionary/20090713/hewiktionary-20090713-pages-articles.xml.bz2). Take any report and check the entry in the pages-articles.xml file which corresponds to a page with that name.

E.g. for "כלכלן" look for "<title>כלכלן</title>". You will find an XML entry for the page http://he.wiktionary.org/wiki/כלכלן, but the contents of the entry have nothing to do with the actual contents of the wiki page. I guess that the XML entry <text xml:space="preserve"> contents come from http://he.wiktionary.org/wiki/דינמיט.

The inner workings of the script are probably of no interest to you. It parses wiki pages and complains when the page seems too inconsistent with the usual Hebrew Wiktionary page template. It should mainly serve as an dump error detector.
Comment 7 ABX 2010-04-12 09:55:47 UTC
Posting again in correct bugreport.

Just notice same thing. Dump is broken at
http://download.wikimedia.org/plwiktionary/20100411/plwiktionary-20100411-pages-articles.xml.bz2.
Several entries are existing with empty text field. For example here is real
content of [[pl:wikt:til]] article:
http://pl.wiktionary.org/w/index.php?title=til&oldid=1279710 while in dump
there is:

  <page>
    <title>til</title>
    <id>7330</id>
    <revision>
      <id>1279710</id>
      <timestamp>2010-04-01T23:12:41Z</timestamp>
      <contributor>
        <username>Interwicket</username>
        <id>5613</id>
      </contributor>
      <minor />
      <comment>iwiki +[[:sv:til]]</comment>
      <text xml:space="preserve" />
    </revision>
  </page>

There are also entries misplaced. Content of [[pl:wikt:ez]] is under
<title>bela</title> in this archive.
Comment 8 Mark A. Hershberger 2011-05-03 18:56:58 UTC
Givng dump bugs to Ariel.
Comment 9 Ariel T. Glenn 2011-08-29 16:27:35 UTC
We check length of revision content from the db against what we have in previous dumps (or what we think we are retrieving from the db), as of June 2010 (http://svn.wikimedia.org/viewvc/mediawiki?view=revision&revision=67324); are people still seeing this issue?
Comment 10 Ariel T. Glenn 2011-09-18 06:37:12 UTC
Closing, since no further reports were submitted after the text length check was put in place and the underlying bug causing text content mismatch was fixed in mid 2010.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links