Last modified: 2014-02-12 23:40:06 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23917, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 21917 - mwdumper does not generates some page_id's
mwdumper does not generates some page_id's
Status: NEW
Product: Utilities
Classification: Unclassified
mwdumper (Other open bugs)
unspecified
PC Windows Vista
: Normal major (vote)
: ---
Assigned To: Brion Vibber
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-21 17:29 UTC by Kein Zantezuken
Modified: 2014-02-12 23:40 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Kein Zantezuken 2009-12-21 17:29:18 UTC
I have an XML-dump of ruwiki. To reduce amount of time required to import content I have converted XML into SQL script. After executing that SQL-script I noticed (via mediawiki) that some articles is missing, but 'text' table has all data regarding these articles. After some search I found the cause - 'page' table has none data about these article's name and id, mwdumper just didn't generate 'INSERT INTO page' command for some (appr. 40% of a dump) articles.

I used unofficial mwdumper build from bug 18328, because I have NO jar-based up-to-date builds except that one (the official one is from 2006 and is not compatible with new XML-dumps now).
Comment 1 Platonides 2009-12-21 17:30:44 UTC
Which xml dump is it?
Can you provide some of those missing articles?
Comment 2 Kein Zantezuken 2009-12-21 17:38:53 UTC
It's http://download.wikimedia.org/ruwiki/20091207/ruwiki-20091207-pages-articles.xml.bz2

Article "Операционная система" for example (line number 515495 in the dump).
Comment 3 Kein Zantezuken 2009-12-22 22:31:04 UTC
Ok, here is the one of missing articles:
http:;/shinra.ru/kein/w/operating_system.xml (40Kb, UTF-8)

Dunno how can I help else :< Really annoying bug, make the whole dumps useless since I can't import things properly ;<
Comment 4 Platonides 2009-12-23 14:06:12 UTC
I do find the insert for Операционная система at page table:


$ bzcat ruwiki-20091207-pages-articles.xml.bz2|java  -jar mwdumper.jar --format=sql:1.5 | grep -m 1 "'Операционная_система'"

INSERT INTO page (page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES (3428,0,'Эмоция','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20389302,28622),(3432,0,'Человек_разумный','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20412105,62890), ...

... (4590,0,'1545_год','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,17287964,3074),
(4591,0,'Операционная_система','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20354505,39406),
(4593,0,'Рим','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20389427,116221),(4595,0,'Двоичные_приставки','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,19830413,15461)...

...(4904,0,'23_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20288736,13963),(4905,0,'24_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20288479,14120),(4906,0,'25_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20288290,29154),(4907,0,'26_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20313506,16559),(4908,0,'27_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20313334,14701),(4909,0,'28_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20420267,22861),(4910,0,'29_января','',0,0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,20352730,19695);

Maybe mysql didn't accept the full line for some reason?
Comment 5 Kein Zantezuken 2009-12-23 17:07:26 UTC
No, as I said I didn't get INSERT for that page at all.
Can you compile latest mwdumper for window, please? So, I can test.
Comment 6 Platonides 2009-12-23 17:46:12 UTC
I used the same mwdumper.jar as you. Jar files work cross platform. How were you looking for the insert?
Comment 7 Kein Zantezuken 2009-12-23 19:40:49 UTC
I searched the whole dump for INSERT into `page` with 'Операционная система'. I found many articles which I already has in DB, but articles which is missing in my current DB missing in the generated SQL-dump as well.
Well, the only missing thing is INSERT into `page`, old_text, old_data and old_id is here.
Comment 8 Platonides 2009-12-23 22:05:45 UTC
Don't look for 'Операционная система', you must look for 'Операционная_система', it will be in db form, with spaces converted into underscores.
There will be three instances: [[Операционная система]] which is the insert line I included above, [[Category:Операционная система]] and [[Template:Операционная система]].
All of them "INSERT INTO page" lines, albeit really long lines.
Comment 9 Kein Zantezuken 2009-12-25 15:26:23 UTC
Yeh, I found 'Операционная_система' in the SQL dump, but that's weird... the whole INSERT into page was skipped, I can't find any page_id's for all these articles in that INSERT script. Weird. Tho, I have [[Category:Операционная система]] and [[Template:Операционная система]] ;<
Annoying.
Well, anyway, bug is INVALID, sorry for the false report.
Comment 10 Kein Zantezuken 2009-12-25 18:05:16 UTC
Or, perhaps, it is valid since mwdumper does not generate correct SQL dump. Too many duplicate entries but why? The DB is empty. Looks like mwdumper does something wrong.
Comment 11 Kein Zantezuken 2009-12-25 18:48:34 UTC
Here is the [http://shinra.ru/kein/out.7z full log].
As you can see all errors related to 'rev_comment' only, so, if generated SQL
is correct there shoul not by any issue with missing articles. But there is ;<
Comment 12 Platonides 2010-03-29 22:06:56 UTC
Which option did you select for the database? utf-8, binary or backwards-compatible with mysql4?
Comment 13 Kein Zantezuken 2010-03-30 11:36:36 UTC
binary
Comment 14 Umherirrender 2010-04-06 11:46:54 UTC
Works for my with ruwiki-20100331-pages-articles.xml.

Have the tables page, revision and text all the same number of rows? (1478943)

Maybe that is a encoding problem, try to append 
&characterEncoding=utf8
to the --output parameter

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links