Last modified: 2011-11-29 03:20:57 UTC
Starting around 5/25 we are seeing Yahoo! abstracts show up as either empty with <feed> </feed> as with en wiki http://download.wikimedia.org/enwiki/20090530/enwiki-20090530-abstract.xml or incomplete as in. http://download.wikimedia.org/eswiki/20090601/eswiki-20090601-abstract.xml The problem seems to have shown up in the last iteration of dumps starting 5/25 but did not affect all of them on that day.
dumpBackup.php is having some oddities here causing the full abstract to be truncated. I can get it to dump what seems to be a full log or stub dump but as soon as I try a "--curent" or "--latest" it chokes up very early on data going to stdout. Running '/usr/bin/php' -q '/apache/common/php-1.5'/maintenance/dumpBackup.php --wiki='plwiki' --current --report=10000 --server='x.x.x.x' --current on amane I get it stalling right after <page> <title>ACM A.M. Turing Award</title> <id>17</id> consistently. Even though running an strace shows that it's still receiving data. The process ends one revision entry.
After digging deeper into the code chain I was able to find that old_text entries matching 'historyblobcurstub' are not being retrieved correctly. Skipping over those entries makes dumpBackup work as expected. Now to find out why those revisions are not being pulled correctly.
Diving even deeper shows that mysql_unbuffered_query is not returning at all for a simple sql call like SELECT cur_text FROM `cur` WHERE cur_id = '17' LIMIT 1 when processing historyblobcurstub
That query shouldn't quote the number. Otherwise mysql converts from text to integer *once per row*. It should still be using an index, but given the table size and poor optimizing could it be just timeouting?
Were looking happy again after todays code update that brought on numerous fixes. http://download.wikipedia.org/ltwiktionary/20090917/ is correctly generating an abstract file.
And just for completeness, fix came in at http://www.mediawiki.org/wiki/Special:Code/MediaWiki/56347