Last modified: 2011-11-29 03:20:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T21046, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 19046 - Yahoo abstracts are incomplete or empty
Yahoo abstracts are incomplete or empty
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal major (vote)
: ---
Assigned To: Tomasz Finc
http://download.wikimedia.org/enwiki/...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-01 17:47 UTC by Tomasz Finc
Modified: 2011-11-29 03:20 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tomasz Finc 2009-06-01 17:47:18 UTC
Starting around 5/25 we are seeing Yahoo! abstracts show up as either empty with 

<feed>
</feed>

as with en wiki http://download.wikimedia.org/enwiki/20090530/enwiki-20090530-abstract.xml

or incomplete as in.

http://download.wikimedia.org/eswiki/20090601/eswiki-20090601-abstract.xml

The problem seems to have shown up in the last iteration of dumps starting 5/25 but did not affect all of them on that day.
Comment 1 Tomasz Finc 2009-07-01 23:53:00 UTC
dumpBackup.php is having some oddities here causing the full abstract to be truncated. I can get it to dump what seems to be a full log or stub dump but as soon as I try a "--curent" or "--latest" it chokes up very early on data going to stdout.

Running

'/usr/bin/php' -q '/apache/common/php-1.5'/maintenance/dumpBackup.php   --wiki='plwiki'   --current   --report=10000  --server='x.x.x.x' --current

on amane I get it stalling right after

  <page>
    <title>ACM A.M. Turing Award</title>
    <id>17</id>

consistently. Even though running an strace shows that it's still receiving data. The process ends one revision entry.

Comment 2 Tomasz Finc 2009-07-03 01:14:26 UTC
After digging deeper into the code chain I was able to find that old_text entries matching 'historyblobcurstub' are not being retrieved correctly. Skipping over those entries makes dumpBackup work as expected. Now to find out why those revisions are not being pulled correctly.
Comment 3 Tomasz Finc 2009-07-03 01:57:36 UTC
Diving even deeper shows that mysql_unbuffered_query is not returning at all for a simple sql call like

SELECT  cur_text  FROM `cur`  WHERE cur_id = '17'  LIMIT 1

when processing historyblobcurstub
Comment 4 Platonides 2009-07-03 10:58:45 UTC
That query shouldn't quote the number. Otherwise mysql converts from text to integer *once per row*.
It should still be using an index, but given the table size and poor optimizing could it be just timeouting?
Comment 5 Tomasz Finc 2009-09-17 02:19:37 UTC
Were looking happy again after todays code update that brought on numerous fixes.

http://download.wikipedia.org/ltwiktionary/20090917/ is correctly generating an abstract file.
Comment 6 Tomasz Finc 2009-09-17 18:21:17 UTC
And just for completeness, fix came in at http://www.mediawiki.org/wiki/Special:Code/MediaWiki/56347

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links