Last modified: 2011-11-29 03:20:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T21046, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 19046 - Yahoo abstracts are incomplete or empty


Summary:	Yahoo abstracts are incomplete or empty

Status:	RESOLVED FIXED

Product:	Datasets
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal major (vote)
Target Milestone:	---
Assigned To:	Tomasz Finc

URL:	http://download.wikimedia.org/enwiki/...
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-06-01 17:47 UTC by Tomasz Finc
Modified:	2011-11-29 03:20 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Tomasz Finc 2009-06-01 17:47:18 UTC

Starting around 5/25 we are seeing Yahoo! abstracts show up as either empty with 

<feed>
</feed>

as with en wiki http://download.wikimedia.org/enwiki/20090530/enwiki-20090530-abstract.xml

or incomplete as in.

http://download.wikimedia.org/eswiki/20090601/eswiki-20090601-abstract.xml

The problem seems to have shown up in the last iteration of dumps starting 5/25 but did not affect all of them on that day.

Comment 1 Tomasz Finc 2009-07-01 23:53:00 UTC

dumpBackup.php is having some oddities here causing the full abstract to be truncated. I can get it to dump what seems to be a full log or stub dump but as soon as I try a "--curent" or "--latest" it chokes up very early on data going to stdout.

Running

'/usr/bin/php' -q '/apache/common/php-1.5'/maintenance/dumpBackup.php   --wiki='plwiki'   --current   --report=10000  --server='x.x.x.x' --current

on amane I get it stalling right after

  <page>
    <title>ACM A.M. Turing Award</title>
    <id>17</id>

consistently. Even though running an strace shows that it's still receiving data. The process ends one revision entry.

Comment 2 Tomasz Finc 2009-07-03 01:14:26 UTC

After digging deeper into the code chain I was able to find that old_text entries matching 'historyblobcurstub' are not being retrieved correctly. Skipping over those entries makes dumpBackup work as expected. Now to find out why those revisions are not being pulled correctly.

Comment 3 Tomasz Finc 2009-07-03 01:57:36 UTC

Diving even deeper shows that mysql_unbuffered_query is not returning at all for a simple sql call like

SELECT  cur_text  FROM `cur`  WHERE cur_id = '17'  LIMIT 1

when processing historyblobcurstub

Comment 4 Platonides 2009-07-03 10:58:45 UTC

That query shouldn't quote the number. Otherwise mysql converts from text to integer *once per row*.
It should still be using an index, but given the table size and poor optimizing could it be just timeouting?

Comment 5 Tomasz Finc 2009-09-17 02:19:37 UTC

Were looking happy again after todays code update that brought on numerous fixes.

http://download.wikipedia.org/ltwiktionary/20090917/ is correctly generating an abstract file.

Comment 6 Tomasz Finc 2009-09-17 18:21:17 UTC

And just for completeness, fix came in at http://www.mediawiki.org/wiki/Special:Code/MediaWiki/56347

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links