Last modified: 2010-05-15 15:33:08 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 1324 - page2xml trouble
page2xml trouble
Product: MediaWiki
Classification: Unclassified
Special pages (Other open bugs)
All All
: Normal normal with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on:
  Show dependency treegraph
Reported: 2005-01-13 17:19 UTC by Jamesday
Modified: 2010-05-15 15:33 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Jamesday 2005-01-13 17:19:57 UTC
At Wikimedia sites the following page2xml query is now being automatically
killed on sight, after use of it caused a problem on one of the database servers
handling de:

/* page2xml */  SELECT old_id as id,old_timestamp as timestamp,old_user as
user,old_user_text as user_text,old_comment as comment,old_text as
text,old_flags as flags  FROM `old`  WHERE old_namespace='4' AND
old_title='whatever'  ORDER BY old_timestamp 

This query asks for all versions of the article, at about one disk seek per
revision for the current schema, unless the revision is in cache. Worse, it ends
up using the old_title index, so it actually retrieves from disk all revisions
in all namespaces. At the time the server overload caused me to take a look
there were three of these running with run times of 891, 583 and 339 seconds.
The last was killed by the changed querybane rules after 805 seconds.

Adding use_index(name_title_timestamp) improves it significantly and should be
done but it still needs a limit added to it.

The planned schema change, if it stores old articles with article ID as the
first part of the primary key, would make this far more efficient because
adjsacent revisions would be in the same database pages. It would still
potentially retrieve hundreds of thousands of revisions for a popular article so
some limit is still going to be needed even with the new schama. If page2xml is
intended  to be a general retrieve an article call, that limit needs to be very
low - tens not hundreds of revisions.

For Wikimedia sites, limit 10 is a good choice at present, assuming the primary
purpose isn't retrieving the article history.
Comment 1 JeLuF 2005-01-15 08:00:09 UTC
enwiki doesn't have the old_title index, nor has a fresh REL1_4 install.
dewiki has this index.

Should old_title index be dropped? I'm not aware of any queries that use
old_title but not old_namespace. 
Comment 2 Jamesday 2005-01-15 08:22:30 UTC
The old_title index was in some of the wikis - de, fr, pl, probably others, but
not in the new ones I looked at. I'll remove it when practical, since it's not
present in the newer wikis. The query does select the correct index when that
one is removed. Not going to be enough to make it practical to return all old
metadata and text, though.
Comment 3 Jamesday 2005-11-12 23:22:45 UTC
Obsolete - old is gone now.

Note You need to log in before you can comment on or make changes to this bug.