Last modified: 2011-11-29 03:20:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T16415, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 14415 - Dump the article titles lists (all-titles-in-ns0.gz) unsorted
Dump the article titles lists (all-titles-in-ns0.gz) unsorted
Status: RESOLVED WORKSFORME
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: Brion Vibber
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-06-05 16:52 UTC by Andrew Dunbar
Modified: 2011-11-29 03:20 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Andrew Dunbar 2008-06-05 16:52:15 UTC
For some uses it's useful to have the list of page titles in their natural unsorted order.

It's trivial for anybody to sort these lists if they are distributed unsorted.

It's impossible for anybody to restore these lists to their original order however.

The sort that's done is likely by byte or by codepoint which will be useful for English but for most other languages will be wrong. Sorting an already sorted list into a similar but different order is close to worst case for quicksort which most text file sorters use.

Retrieving the page titles from the full dump is a lot more work than the sorting would be, and is more error prone. 

It would save a bit of work on the servers that do all the sorting for every wiki each time a new dump is made.
Comment 1 Brion Vibber 2008-06-05 20:07:09 UTC
There's no such thing as unsorted, just different possible sort orders.

They come out of the database in raw index order (namespace, title). If you want some other order, you can trivially sort them yourself.
Comment 2 Andrew Dunbar 2008-06-06 16:14:19 UTC
Please look at one of these files. It's "raw index order" I'm asking for.

The files are not provided in "raw index order" but in fact are provided sorted.
Comment 3 Brion Vibber 2008-08-14 20:43:54 UTC
They come out in the natural index order. Here's the query:

"select page_title from page where page_namespace=0;"

That would follow the (page_namespace,page_title) index.


Note that if you want them in some other order, like say page ID order, you can get that by pulling the stub XML dumps. These include page & revision metadata, ordered by page ID then revision ID. (The current-version only ones would only include the latest revision, which you can easily discard.)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links