Last modified: 2011-11-29 03:20:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 14415 - Dump the article titles lists (all-titles-in-ns0.gz) unsorted
Dump the article titles lists (all-titles-in-ns0.gz) unsorted
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: Brion Vibber
Depends on:
  Show dependency treegraph
Reported: 2008-06-05 16:52 UTC by Andrew Dunbar
Modified: 2011-11-29 03:20 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Andrew Dunbar 2008-06-05 16:52:15 UTC
For some uses it's useful to have the list of page titles in their natural unsorted order.

It's trivial for anybody to sort these lists if they are distributed unsorted.

It's impossible for anybody to restore these lists to their original order however.

The sort that's done is likely by byte or by codepoint which will be useful for English but for most other languages will be wrong. Sorting an already sorted list into a similar but different order is close to worst case for quicksort which most text file sorters use.

Retrieving the page titles from the full dump is a lot more work than the sorting would be, and is more error prone. 

It would save a bit of work on the servers that do all the sorting for every wiki each time a new dump is made.
Comment 1 Brion Vibber 2008-06-05 20:07:09 UTC
There's no such thing as unsorted, just different possible sort orders.

They come out of the database in raw index order (namespace, title). If you want some other order, you can trivially sort them yourself.
Comment 2 Andrew Dunbar 2008-06-06 16:14:19 UTC
Please look at one of these files. It's "raw index order" I'm asking for.

The files are not provided in "raw index order" but in fact are provided sorted.
Comment 3 Brion Vibber 2008-08-14 20:43:54 UTC
They come out in the natural index order. Here's the query:

"select page_title from page where page_namespace=0;"

That would follow the (page_namespace,page_title) index.

Note that if you want them in some other order, like say page ID order, you can get that by pulling the stub XML dumps. These include page & revision metadata, ordered by page ID then revision ID. (The current-version only ones would only include the latest revision, which you can easily discard.)

Note You need to log in before you can comment on or make changes to this bug.