Last modified: 2013-06-04 11:39:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 21195 - Include page count in database dumps
Include page count in database dumps
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: Lowest enhancement (vote)
: ---
Assigned To: Ariel T. Glenn
: analytics
Depends on:
  Show dependency treegraph
Reported: 2009-10-19 23:13 UTC by Gurch
Modified: 2013-06-04 11:39 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Gurch 2009-10-19 23:13:59 UTC
The dumps "pages-meta-current" and "pages-articles", as well as the hypothetical article-namespace-only dump that I would like to see (bug 18919), should include the total number of pages in the dump at the start of the file in the "siteinfo" section.

Among other things, it would be useful for displaying dump search progress to the user. Attempts to estimate the total number based on a small proportion of the file seem to produce wildly inaccurate results, especially with the en.wikipedia dump (pages are approximately ordered by creation time, and it seems the older a page is, the larger it is, which makes sense). Even if it were more accurate, it would be helpful to have the exact number to hand. And obviously the extra few bytes in a 25GB file are negligible :)

An analogous thing could probably be done for some of the other dumps.
Comment 1 Ariel T. Glenn 2013-06-04 06:41:07 UTC
We don't know the total number of pages that will be dumped until the end of the dump;  deleted and hidden revisions are skipped and we don't know how many of those there are in advance.  So writing it in the header isn't really feasible.

If you want to get that number yourself for some use, the quickest way is probably to count the title tags in the stubs pages-articles (or stubs meta-current) file.
Comment 2 Andrew Dunbar 2013-06-04 07:46:53 UTC
The easiest way to include a value in a file that isn't known until after the file is created is to include a null field with enough space. We know a 64-bit unsigned int needs a maximum of 20 ASCII characters so anything like these:

<foo key1="val1" length="00000000000000000000" key3="val3">bar</foo>

<foo key1="val1"                               key3="val3">bar</foo>

Is easy to output on the first pass then go back and fill in the field when we know its value.

Downloading another huge file, decompressed it, and scanning through it is very inefficient in comparison. It would mean one slow pass just to pre-calculate the maximum value to use on a progress bar whose function was to give you an idea how long a slow pass was going to take. And we won't be able to have a progress bar on the new prior slow pass (-:
Comment 3 Ariel T. Glenn 2013-06-04 09:31:55 UTC
The output files are compressed. We would need to uncompress and rewrite the entire file to do what you suggest.  It's possible that in the new incremental format that will be devised this summer (, such data could be provided (for the incremental, not for the full).

We could provide a separate file with these numbers in it but I'd get them by the same grep I recommended to you.
Comment 4 Andrew Dunbar 2013-06-04 09:58:54 UTC
Aha that makes sense. Well it is just a wish but not an entirely frivolous one I hope (-:
Comment 5 Ariel T. Glenn 2013-06-04 11:08:24 UTC
woops, so did not mean to do that.
Comment 6 Ariel T. Glenn 2013-06-04 11:09:23 UTC
I want to mark this as 'later' but I don't see a place to do that.  Let's revisit this after we have the new sort-of-seekable (probably) dump format and see how things look.
Comment 7 Andre Klapper 2013-06-04 11:39:18 UTC
(In reply to comment #6)
> I want to mark this as 'later' but I don't see a place to do that.

I've disabled it (see ). You can use lowest priority.

Note You need to log in before you can comment on or make changes to this bug.