Last modified: 2013-06-04 11:39:18 UTC
The dumps "pages-meta-current" and "pages-articles", as well as the hypothetical article-namespace-only dump that I would like to see (bug 18919), should include the total number of pages in the dump at the start of the file in the "siteinfo" section. Among other things, it would be useful for displaying dump search progress to the user. Attempts to estimate the total number based on a small proportion of the file seem to produce wildly inaccurate results, especially with the en.wikipedia dump (pages are approximately ordered by creation time, and it seems the older a page is, the larger it is, which makes sense). Even if it were more accurate, it would be helpful to have the exact number to hand. And obviously the extra few bytes in a 25GB file are negligible :) An analogous thing could probably be done for some of the other dumps.
We don't know the total number of pages that will be dumped until the end of the dump; deleted and hidden revisions are skipped and we don't know how many of those there are in advance. So writing it in the header isn't really feasible. If you want to get that number yourself for some use, the quickest way is probably to count the title tags in the stubs pages-articles (or stubs meta-current) file.
The easiest way to include a value in a file that isn't known until after the file is created is to include a null field with enough space. We know a 64-bit unsigned int needs a maximum of 20 ASCII characters so anything like these: <foo key1="val1" length="00000000000000000000" key3="val3">bar</foo> <foo key1="val1" key3="val3">bar</foo> Is easy to output on the first pass then go back and fill in the field when we know its value. Downloading another huge file, decompressed it, and scanning through it is very inefficient in comparison. It would mean one slow pass just to pre-calculate the maximum value to use on a progress bar whose function was to give you an idea how long a slow pass was going to take. And we won't be able to have a progress bar on the new prior slow pass (-:
The output files are compressed. We would need to uncompress and rewrite the entire file to do what you suggest. It's possible that in the new incremental format that will be devised this summer (http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps), such data could be provided (for the incremental, not for the full). We could provide a separate file with these numbers in it but I'd get them by the same grep I recommended to you.
Aha that makes sense. Well it is just a wish but not an entirely frivolous one I hope (-:
woops, so did not mean to do that.
I want to mark this as 'later' but I don't see a place to do that. Let's revisit this after we have the new sort-of-seekable (probably) dump format and see how things look.
(In reply to comment #6) > I want to mark this as 'later' but I don't see a place to do that. I've disabled it (see http://lists.wikimedia.org/pipermail/wikitech-l/2012-November/064240.html ). You can use lowest priority.