Last modified: 2013-06-04 11:39:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23195, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 21195 - Include page count in database dumps


Summary:	Include page count in database dumps

Status:	REOPENED

Product:	Datasets
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Lowest enhancement (vote)
Target Milestone:	---
Assigned To:	Ariel T. Glenn

URL:
Whiteboard:
Keywords:	analytics

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-10-19 23:13 UTC by Gurch
Modified:	2013-06-04 11:39 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Gurch 2009-10-19 23:13:59 UTC

The dumps "pages-meta-current" and "pages-articles", as well as the hypothetical article-namespace-only dump that I would like to see (bug 18919), should include the total number of pages in the dump at the start of the file in the "siteinfo" section.

Among other things, it would be useful for displaying dump search progress to the user. Attempts to estimate the total number based on a small proportion of the file seem to produce wildly inaccurate results, especially with the en.wikipedia dump (pages are approximately ordered by creation time, and it seems the older a page is, the larger it is, which makes sense). Even if it were more accurate, it would be helpful to have the exact number to hand. And obviously the extra few bytes in a 25GB file are negligible :)

An analogous thing could probably be done for some of the other dumps.

Comment 1 Ariel T. Glenn 2013-06-04 06:41:07 UTC

We don't know the total number of pages that will be dumped until the end of the dump;  deleted and hidden revisions are skipped and we don't know how many of those there are in advance.  So writing it in the header isn't really feasible.

If you want to get that number yourself for some use, the quickest way is probably to count the title tags in the stubs pages-articles (or stubs meta-current) file.

Comment 2 Andrew Dunbar 2013-06-04 07:46:53 UTC

The easiest way to include a value in a file that isn't known until after the file is created is to include a null field with enough space. We know a 64-bit unsigned int needs a maximum of 20 ASCII characters so anything like these:

<foo key1="val1" length="00000000000000000000" key3="val3">bar</foo>

<foo key1="val1"                               key3="val3">bar</foo>

Is easy to output on the first pass then go back and fill in the field when we know its value.

Downloading another huge file, decompressed it, and scanning through it is very inefficient in comparison. It would mean one slow pass just to pre-calculate the maximum value to use on a progress bar whose function was to give you an idea how long a slow pass was going to take. And we won't be able to have a progress bar on the new prior slow pass (-:

Comment 3 Ariel T. Glenn 2013-06-04 09:31:55 UTC

The output files are compressed. We would need to uncompress and rewrite the entire file to do what you suggest.  It's possible that in the new incremental format that will be devised this summer (http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps), such data could be provided (for the incremental, not for the full).

We could provide a separate file with these numbers in it but I'd get them by the same grep I recommended to you.

Comment 4 Andrew Dunbar 2013-06-04 09:58:54 UTC

Aha that makes sense. Well it is just a wish but not an entirely frivolous one I hope (-:

Comment 5 Ariel T. Glenn 2013-06-04 11:08:24 UTC

woops, so did not mean to do that.

Comment 6 Ariel T. Glenn 2013-06-04 11:09:23 UTC

I want to mark this as 'later' but I don't see a place to do that.  Let's revisit this after we have the new sort-of-seekable (probably) dump format and see how things look.

Comment 7 Andre Klapper 2013-06-04 11:39:18 UTC

(In reply to comment #6)
> I want to mark this as 'later' but I don't see a place to do that.

I've disabled it (see http://lists.wikimedia.org/pipermail/wikitech-l/2012-November/064240.html ). You can use lowest priority.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links