Last modified: 2011-10-06 18:06:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T28499, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 26499 - Include uncompressed size and other metadata in each dump file
Include uncompressed size and other metadata in each dump file
Status: NEW
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Ariel T. Glenn
: analytics
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-30 09:35 UTC by Andrew Dunbar
Modified: 2011-10-06 18:06 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Andrew Dunbar 2010-12-30 09:35:38 UTC
Since dump files can be so huge it is common to process them while still compressed. To enable progress reports when processing compressed dumps we need to know the total size, which of course we can't know unless we decompress it. It would be trivial to add a metadata field to say the root element which states the uncompressed size of the dump file.

The easiest way would be to include it as a fixed length string, say 16 characters hexadecimal which would allow for 64-bits. When intially generating the dump this field would be set to "0000000000000000". After completion of dump generation we now know how long it is and can go back and fill in this field without altering the length of the dump.

Of course if we generate dump files directly to their compressed form then this may not be possible.

Depending on how we generate the dumps we might know how many lines they have, which would also be very useful for those of us processing them line by line.
Comment 1 Brion Vibber 2011-01-02 19:53:25 UTC
Dump files are generated directly to their compressed form, so these exact things aren't really possible to put in. (The compression formats might make it possible to get the uncompressed size without doing all the decompression work, though.)

Having a count of the number of articles/revisions in the snapshot available as additional metadata outside the file is probably more doable (and easier to relate to a progress report on XML stream processing).
Comment 2 Diederik van Liere 2011-01-28 16:15:46 UTC
+1
Comment 3 Adam Wight 2011-02-23 08:16:55 UTC
metadata is sorely needed in most of your dump files, IMO.

This is an unorthodox suggestion, but consider the fact that the .bz2 format can be split and reassembled without codec operations.  You could postpone writing the beginning of the root element and metadata until the end of the backup, and then put all sorts of useful information in the "header", which is compressed and the chunks prepended to the remainder of the data.

(Hope y're taking advantage of this fact to recombine parallel jobs as well.)
Comment 4 Adam Wight 2011-02-24 22:23:39 UTC
A rough proposal for the metadata, please help elaborate: (page_id_start, page_id_end, generator_id_string, snapshot_timestamp, namespaces, history_selector, uncompressed_size ...)

If one of the job outputs is corrupted, for example, this will make it easy to diagnose and recover.
Comment 5 Platonides 2011-06-02 21:50:25 UTC
> Dump files are generated directly to their compressed form, so these exact
> things aren't really possible to put in.
You can just keep the count when writing it (eg, libbzip2 has counters just for giving the applications that convenience).
Comment 6 Brion Vibber 2011-06-02 21:54:24 UTC
(In reply to comment #5)
> > Dump files are generated directly to their compressed form, so these exact
> > things aren't really possible to put in.
> You can just keep the count when writing it (eg, libbzip2 has counters just for
> giving the applications that convenience).

Well yes, but you won't have that final count until you've finished writing the entire file, so you can't really include it in the header of the file. You can put it in another file, or maybe you can append it as some kind of metadata at the *end* of the compressed file, or a second file directory entry or something depending on the format.
Comment 7 Platonides 2011-06-02 22:35:03 UTC
Sorry, I didn't pay enough attention to the first post, I was thinking in giving that metadata separatedly.
Comment 8 Diederik van Liere 2011-06-02 22:40:04 UTC
Or alternatively, first create the page XML elements and once that's done and you have collected meta data like number of articles, uncompressed size, etc. prepend the metadata, <siteinfo> and <mediawiki> XML element to the xml file. A simple cat operation would do that, and finally append at the end of the XML document the closing </mediawiki> tag.
Comment 9 Platonides 2011-06-03 22:00:31 UTC
Diederik, they are not created uncompressed in memory.

I think we should just move to xz (mainly for the space benefits), which would provide the uncompressed size as an added value.
Comment 10 Diederik van Liere 2011-06-03 22:04:31 UTC
xz compression sounds good to me!
Comment 11 Adam Wight 2011-06-04 11:07:57 UTC
Make it a requirement that the compression library is able to report compressed block boundaries as it is working, so an index can be generated.  This will open many possibilities for mediawiki on mobile, DVD, and other resource-limited scenarios.

n.b. -- the libbzip2 counters are not accessible from php.
Comment 12 Ariel T. Glenn 2011-08-29 18:07:24 UTC
(In response to comment 11) 
No they aren't but I have a C library that could be used to build such an index without a ton of work, for bzip2 files; specifically, there is a utility to find the offset to a block containing a specific pageID.  Since 7z and gzip aren't block-oriented it's not possible to generate an index for those files.

However, this feature is not as useful as you might think.  For dump files that contain all revisions, it can take quite a while to locate a given pageID.  That's because there are a few pages which, if the guesser happens to land in the middle of them, are ginormous (up to 163 GB) and take up to an hour to read through.  If one prebuilt an index that mapped revision IDs to page IDs and kept this in memory, things could be speeded up a fair amount; alternatively one could work just with the current revisions.

(In response to comment 9)
Moving to xz will mean a rewrite of my bz2 library and utils and all the bits that rely on them, so that's not likely to happen until Dumps 2.0.

(In response to comment 8)
The easiest way to provide metadata of this nature is, like the md5 sums, to provide it in a separate file.
Comment 13 Andrew Dunbar 2011-08-29 18:54:26 UTC
There is a little tool for indexing the blocks in bzip2: http://bitbucket.org/james_taylor/seek-bzip2

There is a more complicated one for gzip too:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
Comment 14 Ariel T. Glenn 2011-08-29 19:39:55 UTC
Yeah, I'm familiar with seek-bzip2, but it didn't do what I needed for my use case.  I wanted to be able to easily locate a given XML page in a dump file without an index. The gzip tool appears to read through the entire file (and then keep it in memory) for random access, something we wouldn't want to do for large files like the en wikipedia dumps. 

Another approach is to make each page a separate bzip2 stream; I haven't decided whether that's a good thing or not (and it too would require reworking a bunch of thiings that aren't designed to handle multiple streams).
Comment 15 Ángel González 2011-08-29 22:04:36 UTC
I have a similar one, too. Although in this case it recompressed the bzip2 files with given parameters.

I didn't expect it to work efficiently with history dumps, but nonetheless I'm surprised that the pages get *that* big.
Comment 16 Ariel T. Glenn 2011-08-29 22:19:33 UTC
See Adminstrators'_noticeboard/Incidents, a total of 561938 revs last time I looked (which was over a month ago, surely even worse now).
Comment 17 Adam Wight 2011-10-06 18:06:02 UTC
What about saving several indexes of data each in their own file?

For illustration,

  tlwiki-20110926-pages-meta-history.xml.bz2.index-on-revision.sqlite3
  tlwiki-20110926-pages-meta-history.xml.bz2.index-on-page.sqlite3
  tlwiki-20110926-pages-meta-history.xml.bz2.index-on-title.sqlite3

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links