Last modified: 2011-10-06 18:06:02 UTC
Since dump files can be so huge it is common to process them while still compressed. To enable progress reports when processing compressed dumps we need to know the total size, which of course we can't know unless we decompress it. It would be trivial to add a metadata field to say the root element which states the uncompressed size of the dump file. The easiest way would be to include it as a fixed length string, say 16 characters hexadecimal which would allow for 64-bits. When intially generating the dump this field would be set to "0000000000000000". After completion of dump generation we now know how long it is and can go back and fill in this field without altering the length of the dump. Of course if we generate dump files directly to their compressed form then this may not be possible. Depending on how we generate the dumps we might know how many lines they have, which would also be very useful for those of us processing them line by line.
Dump files are generated directly to their compressed form, so these exact things aren't really possible to put in. (The compression formats might make it possible to get the uncompressed size without doing all the decompression work, though.) Having a count of the number of articles/revisions in the snapshot available as additional metadata outside the file is probably more doable (and easier to relate to a progress report on XML stream processing).
+1
metadata is sorely needed in most of your dump files, IMO. This is an unorthodox suggestion, but consider the fact that the .bz2 format can be split and reassembled without codec operations. You could postpone writing the beginning of the root element and metadata until the end of the backup, and then put all sorts of useful information in the "header", which is compressed and the chunks prepended to the remainder of the data. (Hope y're taking advantage of this fact to recombine parallel jobs as well.)
A rough proposal for the metadata, please help elaborate: (page_id_start, page_id_end, generator_id_string, snapshot_timestamp, namespaces, history_selector, uncompressed_size ...) If one of the job outputs is corrupted, for example, this will make it easy to diagnose and recover.
> Dump files are generated directly to their compressed form, so these exact > things aren't really possible to put in. You can just keep the count when writing it (eg, libbzip2 has counters just for giving the applications that convenience).
(In reply to comment #5) > > Dump files are generated directly to their compressed form, so these exact > > things aren't really possible to put in. > You can just keep the count when writing it (eg, libbzip2 has counters just for > giving the applications that convenience). Well yes, but you won't have that final count until you've finished writing the entire file, so you can't really include it in the header of the file. You can put it in another file, or maybe you can append it as some kind of metadata at the *end* of the compressed file, or a second file directory entry or something depending on the format.
Sorry, I didn't pay enough attention to the first post, I was thinking in giving that metadata separatedly.
Or alternatively, first create the page XML elements and once that's done and you have collected meta data like number of articles, uncompressed size, etc. prepend the metadata, <siteinfo> and <mediawiki> XML element to the xml file. A simple cat operation would do that, and finally append at the end of the XML document the closing </mediawiki> tag.
Diederik, they are not created uncompressed in memory. I think we should just move to xz (mainly for the space benefits), which would provide the uncompressed size as an added value.
xz compression sounds good to me!
Make it a requirement that the compression library is able to report compressed block boundaries as it is working, so an index can be generated. This will open many possibilities for mediawiki on mobile, DVD, and other resource-limited scenarios. n.b. -- the libbzip2 counters are not accessible from php.
(In response to comment 11) No they aren't but I have a C library that could be used to build such an index without a ton of work, for bzip2 files; specifically, there is a utility to find the offset to a block containing a specific pageID. Since 7z and gzip aren't block-oriented it's not possible to generate an index for those files. However, this feature is not as useful as you might think. For dump files that contain all revisions, it can take quite a while to locate a given pageID. That's because there are a few pages which, if the guesser happens to land in the middle of them, are ginormous (up to 163 GB) and take up to an hour to read through. If one prebuilt an index that mapped revision IDs to page IDs and kept this in memory, things could be speeded up a fair amount; alternatively one could work just with the current revisions. (In response to comment 9) Moving to xz will mean a rewrite of my bz2 library and utils and all the bits that rely on them, so that's not likely to happen until Dumps 2.0. (In response to comment 8) The easiest way to provide metadata of this nature is, like the md5 sums, to provide it in a separate file.
There is a little tool for indexing the blocks in bzip2: http://bitbucket.org/james_taylor/seek-bzip2 There is a more complicated one for gzip too: http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
Yeah, I'm familiar with seek-bzip2, but it didn't do what I needed for my use case. I wanted to be able to easily locate a given XML page in a dump file without an index. The gzip tool appears to read through the entire file (and then keep it in memory) for random access, something we wouldn't want to do for large files like the en wikipedia dumps. Another approach is to make each page a separate bzip2 stream; I haven't decided whether that's a good thing or not (and it too would require reworking a bunch of thiings that aren't designed to handle multiple streams).
I have a similar one, too. Although in this case it recompressed the bzip2 files with given parameters. I didn't expect it to work efficiently with history dumps, but nonetheless I'm surprised that the pages get *that* big.
See Adminstrators'_noticeboard/Incidents, a total of 561938 revs last time I looked (which was over a month ago, surely even worse now).
What about saving several indexes of data each in their own file? For illustration, tlwiki-20110926-pages-meta-history.xml.bz2.index-on-revision.sqlite3 tlwiki-20110926-pages-meta-history.xml.bz2.index-on-page.sqlite3 tlwiki-20110926-pages-meta-history.xml.bz2.index-on-title.sqlite3