Last modified: 2011-09-18 07:01:05 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29114, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27114 - do we really need to recombine stub and page file chunks into single huge files?
do we really need to recombine stub and page file chunks into single huge files?
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Ariel T. Glenn
:
Depends on:
Blocks: 27110
  Show dependency treegraph
 
Reported: 2011-02-02 19:54 UTC by Ariel T. Glenn
Modified: 2011-09-18 07:01 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Ariel T. Glenn 2011-02-02 19:54:06 UTC
We run en eikipedia dumps by producing multiple stub and page text files, instead of one huge stub file and one huge page/meta/history file. 

Recombining these into one file takes a long time; for the stubs it's not horrible, as these files are smaller, but for the history files it is extremely time-intensive (2 weeks).  We could shorten that for the bz2 files by working on dbzip2, brion's parallel bzip2 project from 2008, but we probably can't do anything to speed up the recombine of the 7z files.

Do we really need to provide one huge file for these things?  Example: the combined bz2 history file is around 300GB, the combined 7z file is around 32 GB.  And it will only get worse. Are several small files ok?  Maybe we can just skip this step.  

This needs community discussion; are the whole files useful?  What happens if we wind up running 50 jobs and producing 50 pieces? Is this just too annoying? Is it better instead because people can process these 50 files in parallel at home? Would it be better if we serve up say no more than 20 separate pieces?  Do people care at all as long as they get the data on a regular basis?
Comment 1 Adam Wight 2011-02-23 08:34:33 UTC
see my comment on bug #26499. You can simply "cat" bzip2 files together.
Comment 2 Ariel T. Glenn 2011-02-24 20:01:54 UTC
You can but if you want the resulting file to have only one header and one footer then you need to strip the headers and footers from the pieces which means uncompression and recompression.  

If all we want is to provide the pieces (with all their headers and footers) in a single package for easy download I'd rather provide users with a simple means to facilitate download of all the pieces than keep essentially two copies of the data and therefore use twice the storage.
Comment 3 Adam Wight 2011-02-24 22:27:31 UTC
Well, I happen to agree with you that multiple files are easier to deal with, but the trend seems to be towards the single, huge file.  Modern file transfer and storage makes the two approaches close to equivalent.  I am in neither camp.

The header and footer could be created as isolated bz2 chunks at a cost of only a few bytes.  Then they will be easy to verify and strip back off without codec.  Unfortunately, php's bzflush() is a NOP and does not call the existing bzlib flush, but you could close and reopen the file...

It seems valuable to preserve the metadata of each job output (see bug #26499), so assuming the pages are organized under a root job-segment element, there is really no header to strip off but the "<?xml version" cruft.

Here's an interesting, if irrelevant, recommendation for a new "xml fragment" representation,
    http://www.w3.org/TR/xml-fragment
Note also section C.3, where they discuss how fragments could be used to index into a huge document in order to minimize parsing.  (yes, i am axe-grinding for bug #27618 !)
Comment 4 Diederik van Liere 2011-02-28 23:14:52 UTC
My 2 cents:
I cannot think of a use case where a single file is preferred above multiple smaller files. If the argument is, I need to download x number of chunks and I don't want that then I suggest to write a simple script that gets all the chunks. We could even provide such a script if that's really important.
Comment 5 Ariel T. Glenn 2011-09-18 07:01:05 UTC
Well I've been producing pieces without recombining for a while now without complaints... silence = consent! Closing.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links