Last modified: 2013-06-18 15:08:14 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29618, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27618 - Add title index to backup dumps
Add title index to backup dumps
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High enhancement (vote)
: ---
Assigned To: Ariel T. Glenn
: patch, patch-reviewed
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-02-21 18:52 UTC by Adam Wight
Modified: 2013-06-18 15:08 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
ROUGH (1.60 KB, patch)
2011-03-18 06:45 UTC, Adam Wight
Details

Description Adam Wight 2011-02-21 18:52:45 UTC
There are several readers available for mediawiki xml.bz2 dumps, some able to read the native format, and others which transform the data.

All suffer from there not being an index into this data.  It is a major barrier to development and adoption by users.

The simplest remedy would be to register a dump filter which creates a text file mapping article title -> byte offset.  If this is done during the backup process, there is almost no resource overhead.

I can write a patch if other developers agree this would be a worthwhile pursuit.
Comment 1 Mark A. Hershberger 2011-02-22 17:59:03 UTC
(In reply to comment #0)
> The simplest remedy would be to register a dump filter which creates a text
> file mapping article title -> byte offset.  If this is done during the backup
> process, there is almost no resource overhead.
> 
> I can write a patch if other developers agree this would be a worthwhile
> pursuit.

I'm interested.  CCing Ariel for input and assigning to you.  Let's have a patch!
Comment 2 Ariel T. Glenn 2011-02-23 00:12:29 UTC
How will this work for runs that do parts in parallel?  I still don't know if those pieces should be recombined later but at present we are running on the assumption that they should be.  Not a big issue, it's just that you'll need to write a little script to recalculate the byte offsets for the combined dump when that phase runs, keeping track of the bit alignment to get the page start byte in later pieces right.

This would be handy for a number of things actually, so I'd like to see it happen.
Comment 3 Adam Wight 2011-02-23 00:16:44 UTC
Interesting--
Also, the byte offsets are into the compressed data of course, ftell(STDOUT), and the boundaries between bz2 chunks also becomes very relevant.

Thanks, I'll have a patch for review this week!
Comment 4 Adam Wight 2011-03-18 06:45:44 UTC
Created attachment 8310 [details]
ROUGH

Not much to show yet, but in case someone wants to lend a hand...
My intention is that:
* each backup job records the arguments with which it was invoked
* an index entry is recorded for each page, giving its offset into the compressed data being generated

Problems:
1) there is no convention for saving to a second file stream (the index file)
2) bz2 php library does not expose the libbz2.so "tell" function, nor could that function work without flushing buffers.  Perhaps the recorded offset can be addressed by bz2 chunk, then by uncompressed offset.
Comment 5 p858snake 2011-04-30 00:09:44 UTC
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
Comment 6 Sumana Harihareswara 2011-11-10 00:16:38 UTC
Adding the need-review keyword because my impression is that Adam wanted other developers to check his approach and give feedback.  Thanks for the patch, Adam!
Comment 7 Diederik van Liere 2011-11-13 01:14:58 UTC
I like this idea and I think two things need to be added to this patch:
1) Currently only the title is written to the index file, but that should also included the namespace or use the page_id instead of the title.
2) As Ariel mentioned, we are generating the dumps in multiple parts so the index file should also keep track in which file the article can be found.

Best,

Diederik
Comment 8 Ariel T. Glenn 2011-11-14 15:23:35 UTC
Out of curiosity, what do the various bz2 offline readers need, byte, or byte and bit, or bzip2 boundary and offset?  

I expect the offline readers don't really use namespace or page ids for anything, so adding the full page title (i.e. namespace:title) should suffice.  If we're talking only about things in the main article space then it doesn't matter at all (but what about images?)...
Comment 9 Ángel González 2011-11-14 15:46:09 UTC
I used bzip2 boundary + title hash.
If your index is 315 MB, even dropping the ability to perform random search, you will hardly be efficient in a consumer PC with maybe just 512 MB of RAM.
Comment 10 Sumana Harihareswara 2011-11-21 10:33:47 UTC
Adam, do you now have enough code review to revise your patch against current MediaWiki trunk?  Thank you!
Comment 11 Adam Wight 2012-02-23 10:11:54 UTC
I like Ariel's solution in r107870 and r107839, are there plans to enable the multistream buildindex job on all dumps?
Comment 12 Ariel T. Glenn 2012-02-23 10:16:49 UTC
Yes but it's buggy. I need to get a bit of other crap off my plate and fix it first; then after a couple of stable runs I'll shove it out the door to the other projects.
Comment 13 Ariel T. Glenn 2013-06-04 06:32:00 UTC
This was enabled on all wikis quite some time back so closing :-)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links