Last modified: 2013-06-18 15:08:14 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29618, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 27618 - Add title index to backup dumps


Summary:	Add title index to backup dumps

Status:	RESOLVED FIXED

Product:	Datasets
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High enhancement (vote)
Target Milestone:	---
Assigned To:	Ariel T. Glenn

URL:
Whiteboard:
Keywords:	patch, patch-reviewed

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-02-21 18:52 UTC by Adam Wight
Modified:	2013-06-18 15:08 UTC (History)
CC List:	7 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
ROUGH (1.60 KB, patch) 2011-03-18 06:45 UTC, Adam Wight	Details
Add an attachment (proposed patch, testcase, etc.)

Description Adam Wight 2011-02-21 18:52:45 UTC

There are several readers available for mediawiki xml.bz2 dumps, some able to read the native format, and others which transform the data.

All suffer from there not being an index into this data.  It is a major barrier to development and adoption by users.

The simplest remedy would be to register a dump filter which creates a text file mapping article title -> byte offset.  If this is done during the backup process, there is almost no resource overhead.

I can write a patch if other developers agree this would be a worthwhile pursuit.

Comment 1 Mark A. Hershberger 2011-02-22 17:59:03 UTC

(In reply to comment #0)
> The simplest remedy would be to register a dump filter which creates a text
> file mapping article title -> byte offset.  If this is done during the backup
> process, there is almost no resource overhead.
> 
> I can write a patch if other developers agree this would be a worthwhile
> pursuit.

I'm interested.  CCing Ariel for input and assigning to you.  Let's have a patch!

Comment 2 Ariel T. Glenn 2011-02-23 00:12:29 UTC

How will this work for runs that do parts in parallel?  I still don't know if those pieces should be recombined later but at present we are running on the assumption that they should be.  Not a big issue, it's just that you'll need to write a little script to recalculate the byte offsets for the combined dump when that phase runs, keeping track of the bit alignment to get the page start byte in later pieces right.

This would be handy for a number of things actually, so I'd like to see it happen.

Comment 3 Adam Wight 2011-02-23 00:16:44 UTC

Interesting--
Also, the byte offsets are into the compressed data of course, ftell(STDOUT), and the boundaries between bz2 chunks also becomes very relevant.

Thanks, I'll have a patch for review this week!

Comment 4 Adam Wight 2011-03-18 06:45:44 UTC

Created attachment 8310 [details]
ROUGH

Not much to show yet, but in case someone wants to lend a hand...
My intention is that:
* each backup job records the arguments with which it was invoked
* an index entry is recorded for each page, giving its offset into the compressed data being generated

Problems:
1) there is no convention for saving to a second file stream (the index file)
2) bz2 php library does not expose the libbz2.so "tell" function, nor could that function work without flushing buffers.  Perhaps the recorded offset can be addressed by bz2 chunk, then by uncompressed offset.

Comment 5 p858snake 2011-04-30 00:09:44 UTC

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

Comment 6 Sumana Harihareswara 2011-11-10 00:16:38 UTC

Adding the need-review keyword because my impression is that Adam wanted other developers to check his approach and give feedback.  Thanks for the patch, Adam!

Comment 7 Diederik van Liere 2011-11-13 01:14:58 UTC

I like this idea and I think two things need to be added to this patch:
1) Currently only the title is written to the index file, but that should also included the namespace or use the page_id instead of the title.
2) As Ariel mentioned, we are generating the dumps in multiple parts so the index file should also keep track in which file the article can be found.

Best,

Diederik

Comment 8 Ariel T. Glenn 2011-11-14 15:23:35 UTC

Out of curiosity, what do the various bz2 offline readers need, byte, or byte and bit, or bzip2 boundary and offset?  

I expect the offline readers don't really use namespace or page ids for anything, so adding the full page title (i.e. namespace:title) should suffice.  If we're talking only about things in the main article space then it doesn't matter at all (but what about images?)...

Comment 9 Ángel González 2011-11-14 15:46:09 UTC

I used bzip2 boundary + title hash.
If your index is 315 MB, even dropping the ability to perform random search, you will hardly be efficient in a consumer PC with maybe just 512 MB of RAM.

Comment 10 Sumana Harihareswara 2011-11-21 10:33:47 UTC

Adam, do you now have enough code review to revise your patch against current MediaWiki trunk?  Thank you!

Comment 11 Adam Wight 2012-02-23 10:11:54 UTC

I like Ariel's solution in r107870 and r107839, are there plans to enable the multistream buildindex job on all dumps?

Comment 12 Ariel T. Glenn 2012-02-23 10:16:49 UTC

Yes but it's buggy. I need to get a bit of other crap off my plate and fix it first; then after a couple of stable runs I'll shove it out the door to the other projects.

Comment 13 Ariel T. Glenn 2013-06-04 06:32:00 UTC

This was enabled on all wikis quite some time back so closing :-)

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links