Last modified: 2011-12-06 02:33:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20919, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 18919 - Provide database dumps of just article namespace and/or remove project-space from "articles" dump
Provide database dumps of just article namespace and/or remove project-space ...
Status: NEW
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Ariel T. Glenn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-25 19:13 UTC by Gurch
Modified: 2011-12-06 02:33 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gurch 2009-05-25 19:13:20 UTC
At the moment I can download "pages-meta-current", a dump of all pages, or "pages-articles", which is articles, templates, image descriptions and "primary meta pages". The latter is nice if I want to redistribute Wikipedia's content, but if I'm just trying to gather some data about articles, and I don't want to try to download them all individually, I only need the articles.

Since for en.wikipedia the "pages-articles" dump contains 8559359 pages, and there are only 2892000 articles, I'm obviously getting a lot of stuff I don't actually need. Seems it would save GBs of bandwidth (and processing time for users) if there was just a dump of article text.
Comment 1 Brion Vibber 2009-05-26 22:53:07 UTC
I suspect that the amount of actual page content in the template, project, image, etc pages is much smaller than the content in the article pages, so it may be a much smaller difference in download size than it looks like from the page counts.

Tomasz, can you take a peek and see if it looks like it might be worth creating such dumps?
Comment 2 Gurch 2009-05-29 03:35:27 UTC
(In reply to comment #1)
> I suspect that the amount of actual page content in the template, project,
> image, etc pages is much smaller than the content in the article pages

On most wikis I suspect it is, but you should see some of the templates en.wikipedia (and possibly others) has come up with recently. :)
Comment 3 Alex Z. 2009-05-29 05:33:44 UTC
The dump includes all non-talk namespaces except user:

From the toolserver, for enwiki, mainspace makes up about 62% of the page content (SUM(page_len)) and 69% of the number of pages (including redirects).

So you could probably save about 35-40% by making a mainspace-only dump.
Comment 4 Gurch 2009-05-29 07:37:19 UTC
(In reply to comment #3)
> The dump includes all non-talk namespaces except user:

Ah... so by "primary meta-pages" it actually means "all project pages", which on en.wikipedia includes rather a lot of junk.

Given even with the extra stuff the "pages-articles" dump is only meant for redistributors of projects' content, project-space should probably be removed. Category, template and portal are all needed because they're part of the actual content, but project isn't. That would probably account for a fair bit of the reduction in size.
Comment 5 Brion Vibber 2009-06-17 16:54:20 UTC
Project space includes licensing & credit information which shouldn't be removed. There's not really a good separation between "syndicate me" and "for internal use"...
Comment 6 Gurch 2009-06-20 23:19:27 UTC
(In reply to comment #5)
> Project space includes licensing & credit information which shouldn't be
> removed. There's not really a good separation between "syndicate me" and "for
> internal use"...

Too true... my efforts to get en.wikipedia's template and category namespaces properly organized into "part of content, intended for redistribution" and "for internal use only" were shot down by people who didn't really know what they were talking about. You are right that project-space is the same (though the licenses are only a couple of pages).

My original request for a dump of articles only stands if it's considered worthwhile, but I guess nothing better can be provided for redistributors with things as they are.
Comment 7 Alex Z. 2009-06-25 19:37:15 UTC
A database dump of only articles would likely be very useful for bot operators and people doing statistics. For people who don't plan on redistributing or republishing the content outside of Wikipedia, the actual content of templates, category pages, image pages, etc. may not be necessary for what they want to do, nor the license information (and couldn't we just put a link to http://wikimediafoundation.org/wiki/Terms_of_Use on the download page?)
Comment 8 Gurch 2009-10-19 01:14:25 UTC
Another thing to add to this request: when the end user is only interested in article content (which is most of the time, especially for 'internal' uses) having the non-article namespaces there as well means that this extra data also has to be parsed and then excluded, adding a significant amount of time to the already lengthy process of searching through one of these dumps. So the saving would not just be on bandwidth.
Comment 9 Mark A. Hershberger 2011-05-03 18:56:51 UTC
Givng dump bugs to Ariel.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links