Last modified: 2011-12-06 02:33:18 UTC
At the moment I can download "pages-meta-current", a dump of all pages, or "pages-articles", which is articles, templates, image descriptions and "primary meta pages". The latter is nice if I want to redistribute Wikipedia's content, but if I'm just trying to gather some data about articles, and I don't want to try to download them all individually, I only need the articles.
Since for en.wikipedia the "pages-articles" dump contains 8559359 pages, and there are only 2892000 articles, I'm obviously getting a lot of stuff I don't actually need. Seems it would save GBs of bandwidth (and processing time for users) if there was just a dump of article text.
I suspect that the amount of actual page content in the template, project, image, etc pages is much smaller than the content in the article pages, so it may be a much smaller difference in download size than it looks like from the page counts.
Tomasz, can you take a peek and see if it looks like it might be worth creating such dumps?
(In reply to comment #1)
> I suspect that the amount of actual page content in the template, project,
> image, etc pages is much smaller than the content in the article pages
On most wikis I suspect it is, but you should see some of the templates en.wikipedia (and possibly others) has come up with recently. :)
The dump includes all non-talk namespaces except user:
From the toolserver, for enwiki, mainspace makes up about 62% of the page content (SUM(page_len)) and 69% of the number of pages (including redirects).
So you could probably save about 35-40% by making a mainspace-only dump.
(In reply to comment #3)
> The dump includes all non-talk namespaces except user:
Ah... so by "primary meta-pages" it actually means "all project pages", which on en.wikipedia includes rather a lot of junk.
Given even with the extra stuff the "pages-articles" dump is only meant for redistributors of projects' content, project-space should probably be removed. Category, template and portal are all needed because they're part of the actual content, but project isn't. That would probably account for a fair bit of the reduction in size.
Project space includes licensing & credit information which shouldn't be removed. There's not really a good separation between "syndicate me" and "for internal use"...
(In reply to comment #5)
> Project space includes licensing & credit information which shouldn't be
> removed. There's not really a good separation between "syndicate me" and "for
> internal use"...
Too true... my efforts to get en.wikipedia's template and category namespaces properly organized into "part of content, intended for redistribution" and "for internal use only" were shot down by people who didn't really know what they were talking about. You are right that project-space is the same (though the licenses are only a couple of pages).
My original request for a dump of articles only stands if it's considered worthwhile, but I guess nothing better can be provided for redistributors with things as they are.
A database dump of only articles would likely be very useful for bot operators and people doing statistics. For people who don't plan on redistributing or republishing the content outside of Wikipedia, the actual content of templates, category pages, image pages, etc. may not be necessary for what they want to do, nor the license information (and couldn't we just put a link to http://wikimediafoundation.org/wiki/Terms_of_Use on the download page?)
Another thing to add to this request: when the end user is only interested in article content (which is most of the time, especially for 'internal' uses) having the non-article namespaces there as well means that this extra data also has to be parsed and then excluded, adding a significant amount of time to the already lengthy process of searching through one of these dumps. So the saving would not just be on bandwidth.
Givng dump bugs to Ariel.