Last modified: 2013-01-08 19:13:19 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T41667, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 39667 - Divide wikis into database lists by approximate size for performance engineering
Divide wikis into database lists by approximate size for performance engineering
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
: easy, performance, shell
Depends on:
Blocks: 15434 43741
  Show dependency treegraph
 
Reported: 2012-08-26 15:18 UTC by MZMcBride
Modified: 2013-01-08 19:13 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Sizes! (13.20 KB, text/plain)
2012-11-16 02:06 UTC, Sam Reed (reedy)
Details
ss_total_pages (13.74 KB, text/plain)
2012-11-16 16:54 UTC, Sam Reed (reedy)
Details

Description MZMcBride 2012-08-26 15:18:48 UTC
There are a number of bugs in which small wikis are unfairly impacted by the performance constraints of large wikis. For example, many Special pages have been disabled across all Wikimedia wikis (cf. bug 15434). A small wiki such as ch.wikipedia.org, with 151 content pages, is treated the same as a wiki with over four million content pages. This doesn't make any sense.

This situation is unacceptable. A small wiki should not see a reduced user experience because of the existence of (almost entirely unrelated) wikis that have millions of content pages. We know the approximate sizes involved, so we should be able to safely and sanely tier these wikis (and then periodically check those tiers for accuracy and appropriateness). While we all wish that every wiki could be treated equally, it doesn't make any sense to punish small wikis indefinitely due to circumstances over which they have no control or involvement (i.e., an explosion in growth on a sibling project).

Some stats are available at <https://wiki.toolserver.org/view/Wiki_server_assignments>. There are other lists at Meta-Wiki, I believe. And I can query the *links tables for size if that's deemed necessary.

As far I as understand this, step one would be to make a set of groupings and then create individual wiki lists. Or perhaps just have a small.dblist or a large.dblist and add conditional statements based on that?

It looks like a small.dblist may already exist, even? Is that a list of small wikis (<https://noc.wikimedia.org/conf/small.dblist> doesn't load for me)?
Comment 1 Alex Monk 2012-08-26 16:13:07 UTC
This looks useful: http://meta.wikimedia.org/wiki/List_of_Wikimedia_projects_by_size

Where should the line be between a large and a small wiki?
Comment 2 MZMcBride 2012-08-26 16:16:19 UTC
(In reply to comment #1)
> Where should the line be between a large and a small wiki?

Any number is going to be arbitrary. Maybe the actual first step is to write a maintenance script that can evaluate the size of the wikis in the cluster and then output a file based on their sizes (with a --size flag or something). So it'd be something like "php measureWikis.php --size=10000 > large.dblist" or something?

Measuring the number of content pages is probably easiest, as it's a stored value (in site_stats) and it gives a decent comparison between wikis (or it should in theory, at least).
Comment 3 Krinkle 2012-08-26 17:03:36 UTC
(In reply to comment #1)
> This looks useful:
> http://meta.wikimedia.org/wiki/List_of_Wikimedia_projects_by_size
> 
> Where should the line be between a large and a small wiki?

That Meta page is auto-generated based on Special:Statistics, which in turn is just queryable from the sitestats database table. So (not to be nitpicky), just to be clear if and when we're going to use a server-side script to create dblists[1] groups by pagecount; it can simply use the db directly, no need to use that wiki page.

[1] https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=tree
Comment 4 Krinkle 2012-08-26 17:05:34 UTC
btw, for technical aspects we should probably use total page count as opposed to article count. That way file pages / categories / users are also taken into account. Because as far as the database is concerned page and revisions are all the same, whether they are articles or not.

Fortunately both total page count and article count are tracked in site_stats.
Comment 5 MZMcBride 2012-08-26 19:02:10 UTC
Marking this as easy. Writing a maintenance script to query the cluster and output the dblist(s) should be trivial.
Comment 6 MZMcBride 2012-09-25 03:11:09 UTC
# Disable all the query pages that take more than about 15 minutes to update
# wgDisableQueryPageUpdate @{
'wgDisableQueryPageUpdate' => array(
	'enwiki' => array(
		'Ancientpages',
		// 'CrossNamespaceLinks', # disabled by hashar - bug 16878
		'Deadendpages',
		'Lonelypages',
		'Mostcategories',
		'Mostlinked',
		'Mostlinkedcategories',
		'Mostlinkedtemplates',
		'Mostrevisions',
		'Fewestrevisions',
		'Uncategorizedcategories',
		'Wantedtemplates',
		'Wantedpages',
	),
	'default' => array(
		'Ancientpages',
		'Deadendpages',
		'Mostlinked',
		'Mostrevisions',
		'Wantedpages',
		'Fewestrevisions',
		// 'CrossNamespaceLinks', # disabled by hashar - bug 16878
	),
),
# @} end of wgDisableQueryPageUpdate

Source: <http://noc.wikimedia.org/conf/InitialiseSettings.php.txt>. Just pasting this here so I don't lose it.
Comment 7 Sam Reed (reedy) 2012-11-15 23:37:06 UTC
(In reply to comment #5)
> Marking this as easy. Writing a maintenance script to query the cluster and
> output the dblist(s) should be trivial.

I've actually just restored small.dblist from the history books.

It's VERY out of date

https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=blob;f=small.dblist;h=5b0a78abf7fe1018576518382cae7a4f5342e422;hb=HEAD
Comment 8 MZMcBride 2012-11-16 00:20:07 UTC
(In reply to comment #7)
> (In reply to comment #5)
>> Marking this as easy. Writing a maintenance script to query the cluster and
>> output the dblist(s) should be trivial.
> 
> I've actually just restored small.dblist from the history books.

I'm not sure what value that provides other than nostalgia. It's a very out of date list that needs a maintenance script of some kind to be able to re-generate (update) it. If you want to use "small.dblist" for the name of small databases list for nostalgia's sake (and continuity's sake as well, I suppose), that's fine, I guess. But we're really nowhere closer to resolving this bug.
Comment 9 Sam Reed (reedy) 2012-11-16 02:06:09 UTC
Created attachment 11366 [details]
Sizes!
Comment 10 Sam Reed (reedy) 2012-11-16 02:06:51 UTC
(In reply to comment #9)
> Created attachment 11366 [details]
> Sizes!

That's using the value of select ss_good_articles from site_stats
Comment 11 Sam Reed (reedy) 2012-11-16 02:19:23 UTC
Basic script (work in progress!) to dump all the wikis sorted by ss_good_articles in https://gerrit.wikimedia.org/r/#/c/33694
Comment 12 Sam Reed (reedy) 2012-11-16 16:54:44 UTC
Created attachment 11379 [details]
ss_total_pages
Comment 13 Sam Reed (reedy) 2012-12-07 22:34:25 UTC
Updated https://gerrit.wikimedia.org/r/#/c/33694 moar and added the dblists to noc  conf etc
Comment 14 MZMcBride 2012-12-26 03:30:43 UTC
(In reply to comment #13)
> Updated https://gerrit.wikimedia.org/r/#/c/33694 moar and added the dblists
> to noc conf etc

This change has now been merged.

I wonder what more is needed to resolve this bug.
Comment 15 Andre Klapper 2013-01-04 15:08:36 UTC
(In reply to comment #13 by Reedy)
> Updated https://gerrit.wikimedia.org/r/#/c/33694 moar and added the dblists
> to noc  conf etc

Reedy: Any idea what else is needed to resolve this request completely?
Comment 16 Sam Reed (reedy) 2013-01-04 15:15:25 UTC
Personally (let Max chime in), I would've thought that this was enough.

We've now got a script to make size related dblist (parameters might want changing at a later date, but that's trivial). Those dblists have been created and are exposed via noc.

The next task is to potentially do something for bug 15434 using those new lists.
Comment 17 MZMcBride 2013-01-06 02:31:57 UTC
Marking this bug resolved/fixed now that bug 43668 ("Re-enable disabled Special pages on small wikis (wikis in small.dblist)") exists. Thanks again, Reedy!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links