Last modified: 2010-05-15 15:28:10 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3058, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 1058 - dealing with large categories
dealing with large categories
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Categories (Other open bugs)
1.3.x
All All
: Normal normal with 4 votes (vote)
: ---
Assigned To: Jamesday
: testme
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-12-10 11:24 UTC by Hemanshu Desai
Modified: 2010-05-15 15:28 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Hemanshu Desai 2004-12-10 11:24:46 UTC
there is no limit currently to the number of pages that can be in a category.
There is also no limit to how big the category page itself is. this leads to
attempts to list thousands of articles on one page. 

I want to suggest that it be possible to use the first letter of the article
title as a variable. So it would be possible to edit the template and change it
from [[Category:XYZ]] to [[Category:XYZ:
{{VARIABLE_CONTAINING_FIRST_LETTER_OF_TITLE}}]] This will separate the category
by first letter... alternately this could be extended to first 2 letters.... all
that is required is to create the variable.

The choice of first letter or first few letters may be arbitrary but this could
be seen as a temporary fix until some other solution can be found (perhaps until
category pages can be rendered better?)

The advantage is that only the templates need to be changed once the variable is
available.
Comment 1 Jamesday 2004-12-10 13:52:09 UTC
This resulted from some discussion of how to handle large categories (which are
typically used from within templates). Largest categories in some big wikis are:

en
| Disambiguation                          | 16338 |
| GFDL_images                             | 13884 |
| Public_domain_images                    |  8181 |
| People_stubs                            |  6983 |
| Geography_stubs                         |  4771 |

de
| Mann                   | 22669 |
| Begriffsklärung       | 12601 |
| GFDL-Bild              |  5913 |
| Deutscher              |  5047 |
| Frau                   |  3632 |
| Autor                  |  3442 |

fr
| Wikipédia:ébauche        | 11090 |
| Homonymie                  |  4183 |
| Années                    |  2759 |

(from use jawiki; select cl_to, count(*) as c  from categorylinks group by cl_to
order by c desc limit 10;)

In addition, in conjunction with other load, this query took over 140 seconds on
Ariel:

SELECT DISTINCT cur_title,cur_namespace,cl_sortkey FROMcur,categorylinks WHERE
cl_to='Disambiguation' and cl_from=cur_id ORDER (truncated)

The de Mann query took over 160 seconds. It's typically less than this in run
time but it should be switched to run on slaves, ideally before the contemplated
switch to suda as master. LIMIT in this case may be ineffective if the order by
isn't in the order of the index fields being used in the where.
Comment 2 Jamesday 2004-12-17 04:29:31 UTC
The proposed variable won't be sufficient. Assuming first letter reduces the
size to 1/10th the current size, a 22,000 entry category will be reduced to an
uncomforably large 2,200 entry category. de has a category "mann" for all men.
Commons has a category for GFDL images, likely to contain half or more of all
images on that site and perhaps more than half of all images on all projects.
For this reason, we need some better approach to large categories.

As a partial solution, variables for the first letter, 2 letters, 3 letters, 4
letters and so on to 9 letters should help. Assuming only 500,000 GFDL images on
Commons and use of these categories the first 4-5 letters would take it to a
tolerable range of members but it's good to be prepared with the rest...:) Not
an ideal solution: better approaches are welcome.

Restructuring the schema in MediaWiki 1.5 will help but probably not enough long
term because the cur and category tables are frequently edited, disabling the
query cache (not database page cache) for those tables. One approach might be to
have reporting copies of categories and cur updated periodically and report from
them. Since the query cache will then work for a while, total load would be
reduced. Allpages subpages would also benefit from this reporting table approach
for cur.

For those unfamiliar with it, the MySQL query cache does an exact string
comparison on a query and if that query has happened recently enough to be in
cache, the cached result is returned, instead of executing the query normally.
All cached results for a query are flushed whenever any table used in the query
is changed.
Comment 3 Jamesday 2005-02-03 00:18:26 UTC
This is now substantially improved in MediaWiki 1.4 beta 5. Not 
closing yet because there are still cases where these pages take a 
long time and I'm observing the change to drop the cur_namespace 
index to see if that assists.
Comment 4 Antoine "hashar" Musso (WMF) 2005-08-17 19:16:56 UTC
Reassigning to jamesday as he is monitoring the issue :)
Comment 5 Chad H. 2009-02-01 07:51:31 UTC
Bumping this. I know cur is long since gone, so is this as big of a problem with the current schema?
Comment 6 Gurch 2009-02-01 17:01:10 UTC
(In reply to comment #5)
> Bumping this. I know cur is long since gone, so is this as big of a problem
> with the current schema?

Categories are only listed 200 pages at a time, so this bug presumably isn't a problem any more. (Though of course there are still many problems with large categories in general).
Comment 7 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-02-16 15:39:33 UTC
This was long ago fixed by paginating category pages, AFAICT.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links