Last modified: 2010-05-15 15:28:10 UTC
there is no limit currently to the number of pages that can be in a category. There is also no limit to how big the category page itself is. this leads to attempts to list thousands of articles on one page. I want to suggest that it be possible to use the first letter of the article title as a variable. So it would be possible to edit the template and change it from [[Category:XYZ]] to [[Category:XYZ: {{VARIABLE_CONTAINING_FIRST_LETTER_OF_TITLE}}]] This will separate the category by first letter... alternately this could be extended to first 2 letters.... all that is required is to create the variable. The choice of first letter or first few letters may be arbitrary but this could be seen as a temporary fix until some other solution can be found (perhaps until category pages can be rendered better?) The advantage is that only the templates need to be changed once the variable is available.
This resulted from some discussion of how to handle large categories (which are typically used from within templates). Largest categories in some big wikis are: en | Disambiguation | 16338 | | GFDL_images | 13884 | | Public_domain_images | 8181 | | People_stubs | 6983 | | Geography_stubs | 4771 | de | Mann | 22669 | | Begriffsklärung | 12601 | | GFDL-Bild | 5913 | | Deutscher | 5047 | | Frau | 3632 | | Autor | 3442 | fr | Wikipédia:ébauche | 11090 | | Homonymie | 4183 | | Années | 2759 | (from use jawiki; select cl_to, count(*) as c from categorylinks group by cl_to order by c desc limit 10;) In addition, in conjunction with other load, this query took over 140 seconds on Ariel: SELECT DISTINCT cur_title,cur_namespace,cl_sortkey FROMcur,categorylinks WHERE cl_to='Disambiguation' and cl_from=cur_id ORDER (truncated) The de Mann query took over 160 seconds. It's typically less than this in run time but it should be switched to run on slaves, ideally before the contemplated switch to suda as master. LIMIT in this case may be ineffective if the order by isn't in the order of the index fields being used in the where.
The proposed variable won't be sufficient. Assuming first letter reduces the size to 1/10th the current size, a 22,000 entry category will be reduced to an uncomforably large 2,200 entry category. de has a category "mann" for all men. Commons has a category for GFDL images, likely to contain half or more of all images on that site and perhaps more than half of all images on all projects. For this reason, we need some better approach to large categories. As a partial solution, variables for the first letter, 2 letters, 3 letters, 4 letters and so on to 9 letters should help. Assuming only 500,000 GFDL images on Commons and use of these categories the first 4-5 letters would take it to a tolerable range of members but it's good to be prepared with the rest...:) Not an ideal solution: better approaches are welcome. Restructuring the schema in MediaWiki 1.5 will help but probably not enough long term because the cur and category tables are frequently edited, disabling the query cache (not database page cache) for those tables. One approach might be to have reporting copies of categories and cur updated periodically and report from them. Since the query cache will then work for a while, total load would be reduced. Allpages subpages would also benefit from this reporting table approach for cur. For those unfamiliar with it, the MySQL query cache does an exact string comparison on a query and if that query has happened recently enough to be in cache, the cached result is returned, instead of executing the query normally. All cached results for a query are flushed whenever any table used in the query is changed.
This is now substantially improved in MediaWiki 1.4 beta 5. Not closing yet because there are still cases where these pages take a long time and I'm observing the change to drop the cur_namespace index to see if that assists.
Reassigning to jamesday as he is monitoring the issue :)
Bumping this. I know cur is long since gone, so is this as big of a problem with the current schema?
(In reply to comment #5) > Bumping this. I know cur is long since gone, so is this as big of a problem > with the current schema? Categories are only listed 200 pages at a time, so this bug presumably isn't a problem any more. (Though of course there are still many problems with large categories in general).
This was long ago fixed by paginating category pages, AFAICT.