Last modified: 2014-11-18 18:07:09 UTC
On large categories, both the articles and subcats are paged. This is fine, but the totals given for both articles and subcats are incorrect. Test case: Category:Business Expected result: "There are [correct total] subcategories to this category." Actual result: "There are 17 subcategories to this category." - it listed only the ones on the page. Quick fix: change text to say "on this page" - then it's true, but not so useful for the reader. Better fix: show the correct total for the entire category.
*** Bug 1336 has been marked as a duplicate of this bug. ***
*** Bug 1588 has been marked as a duplicate of this bug. ***
Hallo! There are some differences now between the test wikis FiverAlpha and Nukawiki: See: http://jadesukka.homelinux.org:8180/betawiki/Betawiki:Categories/counters#bugzilla http://test.leuksman.com/index.php?title=Category:Todo&from=Z looks pretty good compared to http://jadesukka.homelinux.org:8180/mediawiki15c/index.php?title=Luokka:Todo&from=Z Regards Reinhardt [[user:gangleri]]
*** Bug 3095 has been marked as a duplicate of this bug. ***
It appears this error has made its way to 1.5.0 as seen on my site here: http://www.marveldatabase.com/wiki/index.php/Category:Characters I definitely think that David Gerard's 'better fix' should be the course of action. Category pages should show the FULL count for all articles and sub- categories of the category in question. Also, the category page only displays the first 200 total links, both sub- categories and articles combined. (See my referenced link) Shouldn't it show the first 200 articles and ALL (limited to x, for SQL query-time reasons) sub- categories?
*** Bug 3714 has been marked as a duplicate of this bug. ***
Hope this gets fixed..... eventually...
Any word on this bug getting some attention before version 1.6?
*** Bug 4390 has been marked as a duplicate of this bug. ***
Five similar bugs!. This is a very urgent request Mr.Developer!
(changed URL to category currently with paging)
Is SELECT COUNT(*) good enough here? Or do we need something more efficient?
First, it should display a TOTAL of all pages within the category (and a total of all categories in the category). Second, it should give the user the option to display the first 20, 50, 100 or 200 and have a 'previous' and 'next' feature. Something closer to the 'recent changes page' setup... Currently, the way it works is fine, but it lacks a TOTAL sum, the only way to find out is to flip through page by page and count yourself. I also like the idea of adding the choice of limiting the count (20, 50, 100 or 200) like all the other special pages in MediaWiki. Just my thoughts...
(In reply to comment #12) > Is SELECT COUNT(*) good enough here? Or do we need something more efficient? I don't see why not.
The rumor was that SELECT COUNT(*) was too expensive for large categories (eg Category:GFDL on Commons, with tens of thousands of entries). I'm not sure how real that problem is.
I know how this will probably require the creation of another unessecary table, but why don't we have a table that keeps the tallies of articles in each category... This data could be cached server-side for larger wikis and cached client-side for smaller ones. Everyone is happy and we still get our count! When the cache expires, the query is called. Life is beautiful. :)
We could just remove the count from the UI. That's what we do for [[Special:Whatlinkshere]], which has precisely the same schema problem.
(In reply to comment #17) > We could just remove the count from the UI. That's what we do for > [[Special:Whatlinkshere]], which has precisely the same schema problem. Which is now being requested at bug 4394 (Provide a count in Special:Whatlinkshere)
Editors often count up how many articles are in maintenance categories. Removing the automatic count would make that a lot more work. And it's useful to readers to know what is below the virtual fold before they invest the time it takes to get there. Perhaps there is a clever query that would give an exact count for reasonably-sized categories and an "overflow" indication for anything larger than a certain threshold (possibly with a "click here to get an exact count" link). Though it might be easiest just to implement the simply COUNT solution at an off-peak time and check to see what kind of a performance hit there is, if any. That would be useful information in general, and would also help set the threshold optimally.
Has anyone been able to do a successful test on what kind of performance issues this may create? Could we not extrapolate an estimate? Grab a page-render time for a <200 article COUNT and multiply by 50 and 500 to figure for a maximum threshold of 10,000 or 100,000? (Accounting for other page-rendering factors, of course...)
*** Bug 6146 has been marked as a duplicate of this bug. ***
Design Suggestions: 1. MediaWiki:Categoryarticlecount and MediaWiki:subcategorycount shoud have params $1 to $7 at least making messages possible like: "showing 200 articles from # 401 'cat ears' to # 600 'dog pounds' of a total of 782 articles in category 'cats and dogs'." 2. Count of articles/subcats per page (currently 200) to be installable in "user preferences" and selectable via paging option links (display previous/next 10|50|100|200|500|1000|....) as elsewhere. 3. Count of columns of articles/subcats per page (currently 3) to be set in "user preferences" (as 1...10, e.g .) and possibly via paging option links. 4. Show a page directoy, if there is more than one page, like with Special:Allpages, where pages are listed and linked to in a "$first_to_$last" fashion. If there are more page diretory entries than fit on a page, make a directory of page directories. (Iterate til page size is matched) 5. With the page directory, MediaWiki:Categoryarticlepagingcount and MediaWiki:subcategorypagingcount should be introduced having params $1 to $12 at least, allowing messages like: "showing 678 of 12.345 page directory entries in category 'cats & dogs', from # 679 to # 1356 at level 2, corresponding to 134.500 subcategories of 12.345.678.987, covering subcategory # 135.600 (lions) to subcategory # 1.234.200 (eliphants)"
*** Bug 7950 has been marked as a duplicate of this bug. ***
There has been no developer commentary on this in nearly a year. Did we ever determine if it was indeed too expensive? What about the new table field to keep track as discussed. Any thoughts Brion / Rob / Tim?
Offhand, I might recommend changing the indexing for the categorylinks table to split up by namespace, _then_ sort key. This would allow counting, paging, extracting etc the various namespace groups separately. Thus category pages (subcats) and image pages (media) could be treated separately rather easily. Other namespaces would also end up being paged separately, though, so it might take some UI restructuring.
*** Bug 8235 has been marked as a duplicate of this bug. ***
How is the 'members' data generated on [[Special:Categories]]? (Bug 146) It seems that each category has had a SELECT COUNT(*) to build the info (comment #12) or is this data held somewhere else in a table? If this is indeed a COUNT(*), wouldn't this be several times more expensive (comment #15) than just counting the number of members when looking at just ONE category? Example: http://www.marveldatabase.com/index.php?title=Special:Categories&limit=500&offset=500 - Look at the 'Characters' category with over 6,400 'members'. http://www.marveldatabase.com/Category:Characters - Shows 'There are 200 articles in this category', when we know this is only a count of how many appear on this page. Could we somehow use this 'member' count to resolve this bug? (Don't forget this 'total count' should have a separate count for each 'articles' and 'images' in a category.) If we can show: "Character Galleries (1,095 members)", then we should be able to accurately state: "There are 1,095 articles in this category"...
There is no COUNT(*). If there were, there would be an actual count. ;) As it is, we just iterate through the returned rows in PHP until there aren't any more, and then keep the number handy as the count of how many categories there are. Since we have a LIMIT clause, of course, this means the count only goes up to the limit. The question is whether a COUNT(*) would, in fact, be too slow. Just nobody has bothered to determine whether it is, apparently. Don't look at me, I'm not qualified. Maybe bug Domas or something.
So what is the LIMIT? (10,000/100,000?) And upon what circumstance do we run through this 'iteration'? Furthermore, why don't we just create a field that holds this 'pseudo-count' with each category, which gets updated with each iteration. It will behave like a 'cache' such that the figure isn't up-to-the-second, but depending on how often the figures are updated, it might be worthwhile... So far, Brions idea of first splitting up by namespace, then sort key sounds best, long-term, I think. It would substantially lighten the load for search/counting/paging etc. Would this take much re-working of the categorylinks table?
(In reply to comment #29) > So what is the LIMIT? The paging limit is 200.
If somebody would like to try the often diskussed question: "Is select count(*) too slow?", he might want to change this code in CategoryPage.php. The field is indexed, so it should be rather fast. Replace the following function (retrieved from MW 1.9.2): function getPagesSection() { $dbr =& wfGetDB( DB_SLAVE ); $catCount = $dbr->selectField( 'categorylinks', 'count(*)', array('cl_to' => $this->title->getDbKey() ), 'getPagesSection'); $ti = htmlspecialchars( $this->title->getText() ); $r = "<div id=\"mw-pages\">\n"; $r .= '<h2>' . wfMsg( 'category_header', $ti ) . "</h2>\n"; $r .= wfMsgExt( 'categoryarticlecount', array( 'parse' ), $catCount );#count( $this->articles) ); $r .= $this->formatList( $this->articles, $this->articles_start_char ); $r .= "\n</div>"; return $r; }
Given this small change, it might be wise to add a switch to MediaWiki, like $wgFullCategoryArticleCount = true / false. This way, the best configuration can be set by the admin.
This might be exactly what we are looking for... Could some one please implement this? I very much also like the idea of a switch (off by default for Wikipedia's sake, of course ;)
Rob/Brion/et al, Does Gunter's solution look viable?
It looks viable. See URL: http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopla It takes just a few seconds to load, and that's the worst possible case at present. Capping it at 250,000 should be fine if we're okay with a few category pages taking an extra second or two to load. This is assuming that the only issue is latency. Adding "shell" to flag this for sysadmins to look at for performance impact.
That page seems to load quite quickly for me and it seems to be not cached. It is taking as long as it does for all 50 categories to be looked up. One can guesstimate the performance hit to check 1 category should be fractional, comparatively. Does the information present in this bug, above, possibly provide a feasible solution?
Is there any way to accurately time how long this takes from a raw query standpoint? http://en.wikipedia.org/w/index.php?title=Special:Categories&offset=Living_peopla&limit=1
Somewhere, probably not on this laptop, I've got an uncommitted working copy of MediaWiki which effectively tracks and updates simple tallies of the number of pages in each category, and the number of subcategories, in a relatively efficient manner, by hooking into the link update process. This would make such a count theoretically trivially inexpensive. I'll dig it out this weekend and whack a patch somewhere for further review and criticism.
(In reply to comment #37) > Is there any way to accurately time how long this takes from a raw query > standpoint? > > http://en.wikipedia.org/w/index.php?title=Special:Categories&offset=Living_peopla&limit=1 It takes 4.42 seconds on the toolserver to run the COUNT(*) query on Category:Living people, with 221446 rows. A proper solution would be better.
Created attachment 4268 [details] Show true category counts The count(*) solution seems to work fine for me - subsecond responses as it hits the cl_sortkey index. I made the table have 1.1 million rows, and a category with 221,000 entries in it still ran in under a second. (This is for Postgres - MySQL also seemed to use the index, but was much slower about it - maybe this is better on a newer version of MySQL? I only have an early 5.0 available at the moment). Still, no reason not to make this an option for sites that don't have large categorylinks tables, but want a true count. Patch attached: although probably deprecated by Rob's future one, maybe someone else can find use of it until then.
Created attachment 4704 [details] First pass at a patch (not yet ready for commit) This patch adds a category table, as well as a Category class. The first application is this bug. I've gotten the basic functionality working, in that I can successfully view a category page and get the correct count. It needs considerable polishing, which I hope to put in tomorrow. Ideally I'll commit it to trunk in non-scary form tomorrow (i.e., a form that allows it to require no expensive operations on Wikimedia), and persuade Brion or someone to populate it properly at some point thereafter. However, I don't exactly trust my timetables. :) I've marked Greg's patch obsolete. It would be neat if we could tolerate such large scans, because the normalized solution is of course a lot nicer, but we can't, unfortunately. If Postgres can handle the scanning, it would be neat if it implemented the category table as a view, with the count columns running the queries transparently and ignoring manual updates.
Created attachment 4705 [details] Almost-complete patch The patch is nearly complete. Basic features: * Adds Category and CategoryList classes to represent categories themselves. * Adds a category table, giving each category a name, ID, and counts of all members, subcats only, and files. * Adds an updatelog table to track DB updates that can't be inferred from the schema, and uses it to track whether the category table has been populated. * Adds a maintenance script to populate the category table efficiently. This script is careful to wait for slaves and should be safe to run on a live database. The maintenance script's includes file is called by update.php. * Until the category table is populated, the patch handles weird category table rows gracefully. It detects whether they're obviously impossible, and if so, it outputs appropriate messages. Sample messages: ** "The following 200 pages are in this category, out of 556 total." (category table has sane entry) ** "The following 158 pages are in this category, out of 158 total." (category table is ignored, since this is the full set anyway) ** "The following 200 pages are in this category." (category table entry is impossible, like less than 200: no commitments as to total number) I would like review on this before I commit it, which I would like to do by the end of next weekend at the latest. The one major glaring blocker omission is that it totally ignores link updates: the only way to populate the category table is by running the maintenance script. Obviously it's useless until this is fixed. The other flaws that occurred to me I've written down as FIXMEs, some of which I just need to code a fix for and others of which I'd like comments on. Comments from anyone are appreciated, but especially Brion, if he has the time.
Hello, I'll probably test your patch, but don't know when. Can the maintenance be run from the web, or do I need console access ? (In reply to comment #42) > * Adds a maintenance script to populate the category table efficiently. This > script is careful to wait for slaves and should be safe to run on a live > database. The maintenance script's includes file is called by update.php.
Console access is required for all maintenace scripts, but a PHP or Java web-based SSH\telnet client provided by your host will usually suffice.
Created attachment 4706 [details] Same as previous attachment, with all files actually included I forgot to svn add a few things, whoops.
Created attachment 4728 [details] Patch with all functionality, needs review Okay, it now updates correctly on LinksUpdate and page deletion. I think this is it, but I really really want review from Brion or Tim before I check it in, since otherwise it will probably get reverted.
Looks good (I spent about 10 minutes reading through the code, not a substitute for thorough testing). The only immediate issue I can see is that it's not obvious whether you've considered this scenario: The schema change is applied, and a category which does not have an already-populated row in the category table is deleted from a page. If you haven't considered that, you'd get a PHP error for trying to subtract 1 from a signed field with value 0. Maybe I didn't pick up on that logic in your code. Well done, by the way. It's great to see this finally fixed up.
(In reply to comment #47) > The schema change is applied, and a category which does not have an > already-populated row in the category table is deleted from a page. If you > haven't considered that, you'd get a PHP error for trying to subtract 1 from a > signed field with value 0. > > Maybe I didn't pick up on that logic in your code. I explicitly noted that in a comment: + # ... If [a row for] one of the "deleted" [categories] didn't + # exist, we might legitimately not create it, but it's simpler to just + # create it and then give it a negative value, since the value is bogus + # anyway. You may have missed that the fields are actually signed, so you'll just get -1 with no errors, since that's a perfectly valid value for a signed int. I made it signed so that underflow errors are more obvious (as the comment in tables.sql says). I actually thought of a better way to do the update, in a way that doesn't require explicit LOCK IN SHARE MODE, since it's all one query (but no subqueries needed either). I'll change it to use that and then just check it in. Hopefully any errors that are spotted are minor enough to just be fixed on the spot instead of having to back it out.
Patch committed in r32085.