Last modified: 2014-11-18 18:07:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3212, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 1212 - On paged categories, article totals and subcat totals are incorrect


Summary:	On paged categories, article totals and subcat totals are incorrect

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Categories (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal with 19 votes (vote)
Target Milestone:	---
Assigned To:	Aryeh Gregor (not reading bugmail, please e-mail directly)

URL:	http://en.wikipedia.org/wiki/Special:...
Whiteboard:
Keywords:	patch-need-review

Duplicates:	1336 1588 3095 3714 4390 6146 7950 8235 (view as bug list)
Depends on:
Blocks:	13326 11132
	Show dependency tree / graph

Reported:	2004-12-27 17:11 UTC by David Gerard
Modified:	2014-11-18 18:07 UTC (History)
CC List:	16 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Show true category counts (2.71 KB, patch) 2007-10-18 14:31 UTC, Greg Sabino Mullane	Details
First pass at a patch (not yet ready for commit) (12.55 KB, patch) 2008-03-09 04:31 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)	Details
Almost-complete patch (22.33 KB, patch) 2008-03-10 01:49 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)	Details
Same as previous attachment, with all files actually included (27.09 KB, patch) 2008-03-10 23:03 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)	Details
Patch with all functionality, needs review (33.03 KB, patch) 2008-03-16 19:26 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)	Details
Show Obsolete (4) Add an attachment (proposed patch, testcase, etc.)

Description David Gerard 2004-12-27 17:11:31 UTC

On large categories, both the articles and subcats are paged. This is fine, but
the totals given for both articles and subcats are incorrect.

Test case: Category:Business

Expected result: "There are [correct total] subcategories to this category."

Actual result: "There are 17 subcategories to this category." - it listed only
the ones on the page.

Quick fix: change text to say "on this page" - then it's true, but not so useful
for the reader.
Better fix: show the correct total for the entire category.

Comment 1 Brion Vibber 2005-01-29 08:12:00 UTC

*** Bug 1336 has been marked as a duplicate of this bug. ***

Comment 2 Brion Vibber 2005-02-25 08:01:08 UTC

*** Bug 1588 has been marked as a duplicate of this bug. ***

Comment 3 lɛʁi לערי ריינהארט 2005-06-14 23:04:31 UTC

Hallo!

There are some differences now between the test wikis FiverAlpha and Nukawiki:

See:
http://jadesukka.homelinux.org:8180/betawiki/Betawiki:Categories/counters#bugzilla

http://test.leuksman.com/index.php?title=Category:Todo&from=Z
looks pretty good compared to
http://jadesukka.homelinux.org:8180/mediawiki15c/index.php?title=Luokka:Todo&from=Z

Regards Reinhardt [[user:gangleri]]

Comment 4 David Gerard 2005-09-09 05:16:58 UTC

*** Bug 3095 has been marked as a duplicate of this bug. ***

Comment 5 Jamie Hari 2005-10-12 05:39:35 UTC

It appears this error has made its way to 1.5.0 as seen on my site here: 
http://www.marveldatabase.com/wiki/index.php/Category:Characters

I definitely think that David Gerard's 'better fix' should be the course of 
action. Category pages should show the FULL count for all articles and sub-
categories of the category in question.

Also, the category page only displays the first 200 total links, both sub-
categories and articles combined. (See my referenced link) Shouldn't it show the 
first 200 articles and ALL (limited to x, for SQL query-time reasons) sub-
categories?

Comment 6 lɛʁi לערי ריינהארט 2005-10-15 13:11:47 UTC

*** Bug 3714 has been marked as a duplicate of this bug. ***

Comment 7 arenalpha 2005-10-15 14:51:05 UTC

Hope this gets fixed..... eventually...

Comment 8 Jamie Hari 2005-12-19 19:50:30 UTC

Any word on this bug getting some attention before version 1.6?

Comment 9 Rob Church 2005-12-26 18:20:23 UTC

*** Bug 4390 has been marked as a duplicate of this bug. ***

Comment 10 Borgx 2005-12-27 00:10:45 UTC

Five similar bugs!. This is a very urgent request Mr.Developer!

Comment 11 Zigger 2005-12-27 14:30:54 UTC

(changed URL to category currently with paging)

Comment 12 Tim Starling 2006-01-06 13:24:30 UTC

Is SELECT COUNT(*) good enough here? Or do we need something more efficient?

Comment 13 Jamie Hari 2006-01-06 14:36:39 UTC

First, it should display a TOTAL of all pages within the category (and a total of 
all categories in the category). Second, it should give the user the option to 
display the first 20, 50, 100 or 200 and have a 'previous' and 'next' feature. 
Something closer to the 'recent changes page' setup... 

Currently, the way it works is fine, but it lacks a TOTAL sum, the only way to 
find out is to flip through page by page and count yourself. I also like the idea 
of adding the choice of limiting the count (20, 50, 100 or 200) like all the 
other special pages in MediaWiki.

Just my thoughts...

Comment 14 Rob Church 2006-01-06 15:50:58 UTC

(In reply to comment #12)
> Is SELECT COUNT(*) good enough here? Or do we need something more efficient?

I don't see why not.

Comment 15 Brion Vibber 2006-01-06 20:37:31 UTC

The rumor was that SELECT COUNT(*) was too expensive for large categories (eg 
Category:GFDL on Commons, with tens of thousands of entries). I'm not sure how real 
that problem is.

Comment 16 Jamie Hari 2006-01-06 23:38:05 UTC

I know how this will probably require the creation of another unessecary table, 
but why don't we have a table that keeps the tallies of articles in each  
category... This data could be cached server-side for larger wikis and cached 
client-side for smaller ones. Everyone is happy and we still get our count!

When the cache expires, the query is called. Life is beautiful.   :)

Comment 17 Tim Starling 2006-01-07 00:14:48 UTC

We could just remove the count from the UI. That's what we do for
[[Special:Whatlinkshere]], which has precisely the same schema problem.

Comment 18 Borgx 2006-01-07 05:52:18 UTC

(In reply to comment #17)
> We could just remove the count from the UI. That's what we do for
> [[Special:Whatlinkshere]], which has precisely the same schema problem.


Which is now being requested at bug 4394 (Provide a count in Special:Whatlinkshere)

Comment 19 Christopher Beland 2006-01-08 01:00:02 UTC

Editors often count up how many articles are in maintenance categories. 
Removing the automatic count would make that a lot more work.  And it's useful
to readers to know what is below the virtual fold before they invest the time it
takes to get there.

Perhaps there is a clever query that would give an exact count for
reasonably-sized categories and an "overflow" indication for anything larger
than a certain threshold (possibly with a "click here to get an exact count"
link).  Though it might be easiest just to implement the simply COUNT solution
at an off-peak time and check to see what kind of a performance hit there is, if
any.  That would be useful information in general, and would also help set the
threshold optimally.

Comment 20 Jamie Hari 2006-02-04 01:43:35 UTC

Has anyone been able to do a successful test on what kind of performance issues 
this may create? Could we not extrapolate an estimate? Grab a page-render time for 
a <200 article COUNT and multiply by 50 and 500 to figure for a maximum threshold 
of 10,000 or 100,000? (Accounting for other page-rendering factors, of course...)

Comment 21 Rob Church 2006-05-30 21:27:41 UTC

*** Bug 6146 has been marked as a duplicate of this bug. ***

Comment 22 Purodha Blissenbach 2006-05-31 05:31:00 UTC

Design Suggestions:
1. MediaWiki:Categoryarticlecount and MediaWiki:subcategorycount shoud have params $1 to $7 at least making 
messages possible like: "showing 200 articles from # 401 'cat ears' to # 600 'dog pounds' of a total of 782 
articles in category 'cats and dogs'."

2. Count of articles/subcats per page (currently 200) to be installable in "user preferences" and selectable 
via paging option links (display previous/next 10|50|100|200|500|1000|....) as elsewhere.

3. Count of columns of articles/subcats per page (currently 3) to be set in "user preferences" (as 1...10, e.g
.) and possibly via paging option links.

4. Show a page directoy, if there is more than one page, like with Special:Allpages, where pages are listed and 
linked to in a "$first_to_$last" fashion. If there are more page diretory entries than fit on a page, make a 
directory of page directories. (Iterate til page size is matched)

5. With the page directory, MediaWiki:Categoryarticlepagingcount and MediaWiki:subcategorypagingcount should  
be introduced having params $1 to $12 at least, allowing messages like: "showing 678 of 12.345 page directory 
entries in category 'cats & dogs', from # 679 to # 1356 at level 2, corresponding to 134.500 subcategories of 
12.345.678.987, covering subcategory # 135.600 (lions) to subcategory # 1.234.200 (eliphants)"

Comment 23 Brion Vibber 2006-11-16 15:58:09 UTC

*** Bug 7950 has been marked as a duplicate of this bug. ***

Comment 24 Jamie Hari 2006-11-24 02:18:12 UTC

There has been no developer commentary on this in nearly a year. Did we ever 
determine if it was indeed too expensive?

What about the new table field to keep track as discussed. 

Any thoughts Brion / Rob / Tim?

Comment 25 Brion Vibber 2006-11-29 22:32:38 UTC

Offhand, I might recommend changing the indexing for the categorylinks table to split up by namespace, _then_ sort key.

This would allow counting, paging, extracting etc the various namespace groups separately.
Thus category pages (subcats) and image pages (media) could be treated separately rather easily.
Other namespaces would also end up being paged separately, though, so it might take some UI
restructuring.

Comment 26 Raimond Spekking 2006-12-12 15:59:39 UTC

*** Bug 8235 has been marked as a duplicate of this bug. ***

Comment 27 Jamie Hari 2006-12-31 05:34:16 UTC

How is the 'members' data generated on [[Special:Categories]]? (Bug 146)

It seems that each category has had a SELECT COUNT(*) to build the info (comment
#12) or is this data held somewhere else in a table?

If this is indeed a COUNT(*), wouldn't this be several times more expensive
(comment #15) than just counting the number of members when looking at just ONE
category?  

Example:
http://www.marveldatabase.com/index.php?title=Special:Categories&limit=500&offset=500
 - Look at the 'Characters' category with over 6,400 'members'.

http://www.marveldatabase.com/Category:Characters  - Shows 'There are 200
articles in this category', when we know this is only a count of how many appear
on this page.

Could we somehow use this 'member' count to resolve this bug? 
(Don't forget this 'total count' should have a separate count for each
'articles' and 'images' in a category.)

If we can show: "Character Galleries ‎(1,095 members)", then we should be able
to accurately state: "There are 1,095 articles in this category"...

Comment 28 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-12-31 17:42:00 UTC

There is no COUNT(*).  If there were, there would be an actual count.  ;)  As it
is, we just iterate through the returned rows in PHP until there aren't any
more, and then keep the number handy as the count of how many categories there
are.  Since we have a LIMIT clause, of course, this means the count only goes up
to the limit.

The question is whether a COUNT(*) would, in fact, be too slow.  Just nobody has
bothered to determine whether it is, apparently.  Don't look at me, I'm not
qualified.  Maybe bug Domas or something.

Comment 29 Jamie Hari 2006-12-31 19:17:36 UTC

So what is the LIMIT? (10,000/100,000?) And upon what circumstance do we run
through this 'iteration'?

Furthermore, why don't we just create a field that holds this 'pseudo-count'
with each category, which gets updated with each iteration. It will behave like
a 'cache' such that the figure isn't up-to-the-second, but depending on how
often the figures are updated, it might be worthwhile...

So far, Brions idea of first splitting up by namespace, then sort key sounds
best, long-term, I think. It would substantially lighten the load for
search/counting/paging etc.

Would this take much re-working of the categorylinks table?

Comment 30 Rob Church 2006-12-31 19:27:39 UTC

(In reply to comment #29)
> So what is the LIMIT?

The paging limit is 200.

Comment 31 Gunter Schmidt 2007-02-25 21:03:33 UTC

If somebody would like to try the often diskussed question: "Is select count(*) too slow?", he might want to change this code in 
CategoryPage.php. The field is indexed, so it should be rather fast.

Replace the following function (retrieved from MW 1.9.2):

function getPagesSection() {
  $dbr =& wfGetDB( DB_SLAVE );
  $catCount = $dbr->selectField( 'categorylinks', 'count(*)', array('cl_to' => $this->title->getDbKey() ), 'getPagesSection');
  $ti = htmlspecialchars( $this->title->getText() );
  $r = "<div id=\"mw-pages\">\n";
  $r .= '<h2>' . wfMsg( 'category_header', $ti ) . "</h2>\n";
  $r .= wfMsgExt( 'categoryarticlecount', array( 'parse' ), $catCount );#count( $this->articles) );
  $r .= $this->formatList( $this->articles, $this->articles_start_char );
  $r .= "\n</div>";
  return $r;
}

Comment 32 Gunter Schmidt 2007-02-25 21:10:40 UTC

Given this small change, it might be wise to add a switch to MediaWiki, like $wgFullCategoryArticleCount = true / false. This way, 
the best configuration can be set by the admin.

Comment 33 Jamie Hari 2007-03-17 03:37:50 UTC

This might be exactly what we are looking for...
Could some one please implement this?

I very much also like the idea of a switch (off by default for Wikipedia's sake, of course ;)

Comment 34 Jamie Hari 2007-04-30 03:43:54 UTC

Rob/Brion/et al,

Does Gunter's solution look viable?

Comment 35 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-05-10 17:24:39 UTC

It looks viable.  See URL:

http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopla

It takes just a few seconds to load, and that's the worst possible case at
present.  Capping it at 250,000 should be fine if we're okay with a few category
pages taking an extra second or two to load.  This is assuming that the only
issue is latency.

Adding "shell" to flag this for sysadmins to look at for performance impact.

Comment 36 Jamie Hari 2007-09-18 22:57:29 UTC

That page seems to load quite quickly for me and it seems to be not cached. It is taking as long as it does for all 50 categories to be looked up. One can guesstimate the performance hit to check 1 category should be fractional, comparatively.

Does the information present in this bug, above, possibly provide a feasible solution?

Comment 37 Jamie Hari 2007-09-18 22:59:06 UTC

Is there any way to accurately time how long this takes from a raw query standpoint?

http://en.wikipedia.org/w/index.php?title=Special:Categories&offset=Living_peopla&limit=1

Comment 38 Rob Church 2007-09-18 23:03:00 UTC

Somewhere, probably not on this laptop, I've got an uncommitted working copy of MediaWiki which effectively tracks and updates simple tallies of the number of pages in each category, and the number of subcategories, in a relatively efficient manner, by hooking into the link update process. This would make such a count theoretically trivially inexpensive.

I'll dig it out this weekend and whack a patch somewhere for further review and criticism.

Comment 39 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-09-18 23:11:01 UTC

(In reply to comment #37)
> Is there any way to accurately time how long this takes from a raw query
> standpoint?
> 
> http://en.wikipedia.org/w/index.php?title=Special:Categories&offset=Living_peopla&limit=1

It takes 4.42 seconds on the toolserver to run the COUNT(*) query on Category:Living people, with 221446 rows.  A proper solution would be better.

Comment 40 Greg Sabino Mullane 2007-10-18 14:31:07 UTC

Created attachment 4268 [details]
Show true category counts

The count(*) solution seems to work fine for me - subsecond responses as it hits the cl_sortkey index. I made the table have 1.1 million rows, and a category with 221,000 entries in it still ran in under a second. (This is for Postgres - MySQL also seemed to use the index, but was much slower about it - maybe this is better on a newer version of MySQL? I only have an early 5.0 available at the moment). Still, no reason not to make this an option for sites that don't have large categorylinks tables, but want a true count. Patch attached: although probably deprecated by Rob's future one, maybe someone else can find use of it until then.

Comment 41 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-09 04:31:14 UTC

Created attachment 4704 [details]
First pass at a patch (not yet ready for commit)

This patch adds a category table, as well as a Category class.  The first application is this bug.  I've gotten the basic functionality working, in that I can successfully view a category page and get the correct count.  It needs considerable polishing, which I hope to put in tomorrow.  Ideally I'll commit it to trunk in non-scary form tomorrow (i.e., a form that allows it to require no expensive operations on Wikimedia), and persuade Brion or someone to populate it properly at some point thereafter.  However, I don't exactly trust my timetables.  :)

I've marked Greg's patch obsolete.  It would be neat if we could tolerate such large scans, because the normalized solution is of course a lot nicer, but we can't, unfortunately.  If Postgres can handle the scanning, it would be neat if it implemented the category table as a view, with the count columns running the queries transparently and ignoring manual updates.

Comment 42 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-10 01:49:41 UTC

Created attachment 4705 [details]
Almost-complete patch

The patch is nearly complete.  Basic features:

* Adds Category and CategoryList classes to represent categories themselves.
* Adds a category table, giving each category a name, ID, and counts of all members, subcats only, and files.
* Adds an updatelog table to track DB updates that can't be inferred from the schema, and uses it to track whether the category table has been populated.
* Adds a maintenance script to populate the category table efficiently.  This script is careful to wait for slaves and should be safe to run on a live database.  The maintenance script's includes file is called by update.php.
* Until the category table is populated, the patch handles weird category table rows gracefully.  It detects whether they're obviously impossible, and if so, it outputs appropriate messages.  Sample messages:
** "The following 200 pages are in this category, out of 556 total." (category table has sane entry)
** "The following 158 pages are in this category, out of 158 total." (category table is ignored, since this is the full set anyway)
** "The following 200 pages are in this category." (category table entry is impossible, like less than 200: no commitments as to total number)

I would like review on this before I commit it, which I would like to do by the end of next weekend at the latest.  The one major glaring blocker omission is that it totally ignores link updates: the only way to populate the category table is by running the maintenance script.  Obviously it's useless until this is fixed.  The other flaws that occurred to me I've written down as FIXMEs, some of which I just need to code a fix for and others of which I'd like comments on.  Comments from anyone are appreciated, but especially Brion, if he has the time.

Comment 43 iubito (Sylvain Machefert) 2008-03-10 06:07:56 UTC

Hello,

I'll probably test your patch, but don't know when.

Can the maintenance be run from the web, or do I need console access ?

(In reply to comment #42)
> * Adds a maintenance script to populate the category table efficiently.  This
> script is careful to wait for slaves and should be safe to run on a live
> database.  The maintenance script's includes file is called by update.php.

Comment 44 Robert Leverington 2008-03-10 10:17:50 UTC

Console access is required for all maintenace scripts, but a PHP or Java web-based SSH\telnet client provided by your host will usually suffice.

Comment 45 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-10 23:03:08 UTC

Created attachment 4706 [details]
Same as previous attachment, with all files actually included

I forgot to svn add a few things, whoops.

Comment 46 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-16 19:26:16 UTC

Created attachment 4728 [details]
Patch with all functionality, needs review

Okay, it now updates correctly on LinksUpdate and page deletion.  I think this is it, but I really really want review from Brion or Tim before I check it in, since otherwise it will probably get reverted.

Comment 47 Andrew Garrett 2008-03-17 05:08:12 UTC

Looks good (I spent about 10 minutes reading through the code, not a substitute for thorough testing). The only immediate issue I can see is that it's not obvious whether you've considered this scenario:

The schema change is applied, and a category which does not have an already-populated row in the category table is deleted from a page. If you haven't considered that, you'd get a PHP error for trying to subtract 1 from a signed field with value 0.

Maybe I didn't pick up on that logic in your code.

Well done, by the way. It's great to see this finally fixed up.

Comment 48 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-17 13:20:18 UTC

(In reply to comment #47)
> The schema change is applied, and a category which does not have an
> already-populated row in the category table is deleted from a page. If you
> haven't considered that, you'd get a PHP error for trying to subtract 1 from a
> signed field with value 0.
> 
> Maybe I didn't pick up on that logic in your code.

I explicitly noted that in a comment:

+		# ... If [a row for] one of the "deleted" [categories] didn't
+		# exist, we might legitimately not create it, but it's simpler to just
+		# create it and then give it a negative value, since the value is bogus
+		# anyway.

You may have missed that the fields are actually signed, so you'll just get -1 with no errors, since that's a perfectly valid value for a signed int.  I made it signed so that underflow errors are more obvious (as the comment in tables.sql says).

I actually thought of a better way to do the update, in a way that doesn't require explicit LOCK IN SHARE MODE, since it's all one query (but no subqueries needed either).  I'll change it to use that and then just check it in.  Hopefully any errors that are spotted are minor enough to just be fixed on the spot instead of having to back it out.

Comment 49 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-18 00:20:29 UTC

Patch committed in r32085.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links