Last modified: 2014-07-19 00:06:58 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T18036, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 16036 - PAGESINCATEGORY inaccurate for large categories
PAGESINCATEGORY inaccurate for large categories
Status: NEW
Product: MediaWiki
Classification: Unclassified
Categories (Other open bugs)
All All
: Low normal with 8 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: shell
: 19068 20977 68240 (view as bug list)
Depends on:
Blocks: 16660 29782
  Show dependency treegraph
Reported: 2008-10-19 05:27 UTC by MER-C
Modified: 2014-07-19 00:06 UTC (History)
18 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description MER-C 2008-10-19 05:27:43 UTC
*[[Category:All articles proposed for deletion]]
*[[Category:All disputed non-free images]]
*[[Category:Articles for deletion]]

The page count as given by PAGESINCATEGORY for the above categories are 3034, 2460 and 2776 respectively. Manually paginating through the category contents gives 583, 241 and 745.

All three categories are populated by templates and some entries exit the category via deletion.
Comment 1 Foroa 2008-10-27 09:31:20 UTC
Simular problems on [[:commons:category:Halo]] and [[:commons:category:Duplicate]]. Mostly emptied by bot move and deletion. 
Comment 2 Chad H. 2009-06-03 15:56:24 UTC
*** Bug 19068 has been marked as a duplicate of this bug. ***
Comment 3 Alexandre Emsenhuber [IAlex] 2009-10-04 06:54:26 UTC
*** Bug 20977 has been marked as a duplicate of this bug. ***
Comment 4 Christian Thiele 2010-01-06 09:29:53 UTC
On de.wikipedia there is maybe the same problem: "SELECT cat_pages FROM category WHERE cat_title='Mann';" gives 260098, but "SELECT count(*) FROM categorylinks WHERE cl_to='Mann';" gives 259087. I don't know if it's the same bug, because the category "Mann" (man) is normally not inserted by template but directly (I don't know any article, where it's added by template).

But wrong numbers in PAGESINCATEGORY makes this not very usable...
Comment 5 db [inactive,noenotif] 2010-06-29 20:12:30 UTC
See also bug 12188
Comment 6 Church of emacs 2010-08-15 12:01:33 UTC
Does anyone have a clue what's causing this? The error seems to go in both directions (sometimes there are too many articles in the category, sometimes too few). Perhaps we don't put article saves and category updates in one transaction and if one succeeds and the other fails we get wrong numbers?

Here's a simple query to see where there are wrong numbers in the database:

SELECT cat_title as title, cat_pages as storedcount, realcount FROM category,
(SELECT cl_to, count(cl_to) as realcount FROM categorylinks GROUP BY cl_to) as rc
WHERE cl_to=cat_title AND rc.realcount!=cat_pages LIMIT 10;

(In reply to comment #5)
> See also bug 12188

That's not really the same bug.
Comment 7 Church of emacs 2010-08-15 12:51:28 UTC
From what I've tested so far, this problem seems to be two different problems:

a) categorylinks contains obsolete entries. See cl_to='1284_establishments' for on dewiki an example. This was an imported article located at [[:de:Portal Diskussion:Peterhouse, Cambridge]]. Even though the article was moved and the category removed ( ), the reference in categorylinks remains.

b) cat_pages is inaccurate. See for example which clearly contains 32 pages + 5 subcats = 37 cats. Yet, cat_pages says only 36 pages are in that category.
(What I don't understand in this case is why PAGESINCATEGORY returns the correct amount of 37. This seems to be inconsistent with the cat_pages result on the toolserver).

By the way, all in all there are 30 results on dewiki for a cat_pages <-> count mismatches.
Comment 8 DieBuche 2010-12-14 08:50:04 UTC
Does this only happen for template-populated categories?
Possibly related bugs:
Bug 16686
Bug 16765
Bug 21230
Comment 9 Philippe Verdy 2010-12-15 12:40:59 UTC
it looks like there are stale entries in table categorylinks. This is a referential constraint : one deletion occured in table category, which probably did not occur at the same time in table cateforylinks, probably due to concurrent accesses and incorrecct recovery after a failure.

So we need now a SQL-enabled bot to check the counts and cleanup the stale  categorylinks entries. This causes another bug: in debuggging categories (populated by templates, such as those detecting pages using obsoleted or missing parameters or incorrect template parameter values), we can cleanup the listed pages by correcting these template invokations (most of these cleanup actions are performed by bots).

However, the page reappears in the categoy some minutes later, because apparently it is found with a request on "categorylinks" only. When we nulledit these pages, they disappear immedirately from the category, because the query cache will only list the entries found in the "category" page. A few minutes later (about 5 minutes), when no edit was really performed, the same pages reappear without any reason (no edit was performed), because the stale entries in "categorylinks" are still there and where not deleted by the nulledit.

Checking now all categories could take a lot of time and SQL rssources. However, a null-edit action should be able to cleanup the categories that the null-edit is trying to refresh : not only it should look at what is in table "category" but also if the null-edited page is found somewhere in "categorylinks", to delete the unwanted stale entries: if the two "SELECT count(*)" from the two tables are incoherent, then it should list the existing entries and compare them to what is expected for the null-edited page, and delete those superfluous entries.

Note that MySQL does not properly rollbacks transactions (a full support for transactions is very costly for the server and could cause dealys and huge consumption of ressources, plus additional deadlocks: to avoid this, you should make sure that deletions or additions in any of the two tables are always performed in the same order (first "category", then "categorylinks"): this will remain true if transaction support is enabled (in another engine than MySQL, such as Oracle or Postgres).

In my opinion, the full support for transaction only works with Oracle, Sybase, MS-SQL, Informix, Ingres, PostGres, but if transactions is not enabled (i.e. all requests are autocommitted), then you need a better management of errors and their recovery (by inserting insert/update/delete requests in a queue list with a callback to handle the retires with a timed delay, for example by reinserting the page that failed to update completely in the existing update queue).
Comment 10 Ryan Kaldari 2011-04-04 19:51:02 UTC
If this is not specific to deletion categories, the bug title should be changed.
Comment 11 Roan Kattouw 2011-06-28 16:17:47 UTC
Per Aryeh, these counts can be fixed by running populateCategory.php with the --force parameter.
Comment 13 Mark A. Hershberger 2011-06-28 17:06:45 UTC
per comment 11, adding "shell" keyword
Comment 14 Bawolff (Brian Wolff) 2011-06-28 17:45:24 UTC
(In reply to comment #12)
> is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories with less than 200 pages in them, and 63 < 200.
Comment 15 Bawolff (Brian Wolff) 2011-06-28 19:09:05 UTC
(In reply to comment #14)
> (In reply to comment #12)
> >
> > is another example where this is causing issues
> Something weird is happening. This bug should no longer happen on categories
> with less than 200 pages in them, and 63 < 200.

I see why. We don't refresh the cat counts if everything is 0 (or specifically if everything is right except for the images, and the image section is really 0 or >200). This behaviour is kind of weird (Also, we only check if a specific section is correct, which could result in recounting a 200000 big category if it only happens to have 5 images in it, and the image count is wrong) [this all seems kind of wrong, but off topic to this bug so I'll stop talking about that and eiter fix it or split it off to some other bug]

Anyhow, the upshot of all this is - If you add a single image to that category, view that category page (this is the important part, you need to view the category page while it has the single image in it), and then remove that image from the category, it will probably reset the number of files in that category count - no shell required.
Comment 16 Betacommand 2013-02-24 18:31:01 UTC
Has any progress been made on resolving this issue?
Comment 17 Andre Klapper 2013-02-25 14:45:00 UTC
Betacommand: No, otherwise it would be mentioned here. :)
Comment 18 Sam Reed (reedy) 2013-05-16 21:24:37 UTC
Does this need doing everywhere?
Comment 19 Hercule 2013-06-04 16:23:25 UTC
I confirm this is not only for deletion categorie. An example on is :

Today it counts 1 850 pages + 3 sub-categorys and via the api I retrieve 1863 items.
Comment 21 Hercule 2013-06-05 15:45:49 UTC
(In reply to comment #20)
> empty_disambiguation_categories 
> and
> take
> sometimes weeks before getting correct. It became much worse last month.

I don't think this is related to this bug. Your problem is due to the cache management. That's something else.

Today I made an update of every pages in this morning, and update the category page. None changed the count of articles.
Comment 22 Bawolff (Brian Wolff) 2014-07-19 00:06:58 UTC
*** Bug 68240 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.