Last modified: 2014-09-23 23:09:29 UTC
As far as I can tell, entries are never deleted from the category table. This means that if a category is added to a page and removed, it will be in the category table forever, even if a category description page never existed.
Its probably safe to say that if a category has no pages in it and has no corresponding description page, that it ceases to be a category, since it doesn't exist in any visible place on the wiki.
On the English Wikipedia, more than 37% of the category table are these non-existent categories.
mysql> SELECT COUNT(*) FROM category LEFT JOIN page ON page_namespace=14 AND page_title=cat_title WHERE page_id IS NULL AND cat_pages=0 AND cat_files=0 AND cat_subcats=0;
| COUNT(*) |
| 419270 |
This is out of 1112968 entries in the category table.
I'm not sure that this is actually a problem. Is it?
Its not a user-visible problem, but it is completely unnecessary clutter in the database, and could be potentially annoying to toolserver users and people using the SQL dumps.
If not an actual change to the category handling code, perhaps there could be a maintenance script to this effect?
I have to agree with Alex and Dan on this...data that serves no purpose and is unrelated to anything is usually considered A Bad Thing in database administration. It causes bloat, potential slow-downs (not that that's a big concern when you've got a bazillion servers <g>), and is just that much more data that might cause problems in a corruption or large-scale-change scenario. Something should definitely be implemented to clean up the unused entries.
I figured that it would be nice to completely embarrass myself by writing some actual code. Attached patch adds a function to Category.php that deletes a category from the category table, and adds a maintenance script which calls that on all categories which do not have any subcats, pages, files, or a description page. I don't have a wiki to test this on, at least not one I have shell to, and I really don't know how things work, so it would be nice to know what I screwed up.
Created attachment 8792 [details]
Add a deleteFromTable to Category.php, new maintenance script
One of the things that I screwed up is that I forgot to attach the actual patch.
Created attachment 8795 [details]
There was a syntax error in that maintenance script, and apparently my !== was too specific for PHP to handle. I also added the necessary code to a deleteIfUnused() function in Category.php, although the maintenance script doesn't use it since it needs the --dry-run functionality. This script is tested against r92419 with categories containing only pages, though the logic is obviously the same for subcats and files. Also tested with both extant and non-extant category description pages. Only categories with no members and no description page are removed from the database, and there is no bad behavior when viewing old revisions or anything like that.
(In reply to comment #1)
> I'm not sure that this is actually a problem. Is it?
A user hit this bug today because UploadWizard relies on MediaWiki's allcategories API module to list category suggestions. The allcategories API module reads directly from the category table, so users end up selecting category suggestions for bogus categories (categories without any members or a description page). This is most certainly a problem.
This table is used in several places to get a list of categories. One of those places is ApiQueryAllCategories. This api is used to get suggestions for category titles. And is now returning suggestions for categories that have only existed by accident in the past and have no category-namespace page and no members.
Raising priority and adding 1.19 blocker (hopefully sooner, but it's hitting UploadWizard and causing bug 30718)
Dan, is there any reason that we should keep a record of what categories used to exist (that is, archive that list somewhere while deleting them from the Category table)?
Well, none that I can think of, and using that script on my test wiki provided no ill effect. These categories are deleted anyway, looking at old versions of pages that used to be in them will still show a redlink in the categories list either way, and I can't think of any use for a list of all categories past and present. Our lists of pages and files also don't include deleted items, and as others have said it's causing problems that categories refuse to go away when you delete them. UploadWizard had to add a workaround to avoid suggesting nonexistent categories.
(In reply to comment #9)
> This table is used in several places to get a list of categories. One of those
> places is ApiQueryAllCategories. This api is used to get suggestions for
> category titles. And is now returning suggestions for categories that have only
> existed by accident in the past and have no category-namespace page and no
> Raising priority and adding 1.19 blocker (hopefully sooner, but it's hitting
> UploadWizard and causing bug 30718)
Note, uploadWizard seems to have been fixed independently of this.
If we really care about this sort of thing, we could delete the category entries every time cat_pages gets down to 0 and the description page doesn't exist. But then we could have a lot of deleted and then recreated and then deleted entries in the category table.
Re: comment 7 (the patch)
Looks mostly good I think (I didn't take a super in-depth look, but from my quick read over, it looks pretty good). Two issues I see so far:
*When getting the entries from the category table in the maintinace script, you should sort by cat_pages. There's an index on this so it should be fast, and its a lot quicker to delete everything with 0 entries if you start with the things with 0 entries instead of going through all entries
*If you already have a category row available, do Category::newFromRow( $row ); (or even just directly look at the values in $row). Doing Category::newFromName will result in a second query for the category table which you already have.
*I don't like the idea of introducing methods that aren't used.
switch to milestone, remove release tracking dep
pushing to future since release is imminent.
Is it my imagination or has this not yet been published, despite being noticed and corrected three years ago? Certainly Wikipedia's "List of" categories show that the problem still exists. Out of the first 100, it looks like one is a redirect (List of Albanians) and none of the others exist at all.
I've tagged this for the next release, but this is my first time doing that, so if I did anything wrong, please correct me!
(In reply to Robert Morley from comment #15)
> Is it my imagination or has this not yet been published, despite being
> noticed and corrected three years ago?
It was noticed. It was not corrected. Dan proposed making a script that could be run manually to remove categories without any entries. I had some mild objections at the time. My biggest objection is that if we don't want such categories, we should have something automatic to remove them (ie not a maintenance script), where the comment 7 solution would require a human to press a button every so often to get rid of the extra categories.
> I've tagged this for the next release, but this is my first time doing that,
> so if I did anything wrong, please correct me!
Those tags aren't treated too seriously. Generally though I would only recommend setting target milestones if you are in a position to fix the bug.
So possible solutions to this bug:
*Have allcategories default to having the acmin parameter be 1. This is closer to the probable intent of users. Downside - will exclude pages in the category namespace that don't have any category members. However it should be noted that this is already the case. [[Category:*NSYNC members]] is a thing, but https://en.wikipedia.org/w/api.php?action=query&list=allcategories&acprefix=*N&aclimit=100&acprop=size doesn't include it.
Another possible solution (perhaps more proper):
Change line 3384 on includes/WikiPage.php to delete the entry in the category table iff. cat_pages = 0 and there is no description page for that category. If we did that, we would probably want to ensure that creation of a page in the category namespace creates an entry in the category table even if it has no entries.
I hadn't looked closely at the patch and thought he'd added both the automatic deletion functionality and the maintenance script. I see now that that was incorrect.
For your first solution, I think your point about *NSYNC means that a default of acmin=1 would be an easy-to-write interim solution.
I really feel that the second solution is the way to go in the long run, though. The category table never getting cleaned up is basically a resource leak, which is bad in its own right, but could also conceivably be used to expose unwanted profanity or sensitive data with no easy way to get rid of it.
I was thinking about sometimes-empty categories, but even if a category is empty, it'll almost always have a category page associated with it (in which case it should be kept, as you mentioned). Even if it doesn't, I don't think deleting it will really matter. I suppose you might want it left as a category suggestion or what have you, but really, if that's what you want, you should be creating a page for the thing anyway.