Last modified: 2014-11-18 18:07:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3211, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 1211 - Subcategory paging is not separate from article or image paging
Subcategory paging is not separate from article or image paging
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Categories (Other open bugs)
1.12.x
All All
: Normal normal with 48 votes (vote)
: ---
Assigned To: Aryeh Gregor (not reading bugmail, please e-mail directly)
http://en.wikipedia.org/wiki/Category...
:
: 1990 4910 10488 11696 16249 17359 24271 (view as bug list)
Depends on:
Blocks: 450 23682
  Show dependency treegraph
 
Reported: 2004-12-27 17:03 UTC by David Gerard
Modified: 2014-11-18 18:07 UTC (History)
35 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Proposed patch (9.85 KB, patch)
2005-11-24 03:36 UTC, Joel Nothman
Details
alternate category page (17.23 KB, text/plain)
2006-07-28 17:19 UTC, Jim Hu
Details
Patch to always show all subcategories (2.63 KB, patch)
2007-09-06 19:05 UTC, Peter Schlömer
Details

Description David Gerard 2004-12-27 17:03:59 UTC
Category:Business is large, so the list of articles pages, as it should - there
are 200 articles on a page. The first page is all up to F.

However, the subcategories are also paged and only go up to F, even though there
are only 17 on the first page. This is not optimal for usability - I was
actually going to the page to check which subcat was right for an article, but
have to page unnecessarily.

Expected result: That subcats would not be paged, or would also page at 200.

Actual result: Subcats page alphabetically in step with articles, though this
reduces the usability of the subcat listing.

Test case: category:business
Comment 1 Mark Clements (HappyDog) 2005-02-26 20:10:11 UTC
Had a similar problem with Category:Computer_stubs.  I added a sub=category
(Category:Software stubs) but it wasn't showing up in the list.  Spent a while
trying to get it to show up, until I stumbled across it if I looked atarticles
from 'S' onwards.  Sub-Categories should be completely separate from articles. 
I understand that mediawiki is getting top 200 matches and then sorting them
into separate arrays, but unless this is sorted it makes sub-categories in
well-populated categories redundant.

The list of sub-categories should be completely separate from the list of articles.
Comment 2 lɛʁi לערי ריינהארט 2005-02-26 21:35:05 UTC
Halló!

To my opinion (hopefully all) subcategories should be listed whenever a
categgory page is generated. A problem can occur if there are more then 200
subcategories the maximal number of items displayed.

Some "turn arounds" for the actual implementation are discussed in English at:
[[sr:Википедија:Tests for Корисник:Bonzo]] .
One sugestion is to add a space in [[:category:XXXX| YYYY]] or a "!"
[[:category:XXXX|!YYYY]] depending on what consensus you have in your wiki, if
you are using a navigation template etc.

Regards Reinhardt
Comment 3 lɛʁi לערי ריינהארט 2005-02-26 21:58:14 UTC
here the direct link
http://sr.wikipedia.org/wiki/Wikipedia:Tests_for_%D0%9A%D0%BE%D1%80%D0%B8%D1%81%D0%BD%D0%B8%D0%BA:Bonzo
Comment 4 lɛʁi לערי ריינהארט 2005-03-10 19:11:53 UTC
[[en:Category:Stub categories]] is a category having more than 200 subcategories.
Regards Reinhardt
Comment 5 Brion Vibber 2005-04-27 00:22:37 UTC
*** Bug 1990 has been marked as a duplicate of this bug. ***
Comment 6 Niklas Laxström 2005-06-08 12:22:54 UTC
Is this really only an enhancement? It seems a loss of function to me.
Comment 7 David Gerard 2005-06-08 22:58:29 UTC
Indeed. The subcats not appearing *at all* until the second page if they're out of
luck in the alpha sorting is a definite bug. Marking "minor" rather than
"enhancement".
Comment 8 R. S. Shaw 2005-06-26 19:01:45 UTC
This needs fixing. All subcats should come before any articles.  This bug is forcing burdensome workarounds, like setting the sort keys of Category tags on each of 40 subcategory pages done for http://en.wikipedia.org/wiki/Category:Computing 
Comment 9 elian 2005-09-09 00:15:19 UTC
This is definitely a usability bug. If someone fixes this, please care for
http://bugzilla.wikimedia.org/show_bug.cgi?id=3095 as well. 
Comment 10 Joel Nothman 2005-11-24 03:33:24 UTC
[Note that #3095 was a duplicate of #1212]

I see two solutions to this problem (together with #1212):
(1) get ALL entries under the category and sort them out
(2) make additional SQL queries

I have also proposed a synthesis of these two approaches, whereby (1) is used 
for categories containing less than a certain threshold number of entries and 
(2) is used for large categories. I don't know if this is more efficient at all.
Comment 11 Joel Nothman 2005-11-24 03:36:43 UTC
Created attachment 1096 [details]
Proposed patch

Using two-pronged approach with options (1) and (2) given in most recent
comment. May be useless =)
Comment 12 Melancholie 2006-03-15 02:53:38 UTC
*** Bug 4910 has been marked as a duplicate of this bug. ***
Comment 13 Jim Hu 2006-07-25 16:15:54 UTC
It seems to me that both subcategories and articles for large categories should
be displayed in a way similar to how Special:Allpages displays links to ranges
of pages once the size exceeds some threshold.

This is a severe problem for me, since I am running a wiki on biology/genomics,
where the category pages reflect the taxonomy tree, and there are articles for
every gene in the organisms I'm covering.  This means that sometimes the
subcategories are buried under TENS OF THOUSANDS of other pages! :(

So, I'll try to take this on if no one else is working in this direction...
Comment 14 Jim Hu 2006-07-28 17:19:55 UTC
Created attachment 2163 [details]
alternate category page

This alternative Category Page is similar to the earlier patch, but has some
significant differences.  If the number of items in either subcategories or
articles is larger than $limit = $wgCategoryPagingLimit different things
happen:
*If the number of items is < 5*$limit, the user gets pages of with $limit items
with the usual previous and next links
*If the number of items is larger than that, the user gets "Show $limit pages
starting with..." and a list of links to chunks.  The script divides the total
number of items so that there will be <= 20 total chunks, which means that the
number of clicks it will take to get to a particular item will vary.
Comment 15 Jim Hu 2006-07-28 17:26:28 UTC
p.s.  
1) The alternate category page can be seen in action at:
http://dimer.tamu.edu/GO/wiki/index.php/Category:cellular_organisms

2) I am not entirely happy with speed or the cleanliness of the code, but I
posted in the hope that others will build on it and share their versions.  This
can take a couple of minutes on my server for a really large category.  The link
is to one with 95,320 articles, and it just took about a minute to load.  

3) Whether or not you like the other aspects, I think the Category page should
give the correct number of subcategories and articles in the Category, instead
of just giving the size of the subset retrieved by the SELECT statement. This
alternate does so by doing select count(*) for both subcategories and articles.
Comment 16 Steve Bennett 2007-07-07 04:54:54 UTC
No progress on this? This is awful for usability - I thought a certain category just didn't exist at all, because it didn't show up on the list of subcategories. It's incredibly counterintuitive when you see something like this:

Subcategories:
Aardvark

Pages in category "blah"
A
[lots of pages]
B
[lots of pages]
(previous 200) (next 200)


There's no reason to think that there are more subcategories that you can't see.

Would an easier fix be to simply repeat the link to the more pages, something like:

Subcategories:
Aardvark
(previous) (next)

Pages in category "blah"
A
[lots of pages]
B
[lots of pages]
(previous 200) (next 200)

Comment 17 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-07-08 01:53:37 UTC
*** Bug 10488 has been marked as a duplicate of this bug. ***
Comment 18 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-07-08 03:36:07 UTC
This really should not be difficult, I don't think.  A committer just has to review those patches, or spend maybe half an hour writing their own.
Comment 19 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-08-27 03:20:09 UTC
Ah, I see why it was done this way.  It's more efficient.  Otherwise you have to add a WHERE page_namespace != NS_CATEGORY and muck up the efficiency.  Actually, I think that might require that all the rows of the category be read regardless of the LIMIT, depending on how MySQL does joins, but even if not, you could still easily read ten times as many rows as strictly needed this way.  A schema change to add a boolean "cl_subcategory" to categorylinks, along with appropriate indexing, would fix the inefficiency and possibly be useful for other things too.  Alternatively we could ignore the deficiency and scan a bunch of extra rows on certain category views, which is what I would provisionally be inclined to do.

Proposed patch #1 deliberately scans five times as many rows as necessary, not really acceptable.  Proposed patch #2 isn't in diff format and so I can't easily review it.
Comment 20 Peter Schlömer 2007-09-06 19:05:17 UTC
Created attachment 4083 [details]
Patch to always show all subcategories

This is my take on this. I just always show all subcategories; they are no longer paged. This could of course lead to problems if a category has a lot of subcategories; and it requires an additional database query (although I think that's unavoidable if you don't want to read through all rows using PHP).

I've also moved the top "previous"/"next" links below the subcategories; after this change, leaving them at the very top makes no sense.
Comment 21 Brion Vibber 2007-10-18 14:30:38 UTC
*** Bug 11696 has been marked as a duplicate of this bug. ***
Comment 22 Dashiva 2007-12-17 01:26:20 UTC
No paging at all on subcategories would need to be configurable, but that's simple enough. Full-fledged subcategory paging would probably require changes that affect extensions that extend CategoryPage/Viewer as well. I could only find CategoryTree (and DPL using some static methods), so this might not be such a big problem.
Comment 23 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-17 01:49:55 UTC
(In reply to comment #20)
> Created an attachment (id=4083) [details]
> Patch to always show all subcategories
> 
> This is my take on this. I just always show all subcategories; they are no
> longer paged. This could of course lead to problems if a category has a lot of
> subcategories; and it requires an additional database query (although I think
> that's unavoidable if you don't want to read through all rows using PHP).
> 
> I've also moved the top "previous"/"next" links below the subcategories; after
> this change, leaving them at the very top makes no sense.

After some consideration: this patch is unacceptable, given that I believe it requires scanning the entire set of category links for a page.  This can be hundreds of thousands of rows, and take seconds, on the English Wikipedia.  Some kind of schema change is almost certainly necessary for this to be done efficiently.
Comment 24 Tim Laqua 2007-12-17 02:25:20 UTC
(In reply to comment #23)
> After some consideration: this patch is unacceptable, given that I believe it
> requires scanning the entire set of category links for a page.  This can be
> hundreds of thousands of rows, and take seconds, on the English Wikipedia. 
> Some kind of schema change is almost certainly necessary for this to be done
> efficiently.

What do you mean?  The patch specifies 'page_namespace' => NS_CATEGORY as part of the WHERE clause in the additional sub-category query.
Comment 25 Tim Laqua 2007-12-17 05:27:06 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > After some consideration: this patch is unacceptable, given that I believe it
> > requires scanning the entire set of category links for a page.  This can be
> > hundreds of thousands of rows, and take seconds, on the English Wikipedia. 
> > Some kind of schema change is almost certainly necessary for this to be done
> > efficiently.
> What do you mean?  The patch specifies 'page_namespace' => NS_CATEGORY as part
> of the WHERE clause in the additional sub-category query.

I ran an explain on the two queries (trimmed output):
explain select page_title, page_len, page_is_redirect, cl_sortkey from page, categorylinks use index (cl_sortkey) where cl_from=page_id AND cl_to='Departments' AND page_namespace=14 order by cl_sortkey;

table         | type   | possible_keys      | Extra                    |
categorylinks | ref    | cl_sortkey         | Using where; Using index |
page          | eq_ref | PRIMARY,name_title | Using where              |


explain select page_title, page_len, page_is_redirect, cl_sortkey from page, categorylinks use index (cl_sortkey) where cl_from=page_id AND cl_to='Departments' order by cl_sortkey;

table         | type   | possible_keys | Extra                    |
categorylinks | ref    | cl_sortkey    | Using where; Using index |
page          | eq_ref | PRIMARY       |                          |


As far as I understand, the page_namespace addition to the WHERE clause brings in the name_title index f/ the page table, but the type is still eq_ref and unless i'm misinterpreting, I can't see where we'd be taking a huge performance hit.

I haven't worked too much w/ query optimization, especially on the scale of the wikipedia DB - but I'd certainly like to know how significant of a performance hit there is bringing in the name_title index (assuming that really is the only significant difference here).
Comment 26 Tim Laqua 2007-12-17 13:13:06 UTC
I talked to Domas about it, the name_title key wouldn't be used, so as Simetrical said - it could indeed lead to scanning a pile of rows that wouldn't be needed w/o page_namespace in the WHERE clause.

So ignore the last two comments.  ;-)
Comment 27 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-17 17:32:04 UTC
The difference, AFAIK, is not in the queries themselves, but in how they behave with LIMIT (which EXPLAIN ignores).  The query with no condition will scan a maximum of 200 rows, because it scans until it gets 200 rows that satisfy the condition.  The query with condition page_namespace=14 will have to scan every single row, and will not be able to omit rows that don't match that condition, because page_namespace is not in (and cannot be in) the cl_sortkey index it's using to retrieve the rows for the first table.  It has to retrieve a row from categorylinks and join to page before it can decide whether to keep it.  Compare the two queries' performance on a large category (this is the toolserver, so much slower than it would be on the live server, but you can tell the difference anyway):

mysql> SELECT page_namespace, page_title FROM categorylinks USE INDEX (cl_sortkey) JOIN page ON cl_from=page_id WHERE cl_to='Living_people' ORDER BY cl_sortkey LIMIT 200;
+----------------+--------------------------------+
| page_namespace | page_title                     |
+----------------+--------------------------------+
|              4 | Biographies_of_living_persons  | 
|              0 | 20Bello                        | 
|              0 | 40_Cal.                        | 
|              0 | 50/50_Twin                     | 
|              0 | 6025_(musician)                | 
. . .
|              0 | Hisham_Abbas                   | 
|              0 | Imran_Abbas                    | 
+----------------+--------------------------------+
200 rows in set (3.63 sec)

mysql> SELECT page_namespace, page_title FROM categorylinks USE INDEX (cl_sortkey) JOIN page ON cl_from=page_id WHERE cl_to='Living_people' AND page_namespace=14 ORDER BY cl_sortkey LIMIT 200;
+----------------+----------------+
| page_namespace | page_title     |
+----------------+----------------+
|             14 | Pete_Rock      | 
|             14 | Royal_children | 
|             14 | Mar_Roxas      | 
+----------------+----------------+
3 rows in set (12 min 49.22 sec)

Now note that the latter is probably more like three seconds on the real servers (cf. <http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopl>, which evidently no one has killed yet).  But the point remains, the latter is O(N) in the whole category size and the former is O(N) (maybe O(N log N), I don't know) in just the result set.  The difference here was a factor of over 200, the difference on the real servers is probably over 300 (figuring 3 s for the latter, <10 ms for the former), and it will only increase as categories get bigger.  A scalable solution is needed if this is to be put in core.
Comment 28 Brion Vibber 2007-12-18 01:22:19 UTC
The sane solution here is probably to just have a distinct table for "subcategories" from other pages.

Alternatively, an additional column could be added to categorylinks which indicates which set of categorization it belongs in, with an appropriate index.

A separate table would be a quicker and more backwards-compatible upgrade, as it wouldn't require a complete rebuild of the (large) categorylinks table. The upgrade would go like this:

1) create the new subcategorylinks table
2) copy categorylinks data for page_namespace=14 into subcategorylinks
3) delete categorylinks data for page_namespace=14
4) profit!

The two sets of data would then be trivial to page separately.
Comment 29 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-18 02:52:57 UTC
And it's even safe, since it's impossible to move non-categories to categories or vice versa (a point that didn't occur to me when similar ideas were mentioned elsewhere).
Comment 30 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-18 03:03:16 UTC
. . . Of course, any such change will probably break anything else that deals with categorylinks, to some extent.
Comment 31 John Blumel 2007-12-18 03:31:16 UTC
(In reply to comment #28)
> The sane solution here is probably to just have a distinct table for
> "subcategories" from other pages...

I'm a submitter of a duplicate of this issue (although I didn't notice it referenced here, but may have just overlooked it) and I'm afraid I'm utterly unfamiliar with the MediaWiki schema or the code that deals with this, but I just wanted to raise the question of how this would work with subcategories that also have subcategories and categories/subcategories that are subcategories of multiple categories which may also be subcategories.

In our wiki,

http://wiki.hmssurprise.org

every category but one,

http://wiki.hmssurprise.org/index.php/Category:WikiPOBia

is (or will be) a subcategory of some other category, and in many instances is a subcategory of multiple categories. Examples of this can be found particularly under the following two categories (all the entries under these categories, subcategories and articles, were generated and botted to the site, and this is probably quite atypical):

http://wiki.hmssurprise.org/index.php/Category:Guide_for_the_Perplexed
http://wiki.hmssurprise.org/index.php/Category:Maturin%27s_Medicine

The whole categorization scheme here was organized with this particular issue in mind, and we currently are not plagued by this problem, but I wouldn't want to see the fix cause it's own set of problems, or wreak havic with our existing categorization scheme.
Comment 32 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-18 03:42:48 UTC
(In reply to comment #31)
> I'm a submitter of a duplicate of this issue (although I didn't notice it
> referenced here, but may have just overlooked it) and I'm afraid I'm utterly
> unfamiliar with the MediaWiki schema or the code that deals with this, but I
> just wanted to raise the question of how this would work with subcategories
> that also have subcategories and categories/subcategories that are
> subcategories of multiple categories which may also be subcategories.

There would be no difference to the current way things work.
Comment 33 R. S. Shaw 2007-12-18 21:49:51 UTC
Above it was suggested that a new subcategorylinks table be added.  There may be a simpler solution which is also potentially less disruptive.  This wouldn't involve a complete rebuild of categorylinks, but would require a upgrade phase.  It would go like this:

1) Add a column, say specialref, to categorylinks, default NULL, indexed if value not NULL.
2) Make a pass over categorylinks where namespace=14, updating specialref to a nonnull value, say 14.

The normal subcat display would use the index, the page member display would not but would skip records which have a non-NULL specialref.  Accessors ignorant of the change would continue to work as now.  Add of a new subcat of course sets specialref.

(The page list retrieval has to skip the subcat entries, but that can be expected to be a small percentage.)



Comment 34 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-18 22:05:35 UTC
That requires an ALTER TABLE, which is much more of a pain than adding a table.  Copying millions of rows from categorylinks to subcategorylinks: easy, just run some queries.  Altering categorylinks: locks the table to writes while you're doing it, so you have to take slave out of rotation, apply alteration, re-add to rotation, repeat for each slave, switch the master, repeat for master.  It's more disruptive for the sysadmins, not less.

Besides, as I say, there are indexing issues.  For this to be effective here, you would need to change the cl_sortkey index from (cl_to, cl_sortkey, cl_from) to (cl_to, cl_sortkey, cl_subcat, cl_from).  This would break the use of the index for ordering by cl_from, unless you were retrieving by cl_subcat.  Unmodified apps would still work, they'd just take forever on large categories, as it filesorted the entire set of pages in the given category.  This is as far as I can see, without testing, but I'm fairly confident this is correct for MySQL.
Comment 35 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-18 22:06:39 UTC
Of course, your solution would *not* disrupt uses of categorylinks that don't use the cl_sortkey index, which is quite possibly most of them.  It does have that in its favor.
Comment 36 Tim Starling 2008-03-13 13:16:37 UTC
Fixing bug 164 requires changing the cl_sortkey index anyway. So if a fix were within reach for the two of them, the categorylinks rebuild could be combined and we could stay with a single table. 
Comment 37 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-13 14:16:42 UTC
Why does it require changing the cl_sortkey index?  It just requires rebuilding the contents of the cl_sortkey column, which can be done live (sorting on category pages may or may not be a bit odd in the interim, depending on how different the new collation algorithm is).
Comment 38 Danny B. 2008-05-04 23:37:53 UTC
Bug 450 related.
Comment 39 Túrelio 2008-06-03 08:42:45 UTC
I'm asking to step up the Severity of this bug at least to Normal (better to Major) as it really restricts everyday usability of categories on Commons. As another user in the current discussion on Commons put it, we set up a system with currently >2.7 mio files and our category system has problems when there are >200 files in a category. [http://commons.wikimedia.org/wiki/Commons:Village_pump#How_to_show_first_all_sub-cats_of_a_cat.3F]
Comment 40 billclark 2008-12-19 21:59:54 UTC
I suggest that we add a cl_namespace column to the table and also add that column to the cl_sortkey index.  This will allow efficient independent selection (and thus paging) of a category's members based on the namespace, which would not only resolve this bug but would help with #450 as well.  It's really not feasible to resolve #450 unless the different namespaces can be paged independently.

I'll be happy to provide a software patch for both of these bugs (I'm almost done with the one for #450) but want to be sure that the proposed schema change is acceptable, first.
Comment 41 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-12-20 23:13:38 UTC
(In reply to comment #40)
> I suggest that we add a cl_namespace column to the table and also add that
> column to the cl_sortkey index.  This will allow efficient independent
> selection (and thus paging) of a category's members based on the namespace,
> which would not only resolve this bug but would help with #450 as well.  It's
> really not feasible to resolve #450 unless the different namespaces can be
> paged independently.
>
> I'll be happy to provide a software patch for both of these bugs (I'm almost
> done with the one for #450) but want to be sure that the proposed schema change
> is acceptable, first.

First of all, this implementation of a fix would depend on bug 16012.  All schema changes to large tables are suspended until there's a reasonable way of executing them, apparently, and categorylinks is certainly a large table.

Second of all, you're (further) denormalizing the database here.  This might be justifiable -- certainly it seems like the only reasonable way to fix bug 450 -- but it's a cost.  You'd have to make sure that anything that moves pages also updates cl_namespace.  This is not currently a cost we have anywhere else, AFAIK: either things are associated with the page_id and don't have to be updated, or are associated with the namespace/title combination rather than the page itself (e.g., pl_namespace and pl_title) and therefore still don't have to be updated.  We can expect that bugs will occur here as cl_namespace sometimes isn't updated on moves, for whatever reason.

Thirdly, this would probably *require* bug 450 to be fixed to be reasonably efficient.  In the current setup, we'd need something like 'SELECT ... WHERE cl_from=xxx AND cl_namespace NOT IN (' . NS_FILE . ',' . NS_CATEGORY . ') ORDER BY cl_sortkey', which MySQL (at least older versions like 4.0) is too stupid to execute efficiently: it can't use the index for ordering if there's a range condition like IN in the WHERE clause.  (I think it would be smart enough in 5.0, or at least 5.1.)

And as for the proposal in bug 450, I see two issues.  First of all, you'd surely need one query per namespace on each category view, which could be a couple dozen on a lot of wikis.  That seems kind of excessive, if it's avoidable.  Second of all, the current structure of the category table reflects the pages/subcats/files breakdown.  If counts are given separately for all namespaces, you'd need to store all the counts in the category table.  This would either require an ALTER TABLE for every namespace added or removed (not happening), or else breaking off a new categorycount table like (cc_cat, cc_namespace, cc_count) to store the counts and deleting the cat_pages, cat_subcats, cat_files from the category table.
Comment 42 Roan Kattouw 2008-12-21 00:06:06 UTC
(In reply to comment #41)
> And as for the proposal in bug 450, I see two issues.  First of all, you'd
> surely need one query per namespace on each category view, which could be a
> couple dozen on a lot of wikis.  That seems kind of excessive, if it's
> avoidable.  Second of all, the current structure of the category table reflects
> the pages/subcats/files breakdown.  If counts are given separately for all
> namespaces, you'd need to store all the counts in the category table.  This
> would either require an ALTER TABLE for every namespace added or removed (not
> happening), or else breaking off a new categorycount table like (cc_cat,
> cc_namespace, cc_count) to store the counts and deleting the cat_pages,
> cat_subcats, cat_files from the category table.
> 

If we throw the separate paging thing out the window citing performance concerns, we could page by (ns, sortkey), which should be feasible given the right index (especially since the range condition is now also the first field of the ORDER BY). Of course that would page the main namespace first, then Talk, then User, etc., but that's the price we'd have to pay for not liking burned database servers. Jump-ahead links to a certain namespace would still be possible as well.
Comment 43 Sean Colombo 2009-01-24 17:07:29 UTC
Would it be rocking the boat too much to suggest that this bug gets upgraded to a severity above normal? I know it doesn't matter toooooo too much, but it might make us more likely to solve it ;)

It's just that most reasonably sized wikis tend to have categories with >200 articles and categories that large will quite often have subcategories and it's really not obvious to a user what's going on when they see only SOME of the subcategories on the first page.
Comment 44 Siebrand Mazeland 2009-02-02 15:59:41 UTC
+patch, +need-review
Comment 45 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-02-02 17:38:10 UTC
-patch, -need-review.  Patch was reviewed and rejected, see comment 23.  Also removing schema-change until it's clear that we need one (we could just adjust the sortkeys in-place, perhaps).

The most straightforward solution to this seems to be to reserve the first character of the cl_sortkey column for a namespace designator.  So just make it "c" for subcategories, "a" for articles, "f" for files.  Then the three could be paged separately.  This would require some interface work too, of course.  A script would then have to be run before the change goes live, updating the categorylinks table; until this is done, category pagination would be kind of wonky, but there would be no other ill effects (at least if it works ;) ).
Comment 46 Roan Kattouw 2009-02-28 14:52:45 UTC
*** Bug 17359 has been marked as a duplicate of this bug. ***
Comment 47 Subfader 2009-02-28 15:01:44 UTC
*** Bug 16249 has been marked as a duplicate of this bug. ***
Comment 48 Subfader 2009-03-01 16:18:36 UTC
What's the problem about solving this? 
I'm thinking about manipulating CategoryPage.php myself but my php skills suck and it would take me ages of trial & error.
I want to:

sort articles desc, subcats asc

subcats in 3 cols, articles in 2 cols

different default navigation ammount for all 3 sections (subcats: e.g. all, articles: 200, files: 20)

split navigation logically: the 3 sections have their own navigation: going NEXT to articles 200-400 will still list subcats and images 0-X
Comment 49 Subfader 2009-06-04 13:41:22 UTC
Can anyone at provide a patch that sorts articles Articles in the recerve order?

Changing function finaliseCategoryState() to array_reserve only sorts batch per page (200) but not the whole set.
Comment 50 Brian Jason Drake 2009-06-26 06:05:41 UTC
If we can't fix this quickly (and it looks like we can't), can we at least add a message to each category page explaining this issue? I can't even find it documented anywhere except in bug reports.

I skimmed through the above comments and couldn't find an explanation of how the subcategories are split between pages at, say, http://commons.wikimedia.org/wiki/Category:Girls. The subcategories go from A-D, then from G-T, even though the media goes from A-L, then L-Z. Also, if you were viewing the first 200 category items, you'd never guess that the page "Super Girl (contest)" was in that category as well.
Comment 51 Tim Starling 2010-06-02 01:35:04 UTC
(In reply to comment #41)
> First of all, this implementation of a fix would depend on bug 16012.  All
> schema changes to large tables are suspended until there's a reasonable way of
> executing them, apparently, and categorylinks is certainly a large table.

Schema changes are in no way blocked by bug 16012, that was a misconception. I did a couple of schema changes during the deployment of 1.16wmf4, and I'm happy to do more in the next deployment. There are some conditions which need to be met for easy migration, but those won't change when bug 16012 is implemented. The table can be any size, as long as it fits into the free disk space, since MySQL needs to make a copy when it does an ALTER TABLE.

[...]
> Thirdly, this would probably *require* bug 450 to be fixed to be reasonably
> efficient.  In the current setup, we'd need something like 'SELECT ... WHERE
> cl_from=xxx AND cl_namespace NOT IN (' . NS_FILE . ',' . NS_CATEGORY . ') ORDER
> BY cl_sortkey', which MySQL (at least older versions like 4.0) is too stupid to
> execute efficiently: it can't use the index for ordering if there's a range
> condition like IN in the WHERE clause.  (I think it would be smart enough in
> 5.0, or at least 5.1.)

Do you mean the category page query? Presumably that would be cl_to=xxx not cl_from=xxx.

You don't need bug 450 to be fixed. You can keep the current (cl_to, cl_sortkey, cl_from) index and add an additional (cl_to, cl_namespace, cl_sortkey, cl_from) index. Then we can support both UIs.

When namespace breakdowns are enabled, there would be some complexity involved in making the main category page query skip subcategories and files, but I think it would be possible to do it efficiently, even in MySQL 4.0, by using ranges and unions.

For instance, we could have a union with the first subquery having a condition like (cl_sortkey > $offset AND cl_namespace < NS_CATEGORY), and the second subquery having a condition like (cl_namespace > NS_CATEGORY). 

> And as for the proposal in bug 450, I see two issues.  First of all, you'd
> surely need one query per namespace on each category view, which could be a
> couple dozen on a lot of wikis.  That seems kind of excessive, if it's
> avoidable.  

The category page view with the namespace breakdown would have to page through the results one namespace at a time, it wouldn't display, say, pages in all namespaces that start with the letter "L". As in the current scheme, the UI has to reflect the indexes. The query would have ORDER BY cl_namespace, cl_sortkey.

> Second of all, the current structure of the category table reflects
> the pages/subcats/files breakdown.  If counts are given separately for all
> namespaces, you'd need to store all the counts in the category table.  This
> would either require an ALTER TABLE for every namespace added or removed (not
> happening), or else breaking off a new categorycount table like (cc_cat,
> cc_namespace, cc_count) to store the counts and deleting the cat_pages,
> cat_subcats, cat_files from the category table.

Breaking namespaces out into a new table would be the only decent solution, assuming we need that feature.
Comment 52 Tim Starling 2010-06-02 01:40:45 UTC
(In reply to comment #51)
> You don't need bug 450 to be fixed. You can keep the current (cl_to,
> cl_sortkey, cl_from) index and add an additional (cl_to, cl_namespace,
> cl_sortkey, cl_from) index. Then we can support both UIs.

Although of course that means not fixing the current bug. My mistake.

To fix the current bug and not include namespaces in the sorting order would require a boolean cl_is_subcat field, defined to be the same as cl_namespace=NS_CATEGORY. Then the index would be (cl_to, cl_is_subcat, cl_sortkey, cl_from).
Comment 53 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-06-02 18:02:32 UTC
(In reply to comment #51)
> Schema changes are in no way blocked by bug 16012, that was a misconception. I
> did a couple of schema changes during the deployment of 1.16wmf4, and I'm happy
> to do more in the next deployment. There are some conditions which need to be
> met for easy migration, but those won't change when bug 16012 is implemented.
> The table can be any size, as long as it fits into the free disk space, since
> MySQL needs to make a copy when it does an ALTER TABLE.

Okay.

> Do you mean the category page query? Presumably that would be cl_to=xxx not
> cl_from=xxx.

Yes, right.  I always get them confused.  cl_page and cl_category would have made much more sense as column names.  Although less consistent with pagelinks.

> You don't need bug 450 to be fixed. You can keep the current (cl_to,
> cl_sortkey, cl_from) index and add an additional (cl_to, cl_namespace,
> cl_sortkey, cl_from) index. Then we can support both UIs.

I think that by "fixing bug 450" I meant that you'd need to store the namespace in the categorylinks table somehow.

> When namespace breakdowns are enabled, there would be some complexity involved
> in making the main category page query skip subcategories and files, but I
> think it would be possible to do it efficiently, even in MySQL 4.0, by using
> ranges and unions.
> 
> For instance, we could have a union with the first subquery having a condition
> like (cl_sortkey > $offset AND cl_namespace < NS_CATEGORY), and the second
> subquery having a condition like (cl_namespace > NS_CATEGORY). 

Yes, this is true.  Maybe I hadn't seen that hack yet in 2008.

> The category page view with the namespace breakdown would have to page through
> the results one namespace at a time, it wouldn't display, say, pages in all
> namespaces that start with the letter "L". As in the current scheme, the UI has
> to reflect the indexes. The query would have ORDER BY cl_namespace, cl_sortkey.

Do you mean it would have WHERE cl_to='X' AND cl_namespace=Y ORDER BY cl_sortkey?  It wouldn't make sense in any UI to display the first 200 results from the main namespace only, so you'd have to go through fifty pages to get to the files.  You could cut it down to one query by doing a UNION, with different WHERE clauses in each part of the union.

> Breaking namespaces out into a new table would be the only decent solution,
> assuming we need that feature.

If we're displaying different namespaces under different headings as bug 450 suggests, we'd presumably want to keep the total count for each namespace present in the UI.  It could be dropped, but it doesn't seem like there's any point.

(In reply to comment #52)
> To fix the current bug and not include namespaces in the sorting order would
> require a boolean cl_is_subcat field, defined to be the same as
> cl_namespace=NS_CATEGORY. Then the index would be (cl_to, cl_is_subcat,
> cl_sortkey, cl_from).

I don't see why this is necessary or advantageous.  You could just do several queries, maybe UNIONed together, one for each namespace, and that will work with a (cl_to, cl_namespace, cl_sortkey, cl_from) index.  The only advantage would be that pages can't get moved between the category namespace and other namespaces, so if we only stored subcategory-ness, we wouldn't have to update anything extra on page moves, and there'd be little to no risk of errors creeping in.  This is an advantage of sticking to the current subcats/pages/files approach, but we can't have that if we want to fix bug 450.  (Not sure if we need to or not.)
Comment 54 Lejonel 2010-07-05 18:06:09 UTC
*** Bug 24271 has been marked as a duplicate of this bug. ***
Comment 55 Jan Ainali 2010-07-05 20:18:56 UTC
(In reply to comment #50)
> If we can't fix this quickly (and it looks like we can't), can we at least add
> a message to each category page explaining this issue?

This would be a reasonable interrim solution. Just knowing that there are more subcategories is a huge usability benefit.
Comment 56 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-07-21 21:04:01 UTC
I'll be working on this bug.  I hope to have a solution coded up within a couple of weeks.  I wrote a post to wikitech-l about it, and encourage people to respond there rather than here (since this involves several bugs):

http://lists.wikimedia.org/pipermail/wikitech-l/2010-July/048399.html
Comment 57 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-08-13 21:16:35 UTC
This was fixed in trunk as of r70415.  Subcategories, files, and other pages sort in three separate sections.
Comment 58 Deni 2010-09-06 12:58:45 UTC
I don't think this is fixed, just see: http://en.wikipedia.org/wiki/Category:Computer_stubs
It says "This category has the following 8 subcategories, out of 16 total."
Comment 59 Niklas Laxström 2010-09-06 13:28:04 UTC
Bugs are marked as fixed right away when the code is fixed. It then takes time before end users see the fix, until entities like Wikimedia update their code to newer version. When that happens, bugs can be marked as verified or closed, but that doesn't seem the actual practice here.
Comment 60 Deni 2010-09-06 13:52:54 UTC
I checked the version of the English Wikipedia (http://en.wikipedia.org/wiki/Special:Version), it's r72153, so it should be "fixed".
Comment 61 Derk-Jan Hartman 2010-09-06 13:53:55 UTC
That is a different branch of the software. Don't rely on those numbers.
Comment 62 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-09-06 18:18:03 UTC
Wikimedia is way behind on code updates.  There's no guarantee on when Wikipedia will see the features, but they're fixed in trunk -- which means there's no more development work to do, so there's no purpose for the bug to remain open.  The fix is done, it just has to be deployed, which will happen along with all the other tens of thousands of revisions that aren't live yet.

It might not go live for months to come, who knows.  The last full scap was around r64689, and that was based on REL1_16, which was based on r62818 of trunk.  That was in February.  The revision numbers don't tell you what's live, as Derk-Jan points out, since Wikimedia is using the wmf branch of the code, not trunk.  I only gave the revision number so it would produce a commit link.
Comment 63 David Gerard 2010-09-06 19:33:38 UTC
So do we file a new bug that asks to implement the fixed bug?
Comment 64 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-09-06 19:45:09 UTC
If anyone is going to be interested in deploying this out of schedule, they won't need a bug to remind them to do so.  Plus, I doubt it will happen till bug 164 is fixed.
Comment 65 johnnymrninja 2011-02-09 07:09:39 UTC
The patch is included in release 1.17. See http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_17/phase3/RELEASE-NOTES?view=co
Comment 66 Bawolff (Brian Wolff) 2011-02-09 23:57:03 UTC
(In reply to comment #65)
> The patch is included in release 1.17. See
> http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_17/phase3/RELEASE-NOTES?view=co

As an aside, last i heard the category stuff was going to be deployed slightly later then the rest of 1.17 (On Wikimedia).
Comment 67 Subfader 2011-02-12 23:58:27 UTC
If someone can link an example category, pls do so.

Also, please let us know when this was included in a stable MW release.
Comment 68 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-02-13 18:33:14 UTC
This is in 1.17, as comment 65 says.  It will be in a stable release when 1.17 is released.  It will be enabled on Wikimedia at some time in the relatively near future.  Please be patient.
Comment 69 johnnymrninja 2011-03-08 09:25:48 UTC
1.17 is live, though I have not been able to confirm that this is fixed. I've edited categories and they do not move to the front. It could be some sort of category lag... Can someone confirm that this is now working properly on WP?
Comment 70 Roan Kattouw 2011-03-08 10:20:48 UTC
(In reply to comment #69)
> 1.17 is live, though I have not been able to confirm that this is fixed. I've
> edited categories and they do not move to the front. It could be some sort of
> category lag... Can someone confirm that this is now working properly on WP?
1.17 was deployed with the category changes stripped out. Tim has been working on doing the required database schema changes yesterday and today, and I guess we'll roll out the related software changes in the next few days.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links