Last modified: 2014-11-17 10:36:02 UTC
Just like in a book index, category listings should sort numbers by their value, not just as dumb strings of characters. Example: [[http://en.wikipedia.org/wiki/Category:Antonov]], partial listing: ... Antonov An-2 Antonov An-218 Antonov An-22 Antonov An-225 Antonov An-24 Antonov An-26 Antonov An-28 Antonov An-3 ... Of course, it should be ... Antonov An-2 Antonov An-3 Antonov An-22 Antonov An-24 Antonov An-26 Antonov An-28 Antonov An-218 Antonov An-225 ...
Does this really depend on bug 164? That one concerns character-by-character sorting, or an implementation of the Unicode collation algorithm. This one requires some more smarts to reduce a long string of characters, possibly including figures, decimal points, or commas, to a single entity for sorting purposes. I could be wrong, but this sounds like it would be a parallel programming effort.
I'd say it probably does, because bug 164 is basically "add a sort key to the database schema instead of using binary sort". But then again, from that perspective they could be viewed as mutually dependent, or independent. Certainly a) they're closely related, b) bug 164 will get fixed before this does, and c) it would be a good idea to get whoever fixes that to notice this as well, even if strictly speaking this could be resolved without resolving that. I assume (not being especially familiar with sorting algorithms) that numerical sorting could be worked into sort keys somehow. If it can, then it's closely related to 164 and would probably be solved together; if it can't, then that would imply that to implement it you have to resort the entire table every time an entry is added or removed, which is unacceptable and will never be implemented. But hey, remove it if you don't think it fits. I don't mind.
They're definitely related, but I'm not sure which bug is dependent on the other. I suspect that whoever implements either, it would be good if they kept the other in mind. I assumed that there would have to be a hook that sorts the results of a query. Unicode sorting and numerical sorting would plug into the hook as two separate procedures, maybe, and there may be an advantage to one or the other going first. But then, I have no clue of how the back end works.
(In reply to comment #3) > I assumed that there would have to be a hook that sorts the results of a query. Unicode sorting and > numerical sorting would plug into the hook as two separate procedures, maybe, and there may be an > advantage to one or the other going first. No, the results are currently sorted via SQL "ORDER BY", not PHP. Otherwise you'd have to pull up the entire table of names, which is ridiculously wasteful for larger categories/pages (e.g., 1,250,000+ article names to display 100). And sorting directly according to some complicated algorithm as you query the rows is similarly infeasible, because the (comparatively expensive) collation function would have to be executed on every possible pairing. What this (and bug 164) would require is for an extra column to be added to various tables, a sort key. PHP would calculate the sort key only when a title is created, would tell SQL to stick it in the column, and then the query would just have "ORDER BY sortkey" or what have you, which would be a binary sort and therefore very fast as sorts go. (There's also some discussion about using native collation algorithms packaged with newer versions of MySQL, but they appear to have some serious limitations.) So the major change these require is changing the database schema and working out sorting functions. Once that's implemented, tweaking the sorting function would just mean a minor change to the PHP (well, and recalculating the sort keys for every page in the wiki). What you *do* with the sort key is pretty much icing, so these two bugs are pretty much the same. Of course, all the above should be taken with a slight grain of salt, because I haven't actually looked at the code and am not an expert in the matter. But this is my impression from various sources.
"Invisible" sort keys have already been implemented for a long time, see http://en.wikipedia.org/wiki/ Wikipedia:Categories#Category_sorting . For an example, see how I've fixed the Antonov category with these edits: http://en.wikipedia.org/w/ index.php?title=Special:Contributions&go=prev&offset=20070130204422&limit=50&target=Gpvos . It may still be nice to have a way to have MediaWiki do this automatically in the future, but I would consider it extremely low priority.
Can the "Invisible" sort key be used to solve the problem on this page: http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita ? Currently if you sort by Rank you see 1, 10, 100, 101 which isn't correct. Otherwise, if this is an unrelated problem should I file a separate bug report?
I have fixed this on a mediawiki I run by inserting the line insert: usort($this->articles, 'strnatcasecmp'); at the very first line of finaliseCategoryState(), in includes/CategoryPage.php
(In reply to comment #6) > Can the "Invisible" sort key be used to solve the problem on this page: > http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita ? > > Currently if you sort by Rank you see 1, 10, 100, 101 which isn't correct. > Otherwise, if this is an unrelated problem should I file a separate bug report? just to add a +1 to flaxter@gmail.com's comment, I noticed the same bug here: http://en.wikipedia.org/wiki/Energy_density#Energy_densities_ignoring_external_components sorting by Energy density by volume (MJ/L) gets you a couple of interesting sequences, including 43.5, 5.6, 6.02, 72.4 75.1, 8.8, 83.8, 9 38.2, 4.633016x10^104, 40.8 I can see mishandling the exponential value, but am confused as to why it didn't end up between 4 and 5, instead of 38 and 40. The ordering doesn't even make sense. At first I thought it was just ignoring the decimal place, but that doesn't even work for any but the first string I copied. I'm no coder, so sorry, I can offer no suggestions as to fixes, but it does appear to be a common thing with the sorting.
(In reply to comment #8) > (In reply to comment #6) > > Can the "Invisible" sort key be used to solve the problem on this page: > > http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita ? > > > > Currently if you sort by Rank you see 1, 10, 100, 101 which isn't correct. > > Otherwise, if this is an unrelated problem should I file a separate bug report? > > just to add a +1 to flaxter@gmail.com's comment, I noticed the same bug here: > http://en.wikipedia.org/wiki/Energy_density#Energy_densities_ignoring_external_components > > sorting by Energy density by volume (MJ/L) gets you a couple of interesting > sequences, including > 43.5, 5.6, 6.02, 72.4 > 75.1, 8.8, 83.8, 9 > 38.2, 4.633016x10^104, 40.8 > > I can see mishandling the exponential value, but am confused as to why it > didn't end up between 4 and 5, instead of 38 and 40. The ordering doesn't even > make sense. At first I thought it was just ignoring the decimal place, but that > doesn't even work for any but the first string I copied. > > I'm no coder, so sorry, I can offer no suggestions as to fixes, but it does > appear to be a common thing with the sorting. This bug is about category sorting not about the table "sortable" script.
It seems like this has been fixed (http://en.wikipedia.org/wiki/Category:Antonov_aircraft sorts correctly). I would find it hard to believe that it hasn't been - I posted a fix for this two and a half years ago!
(In reply to comment #10) > It seems like this has been fixed > (http://en.wikipedia.org/wiki/Category:Antonov_aircraft sorts correctly). That category is using custom sortkeys with 3 digit 0-padding (see http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmprop=sortkey|title&cmtitle=Category:Antonov_aircraft&cmlimit=max ). Thus the behaviour there does not indicate the bug is fixed. The bug is still present. Re-opening > I > would find it hard to believe that it hasn't been - I posted a fix for this two > and a half years ago! Where?
> > I > > would find it hard to believe that it hasn't been - I posted a fix for this two > > and a half years ago! > > Where? comment 7 above.
That doesn't work fully. It does fix it for a single view of a category page, but the way its broken up between next/prev boundries doesn't change. As a result you can have situations where you could have the last entry in one page of a category not be followed by the first entry in the next page, which would just be weird. Thus I don't think we should do that.
hmm, icu library seems to support natural number sorting. Have not tested though. May be possible to implement this as a custom collation.