Last modified: 2009-06-15 20:33:20 UTC
Currently, all uppercase letters are sorted in categories before all lowercase letters. For example, in http://en.wikipedia.org/wiki/Category:Bo-Bo_locomotives , the article "VR Class Sr2" is listed before "Victorian Railways E class (electric)". This is especially problematic in categories where abbreviations such as "SSX" or "NBA" are commonly used. Logically, uppercase letters should be sorted as being the same as lowercase letters. I understand that this is caused because category sorting uses Unicode ordering, but would it be possible to (essentially) say that "A = a", to have them sort correctly? Current guidelines on this issue at http://en.wikipedia.org/wiki/Wikipedia:Categorization#Using_sort_keys would imply that most articles should have a DEFAULTSORT key in order to fix this, but there is resistance to having DEFAULTSORTs which really shouldn't be needed.
I believe the problem is that the sortkey is sorted as binary, so capital letters will come before lowercase letters. Sorting as utf-8 would fix it, but Wikimedia is still using MySQL 4 which I don't believe supports that. Other than upgrading to MySQL 5, this could be somewhat fixed by forcing sortkeys to lower case before saving them to the database, but that would possibly break other things.
Gotcha... I'm guessing that MySQL 5 would be way too big a jump at this point, right?
Is this a dupe of something? Bug 164 comes to mind.
It's a "sort by something other than Unicode character point" bug, so yes, I'd say so. (In reply to comment #2) > Gotcha... I'm guessing that MySQL 5 would be way too big a jump at this point, > right? > It's in the works. It's been in the works for a while. It will probably still be in the works for a while to come :D
*** This bug has been marked as a duplicate of bug 164 ***