Last modified: 2014-11-17 09:21:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2164, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 164 - Support collation by a certain locale (sorting order of characters)


Summary:	Support collation by a certain locale (sorting order of characters)

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Internationalization (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal major with 109 votes (vote)
Target Milestone:	---
Assigned To:	Aryeh Gregor (not reading bugmail, please e-mail directly)

URL:	http://www.mediawiki.org/wiki/Bugzill...
Whiteboard:
Keywords:	i18n, schema-change, upstream

Duplicates:	353 608 1304 2489 2602 2628 2818 3343 4622 4963 6928 9247 9614 10771 10846 10966 11322 11503 13093 17260 18599 19197 19279 19967 24142 24953 25129 (view as bug list)
Depends on:
Blocks:	unicode 6948
	Show dependency tree / graph

Reported:	2004-08-18 17:40 UTC by Peter Gervai (grin)
Modified:	2014-11-17 09:21 UTC (History)
CC List:	61 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Thai sorting function See comment #61 (2.68 KB, application/zip) 2008-02-08 06:30 UTC, Octra Bond	Details
sorting orders patch for Chinese language (249.66 KB, application/7z) 2009-01-13 11:54 UTC, Philip Tzou	Details
Add an attachment (proposed patch, testcase, etc.)

Description Peter Gervai (grin) 2004-08-18 17:40:41 UTC

National wikipedias still have to bear English sorting order, which orders the accented characters 
outside of normal sort which is "not acceptable" for the languages. 
This problem is most visible in the rise of Categories, but present in every automatic list, like 
Allpages or the list of registered editors.

I would guess either mySQL or PHP can utilise national locales/collation orders which would simply 
require the sites to specify their locale and there we go. Or do I guess it wrong? (I am only 
familiar with perl and postgres, where this works.)

Comment 1 Brion Vibber 2004-09-30 18:22:45 UTC

*** Bug 353 has been marked as a duplicate of this bug. ***

Comment 2 Brion Vibber 2004-09-30 18:23:23 UTC

*** Bug 608 has been marked as a duplicate of this bug. ***

Comment 3 Brion Vibber 2004-09-30 18:27:17 UTC

Just a note: English sort order is not used, but rather the raw order of the characters as they appear in Unicode.

MySQL's collation options are useless as they are server-wide and (through 4.0.x) don't support Unicode. We likely need to add a sort key field that 
we generate ourselves, appropriate for each language.

Probably relevant: http://www.unicode.org/reports/tr10/

Comment 4 Jamesday 2004-09-30 18:58:08 UTC

Or we could wait for the character set support in version 4.1, due for release
within a month or two. http://dev.mysql.com/doc/mysql/en/Charset.html

Comment 5 Antoine "hashar" Musso (WMF) 2004-11-09 15:12:59 UTC

*** Bug 616 has been marked as a duplicate of this bug. ***

Comment 6 Peter Gervai (grin) 2004-12-09 08:43:30 UTC

Any progress on #4 ? I don't use mysql, so I couldn't guess whether it's looking became useful or not.

Comment 7 Klaus-Eduard Runnel 2005-01-21 03:30:06 UTC

I'd like to point out that there can be Categories on national Wikis which
should be sorted using locale matching some third language. E.g. there is a
French-specific Category on Estonian Wiktionary (Vikisõnaraamat).

Comment 8 Peter Gervai (grin) 2005-01-21 13:21:00 UTC

re #7: It is true, it is valid, but I believe it's a minority problem. 

A national wikipedia probably sorts 99% of its lists (categories) by using the
local sorting order.
This is actually possible using per-db collating order.

I don't know how problematic would be to do the sorting in PHP, which would make
it possible to have whatever order anyone please. I still wonder how unicode will
sort under mysql 4.1...

Oh, 4.1 seem to be out for a while now (I know that's not news to you ppl :))

Comment 9 Brion Vibber 2005-01-29 07:12:27 UTC

*** Bug 1304 has been marked as a duplicate of this bug. ***

Comment 10 zhengzhu 2005-03-07 04:05:36 UTC

A simple implementation for this is running at http://tinyurl.com/5l24b

Comment 11 Ævar Arnfjörð Bjarmason 2005-04-09 00:53:06 UTC

Changed the summary to better reflect what the bug is about.

Comment 12 peter green 2005-04-09 01:16:37 UTC

also remember you can set the sort key of an article manually for categories by
using [[category:categoryname|sortkey]]

Comment 13 Peter Gervai (grin) 2005-04-09 09:10:59 UTC

re: #12
It is not about "just" sorting articles in categories, but - for example -
sorting of THE CATEGORY KEYS. You cannot force it, and even if you could, it
would be very very very messy.
But naturally sorting is everywhere in WP, from all_articles to
registered_users... even if the most visible spot is category sort. But that's
another matter, if categories could be fixed that would silence most of us
complainers for a while. :)

Comment 14 Mormegil 2005-06-22 15:36:19 UTC

*** Bug 2489 has been marked as a duplicate of this bug. ***

Comment 15 Brion Vibber 2005-06-29 09:11:52 UTC

*** Bug 2602 has been marked as a duplicate of this bug. ***

Comment 16 Ævar Arnfjörð Bjarmason 2005-07-04 18:08:21 UTC

*** Bug 2628 has been marked as a duplicate of this bug. ***

Comment 17 Andrew Dunbar 2005-07-04 19:54:49 UTC

Bug 2628 was just closed as a duplicate of this bug, but it deals with case-sensitivity which 
is quite independent to locale. In fact on Wiktionary where all languages share a namespace, 
locale sorting doesn't make sense unless it can be set on a per-category basis.

If Bug 2628 was closed because a fix is due which will take care of both locale and case-
sensitivity, then perhaps it would be prudent to mention case in the summary.

Comment 18 Mormegil 2005-07-12 11:09:45 UTC

*** Bug 2818 has been marked as a duplicate of this bug. ***

Comment 19 Arthit Suriyawongkul 2005-07-13 22:00:38 UTC

from Bug 2818,
in order to handle many locale/language-specific sorting perferences,
may be we can have a mechanism that allow contributor to specify the sorting
preference of each category.

may be we can add something like this to the category page,

[[Category-sort-option:sort_option]]

for examples:

[[Category-sort-option:en-GB]]
[[Category-sort-option:ja]]
[[Category-sort-option:th]]


in the case of Wikipedia, this may be optional.
if not presented, use a default one - determined from the language of the Wikipedia.
(in which, may be problematic in the case of languages like English, German,
French, .. which a single language is shared by many countries (locales) .. so
what should be a default ?? -- not a technical question, more "political".)

Comment 20 Antoine "hashar" Musso (WMF) 2005-07-14 03:49:22 UTC

MediaWiki 1.5 only support UTF-8 charset now.

MySQL 4.1 support UTF8
character set support for InnoDB since 4.1.2

Looks like the issue can be assigned to wikimedia servers
and pending for a mysql 4.1.x upgrade

Comment 21 Brion Vibber 2005-07-14 04:30:36 UTC

1) MediaWiki has no support for collations currently.
2) MySQL's UTF-8 support doesn't support 4-byte characters, according to MySQL docs. It's unknown whether this will cause data 
corruption for some pages and needs to be investigated.
3) It's unknown at this time whether we can make use of collations for sorting without that also infecting equality checks (and 
unique indexes).
4) If 3), the affects of this for creating page name conflicts and other issues is unknown.
5) MySQL may or may not have sufficient collation options.
6) We may want this option without requiring particular versions of MySQL.

Comment 22 Andrew Dunbar 2005-08-10 02:10:11 UTC

Since nobody has commented on the capitalisation issue at all here which I
opened as Bug 2628 and mentioned above after that bug was incorrectly closed as
a duplicate of this one, I am now re-opening bug 2628 as it is clearly a
separate issue soluble without any of the issues brought up here.

Comment 23 Rick Block 2005-08-25 02:48:39 UTC

Related to this, although perhaps different enough to warrant a separate bug entry (just a note for now) - "case 
insensitivity" with regard to searching should treat accented and unaccented characters as the same.  This would allow, 
for example, an article named "Zürich" to be found by a "Go" search for "Zurich" without requiring creation of a redirect.

Comment 24 Roman Maurer 2005-08-25 16:52:16 UTC

Related to Rick's comment, please note that while Germans consider ü to be "u
with umlaut", Slovenians consider č to be a separate letter, unrelated to c, so
sorting order should not treat them the same.

Comment 25 peter green 2005-08-31 19:51:18 UTC

one possible option would be to let languages plug in a filter to generate sort
keys from thier text.

basically the functions task would be to take a string of unicode text and
produce a (prefferablly but not nessacerally human readable) string of bytes
from it in any way it wishes such that the output byte strings sort in the
desired way.

e.g. the one for english could be a diacritic stripper ones for languages were
letters with diacritics have a two letter equivilent could make those
replacements etc

Comment 26 Antoine "hashar" Musso (WMF) 2005-09-03 14:11:05 UTC

*** Bug 3343 has been marked as a duplicate of this bug. ***

Comment 27 Ævar Arnfjörð Bjarmason 2005-09-11 19:40:43 UTC

severity => enhancement

Comment 28 Minh Nguyễn 2005-09-12 01:19:07 UTC

Perhaps the per-category sorting, à la Comment 19, should be made available only
as an extension to be installed at Wiktionary. I still support per-locale
sorting, however.

Comment 29 Neil Harris 2005-10-06 11:53:04 UTC

A very crude algorithm for creating sort keys is to normalize the input string
using Unicode normalization form NFKD ("canonical decomposition"), and then
simply delete all combining characters (which are helpfully partitioned into
only four ranges of code points) from the string. 

This quite effectively "de-accents" the text string, as well as converting
ligatures to their component letters, superscript digits to their base digits,
and so on.

For extra thoroughness, you could also case-fold to lowercase at the same time,
and even use some of the extra rules specified in the NAMEPREP RFC to do things
like map ess-zet to "ss".

Perhaps the sort key for each article title (since that's what most sorts work
on) could be stored somewhere in the database alongside the title, eliminating
the need to recompute it each time it is needed? This would also mean that it
would be possible to use native database sorting to get data in sort order,
without needing a post-processing sort.

Yes, I know this is not the same as implementing true internationalized
collation, but it would be a great deal better than the status quo for all
Latin-character based languages, and would not mess up non-Latin scripts (or if
it did, you could simply fix it by only stripping combining marks _after Latin
letters_).

Comment 30 Neil Harris 2005-10-06 12:02:59 UTC

...and since sorting UTF-8 characters will result in the same order as sorting
by Unicode code point, it should work with all versions of MySQL, regardless of
whether or not they are Unicode-aware.

And, after a little Googling, there seems to be a PHP implementation for Unicode
normalization already written as part of MediaWiki: UTFNormal.php, so all you
need to know are the code point ranges for combining marks, which are:

U+0300 - U+036F: Combining diacritical marks
U+FE20 - U+FE2F: Combining half marks
U+1DC0 - U+1DFF: Combining diacritical marks supplement

I think you can safely ignore the combining diacritical marks for symbols.

Comment 31 Neil Harris 2005-10-06 12:19:45 UTC

...and, finally, to avoid the problem of indeterminate ordering for different
strings which map to the same canonical sort key, doing the SQL query as 

 ... ORDER BY article_sort_key, article_title

will ensure that any two articles with distinct titles will always sort in a
well-defined order.

Comment 32 Arthit Suriyawongkul 2005-10-06 13:02:16 UTC

Canonical decomposition / normalization is a nice idea, for a short term (better
than nothing).
But be warned that, although it will work for many European languages,
it will not work with French
(see: http://www.unicode.org/unicode/reports/tr10/#Contextual_Sensitivity )
and some CTL languages, like Thai.

In Thai language, we will considers only consonants in the first comparison,
the vowels in the 2nd run, and tonemarks and other symbols in later runs.

Considering:
(1) กา U+0E01 U+0E32
(2) เก U+0E40 U+0E01
(3) ขา U+0E02 U+0E32

normalization doesn't change any of above terms,
hence terms will sorted like:
(1) (3) (2)
where a preferred one is:
(1) (2) (3)

Comment 33 Pablo Saratxaga 2005-10-06 13:39:16 UTC

As GNU/Linux operating system is being used on the wikipedia
servers; then a possible solution would be to delegate the
alphabetical string sorting to the the system (glibc and locales); there already
are correct LC_COLLATE defintions
for glibc for almost all languages that have a quite good
number of articles; and for those that don't have yet (or the
languages still to grow) an LC_COLLATE definition could be
easily writen and installed.
I can help with that if needed (I work for Mandriva and part
of my work is actually checking and testing the locales, and
write new ones for whatever new language we are requested
support for; I've already corrected or written from scratch
various glibc sorting order definitions already)

Comment 34 lɛʁi לערי ריינהארט 2005-10-15 13:17:18 UTC

¿is this related to?
bug 2900: Implementing your own alphabetical index?

Comment 35 Ævar Arnfjörð Bjarmason 2005-11-30 03:58:59 UTC

(In reply to comment #34)
> ¿is this related to?
> bug 2900: Implementing your own alphabetical index?

no

Comment 36 Brion Vibber 2006-01-15 21:06:47 UTC

*** Bug 4622 has been marked as a duplicate of this bug. ***

Comment 37 Arthit Suriyawongkul 2006-01-25 22:47:59 UTC

What else we need right now for the spec ?
What are still missing ? Where are the points that we have to agreed first ?
 in order to have a spec for an implementation. thx

----

Also, anybody has an idea where this feature is going to be implemented ?
Say, in 1.6 branch or 2.0 branch, for instances.

So we can see if it worth to come up with a workaround (like normalization, etc.
manually) in this while,
or it's better to wait a bit.

Comment 38 lɛʁi לערי ריינהארט 2006-02-11 23:17:16 UTC

*** Bug 4963 has been marked as a duplicate of this bug. ***

Comment 39 James 2006-05-24 00:51:44 UTC

I sincerely hope that for each wikipedia language, there can be a specified sorting order, perhaps in the system messages or in its own category in special pages.  
Each language handles the topic differently, some with un/accented characters considered the same (ā, a, á; for example).  If each wiki could determine its sorting 
order on its own, perhaps as a string of characters separated by semi-colons and commas, such as: a,A,ā,Ā; æ, Æ; b,B; c, C; (and so on).  But if this were solved it 
would be a great help to other non-English wikipedias.

James

Comment 40 mlewan 2006-05-24 15:57:22 UTC

There is an additional problem with Japanese, which has complicated ways of
sorting, so a separate dictionary is needed to decide how to sort the words. You
cannot tell from just looking at the word 中国 whether it shall be sorted under
"naka" or "chuu" as 中 has both those pronunciations (and a few more ones).
Unicode numbering does not work. I have seen an api to sort according to the
"onyomi" or "kunyomi" of the kanji, but that sorting is as useless as to sort
according to the Unicode number. A separate dictionary with the pronunciation
for each word is the only solution, I know of.

Comment 41 Minh Nguyễn 2006-05-25 01:13:13 UTC

I think the Unihan Database contains that kind of information.
<http://www.unicode.org/charts/unihan.html>

Comment 42 mlewan 2006-05-26 07:09:27 UTC

(In reply to comment #41)
> I think the Unihan Database contains that kind of information.
> <http://www.unicode.org/charts/unihan.html>

The Unihan database knows that 中 can be pronounced chuu or naka, but it doesn't
on its own know that in 中国 the only possible Japanese pronunciation is chuu.
On http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4E2D you find
compound data, but it is from third party sources. The about page states:

"Chinese and Japanese compound data presented in the on-line database come from
the on-line CEDICT and Jim Breen's EDICT projects. These additional data are not
available in the text-file version."

Wikimedia could use CEDICT and EDICT, but it may be a detour to try to retrieve
it from unicode.org.

Comment 43 Sukh 2006-05-31 14:20:53 UTC

Is there any progress on this?  Gurmukhi sorting order on Punjabi language Wikis
(and indeed on all wikis) does not follow byte order so this is a problem.

Comment 44 Brion Vibber 2006-05-31 18:40:12 UTC

This requires major database alterations and possibly a massive, 
potentially breaking MySQL upgrade. Nothing done to date.

Comment 45 Michael Zajac 2006-08-09 15:28:39 UTC

See also Bug 6948: Natural number sorting in category listings

Comment 46 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-10 06:01:52 UTC

*** Bug 2628 has been marked as a duplicate of this bug. ***

Comment 47 Matthew W. Jackson 2006-08-29 19:16:43 UTC

If another field is going to be added for sort-key, it might be worthwhile to
let the users override the sort-order on a per-article basis.  Technically
speaking, shouldn't "George Washington" be sorted under "Washington, George". 
Since there's no way for a machine to deduce the proper sort order, leaving it
up to the user is the only way to handle it.

So, if database collation can be used, then one field needs to be added to the
database: A Sort-Title.  If, for whatever reason, database collations are not an
option, two fields need to be added: A Sort-Title and a Sort-Key.  A Sort-Title
field would be unicode test, but a Sort-Key would basically be a binary field
that could be sorted byte-by-byte (or ordinally, or lexigraphically, or whatever
you want to call it).

Adding an override for the sort order to the article would reduce the need for
piped category links, since by default an article would sort by it's sort-title
instead of its title.

Comment 48 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-04 15:20:09 UTC

*** Bug 6928 has been marked as a duplicate of this bug. ***

Comment 49 Tisza Gergő 2006-09-07 09:34:38 UTC

*** Bug 6928 has been marked as a duplicate of this bug. ***

Comment 50 Alexander Sigachov 2007-03-10 22:09:50 UTC

*** Bug 9247 has been marked as a duplicate of this bug. ***

Comment 51 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-04-17 16:52:40 UTC

*** Bug 9614 has been marked as a duplicate of this bug. ***

Comment 52 Brion Vibber 2007-08-08 17:25:56 UTC

*** Bug 10771 has been marked as a duplicate of this bug. ***

Comment 53 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-08-17 06:57:20 UTC

*** Bug 10966 has been marked as a duplicate of this bug. ***

Comment 54 Rob Church 2007-08-22 05:43:24 UTC

*** Bug 10846 has been marked as a duplicate of this bug. ***

Comment 55 nav 2007-08-24 13:06:16 UTC

Hi Guys,
Do you have any news about fixing this bug? 
When do you expect that it will be possible to order categories in non-English wikies?
Thanks

Comment 56 Brion Vibber 2007-08-24 14:18:19 UTC

No, it's a big change, no hurry on it.

Comment 57 Stu Derby 2007-09-07 17:22:35 UTC

Could someone confirm that the current collation is still based on the UTF-8 encoding? The last reference to that is a couple years old. I'm particularly interested in the collation on en.wikipedia.org, if it matters. (Not a big deal, but someone is seriously arguing that Swedish collation is correct on the "en" site, in part because certain characters sort "correctly" by default).

Comment 58 Brion Vibber 2007-09-08 18:57:10 UTC

Nothing has changed. No "Swedish" collation is, or has ever, been used.

Binary sorting is used only.

Comment 59 Shinjiman 2007-09-13 09:20:25 UTC

*** Bug 11322 has been marked as a duplicate of this bug. ***

Comment 60 Lejonel 2007-10-01 08:47:59 UTC

*** Bug 11503 has been marked as a duplicate of this bug. ***

Comment 61 Octra Bond 2008-02-08 06:30:34 UTC

Created attachment 4632 [details]
Thai sorting function
See comment #61

This zip archive contains comparing function for sorting Thai words.

Since Thai words cannot be sorted alphabetically under UTF-8 encoding, especially words lead with pre-vowel -- เ, แ, โ, ใ, ไ. Unicode treats these characters like binary so the words always show up at the end.

For example, we have a list: กา, เก, ขา, เข, คา. Sorting under Unicode will result กา, ขา, คา, เก, เข and it becomes 4 indexes of characters (ก, ข, ค, เ) which is not preferable. It should be กา, เก, ขา, เข, คา and should have only 3 indexes (ก, ข, ค). The fact is เ, แ, โ, ใ, ไ were vowels, not consonants. The consonant is located after it.

Secondly, all of tone symbols such as ก็ ก่ ก้ ก๊ ก๋ ก์ (seen as upper diacritics) were also treated as binary and on the same precedence as other consonants. This will result another case of incorrect alphabetical order.

For example, we have a list: กอ, ก่อ, ก้อ, กา, ก่า, ก้า. Sorting under Unicode will result กอ, กา, ก่อ, ก่า, ก้อ, ก้า which is totally wrong. Thai dictionaries commonly do not arrange tone symbol at first. It should be กอ, ก่อ, ก้อ, กา, ก่า, ก้า like I just presented before.

This function was released 15 years ago by a government agency but only few people keep an eye on it. Fortunately, I found this valuable function at http://tinyurl.com/ypot5n (I must shorten the URL because it is too long) then I adapt into PHP and it wonderfully works.

You can directly test PHP scripts. I already demonstrate how to sort (which requires an array). For Unicode framework, please use "thaisort_u8.php" instead of another one.

Comment 62 Arthit Suriyawongkul 2008-02-09 08:53:06 UTC

reply to comment  #33 from Pablo Saratxaga

in that case (specific to MediaWiki installed on Wikimedia servers, for Wikimedia serving purpose, and not MediaWiki software in general), ICU4C/libicu also a good candidate,
as it supports wide range of locales/charsets, including locale data from Common Locale Data Repository (CLDR).
libicu now available in many popular Linux distros.

http://www.icu-project.org/

Comment 63 Arthit Suriyawongkul 2008-02-09 08:59:43 UTC

for MediaWiki software in general,
we might try this new PHP Internationalization extension (Intl), an ICU wrapper for PHP.
Matched to our interests, this extension support locale-sensitive collation (sorting),
and it can do normalization as well.

More features list and info can be found at:

http://docs.php.net/manual/en/intro.intl.php

Comment 64 Mormegil 2008-02-09 11:30:02 UTC

(In reply to comment #63)
> for MediaWiki software in general,
> we might try this new PHP Internationalization extension (Intl), an ICU wrapper
> for PHP.

You don’t want to sort in PHP code. You need the sorting to be done directly in the database. MySQL already has quite good support for locale-dependent sorting (collation), (AFAICT generally much better than what is in CLDR, see e.g. http://dev.mysql.com/doc/refman/6.0/en/charset-general.html) however, MediaWiki (on Wikimedia sites; you can try to use it in your own wiki installation) does not use it.

Comment 65 Octra Bond 2008-02-10 06:46:29 UTC

From #61

Every language of Wikipedia uses UTF-8 encoded database so they cannot sort correctly on some language, especially Thai, and we cannot convert database encoding for just one locale (they will not able to store any useful Unicode characters). So I must present the function to sort after database does. Thai Wikipedia has had this problem since it was founded.

Comment 66 Octra Bond 2008-02-10 06:57:43 UTC

Or another way, could anyone complain about this case and submit my file to the Unicode Consortium insead? Then we (as Thai people) will not worry about it anymore. :D

Comment 67 Mormegil 2008-02-10 09:30:33 UTC

(In reply to comment #65)
> Every language of Wikipedia uses UTF-8 encoded database so they cannot sort
> correctly on some language, especially Thai, and we cannot convert database
> encoding for just one locale (they will not able to store any useful Unicode
> characters). So I must present the function to sort after database does. Thai
> Wikipedia has had this problem since it was founded.

I am not sure I understand your objection. The character encoding and collation algorithm are two separate issues. There is no reason why a database could not be in UTF-8 and, at the same time, sort according to some complicated system. See http://dev.mysql.com/doc/refman/6.0/en/charset-unicode-sets.html for a list of currently supported collations with Unicode encoding in MySQL. Unfortunately, I don’t see Thai in there, so if you want to submit your code to somebody, it should be primarily MySQL. ;-)

(Note that e.g. the Czech collation is quite complicated, too, and it _is_ currently supported by MySQL with Unicode charset. For instance, Aaaaaaa sorts before Aáaaaaa, which sorts before Aaaaaab, i.e. you cannot sort in a single pass.)

Comment 68 Octra Bond 2008-02-11 01:46:58 UTC

I think I understand your point. I might contact MySQL instead. The last question, how we can do with Thai Wikipedia? Is there somewhere to set collation separately (or in Mediawiki however)?

Comment 69 Octra Bond 2008-02-11 01:52:16 UTC

*which is not to set at the database.

Comment 70 Octra Bond 2008-02-11 02:19:02 UTC

If nowhere to set, I must present an optional variable such as "$wgDBcollation" to add collation literals through sql statement. See syntax to collate in MySQL web site. (Sorry if my posts annoy you. I can't edit comments like Mediawiki.)

Comment 71 Octra Bond 2008-02-11 02:26:45 UTC

*Because the fact is one encoding may have more than one collation, like UTF-8 that Mormegil shows me.

Comment 72 Mormegil 2008-02-11 20:25:35 UTC

(In reply to comment #68)
> Is there somewhere to set
> collation separately (or in Mediawiki however)?

Collation can be set at each table, or even column, so that this is “only” a task of altering the database structure a bit. The problem is that this has not really been tested with MediaWiki (AFAIAA), and there could be some problems, especially with non-BMP characters (characters above U+0FFFF), because of limited implementation of Unicode in MySQL (see e.g. http://bugs.mysql.com/bug.php?id=14052). So that I doubt developers would be willing to make such switch on any Wikimedia wiki before much more work and testing would be done there.

I was considering trying and testing a partial interim solution at least for category sorting, where we have cl_sortkey, which is not used for much else besides sorting. So that there should no real problems switching this from BINARY sorting to some specified, which would be a good first step.

Another possibility of implementing this would be to have a copy of page_title column+index (and possibly dtto for some other tables), only with a defined collation and use this column for sorting instead of the original page_title. It would be easier if you could add just another index on the same column with another collation, but you can’t do that in MySQL, IIANM.

(On a sidenote, basic support for database-level Unicode has been in MediaWiki for quite some time, see http://www.mediawiki.org/wiki/Manual:$wgDBmysql5, although it is still marked as “WARNING: THIS IS EXPERIMENTAL!” and not much work (and capabilities) has been put there since, AFAIK.)

Comment 73 Brion Vibber 2008-02-13 00:24:16 UTC

A couple quick notes:

* Wikimedia currently runs MySQL 4.0, which has no native Unicode support.

* MySQL 4.1 and 5.x have not-quite-done UTF-8 support, which is insufficient for storage in the primary tables. In most cases this would be sufficient for a sorting-specific summary table (such as the titlekey table used in the TitleKey extension I whipped up)

* Collation affects not only sorting, but unique key restraints. This makes a MySQL-based collation unsuitable for the primary table fields, but again would be acceptable in a summary table used for sorting.

* Some sites are multilingual, which may make an ideal sort order rather difficult to manage. :)

Comment 74 Tisza Gergő 2008-02-13 22:41:21 UTC

So are we waiting still for Wikimedia to upgrade its MySQL servers? The solutions described in comment 25 and comment 29 (two and a half years ago...) look simple, and they would work with any database, without need for potentially breaking changes.

Comment 75 Mormegil 2008-02-14 11:43:18 UTC

(In reply to comment #74)
> solutions described in comment 25 and comment 29 (two and a half years ago...)
> look simple, and they would work with any database

The solution in comment 25 is not general enough. As I mentioned above, Czech sorting cannot be done in one step, so that you would have to keep _two_ sort keys (and sort on the tuple (key1, key2)), and the older version of the Czech sorting standard required _three_ steps (it has been simplified in the last revision). I have no idea what strange sorting systems are there… Not to mention that we would need to develop our own collation algorithms (That is far from “simple” as you write!), when we have better things to do… (NIH, anyone?)

The comment 29 ''could'' be a bit better than the current solution (although it is only an interim solution), although some people might disagree even with that, I guess.

Comment 76 Tisza Gergő 2008-02-14 12:53:03 UTC

For multipass sorting you just need to generate a key for each pass, and join them together with a space (which is the first character of the ASCII and Unicode alphabet apart from control characters). But most of the time you don't even need that, just transform the word to lowercase, strip out all the accents that don't matter in the first pass, and the words with equal keys will be ordered based on their original names, which is usually just what the second pass would do. For example see here a simple character mapping based sortkey for Hungarian (which also has a two-pass, aaa < aáa < aab sort order): http://translatewiki.net/wiki/Support#section-2fce6e673f06ad27330b562f8836a458

Number sorting (bug 6948) can also be handled this way, for example by adding a larger-than-digits character to every digit-nondigit boundary.

Comment 77 Li-sung 2008-02-21 15:00:11 UTC

*** Bug 13093 has been marked as a duplicate of this bug. ***

Comment 78 Anon Sricharoenchai 2008-04-28 11:01:03 UTC

(In reply to comment #33)
> As GNU/Linux operating system is being used on the wikipedia
> servers; then a possible solution would be to delegate the
> alphabetical string sorting to the the system (glibc and locales); there already
> are correct LC_COLLATE defintions
> for glibc for almost all languages that have a quite good
> number of articles; and for those that don't have yet (or the
> languages still to grow) an LC_COLLATE definition could be
> easily writen and installed.
> I can help with that if needed (I work for Mandriva and part
> of my work is actually checking and testing the locales, and
> write new ones for whatever new language we are requested
> support for; I've already corrected or written from scratch
> various glibc sorting order definitions already) 

1. MediaWiki category sorting does currently depend on MySQL sort function.
   According to, "includes/CategoryPage.php",

<pre><nowiki>
		$res = $dbr->select(
			array( 'page', 'categorylinks' ),
			array( 'page_title', 'page_namespace', 'page_len', 'page_is_redirect', 'cl_sortkey' ),
			array( $pageCondition,
			       'cl_from          =  page_id',
			       'cl_to'           => $this->title->getDBKey()),
			       #'page_is_redirect' => 0),
			#+ $pageCondition,
			__METHOD__,
			array( 'ORDER BY' => $this->flip ? 'cl_sortkey DESC' : 'cl_sortkey',
			       'USE INDEX' => 'cl_sortkey', 
			       'LIMIT'    => $this->limit + 1 ) );
</nowiki></pre>

   which will query sql like this,

<pre><nowiki>
      SELECT  page_title,page_namespace,page_len,page_is_redirect,cl_sortkey  FROM `page`,`categorylinks` FORCE INDEX (cl_sortkey) WHERE (cl_sortkey >= 'my begin page name') AND (cl_from = page_id) AND cl_to = 'my category name'  ORDER BY cl_sortkey LIMIT 201
</nowiki></pre>

   This will have advantage in that, php doesn't need to fetch all page title (for that category) from database.  It just fetch only the titles needed to appear in current displaying page.


2. From my experiment in Ubuntu GNU/Linux, setting LC_ALL or LC_COLLATE doesn't affect php5 sort() function.

<pre><nowiki>
$ LC_ALL=C php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING|SORT_STRING); foreach ($a as $b) { print "$b\n"; }'
A
B
a
b
$ LC_ALL=en_US php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING|SORT_STRING); foreach ($a as $b) { print "$b\n"; }'
A
B
a
b
$ LC_COLLATE=en_US php5 -r '$a = array("a", "b", "A", "B"); sort($a, SORT_LOCALE_STRING|SORT_STRING); foreach ($a as $b) { print "$b\n"; }'
A
B
a
b
</nowiki></pre>

   While it work in "sort" program,
<pre><nowiki>
$ (echo a; echo b; echo A; echo B) | LC_ALL=C sort
A
B
a
b
$ (echo a; echo b; echo A; echo B) | LC_ALL=en_US sort
a
A
b
B
$ (echo a; echo b; echo A; echo B) | LC_COLLATE=en_US sort
a
A
b
B
</nowiki></pre>

Comment 79 Anon Sricharoenchai 2008-05-09 14:28:27 UTC

(In reply to comment #21)
> 1) MediaWiki has no support for collations currently.
> 2) MySQL's UTF-8 support doesn't support 4-byte characters, according to MySQL docs. It's unknown whether this will cause data 
> corruption for some pages and needs to be investigated.
> 3) It's unknown at this time whether we can make use of collations for sorting without that also infecting equality checks (and 
> unique indexes).
> 4) If 3), the affects of this for creating page name conflicts and other issues is unknown.

1. You can set default collation of the table to "utf8_bin", so its won't infect equality checks.

2. The category sorting can be selected by passing collation via cgi variable, for example,

   http://.../wiki/Category:Abc?collation=czech_ci

2.1 The mysql query will look like,

   mysql> SELECT ... ORDER BY cl_sortkey COLLATE utf8_czech_ci LIMIT 201

2.2 The collation may also be set in Special:Preferences.
2.3 The drop down list for all available collations in Special:Preferences will depend on the database.
    In MySQL 5.x, MediaWiki will query the following command the get the available collations,

   mysql> show collation like 'utf8_%';

> 5) MySQL may or may not have sufficient collation options.
> 6) We may want this option without requiring particular versions of MySQL.
> 

There's no way to not depend on MySQL sorting, which depend on the versions of MySQL.
By not depend on database sorting, MediaWiki will need to fetch all page title
(for that category) from database.

With my solution, it just let the available collations to depend on the database engine.

Comment 80 Brion Vibber 2008-05-09 16:24:45 UTC

If the indexed field doesn't have collation specified, that sorting is going to be done as a filesort -- fetching the entire potentially matching data set, sorting it in a temporary table, and then returning results in order *every time a request is made*.

That's obviously not going to be suitable.

Comment 81 Anon Sricharoenchai 2008-05-12 05:24:11 UTC

(In reply to comment #80)
> If the indexed field doesn't have collation specified, that sorting is going to
> be done as a filesort -- fetching the entire potentially matching data set,
> sorting it in a temporary table, and then returning results in order *every
> time a request is made*.

Do you mean that,

1. select ... order by ... collate utf8_czech_ci limit 201;
2. select ... order by ... collate utf8_bin limit 201;

You mean that 1 will slower than 2, if the default collation is utf8_bin?
You mean that 1 will need to do something in temporary table, if it sort using the non-default collation?

> 
> That's obviously not going to be suitable.
>

Comment 82 Anon Sricharoenchai 2008-05-12 11:20:25 UTC

I've just summarized all ideas from this discussion in, http://www.mediawiki.org/wiki/Bugzilla/164

Please update your ideas there.

Comment 83 Anon Sricharoenchai 2008-05-12 11:27:20 UTC

I've just summarized all ideas from this discussion in, http://www.mediawiki.org/wiki/Bugzilla/164

Please update your ideas there.

Comment 84 Lars Aronsson 2008-09-07 16:04:38 UTC

According to the reference manual, MySQL 6.0 introduces
support for supplementary Unicode characters.
Would that be a solution to this old bug?

MySQL 6.0 is still under development, so it might not
be mature for Wikimedia servers yet. But waiting for
it to become stable might be smarter than developing
an advanced work-around.

Or is the new functionality in MySQL 6.0 still
insufficient, so we will need the work-around anyway?

Comment 85 Mormegil 2008-09-08 08:05:34 UTC

(In reply to comment #84)
> MySQL 6.0 is still under development, so it might not
> be mature for Wikimedia servers yet. But waiting for
> it to become stable might be smarter than developing
> an advanced work-around.

Note that currently, Wikimedia servers still run on MySQL 4.0, even though the current GA release is 5.0. I wouldn’t hold my breath waiting for 6.0… (But yes, I agree that it is not too smart trying to implement our own collation (etc.) in PHP, when that is obviously a DBMS’s job.)

Comment 86 Anon Sricharoenchai 2008-09-08 12:02:40 UTC

(In reply to comment #85)
> 
> Note that currently, Wikimedia servers still run on MySQL 4.0, even though the
> current GA release is 5.0. I wouldn’t hold my breath waiting for 6.0… (But

We could implement this as an optional feature, for those who are using mysql 6.0.
This optional feature will enable any mediawiki server (other than Wikimedia servers) with mysql 6.0, to have the correct sorting order.

> yes, I agree that it is not too smart trying to implement our own collation
> (etc.) in PHP, when that is obviously a DBMS’s job.)
>

Comment 87 Tisza Gergő 2008-09-08 20:25:23 UTC

(In reply to comment #84)
> But waiting for
> it to become stable might be smarter than developing
> an advanced work-around.

Why wait for years when there is a clean temporary solution which is reasonably easy to implement, has no extra resource cost (category sorting is already done by custom sortkeys), and can be easily undone if neccessary? Besides, there is already a work-around: people DEAULTSORT directives in articles (in languages that use non-ASCII letters often, that means pretty much sorting every single page by hand). This is a huge waste of editor time which could be put to much better use, and it would be similarly difficult to remove the manual sortkeys if a superior software solution appeared. Allowing language-specific algorithms to define default sortkeys would be better then the current situation in every single aspect.

Comment 88 Philippe Verdy 2008-10-20 17:10:19 UTC

UCA collation keys is really the solution; a pseudo-UCA-like system is currently used in French Wikitionary:

In almost all articles that contain any other characters that lowercase ASCII letters or space, we have inserted {{clé de tri}}, with an optional parameter (by default this parameter takes the article name itself). This template uses {{DEFAULTSORT:}} but computes itself the sortkey.

When the parameter is used, it must specifiy the article name without diacritics (but the original capitalization of the article title must be kept: it is implicitly converted to lowercase when computing internally the primary collation key. The parameter, when it is used should contain only base letters (but not necessarily ASCII, because it could also be Greek or Cyrillic base letters, or Hebrew, Arabic, Chinese...); it should contain no diacritic, no controls like joiners, apostrophes and ignorables removed, and all separators and punctuation converted to a single space. It can contain base digits (also without diacritics, and non-decimal digits shuold be converted to decimal digits appropriate for the language, e.g. the Roman VIII digit character should be rewritten using Latin letters).

If letters are ligatures, the parameter 1 must be specified with the letters separated withoutthe ligature (for the French article name "œuf", the secondary key to specify in parameter 1 is "oeuf", so that it sorts at letter "o").

For the Catalan middle dot diacritic (that can only follow a letter l), it should be removed from the secondary key, treated as an ignorable, even though it is also a punctuation; this is not much a problem for most articles: only for the article related to the punctuation itself, you can keep it withut relpacing it by a space or removing it). Some intelligence must be taken into accoutn for such specific cases (but don't forget that Wiktionnary is built to contain words in many languages, but they will still be sorted according to the Wiktionnary's home language, here French).

If there are any other characters than letters, digits or spaces, the parameter MUST be specified, otherwise the indexing key will not be computed correctly.
If there are only plain letters, spaces, or digits, but some letters are capitals, the template {{clé de tri}} (which means "sort key" in French) MUST be used any way, even if it takes no parameter, because the indexing key will be computed correctly to take into account the case differences.

Then the index key is computed like this:
* compute the primary key : the parameter 1 (or its default, the article name) converted to lowercase.
* the secondary key is the parameter 1 (or its default, the article name): its use is to sort while still taking the case differences into account (note that ideally, this secondary key should be the parameter 1 written in reverse order from end to beginning to better match the French sort order, but there's still no function in Mediawiki to reverse a string).
* the third-level key is the unfiltered article name, it is used to sort while taking into account other differences: diacritics, differences of punctuation.
* The 3 keys are then concatenated, separated by " !" (a space followed by an exclamation point; ideally it should be a control soft as SOH=U+0001, however MediaWiki does not like this very much and generates garbage in the key that completely breaks the sort. The choice of " !" works correctly for that.

However:
(1) when the secondary key equals the primary key, it is replaced by an empty string: this occurs when the article name only use lowercase letters, or letters in a monocameral script: "a !a !à" becomes really "a ! !à", without the secondary key "a" (which had to be given explicitly in parameter 1 for the full article name "à")
(2) when the third-level key equals the secondary key, it is replaced by an empty string: this occurs when the default value of parameter 1 is used, i.e. it is the article name itself (when the artcile name contains no diacritics and no punctuation or ignorable characters): "a !A !A" becomes really  just "a ! A !" without the third level key (i.e. the full article name "A" here)
(3) key separators at end are dropped: "a !a !a" becomes really "a ! !" according to rules (1) and (2), but then just "a" according to rule three: this means that all lowercase or monocameral article names without diacritics or punctuation but just spaces have the shortest keys.

You'll note immediately that with articles whose titles only differ by case, the sort key for the articles with capitals is longer than the name without capitals. It's not perfect, but it works most of the time (and it is sufficient to sort the French Wiktionnary).

Then Mediawiki is building its own indexing keys from the keys supplied: if they are too long for its internal limits, they are truncated and MediaWiki appends "__UNIQUE__" followed apparently by the revision id or some magic number (if truncation occurs, not all levels will be used, but anyway, at least the sort will work to group a small number of articles in the same displayed page, with just minor sort problems...

I am still not satisfied by the way the secondary key and ternary key are generated: they are too long. Using full UCA (tailored to take into account the language order), we could dramatically reduce their length.

Also we could sort in traditional order for other languages that use digrams or trigrams or whose letters are not ordered in the binary order, such as letters that are foreign to that language and that are placed at end of their reduced alphabet (but note the problem with those letters or digrams, you would have to remap them after letter "z" to get a significant chracter usable as the prefix for sort groups displayed in the lsit or used in the "from=" parameter of categories. This is not a problem for French, including for its two orthographic ligatures "œ" and "æ" (the first one is not frequent but still present in widely used words, the second one is used only in some scientific, medical, or juridic terms, whose origin comes from Latin).

In the French wiktionnary, this works remarkably well to have categories that are correctly ordered and meet user expectation (with just minor sort quirks: you can still find the searched entries just within a very small distance of search while navigating the categories with lots of articles, notably the categories per language that are supposed to index ALL words and expressions used in a language): you can effectively search from some letters (yu just have to type them in lowercase only, however the search form in categories will not enforce it).

Look also in the French Wikitionnary to see what is happening in this system for sorting the "russe" category for the Russian language (using cyrillic), or "bulgare", or hebrew (yes, it works quite well for words that contain hebrew points), or "grec", look at "polonais" to see how Polish words are sorted (We are still looking at corercting the articles where {{clé de tri}} has not been specified where it should).

Note that categories that are populated with both article names and subcategories, the subcategories should not be indexed using {{clé de tri}} but using {{DEFAULTSORT:}} directly (where you'll specify a key starting by a capital, so that all subcategories will be in the head of the list, before all other the articles that you'll navigate with "next 200" or "previous 200".)

This solution is exploitable too in Wikipedia (however Wikipedia does not keep the initial lowercase letters and forces their capitalization.

All this work should be automatable without having to use {{clé de tri}} and figuring out how to specifu the secondary key. It would be great if MediaWiki used such algorithm by default (i.e. in absence of {{DEFAULTSORT:}} and of any explicit key in "[[category:name|key]]" (just "[[category:name]]".

Then it's up to Mediawiki to determine its maximum key length from such algorithm and truncate keys where appropriate to follow them with the "__UNIQUE__id" (which is visible in the "previous 200" and "next 200" links) (It would also be preferable that the unique id, becomes a separate key for the indexing and for the navigation with next/previous 200)

Comment 89 Philip Tzou 2009-01-13 11:54:13 UTC

Created attachment 5669 [details]
sorting orders patch for Chinese language

This patch can provide pinyin, strokes and bopomofo sorting orders for MediaWiki in Chinese language. However, sorting orders of other languages can be created and customised easily by inheriting related functions defined in Language.php, then use "update.php" to update your mysql database.

Comment 90 Happy-melon 2009-06-15 20:33:20 UTC

*** Bug 19197 has been marked as a duplicate of this bug. ***

Comment 91 Chad H. 2009-06-18 15:49:56 UTC

*** Bug 19279 has been marked as a duplicate of this bug. ***

Comment 92 Le Chat 2009-06-18 20:26:14 UTC

With bug 19279 being marked as a duplicate of this, I'm wondering what's the state of play on the (perhaps more specific) issue of the ordering of pages in English Wikipedia - on Special:PrefixIndex for example (where we have the totally counter-intuitive ordering Z - space - a. This bug looks to have been kicking around for five years - is there a solution, or at least partial one, in the pipeline? (Perhaps based on whatever it is that generates case-insensitive lists for type-ahead in the search box.)

Comment 93 Michael Zajac 2009-06-18 23:51:14 UTC

It's even more embarrassing in Wiktionary.  Who wants a dictionary that doesn't list words in alphabetical order?  Over there, some editors have been routing around WikiMedia limitations by building index pages with other tools 

  http://en.wiktionary.org/wiki/Index:English/a1

Comment 94 Jean-Sébastien Girard 2009-06-23 21:29:49 UTC

I've just realised that for any given page in French in en:wikt that has one or more accent (and that is not counting capitals that often needs to be discounted), you need to add a sortkey for every single template that categorise: etymology templates, topical templates, part-of-speech templates etc. etc. You CANNOT use a defaultsort on the off chance another clsoely related language will need a different sorting! It's just ridiculous and start running in the 5+ pointless bits of code very fast.

People complain that Wikipedia pages are complicated to edit? Ah! Come to Wiktionary and get your ass whipped!

Comment 95 Shinjiman 2009-07-08 18:36:24 UTC

There are some data that might be useful in the CLDR from Unicode.org, it might be a good resource to producing the way that are collated.

Comment 96 Brion Vibber 2009-07-20 03:53:08 UTC

*** Bug 18599 has been marked as a duplicate of this bug. ***

Comment 97 Niklas Laxström 2009-07-28 13:12:12 UTC

*** Bug 19967 has been marked as a duplicate of this bug. ***

Comment 98 Chad H. 2009-07-30 23:29:58 UTC

*** Bug 17260 has been marked as a duplicate of this bug. ***

Comment 99 Ihar Mahaniok 2009-09-01 15:11:10 UTC

Hi,

Philip, what's the status of it?

I've asked around, and the most straightforward and the best way to address this seems to use ICU library at http://icu-project.org/. It will auto-magically give correct sorting for all (or most) of the world languages.

It would be nice to know what are the hurdles or obstacles or where I can help with getting this implemented.

Comment 100 Anon Sricharoenchai 2009-09-02 04:55:09 UTC

#88, Does UCA explicitly specify the binary representation of the generated key?

#99,
http://www.mediawiki.org/wiki/Bugzilla/collation_by_locale#Drawback_3
Could anyone confirm, if libicu provide an api to generate the binary representation of the collation key?

Comment 101 Ihar Mahaniok 2009-09-02 15:32:10 UTC

#100, apparently, yes. see http://bugs.icu-project.org/apiref/icu4c/classCollationKey.html. But I don't have direct experience with this.

Comment 102 Anon Sricharoenchai 2009-09-03 06:52:25 UTC

(In reply to comment #101)
> #100, apparently, yes. see
> http://bugs.icu-project.org/apiref/icu4c/classCollationKey.html. But I don't
> have direct experience with this.
> 

Thanks a lot.
Hmm, but seems that this class is deprecated in ICU 2.8, and they recommend to use Collator::getSortKey instead,

http://bugs.icu-project.org/apiref/icu4c/classCollator.html#e524fd43a06d4429e2c76bef35874d4c

Comment 103 Anon Sricharoenchai 2009-09-03 08:24:34 UTC

In case, we are going to use Collator::getSortKey(), the problem will be that,
php-intl seem to not yet implement the direct interface to getSortKey(),
http://docs.php.net/manual/en/collator.sortwithsortkeys.php
:(

Should we first file this bug to php-intl?

Comment 104 Anon Sricharoenchai 2009-09-03 08:54:27 UTC

(In reply to comment #103)
> In case, we are going to use Collator::getSortKey(), the problem will be that,
> php-intl seem to not yet implement the direct interface to getSortKey(),
> http://docs.php.net/manual/en/collator.sortwithsortkeys.php
> :(
> 
> Should we first file this bug to php-intl?
> 

http://pecl.php.net/bugs/bug.php?id=16831

Comment 105 Alexandre Emsenhuber [IAlex] 2009-09-09 12:06:27 UTC

*** Bug 19967 has been marked as a duplicate of this bug. ***

Comment 106 Anon Sricharoenchai 2009-09-18 11:32:43 UTC

(In reply to comment #3)
> Just a note: English sort order is not used, but rather the raw order of the
> characters as they appear in Unicode.
> 
> MySQL's collation options are useless as they are server-wide and (through
> 4.0.x) don't support Unicode. We likely need to add a sort key field that 
> we generate ourselves, appropriate for each language.
> 
> Probably relevant: http://www.unicode.org/reports/tr10/
> 

Brion, what method are we going to implement for key genration?

There are 2 interesting approaches in,
http://www.mediawiki.org/wiki/Bugzilla/collation_by_locale#Proposed_solution_3
* implement the direct C/C++ interface to ICU4C/libicu's Collator::getSortKey() in mediawiki
* use php-intl, for key generation (waiting for http://pecl.php.net/bugs/bug.php?id=16831 to be resolved)

Each approach has its own drawback,
http://www.mediawiki.org/wiki/Bugzilla/collation_by_locale#Drawback_3
* The first approach will make the mediawiki installation to be architecture dependent.  It also introduce an additional dependency (libicu) of mediawiki installation.
* The second approach introduces more dependencies (php-intl extension and libicu) of mediawiki installation.

Which one does you prefer?

Comment 107 Anon Sricharoenchai 2009-09-23 05:03:14 UTC

The key generation module may be implemented as an extension, so that the core mediawiki installtion will not be architecture dependent nor require additional dependency.

This extension will just do the task of key generation, mediawiki core will then put the generated key in the field categorylinks.cl_sortkey.
It also need the script maintenance/updateSortkey.php to be invoked when this extension is added or removed.

How do you think about this idea?

Comment 108 Anon Sricharoenchai 2009-09-23 08:53:12 UTC

(In reply to comment #107)
> The key generation module may be implemented as an extension, so that the core
> mediawiki installtion will not be architecture dependent nor require additional
> dependency.
> 
> This extension will just do the task of key generation, mediawiki core will
> then put the generated key in the field categorylinks.cl_sortkey.
> It also need the script maintenance/updateSortkey.php to be invoked when this
> extension is added or removed.
> 
> How do you think about this idea?
> 

Hmm, we cannot store this generated key in categorylinks.cl_sortkey, since the firstChar() of the original key is needed in rendering the Category pages.

Seems that, we really need to add additional field in table categorylinks.

Comment 109 Anon Sricharoenchai 2009-09-23 09:18:36 UTC

Note that, an extension that generates the sort key should also provide the firstChar() interface.

In some languages, when the sorting is implemented, the behavior of firstChar() must be properly handled.

See: http://www.mediawiki.org/wiki/Bugzilla/collation_by_locale#firstChar.28.29

Comment 110 Le Chat 2009-09-23 10:38:58 UTC

I find it rather disheartening that this very simple thing has not been solved after so many years (how is it conceivable that software designed for encyclopedias and dicionaries can't even get alphabetical order right?)

Here's how I would do it (this is all just abstract thought): have a new column in the table of pages, to contain a unique key: X__Y, where Y is the page title, and X is some language-dependent function of Y which represents Y alphabetically. For example, for English, the function would convert all letters to the same case, convert modified letters (diacritics and so on) to their basic unmodified form, strip out most punctuation, probably strip out spaces too (though there are alternative approaches to spaces).

Then alphabetical indexing of the table (sorry, don't know exactly what I mean, but whatever is used to generate the contents lists at Special:PrefixIndex/AllPages) would be based on this new column, not on the page title itself, and the value in this column would also be passed as the category sort key. This way you solve two problems at once - the contents lists and the categories (the preceding suggestions seem to relate to the categories only). AND it can probably be leveraged to improve search/Go (to avoid the need to create redirects to titles with diacritics and so on).

Don't know if this is any use, but anyway, I hope someone can get these very basic issues solved soon.

Comment 111 Anon Sricharoenchai 2009-09-23 12:56:55 UTC

(In reply to comment #110)
> 
> Here's how I would do it (this is all just abstract thought): have a new column
> in the table of pages, to contain a unique key: X__Y, where Y is the page
> title, and X is some language-dependent function of Y which represents Y
> alphabetically. For example, for English, the function would convert all
> letters to the same case, convert modified letters (diacritics and so on) to
> their basic unmodified form, strip out most punctuation, probably strip out
> spaces too (though there are alternative approaches to spaces).
> 
> Then alphabetical indexing of the table (sorry, don't know exactly what I mean,
> but whatever is used to generate the contents lists at
> Special:PrefixIndex/AllPages) would be based on this new column, not on the
> page title itself, and the value in this column would also be passed as the
> category sort key. This way you solve two problems at once - the contents lists
> and the categories (the preceding suggestions seem to relate to the categories

Storing the sort key in table of pages does not solve the categories case, since the category sort key for each page can be customized to be different from the page's title, like this,

[[Category:My Category|My customized key]]

> only). AND it can probably be leveraged to improve search/Go (to avoid the need
> to create redirects to titles with diacritics and so on).
>

Comment 112 Le Chat 2009-09-23 15:28:35 UTC

Yes, obviously customized keys would still override the stored ones - is there any problem associated with this? (I suppose a similar language-dependent function would have to be applied to customized sort keys as well, to ensure their comparability with the ones from the table, and there might be changeover issues.)

Comment 113 Mormegil 2009-09-23 16:44:19 UTC

(In reply to comment #112)
> Yes, obviously customized keys would still override the stored ones - is there
> any problem associated with this?

But the customized keys are in the page ↔ category relationship, not on the page itself, i.e. one page can have more than one sort key, e.g.:

 [[Category:Lists|Foo]]
 [[Category:Foo|*List of foos]]

So that your point that you would create a new column “in the table of pages”, is not a sufficient solution. But (no offense!) I don’t think your idea adds anything new to this discussion. All this has been debated thoroughly, only minor details seem to remain, plus that simple matter of programming…

Comment 114 Andrew Dunbar 2009-09-24 00:24:15 UTC

Yes both the page titles and the custom sort keys are going to need collation keys.

Ugly or confusing naming could result.

Collation keys are intended for efficient binary comparison and are not guaranteed to be human readable text, therefore they cannot be used for what we already call custom category sort keys. And since the custom category sort keys have to collate into correct sequence with plain old page titles they both need to use the same kind of binary collation key.

Comment 115 Anon Sricharoenchai 2009-09-24 07:51:20 UTC

*** Bug 19279 has been marked as a duplicate of this bug. ***

Comment 116 Anon Sricharoenchai 2009-09-24 08:13:00 UTC

Umm, technically, seem that, this bug has been heavily focused on collation only in category pages.
However, the Bug 19279 (as reported by Le Chat), is about collation of the page title which is employed in Special:PrefixIndex and Special:AllPages, but it is also marked as a duplicate of this bug.

Should we deduplicate them?

As the collation in PrefixIndex/AllPages has not been thoroughly discussed,
could we solve category pages first, and leave the PrefixIndex/AllPages to be resolved later?
Leaving categories and PrefixIndex to have different collation is still make sense?

I think it make sense, since wikipedia in some languages has thoroughly fixed the collation in category pages by using the custom sort key, so its collation is currently different to the PrefixIndex.

Comment 117 Le Chat 2009-09-24 08:36:15 UTC

If two birds can be killed with one stone, then let's take aim. Of course there's a workaround for categories with the sort keys, and sort keys will continue to have other uses (and thus require supporting) even if this bug is fixed. However there is no workaround for PrefixIndex, so that should be worked on with greater urgency, but clearly the same solution could be leveraged to make categories work more intuitively too. And it will potentially have other uses as well, such as making "Go" work without redirects in many cases. Is it maybe time to stop talking about this and for someone who knows what they're doing just to write the code?

Comment 118 Anon Sricharoenchai 2009-09-24 11:13:18 UTC

(In reply to comment #117)
> uses as well, such as making "Go" work without redirects in many cases. Is it
> maybe time to stop talking about this and for someone who knows what they're
> doing just to write the code? 
> 

Writing the code is easy, I think the pending work is to choose the solutions among these,

http://www.mediawiki.org/wiki/Bugzilla/collation_by_locale#Proposed_solution_3

When a final solution is picked, the coding can then be started.

Comment 119 Philippe Verdy 2009-09-29 06:31:02 UTC

I think that you should first implement it in two steps:

== STEP ONE ==

First create a function that computes a collation key from a given string within a given locale, and provide it as a string function, something like {{#collate:locale|string}}

* e.g. "{{#collate:fr|Clé}}" could return something like "cle !élc !Clé", using the " !" substring as a separator between collation elements in each level : note that here the secondary key is inverted, as specified in French.

* The exact collation keys should be made readable by taking, at each collation level, the first character that is part of the same collation group (at a given level).

* If you can find a more compact way to represent collation keys, you could as well accept the TAB character (instead of space+first non-space character) as the collation level separator, if the database will accept to store it and display it, however it will be difficult to use such key within prefix searches and in links, even if they are {{urlencode:d}}.

* This will allow Wiktionnaries to compute collation keys more reliably, but will still also allow using those strings for populating differrently the sort key for specific categories. The locale parameter can just be the language identifier: at least those languages already supported in WikiMedia projects, for which we can at least bind them to a default collation order appropriate for their scripts.

* The collation key computed should only use 3 levels (the fourth one is the string itself in its binary Unicode form, and is implicitly handled, it does not need to be specified or stored).
** Most of the time, the primary level will be readable as if it was using a very simplified script, with ignorable and diacritics characters dropped as well as apostrophes, other separators chanded to a single space, and all characters in lowercase.
** The most complex substring will be that for the secondary level : this is the one that cannot be computed easily today, it should not preserve the case differences and ignorable characters, the differences of accents should be in it (and the one that the French Wiktionnary requests as a parameter for its template "[[Modèle:clé de tri]], but in fact it requests it with its original case, and then generates the secondary key by converting it to lowercase, but uses it directly as the third-level key).
** Most of the time the third level will be very similar to the original string (with its significant case), with just the ignorable characters removed.

* Some considerations should be done for languages with complex scripts: the primary key should at least be able to extract a meaningful first character usable when rendering categories : an initial Hangul syllable can be decomposed to its initial jamo, Chinese ideographs should be mapped to a radical/strokes mapping, so that the radical can be used as the first "character". The Unicode's Unihan database can be helpful.

* For locales that use contractions, the "first" character in a collation group may be a digram or trigram : displayin the content of a category sorted this way should be able to use that digram/trigram as the title, instead of just the first physical Unicode character. We could imagine the a string function would return this "first letter" even if it is a digram/trigram, with something like {{#collatefirst:locale|string}}. For example {{#collatefirst|br|Christian}} would return "ch" because it is a single letter in Breton, and it is the first in the primary collation group that contains also contains "cH", "Ch" and "CH" in Breton, sorted between "c" and the trigram "c'h", the later is also distinct and comes before "d"). Such cases (named contractions in UCA) are frequent and needed in Spanish and Nordic languages.

== STEP TWO ==

* Then, if articles are categorized without any DEFAULTSORT: key and without sort key parameter, use the same function to automatically use this function to generate their collation key but ONLY when displayin the category, using the project's default locale, or with the locale specified within that target category (but don't store the collation key with the article, unless you are ready to have a server update task that will be able to recompute the collation keys of pages and subcategories that have been categorized in it, because the locale of a category could change over time, or could be set much later) :

* Category pages would contain their own option to specify their prefered collation, some thing like {{DEFAULTCOLLATE:languagecode}} in the text of that category page, to change it from the default's project locale/language (this additional and magic keyword would only be meaningful for category pages, and distinct from the string function above).

* But this would not prohibit pages to set their own sort key if they wish so when they categorize themselves in such category.


I am convinced that there's no need to use collation locales according to user's own preferences, the locales should be a property of both the project site and of the target category (which should be specific to a given reference language, so that it can effectively specify itself what is its prefered locale. For this reason, the collation key can be stored as it is today, when assigning different sortkeys for the same page but within distinct categories.

Having the string function {{#collate:language|string}} would still allow to sort some pages in specific groups (like "*" today in many wikis, for special elements that require higher priority in categories with large enough populations).

Comment 120 Philippe Verdy 2009-09-29 06:46:25 UTC

Note with my example sort key format, compatible with collation :

I gave "cle !élc !Clé" as a french collation key for "Clé" (remember, here the level separator is " !", i.e. space+first non space, because MediaWiki drops multiple spaces : this sequence is not needed for any primary level key because the punctuation is normally dropped at this level and replaced by spaces, then sequences of spaces are packed to single one)

But nothing prohibits you to use this format instead: "c le !élc !Clé"

i.e. with an extra space separator between the firstChar() and the rest of the primary-level key.
This space is enough to generate the section titles in categories (so here it will effectively sort in section "c"...


For the Breton example, "Christian" will give this collation key : "ch ristian !naitsirch !Christian"

(note that Breton also uses accents and treat them like in French dictionnaries, in reversed order at the secondary level, i.e. from the last letter or digram or trigram, to the first one), so it will be displayed in section "ch", and not "c".

Comment 121 Stu Derby 2009-11-10 01:42:13 UTC

In comment #116:
> since wikipedia in some languages has thoroughly fixed the collation in category pages by using the custom sort key

No, the custom sort key is still a hacky workaround that's less than thorough; not least because even with a custom sort key, the sorting is still case-sensitive. To achieve the proper case-insensitive sort that a normal reference work uses (in English at least), the en wikipedia is more-or-less using an ad-hoc casing rule - "first-letter uppercase, all others lower" - and there are lots of articles not 100% correctly "fixed" in this way.

Furthermore and FYI, on en, some folk have decided that case-sensitivity in the sort is a feature, not a bug.

Comment 122 Laurent Jauquier 2009-11-16 21:58:22 UTC

5 years old! This major (!) bug must absolutely be fixed. This makes no sense in our indexes! « À » is sorted after « Zèbre » !!

Comment 123 Roan Kattouw 2009-11-16 22:01:11 UTC

(In reply to comment #122)
> 5 years old! This major (!) bug must absolutely be fixed. This makes no sense
> in our indexes! « À » is sorted after « Zèbre » !!
> 

We know. Please don't spam this bug with comments like "I WANT THIS TO BE FIXED!!!". Only comment when you have something technical and relevant to say.

Comment 124 Le Chat 2009-11-17 08:25:28 UTC

If you "know", why do you change the severity level back to "normal"? This is a really fundamental bug - maybe developers living in their ASCII world don't realize it, but for people who write dictionaries and encyclopedias (which is what this software is most famously used for), particularly non-English ones, it really is a major impediment to quality. Can we at least have some information as to what is being done about this (we're told there's no shortage of ready solutions), and roughly when we can expect a fix to be implemented?

Comment 125 Varga Benjámin 2009-11-17 13:26:56 UTC

+1. It really seems to us, that the severity of this problem is not clear enough for those living in "ASCII world". ;) Any solution will be greatly appreciated...

Comment 126 Michael Zajac 2009-11-17 23:47:42 UTC

I agree with le Chat: how can any dictionary or encyclopedia be taken seriously if it can't even alphabetize its headwords or entries?  This is a fundamental requirement.

Comment 127 Alex Z. 2009-11-18 00:01:38 UTC

As Roan said a few comments ago, please only comment when you have something *technical and relevant* to say. We already know that its a problem.

Comment 128 Le Chat 2009-11-18 11:55:25 UTC

It seems that the developers *don't* know that it's a problem; or at least, since it's taking over five years to solve, they don't appreciate how much of a problem it is. When I wrote with something technical and relevant I was told it wasn't required since we already know how to solve this problem - I don't know what else we can do to drive the point home that we *really really* should have a solution implemented. Can a dev at least answer my question about what is being done about this and when we can expect a fix to be made? (Changing severity back to major, in case that has any effect on focusing minds.)

Comment 129 Mormegil 2009-11-18 13:55:55 UTC

(In reply to comment #128)
> I don't know
> what else we can do to drive the point home that we *really really* should have
> a solution implemented.

What you can do to have a solution implemented? Write the code. This is an open-source project, after all…

> Can a dev at least answer my question about what is
> being done about this and when we can expect a fix to be made?

“A dev”? “What is being done”? This is an open-source project developed primarily by volunteers. AFAIK there is no official paid-by-Wikimedia-Foundation project to implement this functionality. You can hardly detect whether some volunteer somewhere spends his time working on this (and I know of no such person). We cannot expect a fix to be made until somebody goes and creates it…

Go implement the functionality, or go and bug the Foundation to throw some money on the problem and assign a paid developer to it. There is no point in debating the severity of it *here* (it’s not like there are bored developers lurking around here and not fixing the bug just because they do not consider it a problem).

OTOH, should the bug really be “ASSIGNED to Philip Tzou”? Is he actively working on solving this bug? I see he attached his patches here, but that is only a part of the solution.

Comment 130 Le Chat 2009-11-18 17:20:05 UTC

Well, there are people who know the architecture and the programming language and can presumably do this very quickly (at least the basic step) if they realize how much of a priority it is. The first step is just to apply a few extra functions to category sort keys with the intention of converting them into collation keys. Once that's in place, people can work on actually writing such functions for their particular languages. Later, when those functions are written, they can be used additionally to generate a proper alphabetically ordered table of pages for use in the contents listings. Or some similar workflow - but there needs to be a plan of action, and that can't be effected by just anyone, only by the devs who are in charge (no use pretending that everyone's equal - only certain devs actually have the power to make anything happen).

Comment 131 Andrew Dunbar 2009-11-19 08:11:38 UTC

> Well, there are people who know the architecture and the programming language
> and can presumably do this very quickly (at least the basic step)

Why do you presume it can be done quickly? It doesn't seem so to me and I
regard this bug to be of huge importance.

> if they realize how much of a priority it is. The first step is just to
> apply a few extra functions to category sort keys with the intention of
> converting them into collation keys.

Which functions would those be? What makes you think category sort keys can
be converted to collation keys?

For instance category sort keys are human-readable and human editable but
collation keys are typically binary and not human readable. Category sort
keys are included directly in the text of a category link such as
[[Category:foo|bar]].

Do to their binary nature collation keys cannot appear here. So you would
need to decide whether to remove all category sort keys or make category sort
keys interact with collation keys that would be added elsewhere.

For collation keys to replace category sort keys you would need to establish
that cateogry sort keys have no legitimate uses other than forcing alphabetic
order in cases where the current order results in nonalphabetic sequences.
I can assure you that people do use category sort keys for other purposes and
some might be vociferously upset if these were removed without discussion.

For collation keys to interact with category sort keys you need to generate
and maintain in the database, collation keys for each page title and for each
category sort key since collation keys must be of the same nature to be able
to compare and hence sort them.

Now Unicode does specifiy a "Unicode Collation Algorithm" (UCA) which we could
and probably should use. It is language agnostic but provides for "tailoring"
for individual languages.

The UCA definitely generates binary keys. Not printable. Not human readable.

UCA keys can be very long. I use them in an offline tool for the English
Wiktionary and initially set their maximum length to 1024, 4 times the maximum
length of a page title. We already had about 10 pages for which 1024 was too
short so I had to set it to 2048! Many people might not like all page titles
and category sort keys to now require 9x their current amount of space in the
database.

UCA does allow for various types of sort key compression however. In which case
we would need to choose one to use since it will not be possible to mix and
match them.

PHP currently seems to have no implementation of UCA. We would need to create
it from scratch, or find a way to use one in C.

For multilingual wikis such as Commons and all of the Wiktionaries just havine
one collation language will not work since users of each language will expect
things to be in the correct order for their language.

For the Wiktionaries this means each category needs a way to declare which
language collation to use and each page needs to declare which subset of possible
language collation keys to generate for that page.

For Commons I'm not sure what the requirements would be but the may differ from
those of the Wiktionaries.

These new fields will need support in the database schema. The ones requiring
multiple language collations will reqire more drastic database changes quite
different from what we now have.

> Once that's in place, people can work on actually writing such functions for
> their particular languages. Later, when those functions are written, they can
> be used additionally to generate a proper alphabetically ordered table of
> pages for use in the contents listings. Or some similar workflow

UCA tailoring would make the particular language collations very easy as long as
we have a decent implementation of UCA that easily works with tailoring.

> - but there needs to be a plan of action, and that can't be effected by just
> anyone, only by the devs who are in charge (no use pretending that everyone's
> equal - only certain devs actually have the power to make anything happen).

Not true. Anyone with commit access can add such code. Myself for instance.
My understaning is that there are not technically any dev in charge at the
moment since Brion stepped down though there certainly are a few such as Tim
who are acknowledged to have a greater understanding of the entire codebase
and hence greater trust, and you definitely want those people to check such
changes and would expect them to revert any premature commits.

Comment 132 Philip Tzou 2009-11-19 08:27:39 UTC

I'm busy with my study and I can't help this until I have passed my exam next year January. I'm sorry to make your guys unhappy, though Chinese Wikipedia also needs such functions. :)

Comment 133 Le Chat 2009-11-19 12:53:22 UTC

We seem to be overcomplicating things now. Of course category sort keys and binary collation keys are different beasts. We just need a couple of stub functions (we - or rather individual projects - can worry about what they should do later), F1 and F2, such that F1(T) means the collation key derived from page title T, and F2(K) means the collation key derived from human-readable category sort key K (the two functions will be similar, but potentially not identical). Then apply them at the appropriate places in the code generating the category listings. There will also need to be a function that retrieves the first letter back from a binary key, as has been noted. All these functions will be language- and project-dependent - no-one is in a position to write all of them, but once the basic (and trivial) framework is in place, people who know about the various languages will be able to work on writing these functions. If a project thinks it needs the option of multiple sort algorithms, then that's pretty simple to implement as well, though starting with just one algorithm per project seems perfectly satisfactory - all the main public-facing projects are language-specific anyway, so even if English Wiktionary contains Czech words, we would still expect them to be sorted in the English order.

Comment 134 Philippe Verdy 2009-11-19 13:51:53 UTC

Collation keys can coexist with custom sort keys.
Custom sort keys are useful and will continue to be useful, to tweak the default collation (which for now is simply a binary ordering of codepoints).

Tdoday, basically, the categories are sorted by [sort key, full page name]. (possibly truncated to a reasonable size using unique identifier).

What we want is to be able to sort by: [collationkey([sort key, full page name]), sort key, full pagename) (also with possible truncation of the whole). 

This will preserve the existing tweaks made in pages when they reference categories, and will help sort all the rest.

Ideally, the collation keys should be generated "on the wild" by the SQL engine itself (because it would allow alternate sort orders, according to user locale preferences or according to web query parameters set by GUI buttons, especially for Chinese where several sort orders are common: sort by Pinyin, sort by radical/strokes, sort by traditional dictionary orders), as part of its supported "ORDER BY" clause for getting the list of article names to display in categories.

But if the SQL engine does not have such support, this must be implemented in the PHP code and collation keys can be stored in a new datacolumn (the extra data column can be added or filled conditionnally : if the SQL engine supports the needed collations, this column can remain NULL to save storage space).

If the SQL engine does not have support for dynamic collations, then the alternate (user locale-based) sort orders will not easy to implement because of the cost that it would require in the SQL client-side (in PHP) for heavily populated categories, where the support for true locale-based collation orders is the most wanted, unless the database can store multiple collation keys (for distinct specific locales): supporting the storage of multiple collation keys for different locales can severaly impact the server performance as it would require an extra join to a separate 1:N table to store the collation keys indexed by (pageid, locale); instead of storing these keys in the same SQL table used for storing the category index.

Additionally, the stored collation keys will sometimes need to be updated (when the CLDR data for locale-tailored collations will be updated or when there will be updated in the Unicode version with new characters): updating a large volume of stored collation keys will require a lot of work, and this can impact the availability of the wiki project, unless the data model includes a versioning system that allows at least two versions for the same locale to coexist for some time, and then allows switching from one version to the next before cleaning up the old collation keys after the collation keys have been updated to the new tailoring.

Comment 135 Philippe Verdy 2009-11-19 14:03:26 UTC

Finally, the proposed patch for Chinese is not sufficient. It just supports a single collation, and is not suitable. It will require lots of works to support other locales.

I really suggest that such patch uses either the SQL-supported collations, when they exist for wellknown locales, or that the collation keys are computed using a strong and well supported design, most probably ICU for C (already available as a extension module for PHP and Perl) also available as ICU for Java (and that can be integrated in some SQL engine backends), that natively includes the support for CLDR locales (and their widely discussed tuned standard tailorings), in addition to easy integration of locales using MediaWiki supported tailorings (using the LDML syntax) for the locales currently not supported in CLDR but needed for the many languages that MediaWiki sites support.

Comment 136 Le Chat 2009-11-19 14:13:17 UTC

I'm assuming the SQL engine doesn't support this (if it did, then it would have been turned on years ago, right?) But I don't understand the emphasis on multiple locales - the main need is just for ''one'' sort algorithm per database (any more would be a luxury). The human->binary algorithm just needs to be applied once when a page is added to a category; I don't see any major performance issues resulting. (Of course there would be inconsistencies for a time after the algorithms have been rewritten, but we wouldn't expect any tweaks to those algorithms to have a major impact on the sort order.)

Comment 137 Roan Kattouw 2009-11-19 14:17:38 UTC

(In reply to comment #134)
> Tdoday, basically, the categories are sorted by [sort key, full page name].
> (possibly truncated to a reasonable size using unique identifier).
> 
This is not true. We're sorting by (sort key, pageID) where the sortkey is whatever the page sets as sortkey (or the page name if not set), and pageID is a unique number identifying the page (roughly proportional to creation time).

> Ideally, the collation keys should be generated "on the wild" by the SQL engine
> itself (because it would allow alternate sort orders, according to user locale
> preferences or according to web query parameters set by GUI buttons, especially
> for Chinese where several sort orders are common: sort by Pinyin, sort by
> radical/strokes, sort by traditional dictionary orders), as part of its
> supported "ORDER BY" clause for getting the list of article names to display in
> categories.
> 
SQL engines probably support this, but using it would result in an unacceptably inefficient query. Only queries sorting by the actual values of fields are efficient, and only if there is an index on those fields.

> But if the SQL engine does not have such support, this must be implemented in
> the PHP code and collation keys can be stored in a new datacolumn (the extra
> data column can be added or filled conditionnally : if the SQL engine supports
> the needed collations, this column can remain NULL to save storage space).
> 
If you sort this stuff in PHP, you need to grab the entire list before you can reliably sort it. Doing that for [[Category:Living people]] has no chance of staying within the memory limit.

> If the SQL engine does not have support for dynamic collations, then the
> alternate (user locale-based) sort orders will not easy to implement because of
> the cost that it would require in the SQL client-side (in PHP) for heavily
> populated categories, where the support for true locale-based collation orders
> is the most wanted, unless the database can store multiple collation keys (for
> distinct specific locales): supporting the storage of multiple collation keys
> for different locales can severaly impact the server performance as it would
> require an extra join to a separate 1:N table to store the collation keys
> indexed by (pageid, locale); instead of storing these keys in the same SQL
> table used for storing the category index.
> 
For supporting multiple collations, a separate table sounds a lot more sane in terms of read performance, as it would suffer less from the problems mentioned above.

> Additionally, the stored collation keys will sometimes need to be updated (when
> the CLDR data for locale-tailored collations will be updated or when there will
> be updated in the Unicode version with new characters): updating a large volume
> of stored collation keys will require a lot of work, and this can impact the
> availability of the wiki project, unless the data model includes a versioning
> system that allows at least two versions for the same locale to coexist for
> some time, and then allows switching from one version to the next before
> cleaning up the old collation keys after the collation keys have been updated
> to the new tailoring.
> 
Expensive writes are better than expensive reads.

Comment 138 Le Chat 2009-11-19 15:27:02 UTC

>If you sort this stuff in PHP, you need to grab the entire list before you can
>reliably sort it. Doing that for [[Category:Living people]] has no chance of
>staying within the memory limit.

Not if you compute the collation key when you add a page to a category, and index it based on that key. (Just as you currently index it based on the human-readable sort key.)

Comment 139 Philippe Verdy 2009-11-19 15:51:05 UTC

#88,#100,#101 "Does UCA explicitly specify the binary representation of the generated
key?"

yes and no:
- Yes there's an interface to retrieve the computed sort key.
- But no: it is not stable across versions of Unicode (which can add characters in the DUCET), of ICU (that can change the representation), and of CLDR (that can modify also the locale-dependant tailorings).

In other words : precomputed (stored) collations will need to be versioned too, any upgrade will require recomputing the collation keys (but a real problem on large wikis, if this is implemented in the SQL native engine (because the database index will have to be FULLY rescanned and updated, which may take a considerable down time on large wikis).

This will be much less problematic if the colaltion keys are not part of the SQL engine itself, but stored in a separate table that can store multiple keys for distinct collations, because it can also be used for storing successive versions. In that case, recomputing the collation keys can be deferred, and once a category is finished, the collation version can be updated in the category, and then the collation keys for the previous version can be deleted, and then the next category can be handled. This will imply NO downtime on the server, as new collations can be added on the fly (and separately for each category).

Consequence: add to the MediaWiki data model a table of supported locales-collation (those that can be specified in the site-wide default) with their supported version. Attach a unique id for the versioned collation, and create a separate category collation table containing the collation keys.

Conceptually (some details omitted) :

CREATE TABLE collations(
  coll_id INTEGER NOT NULL,              -- primary key for this table
  coll_name VARCHAR(32) NOT NULL,        -- e.g. "root", "fr", "en-US", "en-GB", "zh-Latn", "zh-Bopo", "zh-Hans.radical-strokes"
  coll_isprefered SMALLINT NOT NULL,     -- identifier of the version
  coll_version VARCHAR(32) NOT NULL,     -- unique description of the version
  PRIMARY KEY (coll_id),
  UNIQUE KEY(coll_name, coll_version),   -- strong constraint
  INDEX ON (coll_name, coll_ispreferred) -- optional, for fast retrieval of the prefered version of a given collation
);

CREATE TABLE categorysort (
  cl_id INTEGER NOT NULL,   -- in fact, the primary key of the categorylinks table
  coll_id INTEGER NOT NULL, -- primary key of the collations table
  cs_key VARCHAR(255),      -- in fact a binary value, computed from ICU's Collator:getKey(UString).
  PRIMARY KEY(cl_id, coll_id)
);

Then no need to change the categorylinks table, which will continue to store the full pagename, and the custom sortkey.

----

To support the firstChar() conceptual API, each entry in the collations table above would also need another table containing the possible lowest strings (in collation order) that use the same weight value at the primary level:

CREATE TABLE collations_headings(
  coll_id INTEGER NOT NULL, -- primary key of the collations table
  ch_weight INTEGER NOT NULL, -- primary collation weight value
  ch_cluster VARCHAR(32) NOT NULL -- single default grapheme cluster from the first string starting with this primary weight. 
  PRIMARY KEY (coll_id, cf_cluster)
)

One problem is that ICU does not currently contains such a list of default grapheme clusters suitable for all primary weight values in each collation. Is there a way to generate it anyway? Note that some primary weights can sometimes only exist with multiple characters

E.g. "ch" is a default grapheme cluster in the Breton collation, LC_COLLATE=br. It makes no sense in Breton to mix "c" and "ch" together under the same heading, given that they sort separately. The same thing occurs in many languages (e.g. Spanish, Swedish...).

However it probably does not occur (?) in the DUCET (I did not verify this assertion), so may be we can just avoid storing all possible Unicode characters to convert them to their default primary heading, and instead we just have to store the graphemes specified in the examplar set for the language (or other graphemes that are explicitly present in the locale specific tailoring rules.

The other graphemes usable as first char can then be taken automatically generated from the first Unicode character present in the Mediawiki custom cl_sortkey (whose default is the full pagename, including the namespace name localized to the default locale of the wiki), according to the DUCET which is shared as the base of all locales.

Comment 140 Philippe Verdy 2009-11-19 15:59:48 UTC

#138: I fully agree. Using stored collation keys will remove the dependency with the database support, and will also avoid the problem of downtime when upgrading the collations and recomputing the categorylinks SQL index to migrate the collation keys...

I still defned the concept of multiple collations in parallel, if only just for supporting the migration of collation rules.

In addition it will offer users preferences for their preferred collation order, and will also allow to view the same category using different orders, without additional processing (as long as the category page itself, or the site-wide default collations, specify the collations as supported by default)

Comment 141 Philippe Verdy 2009-11-19 16:09:07 UTC

#137 From Roan Kattouw 2009-11-19 14:17:38 UTC
>(In reply to comment #134)
>> Tdoday, basically, the categories are sorted by [sort key, full page name].
>> (possibly truncated to a reasonable size using unique identifier).
>> 
>This is not true. We're sorting by (sort key, pageID) where the sortkey is
>whatever the page sets as sortkey (or the page name if not set), and pageID is
>a unique number identifying the page (roughly proportional to creation time).

I was speaking conceptually. The "pageID" is absolutely not conceptual and makes absolutely no sense lingusitically for most users, for sorting pages having the same sort key specified. We really want these pages to be sorted according to the specified sort key (when it is present), then by the full page name; and both using the same locale-dependant collation order.

The pageID was just added to make sure that the sort key was unique (actually this is not a requirement), and for allowing navigation from page to page without missing entries or duplicating them across pages, but it is definitely not part of a good sort key (this is the only reason why we want a unique id there, but in fact it is redundant and is an alternate binary representation of the full pagename which is also unique, but which is completely unsuitable for collation).

Comment 142 Roan Kattouw 2009-11-19 16:40:25 UTC

(In reply to comment #141)
> The pageID was just added to make sure that the sort key was unique (actually
> this is not a requirement),
It is a requirement software-wise. If two categorylinks entries are indistinguishable, category paging will break.

> and for allowing navigation from page to page
> without missing entries or duplicating them across pages,
Exactly.

> but it is definitely
> not part of a good sort key (this is the only reason why we want a unique id
> there, but in fact it is redundant and is an alternate binary representation of
> the full pagename which is also unique, but which is completely unsuitable for
> collation).
> 
Yeah the full page name could also be used as a tiebreaker (from a technical standpoint we probably want a (namespace, title) pair instead, where namespace is a numeric value).

Comment 143 Philippe Verdy 2009-11-19 17:58:47 UTC

(in reply to comment #142 from Roan)

>The pageID was just added to make sure that the sort key was unique (actually
>> this is not a requirement),
> It is a requirement software-wise.

Absolutely not, it is not required FOR SORTING (this was explicitly stated in my previous statements, dont cut it where you think it arranges you). It does not have to be "software-wise", because exactly it is completely undesirable for sorting.

You are confusion uniqueness requirements (for navigation and indexing for fast SQL retrieval) with sort order requirements. The need for the separate id is completely orthogonal, because the ID is just a small alias strictly equivalent to the full page name that it represents (i.e. to the unique identifier created by the namespace and the page name, which are what we really want to show and sort, but not what we need to avoid duplicated or missed entries in the "next 200"/"previous 200" links). Such page id should always remain invisible in the GUI interface (only visible in the URL as a query parameter for the next/previous 200 links), and should be COMPLETELY ignored by the SQL "ORDER BY" clause, even if it MUST still be part of the SELECT'ed columns in the SQL query.

----

Couldn't the namespace be part of the [[category-sort-option:]] or similar ? Currently they are sorted by name (and the initial of the namespace name is also used to generate the headings : it is convenient for some categories, but not all, and in fact, the namespace could be used to generate other subheading levels, or given lower prirority when compared to the unprefixed pagename.

I see no valid option for sorting by numeric namespace id (except for sorting by namespaces, independantly of their name), and at least it should never be made visible in the "first char" used in section heandings...

So may be another [[category-ns-option:]] could be set in category pages (but this will probably be part of another proposal for enhancement).

Comment 144 Philippe Verdy 2009-11-19 18:12:31 UTC

Note that in SQL, the ORDER BY clause needs NOT unique sort keys.
Not even for sort stability, because the SQL engine will ensure the sort stability itself from the default store order in the index (when there's a qualifying index for this sort order), or from the implicit positional "rowID's" automatically for each SELECT'ed row added to the temporarily created table/index during the query execution.

Some SQL engines (like Oracle, Sybase, Informix, but probably not all index formats supported in various versions of MySQL) also store the rowID's of the component tables from which columns are extracted in the SELECT clause, in order to allow the columns under the read cursor to be updatable, but won't do that for columns computed from expressions, which are not updatable. This just requires opening the cursor "FOR UPDATE" (and sometimes, you can also limit the list of table columns that need to be updated within the cursor loop, which remains open during the transaction, in order to minimize the extra storage for dependant rowID's).

Comment 145 Roan Kattouw 2009-11-19 22:13:25 UTC

(In reply to comment #144)
> Some SQL engines (like Oracle, Sybase, Informix, but probably not all index
> formats supported in various versions of MySQL) also store the rowID's of the
> component tables from which columns are extracted in the SELECT clause, in
> order to allow the columns under the read cursor to be updatable, but won't do
> that for columns computed from expressions, which are not updatable. This just
> requires opening the cursor "FOR UPDATE" (and sometimes, you can also limit the
> list of table columns that need to be updated within the cursor loop, which
> remains open during the transaction, in order to minimize the extra storage for
> dependant rowID's).
> 

Let's not make assumptions about individual DB engines here, but use SQL queries that unambiguously state what we want without relying on such behavior.

About the whole page ID thing: as long there's a sorting tiebreaker of some kind making every entry unique (not necessarily visibly unique in the UI, just unique to the software), paging will be fine. Whether that tiebreaker is the page ID or the page title or the water level in the San Francisco Bay at the time of the page creation measured in 128ths of inches doesn't matter all that much from the implementation side, although of course only the second is remotely understandable for users.

Comment 146 Philippe Verdy 2009-11-20 00:11:53 UTC

I did not make assumptions about SQL engines, because this message exactly says the opposite: there are varieties there, and Mediawiki could/should also work with other SQL backends then just MySQL, even if it is open-sourced, but mainly supported now by Sun which has its own strategies on it.

But I wanted to really note that the page id had no use for sorting results, given that we really want to sort by full page name, as a single entity or split in two parts if we consider the namespace...

And may be into more parts, if we consider the subpage names that are also part of the full page name by specially tailoring the "/" character in the collation order so that "a/c" will still sort with sort with "a", and not after "ä" or "a!", by giving a less than minimum primary weight to the "/" character used as subpages separator (a value even before characters with default-ignorable primary weight values).

Another way to perceive multicolumn hierarchical sort keys is to think about them as if they were separated in a single string by an implicit (non-coded) character with a non null but less than minimal primary weight value lower than default ignorable characters. In UCA impelmentations there a reserved weight for this function (fully ignorable characters get the null primary weight which does not participate at all to the computed collation key, this invisible character takes the next weight, and then the default-ignorable characters come just after, followed by combining characters, then whitespaces, and then generally in the order: punctuation and symbols, numerals, letters and ideographs).

This implicit separator may be the same as the one for separating series of weight at each level within the same source string (but many UCA implementations, including ICU, do not need to encode this separator in the computed multilevel collation key, as they allocate the collation weights in non overlapping numeric ranges where the highest one is used for the primary weights ; but they still maintain a reserved value for easily computing multicolumn sort keys by simple concatenation with the encoded binary primary weight for this the column separator).

Note also that in UCA, some groups of distinct strings, whose collation key are computed at the maximum level (generally level 4 with the DUCET, but possibly longer in collations tailored for some complex languages), can still have the same collation key: this will be true if the stricts contain different characters, but they are still canonically equivalent (i.e. they have the same NFC form). One string in each subgroup will be in NFC form, the others are in alternate non canonical forms and may even be longer):

It is not a problem for language-sensitive collation, but effectively this sometimes requires adding a final binary comparison between the Unicode-encoded strings, to make sure that the sort order will be stable across database updates (such comparison needs not be stored in the collation key itself, given that we will still retreive the full page name from the database, for displaying them, and not just the collation key which is just used in the ORDER BY clause.

Comment 147 Andrew Dunbar 2009-11-20 01:50:55 UTC

MySQL does support CLDR/UCA collation and I believe it to be fast. The reason we don't use it is that MySQL does not support the full range of Unicode whereas MediaWiki does. MySQL is limited to the BMP (Basic Multilingual Plane). I believe for this reason MediaWiki does not even use any of MySQL's support for UTF-8 but instead stores all text as binary and handles UTF-8 issues itself.

The feature request for non-BMP support in MySQL seems to be this one marked low priority with only 2 people watching it: http://forge.mysql.com/worklog/task.php?id=1213

Comment 148 Philippe Verdy 2009-11-20 04:25:06 UTC

May be it's low in their priority, but with the development of Unicode and the now ongoing efforts to add now many more characters in plane 1 (notably lots of symbols, including common things like circled letters and numbers) and 2 (ideographs) for modern use, this position will become unmaintainable, and the need for extending the MySQL UTF-8 support will increase.
After all, MySQL is now controled by Sun, which is an active member of Unicode Technical Commity, and even Sun will want a better interfaction with its Java platform that has excellent Unicode implementation and is used as a reference platform (even before the newest algorithms are ported to C/C++). ICU is also supported by IBM that has a strategic need in C/C++, Java, and intergration in web services (both on the server- and client-side of applications).

Here also IBM wants itegration with its dB2 SQL platform but has siognificant offers over MySQL. Both companies (also Apple, Oracle, and many others) are highly engaged in the Unicode development process, internationalisation, and want support for their platforms (including in China, where currently MySQL will not be an option, as long as full support of Unicode will not be there).
in fact, extending the support to the missing plnes is now a small effort in comparison to what has already been made by them. And another refernce platform (Perl) can also help generate the missing code or data tables that are needed (the PHP platform makes no efforts itself, as its Unicode support completely depends on ICU now, which is now a reference library that is becoming nearly universal on all platforms, with the except of Windows where Microsoft has rewritten everything in its CLR for .Net). May be MySQL will sooner or later replace its own implementation simply by integrating ICU, to save the collaborative efforts...

The only seriously missing development for collation for now is on the client side: we still lack a good support of collation in Javascript (which just as a string.localeCompare(string) method with an unspecified collation table, and incompatible implementations across browsers with lots of bugs.): the few demonstrations that I have seen need to use server-side components to actually collate or process the Unicode data through AJAX! I think that this will come when JSP server-side applications will start to want Collators without depending on native libraries (that cannot be deployed easily from servers to servers, or over computing grids). Collator factories should become as universal in I18N framewaorks, as it is now for translation resource bundles, date/time formats, case mappings, encoding convertors, and regular expressions.

So are there voluntaries to implement it in MediaWiki using its native PHP platform (through the intl10n library), independantly of MySQL (which is just a backend, from which MediaWiki should not heavily depend exclusively).

Comment 149 Andrew Dunbar 2009-11-20 11:05:53 UTC

I've added some feedback on the MySQL bug. I urge anyone here who cares about this bug to also pay attention to that bug. Perhaps by working on their code, perhaps voting for the bug, perhaps providing further helpful feedback. Perhaps they will raise its importance if they see it is important for MediaWiki and Wikipedia - a top 5 website. I believe MediaWiki and MySQL have worked together in the past.

http://forge.mysql.com/worklog/task.php?id=1213

Comment 150 Peter Gervai (grin) 2009-11-20 15:09:45 UTC

Well I didn't want to chime in (you seem to be able to shout at each others without my assistance :-)) but please do not forget that the solution for MEDIAWIKI (as opposed to Wikipedia) ought not base on Mysql specific features. 

If there are two approaches, and one of them is database backend agnostic then that one should be preferred.

Of course I didn't want to mention that I believe code speed (resource hunger) is a real factor here, since this is accessed frequently.

By no means I want to speak against helping Mysql, since it's a nice open source project. I just use postgresql, that's all. :-)

Comment 151 Philippe Verdy 2009-11-20 15:43:56 UTC

Anyway, the backend compatibility layer that maps the SQL into the SQL dialect spoken by the SQL engine can still adapt itself : if there's a true support for collation supported by the SQL backend, it can be used instead of the client-side (in PHP) implementation in the MediaWiki code itself, if it provides additional performance benefits and stronger security. If there's no real difference, then we can simply abandon the SQL-based collation (even if its supported) and use the PHP implementation everywhere

This will simplify the design of the SQL-backend adaptation layer within the MediaWiki code.

This will allow easier integration of other backends, such as Postgres (suggested in Comment #150), or Oracle, Sybase, Informix, MSSQL, or other backends available on Windows servers through ODBC, or others accessible through JDBC via a PHP-to-Java adaptation layer... if the PHP code used in MediaWiki can also run from within a Java-based JSP application server or .Net-based ASP application server, or from within other newer VMs that are already being developed to support both the Java and .Net environments simultaneously. There are very exentive research to make all these VMs compatible with each other (and to ease the migration with mutual compatibility, and with transparent deployment facilities in heterogeneous environments like computing grids and cloud computing).

Note that in some future, even the WMF projects could benefit a lot of such facilities, when its servers will be virtualized to scale better with reduced costs, because we all know that the WMF projects need more and more money each year to support its ever growing databases and audience, and the increased need for supporting better internationalization. It is highly strategic (and should also be discussed in the WMF Strategy wiki) to reduce the technical  dependencies (also because there's now a demonstrated need for open-projects to work together, such as within the Open Alliance which has lots of contents to share with the WMF, thanks to their compatible licencing schemes).

Comment 152 Anthony 2009-11-25 05:42:19 UTC

Peter, for PostgreSQL you can ignore all the hacks and workarounds and just use the built in collation.

psql (8.4.1)
Type "help" for help.

a=# create table category (title varchar(255) not null primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "category_pkey" for table "category"
CREATE TABLE
a=# insert into category values ('Apple');
INSERT 0 1
a=# insert into category values ('Banana');
INSERT 0 1
a=# insert into category values ('aaa');
INSERT 0 1
a=# insert into category values ('banana');
INSERT 0 1
a=# insert into category values ('aáa');
INSERT 0 1
a=# insert into category values ('apple');
INSERT 0 1
a=# insert into category values ('Apple');
ERROR:  duplicate key value violates unique constraint "category_pkey"
a=#  insert into category values ('APPLE');
INSERT 0 1
a=# insert into category values ('aab');
INSERT 0 1
a=# select * from category order by title;
 title
--------
 aaa
 aáa
 aab
 apple
 Apple
 APPLE
 banana
 Banana
(8 rows)

The easiest solution would probably be to drop support for MySQL.  Not that that's ever going to happen.

Comment 153 Andrew Dunbar 2009-11-25 07:57:50 UTC

MySQL can do the collation that Anthony mentions for PostgreSQL above in #152 and much more besides.

PostgreSQL has all the problems and more of MySQL when it comes to MediaWiki's collation needs. It depends on the underlying OS, some of which don't do UTF-8 collation at all and as yet does not make us of ICU. See the page here: http://wiki.postgresql.org/wiki/Todo:ICU

I can't find any information on whether PostgreSQL supports Unicode beyong the Basic Multilingual Plane.

Comment 154 Philippe Verdy 2009-11-25 23:06:01 UTC

That's another good resson why collation should be supported directly within the MediaWiki software, which already depends completely of PHP, so that it should really use the best integration as possible using the dedicated ICU integration module for PHP.
This also means that it will be much better to simply store the computed collation keys directly within the database schema, unless the database adapter for PHP already supports ICU (and there's a commitment from the database vendor to integrate ICU as part of its design).
The Todo:ICU in PostgreSQL will be fine when it will be effectively implemented, and this integration becomes fully supported.
But anyway, the first thing to do is to press the PHP developers to have their own commitment to offer full support and integration of ICU within PHP, and if this is still nto the case, making sure that the ICU integration module for PHP comes from a viable project (otherwise there will be the need, in MediaWiki, to develop an adaptation layer for collation, that will support transparent change for another PHP integration module, or a later integrtion within PHP core itself).

The second thing to look for (and that is still missing) is a support for a ICU-like project (or port) for Javascript (for integration on the client-side, in the browser) with here also an Javascript-written adapter layer, that allows replacement of the Javascript-written collator by some future API supported natively by browsers (because it will perform much better).

The best integration tools (for client-side collation) that I have seen, using Javascript, fully depends on AJAX (i.e. with collaboration with serverside-scripts that can provide precomputed collation data, or that can compute the collation keys from the client-provided texts): some interesting demos use JSON requests or XML requests though AJAX, but this adds some delays and increases the number of HTTP requests needed to sort lots of client-side data (for example when sorting the rendered HTML table columns, which currently just uses the Javascript "localeCompare" function which seems to use only the DUCET or some locale-neutral collation, without taking into account the actual locale).

It would be much better if all major browser engines (for IE, Mozilla for Firefox, Wekbit for Safari/Chrome/KHTML) decided to extend the very poor support of Unicode and locales within Javascript/ECMAScript strings, using ICU as a base foundation or at least for the services API that it can implement (even if those browsers use different integration strategies): they should still support the same collation rules, with the same syntax in a similar language, such as the languages already documented in the Unicode standard, including the possibility to use the collation tailoring data already coming the CLDR project, and the possibility for these implementations to still support user-specified tailorings (so without hardcoding them in a way that would completely depend on the implemented Unicode version and the limited list of locales already supported by CLDR).

There are two standard languages defined for collation tailorings : one is XML-based but is extremely verbose (it is probably easier to use from a DOM-based object view, and most probably more efficient at run-time), another equivalent one is much more compact and more readable and much easier to specify by users or in scripts. Both syntaxes are automatically and easily convertible between each other, with equivalent compilation times and complexities, but the comapct form is easier to transmit in small scripts over HTTP (including through AJAX), and the compact form is much faster to parse as it won't depend on a ressource-hungry XML parser (due to its required and complex conformance rules). For Javascript clients, a JSON-based syntax for collation tailorings may also be even more efficient without requiring additional complex code written in Javascript itself.

Comment 155 Andrew Dunbar 2009-11-27 02:07:39 UTC

Here is another related MySQL bug to watch: http://bugs.mysql.com/bug.php?id=25666

Comment 156 Andrew Dunbar 2009-11-27 21:50:21 UTC

Apparently non-BMP UTF-8 was in MySQL 6.0 which was around as alpha for a while and whose features are all due to be added into upcoming 5.X versions: http://lists.mysql.com/packagers/418

Comment 157 Chris 2010-01-19 16:35:07 UTC

I'm not sure if i'm posting in the right category, but I figured I should post it here before making a new one post.
There's a problem for sorting with Korean page names.
If the page is 가, it will sort under ㄱ if you don't specify where to sort it (which is what it should do).

However, if I manually sort it to like this: [[Category:name|ㄱ]] it won't sort under the same ㄱ as before. Instead it will make another ㄱ to sort under.
So making if a user manually sorts a page under ㄱ, and another has makes a page that starts with the letter ㄱ and mediawiki automatically sort it -- even though they should be sorted under the same letter, they are not. My mediawiki is the latest version by the way (1.15.1)

Incase your confused:

ㄱ
-가 (manually sorted page)

ㄴ
-(page)

.
.
.

ㅈ
-(page)

<<end of korean alphabet, there should be no more Korean letters, but pages that weren't sorted manually will appear after>>

ㄱ (appears at the end, identical letter as before and should not be recreated by mediawiki)
- 기 (a page that was sorted by mediawiki, this should be sorted in the same section as page "가")

Comment 158 Le Chat 2010-02-27 16:37:31 UTC

As I understand it, this bug was wrongly assigned to someone who is not working on it (at least, not in its full generality), so I'm boldly de-assigning it, in the hope that someone will finally take it up.

Comment 159 T. Gries 2010-03-07 11:15:15 UTC

When setting up a new(!) wiki, and when using MySQL 4.1/5.0 UTF-8 (not: binary) a correct page title sort order can be achieved when applying this patch

in maintenance/tables.sql change 
  page_title varchar(255) binary NOT NULL,
to
  page_title varchar(255) NOT NULL,

before you run the installer.

The column page_title will then become of encoding type

  character set utf8 collate utf8_general_ci

and the sorting order of page titles is correct for (at least European languages). You then get page titles in Special:Pages listed in the order "Anton", "Äpfel", "École supérieure", "Zulu" - and not any longer "Anton", "Zulu", "Äpfel", "École supérieure" as is currently the case on all WMF wikis.

This tip is only valid for fresh installations: you cannot change this important column when updating from an existing database even not when mysqldumping and importing; attention: the tip has not yet checked for unwanted side effects.

For collations see 
[1] http://www.collation-charts.org/mysql60/ and 
[2] http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html

Comment 160 Stu Derby 2010-03-12 01:32:28 UTC

Re: comment 159 

While the suggested patch clearly dramatically improves the page title sorting, it doesn't make it 100% correct for all "European languages" though it may make it so for some, and I think it would be an improvement for all but the Nordic languages.

The different European languages have different and conflicting collation sequences; no single order will work for all of them.

For example:

German speakers expect "Ö" and "O" to collate as equivalents; Danes, Norwegians, and Swedes expect "Ö" to be treated as a separate letter that sorts after "Z".

French has an unusual collation rule; the last accent on a word is more significant than the first, leading to this collation:
cote < côte < coté < côté. This is unique to French, so far as I know.

Hungarian sorts the combination "dzs" after the combination "dz" (i.e. "foodzsbar" should sort before "foodzbar" in Hungarian).

In Estonian, "Z" sorts as an equivalent to "S".

Etc. etc. etc.

The suggested patch may be very appropriate for individual Wikimedia installations, but is not so as part of the proper fix.

Comment 161 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-12 17:12:17 UTC

utf8 support in MySQL supports a number of common languages, including Danish, Swedish, French, Hungarian, and Estonian:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html

Comment 162 Huji 2010-03-12 17:48:42 UTC

(In reply to comment #160)

Actually, I think the issues you raised with collations and languages has nothing to do with MediaWiki; if anything, MySQL is to be blamed. All we want here is for MediaWiki to use the full extent of collation features offered by MySQL. Perhaps not all languages will benefit from it, but at least some of them do, and that's enough a reason for a change like this.

Comment 163 T. Gries 2010-03-12 21:47:56 UTC

(In reply to comments #159 and #160)
> While the suggested patch clearly dramatically improves the page title sorting,
> it doesn't make it 100% correct for all "European languages" though it may make it so for some, and I think it would be an improvement for all but the Nordic
> languages.... Norwegians, and Swedes expect "Ö" to be treated as a separate letter that sorts after "Z".

Yes, I know this, too.

> The suggested patch may be very appropriate for individual Wikimedia
> installations, but is not so as part of the proper fix.

Thanks for pointing out. I must admit, that you are _fully_ right, also knowing of the collation differences. We developers should collaboratively think to find a satisfying solution for the "collation problem".

This could be done in Berlin during the Wikimedia Developer Workshop 2010, held on April 14.-16. in Berlin, Germany, as part of the Wikimedia Conference 2010. http://techblog.wikimedia.org/2010/03/registration-open-for-the-developer-workshop-in-berlin/

Comment 164 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-12 22:06:57 UTC

(In reply to comment #163)
> Thanks for pointing out. I must admit, that you are _fully_ right, also knowing
> of the collation differences. We developers should collaboratively think to
> find a satisfying solution for the "collation problem".

The options I know of are:

1) Use MySQL collation support, just using utf8 everywhere.  This will mess up equality comparisons in ways we almost certainly don't want, doesn't support characters outside the BMP, etc.

2) Roll our own sortkeys.  Probably requires a lot more work, and a lot less efficient (need to keep extra column+index around in all relevant tables).  Also might be tricky, since the default Unicode algorithm gives very long sort keys, which we'd have to squish somehow (AFAIK).

3) Use some utf8 collation for cl_sortkey, but keep everything else binary.  We could also add an extra page_sortkey using utf8 collation, which is no more expensive than (2).  This sounds like the best option to me offhand.  It would be as easy to implement as anything here, and wouldn't have such bad side effects.  It would somewhat mess up pages with non-BMP characters in the name, though.

Comment 165 T. Gries 2010-03-12 22:18:10 UTC

(In reply to comment #164)
> (In reply to comment #163)
> The options I know of are:
> 1) Use MySQL collation support, just using utf8 everywhere.  This will mess up
> equality comparisons in ways we almost certainly don't want, doesn't support
> characters outside the BMP, etc.
> 
> 2) Roll our own sortkeys.  Probably requires a lot more work, and a lot less
> efficient (need to keep extra column+index around in all relevant tables). 
> Also might be tricky, since the default Unicode algorithm gives very long sort
> keys, which we'd have to squish somehow (AFAIK).
> 
> 3) Use some utf8 collation for cl_sortkey, but keep everything else binary.  We
> could also add an extra page_sortkey using utf8 collation, which is no more
> expensive than (2).  This sounds like the best option to me offhand.  It would
> be as easy to implement as anything here, and wouldn't have such bad side
> effects.  It would somewhat mess up pages with non-BMP characters in the name,
> though.

Can you start writing an extension, which outputs the list of pages http://en.wikipedia.org/wiki/Special:AllPages in a user-definable collation such as utf8_general_ci; utf8_unicode_ci; utf8_swedish_ci ? This would be a good starting point in my view.

Comment 166 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-12 22:30:03 UTC

This shouldn't be an extension, we may as well have a page_sortkey added in core.  Makes more sense there.  However, categories are lower-hanging fruit, since we just need to change cl_sortkey's collation -- the column already exists.  Plus, categories are more visible.  This needs a sysadmin either way, however, not just developers.

Comment 167 T. Gries 2010-03-12 22:38:48 UTC

(In reply to comment #166)
> This shouldn't be an extension, we may as well have a page_sortkey added in
> core.  Makes more sense there.  However, categories are lower-hanging fruit,
> since we just need to change cl_sortkey's collation -- the column already
> exists.  Plus, categories are more visible.  This needs a sysadmin either way,
> however, not just developers.

Sounds reasonable, ok, but who comes up with a sustainable solution for the core code and this buzilla ?

Comment 168 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-03-14 16:26:48 UTC

Changing cl_sortkey to utf8 *is* a solution.  It needs a sysadmin to do it on Wikimedia (if we want to do it), and we'd probably also want some core code to filter out non-BMP characters before they get to the sortkey.

Comment 169 Liangent 2010-05-14 15:33:03 UTC

[[mw:Extension:CategoryMultisort]] and [[mw:Extension:CategoryMultisortChinese]] written by me is expected to resolve the problem on Chinese Wikipedia.

Comment 170 Tim Starling 2010-06-02 00:48:00 UTC

I think using ICU sort keys, possibly with a pure PHP port to support non-WM installations, would be a good way to go. It would support all the languages we want to support, which is important.

Comment 171 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-07-21 21:03:40 UTC

I'll be working on this bug.  I hope to have a solution coded up within a couple of weeks.  I wrote a post to wikitech-l about it, and encourage people to respond there rather than here (since this involves several bugs):

http://lists.wikimedia.org/pipermail/wikitech-l/2010-July/048399.html

I'll also note that the enormous amount of bickering and repetition here meant that I wound up skimming it, so if I missed your suggestion, please repeat it on the wikitech-l thread.

Comment 172 Philippe Verdy 2010-07-22 06:00:34 UTC

>> But if the SQL engine does not have such support, this must be implemented in
>> the PHP code and collation keys can be stored in a new datacolumn (the extra
>> data column can be added or filled conditionnally : if the SQL engine supports
>> the needed collations, this column can remain NULL to save storage space).
>> 
>If you sort this stuff in PHP, you need to grab the entire list before you can
>reliably sort it. Doing that for [[Category:Living people]] has no chance of
>staying within the memory limit.

And this was false. Because you assume that the generation of sort keys has to be done on database queries for listing the content of a category, when instead this generation os sortkeys can be done safely only on database inserts/updates for each separate page.

What I wanted to say is that the computed sortkeys will have to be stored. But several sort keys for the same page in the same category are possible (one for each collation locale indicated by the target category).

There will be no memory limit, but of course this will have a storage cost, as the stored sort keys will have to be queried along with the list of pages to display. 

The good question to ask is: where do we store these sortkeys? Currently we have a SQL relation table containing a unique index on (categoryID, sortkey, pageID) and this is enough to perform the join with the table of pages. However thre's still only one sortkey per page and per category.

That sortkey is needlessly using the pageID within the generated sortkey (this is is visible when crossing a page limit and navigating throught pages) so in fact the unique index is on (categoryID, "augmented sortkey"). Conceptually wrong and bogous (I think this was just a fast patch when there were unicity problems and multiple pages could be specified with the same sortkey).

What is limiting you of changing the relation table containing the list of pages in categories, of using instead a unique index on:
 (categoryID, sortkey, pageID, localeID)
where the localeID is one of the locales supported by the project, which specifies: the language for which a collation is being performed, and a collation variant (for example, in Chinese, sort by radical/strokes with locale="zh-Hans", or sort by pinyin with locale="zh-Latn")

The generation of the concent of the sortkey column is the only major problem requiring a design decision. This is where it should not even depend on the SQL engine, and where it can be implemented within PHP, using the PHP extension that allows using ICU functions. That string does not have to be extremely long and does not have to be be humane readable.

It can be safely be stored with a reasonnable length limit. So ICU-generated sortkeys are still safe if they get truncated. Notably because the unique index on:
 (categoryID, sortkey, pageID, localeID)
is also unique on its restriction:
 (categoryID, pageID, localeID)
And the sortkey generated by ICU, even if it's a string of binary bytes can still safely be stored in a table index that does not support blobs but want only "VARCHAR(n)" types, by serializing the binary sortkey to a safe encoding (the most basic that will work is hexadecimal) that does not even require the support of Unicode or UCA collation. Just use an ASCII only column to store the computed binary sortkey serialized as an ASCII-only string.

But if the database engine supports strings of bytes, just don't serialize the blob, use the supported SQL type that can store it directly, for example VARBINARY(n), if it remains sortable in binary order.

With this design, you are completely independant of the SQL engine, it will work identically on MySQL, PostrgresSQL, or others. And you'll have solved the problems of collation with multiple locales according to their rules, and possibly according to visitors preferences.

Note: above, the localeID is not directly a VARCHAR(n) containing "en", or "zh-Hans". It is an arbitrary unique numeric identifier that maps in fact to a collation rule within a locale, and this collation rule may need to be updated from time to time: when upgraded a collation, you'll generate additional keys with a new localeID. And when this is done the table of supported collations will indicate which localeID is the current one to use, and you'll be able to perform easiky the cleanup of old sortkeys that where computed within the old rule.

It's not a complicate design, and it offers stability warranties and supports as well the possibility of upgrading the collations.

The possibility of offering multiple sortkeys for the same page in the same category comes as a bonus, and you can assign "localeID=0" to store the user-specified sortkey that has been assigned in a page using the second parameter of the [[category:pagename|sortkey]] link or the parameter of the {{DEFAULTSORT:sortkey}} if this paramter was missing (this will avoid having to reparse the pages just to retrieve this user-specified sortkey).

Comment 173 Roan Kattouw 2010-07-22 07:15:01 UTC

(In reply to comment #172)
> And this was false. Because you assume that the generation of sort keys has to
> be done on database queries for listing the content of a category, when instead
> this generation os sortkeys can be done safely only on database inserts/updates
> for each separate page.
> 
> What I wanted to say is that the computed sortkeys will have to be stored. But
> several sort keys for the same page in the same category are possible (one for
> each collation locale indicated by the target category).
> 
[etc., etc., etc.]

I suggest you read comment #171 and the wikitech-l thread linked from there. That thread is an implementation proposal by Aryeh, who, unlike pretty much everyone else commenting on this bug (myself included), is actually gonna start implementing it; he suggested he might start today. This bug already has 173 comments, and posting more long comments about hypothetical implementations while not replying to the implementation plan of the one person actually being serious about getting this done is not useful and might just push this bug to 200 comments without getting anywhere. If we're gonna discuss anything, let's discuss the current implementation plan: it is the only relevant plan to discuss at this point.

Comment 174 Philippe Verdy 2010-07-22 09:13:30 UTC

> If we're gonna discuss anything, let's
discuss the current implementation plan: it is the only relevant plan to
discuss at this point.

This is EXACTLY what I was discussing: proposing an implementation design, which also considers the fact that collations will also need to evolve over time (for example the UCS repertoire is evolving (so the DUCET table is modified), and collation rules are being corrected for some languages, in the CLDR project) : each change will generate a new internal localeid to support it, and there will be possibly different keys during the transition, even if (finally) an old collation rule will be deleted (with its old sortkeys) after the new sortkeys will have been fully recomputed.

So this is clearly not "blah blah". And this is certainly relevant for the fact that you're considering implementing some or all of the suggestions (and you'll have to test your solutions, including on their performance impact.

I propose a simple model that can be very space-efficient, and that also avoids reparsing all pages if ever a collation rule is changed, or if additional collation rules are added in a category to support multiple sort orders (notably within Chinese categories that could support different orders).

My proposal does not even depend on the backend SQL server capabilities (all it has to support is at least a binary order on ASCII-only VARCHAR(n) to store the computed and truncated sortkeys, that will be generated by the PHP front-end (using ICU) and followed by an ASCII-only serialization. This means that the simplest "ORDER BY" queries to retrieve correctly ordered lists of pages will work independantly of the backend.

The function used in PHP to generate the binary-ordered sortkey (that will finally be effectively stored) should also be accessible in MediaWiki as a builtin parser function, that will take two parameters: the locale code, and the text.

For example, as {{SORTKEY:text|locale}}, where the ''locale'' specified can be optional and will take the default value of the {{CONTENTLANGUAGE}} of the project).

This builtin parser function could also be used to correctly sort the sortable Mediawiki tables inserted in articles, by using correct sortkeys generated by this template, if the generated sortkey is readable and effectively serialized as ASCII-only, but it does not necessarily have to be truncated by this function, even if it will be truncated when the same function result will be used to store sortkeys in the schema).

This parser function should even be the first development to realize, before even changing the category-page indexes, because it can be applied and tested immediately in existing categories (for example by using categorizing templates in Wiktionary), without even upgrading the SQL schema.

Comment 175 Roan Kattouw 2010-07-22 09:59:32 UTC

(In reply to comment #174)
> > If we're gonna discuss anything, let's
> discuss the current implementation plan: it is the only relevant plan to
> discuss at this point.
> 
> This is EXACTLY what I was discussing: proposing an implementation design,
AFAICT you're proposing your own design without mentioning where it differs from or coincides with Aryeh's proposal. From briefly reading your latest comment, it looks like a lot of what you're proposing is already in Aryeh's proposal, such as the binary-ordered sortkey thing. Other things, such as the attention you're paying to the possibility that collation rules may change, are not covered in Aryeh's proposal, so those are valuable additions.

So you're right, it's certainly not "blah blah", but it would help a lot if you limited your comments to the differences between your and Aryeh's proposal, rather than leaving it up to the reader to find and filter the parts the two proposals have in common.

Comment 176 Philippe Verdy 2010-07-22 10:04:53 UTC

Anyway, the Aryeh's proposal is not found or documented at the location you indicate. It just says that he will try to work on it today and still asks for solutions and asks for questions that are unanswered in his comment on the wikitech-l list.

In a similar spirit, the generation of the section heading for categories could
also be a different builtin parser function such as:

: {{COLLATIONMAP:text|locale|level|clusters}}

with the similar parameters, and clusters=1 be default (more below). It will
return a non-opaque string that can be displayed in category pages, and could
even become a valid anchor for searching starting at some text position in the
list (ordered using the specified locale). It should be only a level-1
collation mapping by default (only level 1 will be considered for displaying
headings, however some categories could later specify another default collation
level for such mapping.

Note that collation-based mappings is a VERY different concept from the concept
of collation-based sortkeys (read the Unicode UCA specification): sortkeys are
opaque and intended to generate a full order of texts, mappings are readable
but only intended to produce a partial order.

Another optional parameter could be given after the collation level, to
indicate how many locale grapheme clusters should be included in the heading.
The default 
 headings in categories should just use 1 grapheme cluster from the text given
to the first parameter of {{COLLATIONMAP:text|locale|level|clusters}} (more
below).

In a category you could *LATER* specify additional sort orders (and collation
mappings at the same time) using a syntax like:
{{SORTORDER:locale|level}} (the default collation level will be 1 for
categories).

Example 1: in a Chinese category page, you could specify:
; {{SORTORDER:zh-hans}}
: to indicate that pages in that category will be also available using the
radical/stroke order of sinograms. 
; {{SORTORDER:zh-latn}}
: to indicate that pages in that category will be also available using the
Pinyin Latin order.

Example 2: in a Korean cateogry, where the primary order is based on the
decomposition into "jamos", it will be enough to specify:
; {{SORTORDER:ko}}
: (for the default South Korean order of jamos)
; {{SORTORDER:ko-kp}}
: (for the default North Korean order of jamos)

Indicating the collation level with a value different from 1 could generate sub
headings for level 2, but I think it should only display them with the
specified level, all using the same heading from the level-2 collation mapping.

Indicating clusters=2 (or more) in the builtin parserfunction {{COLLATIONMAP:}}
may be used to generate more precise headings (for example in heavily English
or French populated categories, use 2 grapheme clusters to generate headings on
the 2 first letters, but keep the collation level to 1). By default the builtin
parser function will not limit the number of grapheme clusters (so it will
remap all the text), but a category could still specify this maximum number of
clusters to consider for headings.

By default a category will consider either
* clusters=1 : section headings will be generated only for the first cluster
(this is what is currently used in categories). or
* clusters=0 : section headings will not be generated at all.
(this default could be a per-project default).

The generation of section headings (using the same PHP function used by
{{COLLATIONMAP:}}) does not require any modification of the schema. Headings
can be computed and generated on the fly, from the retrieved list of pages.
Headings should not even influence the sort order, they are just convenient
groupings for display.

The {{COLLATIONMAP:}} function described here would in fact enter in the list
of string builtin function, as it falls in the same category as other mapping
functions like {{UC:}} or {{LC:}}. This is just another mapping.

The generation of sort keys (using the same PHP function used by {{SORTKEY:}})
however requires active parsing to store them in the schema. So it can only be
done later. This developement should be made *later* (when the SQL schema for
category indexes will be adjusted to support multiple/upgradable collation
orders).

So yes, a builtin parser function such as
{{COLLATIONMAP:text|locale|level|clusters should first be implemented and
tested separately before being used to group items in category headings, but it
can be implemented separately from the developement and deployment of the SQL
schema for indexing categories with {{SORTKEY:}}), and integrated later in the
code that will present the list of pages to the users.

Comment 177 Roan Kattouw 2010-07-22 10:17:54 UTC

(In reply to comment #176)
> Anyway, the Aryeh's proposal is not found or documented at the location you
> indicate. It just says that he will try to work on it today and still asks for
> solutions and asks for questions that are unanswered in his comment on the
> wikitech-l list.
> 
He does document a fair bit of what he wants to do (e.g. separating files, categories and regular pages with sortkey prefixes). It's not a complete proposal and he is asking for input, but he hasn't left everything open either.

Also, why keep the discussion in two places? Why not join the discussion on wikitech-l?

Comment 178 Philippe Verdy 2010-07-22 10:48:39 UTC

Because his specification is really incomplete, and he said that Bug#164 was useless (despite of the fact that I had described my solution extensively in Bug#164long before Ariyeh started working on it.

And yes, before ever attempting to change the schema, I support the prior developement and extensive testing of builtin parser functions supported by PHP functions which will be shared later to support the updated SQL schema.

Only this developmeent alone will have significant difficulties:
* notably integrating ICU in a PHP module installation, or
* rewriting the collation algorithms entirely with PHP;
* having to support the DUCET updates caused by new Unicode versions or corrections;
* having to support multiple collation orders by per-locale tailorizations (coming from CLDR or from other sources).

The need to support upgraded collation orders is also an important decision factor for the schema, if sortkeys are stored in a SQL backend, that's why I speak about it very early:
* collations supported by SQL backends have very strong limitations, or any upgrade would require shutting down the servers for hours or days to perform the upgrade of collated indexes.
* in their missing full ISO 10646 "level 3 implementation" for the support of supplementary planes.

All this is something that can be avoided completely by using ICU and not depending on SQL backends for their support of many more collation locales that we need in our international projects:

* the schema just needs to be able to store multiple sortkeys, so that newer sortkeys (computed with the new rules) can be progressively computed in the background by a bot or server script or some upgrades occuring on the fly when processing articles.
* older sortkeys that were using a older generation rule can be deleted in a simple DELETE operation after the new collation rule for a corrected locale has been made the default one, or can be deleted one by one each time a new generation sortkey is recomputed and has been inserted (there's not even the need to perform the two sucessive operations in a transaction if the first INSERT withe the new rule has been sucessful).

Because we have now multiple sortkeys per indexed page in a category, we can conveniently support multiple sortkeys for different locales and offer a good experience for users that will want alternate sort orders (notably Chinese users that will want presentations in radical/stroke order, or in pinyin order).

----

Another note about how to serialize the opaque sortkeys:
the builtin function {{SORTKEY:text|locale|level}} described above will not
limit the length of the generated binary sortkey, however it should serilize it
in a valid Unicode text that can be used in tables.

A convenient serialization of bytes to characters that will sort correctly is
Base-36 using the alphabet [0-9A-Z] (no padding necessary) or Base-32 (it
avoids modular arithmetics but will serialize into longer strings)

If sortkeys are about to be stored, retrieved in the SQL schema, and sorted by
the SQL clause "ORDER BY...sortkey...", then:

- either the SQL backend allows storing and sorting binary sequences of bytes
as VARBINARY(N) : then no extra serialization is needed, store directly that
opaque sort key, after truncation to the max length value (N) indicated in the
SQL type of the "sortkey" table column.

- or the SQL backend does not support sortable binary sequences of arbitrary
bytes, but can only sort VARCHAR(N), then use a similar Base-32 or Base-36
conversion to create compatible sortkeys, and then store the converted string
after truncating to the max length value (N) indicated in the SQL type of the
'sortkey" table column.

- in both cases, the stored sortkeys will NEVER be exposed to users, its sole
purpose is to make the SQL "ORDER BY" clause work properly.

To start listing a category from a given artbitrary Unicode text, use the
"start=" HTTP query parameter and compute internally the sortkey associated
with it to generate the value used in SQL clause "WHERE sortkey >= 'value'".

- Section headings in categories will never need to be stored, they are
generated on the fly by reading the page names retrieved in the SQL result set
using the {{COLLATIONMAP:}} function, with the specified locale in the
"uselang=" HTTP query parameters, and the specified (or default) "clusters="
parameter (whose default will be 1 or 0 as indicated above). They will be
diretly readable by users and do not require decoding anything from the stored
sortkey.

- the readable collation mappings and the opaque sortkeys should be coherent in
the same locale, but they are clearly different: pagenames that are
collation-mapped should sort in the same natural order as the section headings
generated from them, so it's absolutely not needed to generate sort keys from
collation-ampped headings computed in the fly.

Comment 179 Roan Kattouw 2010-07-22 11:16:18 UTC

(In reply to comment #178)
> All this is something that can be avoided completely by using ICU and not
> depending on SQL backends for their support of many more collation locales
This is exactly what Aryeh is proposing. I think everyone agrees that it's better to use binary sorting with munged sortkeys rather than SQL backends' half-baked collation support, so you don't need to argue that.

> - Section headings in categories will never need to be stored, they are
> generated on the fly by reading the page names retrieved in the SQL result set
> using the {{COLLATIONMAP:}} function, with the specified locale in the
> "uselang=" HTTP query parameters, and the specified (or default) "clusters="
> parameter (whose default will be 1 or 0 as indicated above). They will be
> diretly readable by users and do not require decoding anything from the stored
> sortkey.
> 
This is not that simple, as was pointed out on wikitech-l: what if you've got a language where á sorts the same as a (that is, a and á are the same for sorting purposes), then your sorted result could look like:

Áa
Áb
Ac
Ád
Ae
Af
Ág
...

We humans understand that the proper heading for this is "A", not "Á", but how will the software know that? Even if we store the original, unmunged sortkeys (note that the sortkey is not necessarily equal to the page name: [[Albert Einstein]] sorts as "Einstein, Albert") so we can differentiate A from Á, we can't just take the first or even the most frequent letter: neither is accurate in this case.

Comment 180 Philippe Verdy 2010-07-22 11:42:55 UTC

Your issue ***IS*** addressed in my proposal:

*Both* {{COLLATIONMAP:Áa}} and {{COLLATIONMAP:Aa}} will be unambiguously "aa" in the primary collation level for that locale. They only differ at the secondary collation level (with accents).

You did not understand that a collation-mapping is DIFFERENT from an opaque sort key, even if both are built using the same collation rules for the same locale.

The case of "Albert Einstein" sorting as "Einstein, Albert" will pass through the standard generation of the sortkey from the string "Einstein, Albert" explicitly given in MediaWiki source code as the parameter of the {{DEFAULTSORT:Einstein, Albert}} special function or as the second parameter of the [[category:...|Einstein, Albert]] link.

----

So here is a development and deployment map:


1) Develop to PHP functions that will compute:

  function sortkey($text, $locale, $level=1)
    - it will return an opaque array of byte values
    - $locale may be given a default value from the project's content language, but this is not specified here but part of its integration in MediaWiki
    - $level may take the default value of 1.
    - the algorithm involves parsing steps to transform the $text parameter into normalized form, then parse it by collation element, and then locating the collation element in the tailored collation table, which is indexed by collation element value and returns an array of collation weights, one for each level.
    - it packs all the collation weights into the returned opaque array of byte values, by appaending all non-zero collation weights for each collation element at the same collation level before appending the collation weights for higher successive levels.
 
  function collationmap($text, $locale, $level=1, $clusters)
    - it will return a remapped text using the same $locale and $level parameters
    - it will use the same DUCET table and the same per-locale tailorings
    - the same parsing steps are performed
    - but instead of packing the collation weigths, it scans the collation table in the reverse order, by locating the first collation element (a small Unicode string, often limited to a single character) that has the same collation weights up to the specified level. When this smallest collation element is found, append this to the result string.

  function base36($bytes)
    - it packs the opaque binary array of bytes into plain ASCII that has safe binary order and can be safely be stored in a VARCHAR(N) table field, or that can be returned in a MediaWiki function.

  This module should use ICU and implement the locale tailorings, and should be able to support a full DUCET table,and allow lookups from a collation element to an array of collation weights, or the reverse (and ordered) lookup from a collation weight to a collation element for the function collationmap())

2) Integrate these functions in a Media Wiki extension for builtin parser functions.

  {{SORTKEY:text|locale|level}}
    - this will return base36(sortkey($text, $locale, $level))
    - by default take $level=1
    - by default take $locale={{CONTENTLANGUAGE}} (or {{int:lang}} ?)
    - it can be used as a much better implementation of Wikipedia templates like [[Template:Sort]]

  {{COLLATIONMAP:text|locale|level|clusters}}
    - this will return collationmap($text, $locale, $level, $clusters)
    - it can be used to simulate the generation of headings in categories, but as well within mediawiki tables
    - by default take $clusters=null (no limitation of length)
    - by default take $level=1
    - by default take $locale={{CONTENTLANGUAGE}} (or {{int:lang}} ?)


3) build a function for mapping category sortkeys into stored sort keys, this will depend on the SQL backend capabilities and on the schema constraint length for the sortkey data columns:

   function sqlsortkey($text, $locale, $level)
     - it will return either :
        substring(sortkey($text, $locale, $level), 0, $maxlength)
     - or :
        substring(base36(sortkey($text, $locale, $level)), 0, $maxlength)
     - the choice will depend on the support of VARBINARY(N) and its sortability  in the SQL engine, or of only VARCHAR(N)
     - the sortkey will not have to be UTF-8, and will not need any support of the same locales for collation tailoring in the SQL backend.

4) update the schema to support the description of supported internal collation ids.
   - add a mapping function from the human readable $locale parameter to the collation id associated to the curent version of the collation rule currently applicable to a locale.
   - support this mapping with a simple "collations" relational table

5) build a new category index based on:
  (categoryid, collationid, sortkey, pageid)
  - categoryid and pageid are standard MediaWiki page ids (in any modification version).
  - collationid will come from the previous mapping (using the locale identifier string, and where the locale will be determined by HTTP query parameters like "uselang=", i.e. {{int:lang}}, or from the project's default {{CONTENTLANGUAGE}}).
  - the sortkey column will be computed using PHP's:
    sqlsortkey($text, $locale, $level)
    described above.

....

Comment 181 Philippe Verdy 2010-07-22 11:55:26 UTC

But effectively, we would even be smarter for category indices, by separating the namespaces, so that they will be paginated and navigatable independantly (this means that namespaces will have their own prefered sort order, specified in the project, but this should not require any complex computing of sortkeys: we already have namespace ids that sort naturally in numerical order)

This means that an extended category index would be:
(categoryid, collationid, sortkey, namespaceid, pageid)

Namespaces should have never been included at all in the computation of the default sortkeys, so instead of using the default sortkey from {{FULLPAGENAME}} (the way it is today), it should just use {{PAGENAME}}.

... Unless a category specifies specifically that it wants to mix the various namespaces, in which case the namespace name will be part of the default sort key, and the namespaceid will become 0 in all pages categorized in that category.

This is a decision factor for the last steps of the development plan above, it should not influence the development of the 2 first steps above.

Comment 182 Philippe Verdy 2010-07-22 13:18:28 UTC

Note that my developement plan will NOT imply an immmediate change of the SQL schema. At first, if only working on the frist 2 steps, no schema change is necessary to effectively test the two functions.

Notably, you'll be able to create test pages containing lists of words formatted as table rows using a template. You'll be able to show the words, in one column, then the opaque (base-36 converted) sort keys in another column, and then the visible collation-mapped string in a third column.

Then you'll include this list within a page adding the table headers and using the existing "sortable" table class to see how they collate when sorting by the second column, and you'll also be able to sort again on the third column, to see that the collation-mapped string will not alter this sort order.

If all this passes, this can be deployed and immediately used in Wiktionnary (notably French Wiktionnary) for sorting categories containing huge lists of words, or in english Wiktionnary for use in the template that generate sort keys for people's "Last name, first name".

This will be already a considerable progress, even if there will be no immediate support for multiple sort orders (in Chinese).

The change of schema, where the {{defaultsort:sortkey}} parameter or the second parameter of the [[category:page|sortkey]] link will be used effectively, so that we no longer have to use a sortkey parameter in pages and templates can be delayed a lot.

But the two developed functions (at least the first one {{SORTKEY:text|locale|level}} which is a bit simpler to implement than the second one {{COLLATIONMAP:text|locale|level}} that is a bit trickier as it requires special handling for collation elements that are mapped to more than 1 collation weight per level) should be a lot simplified, and will be enough tested before even thinking about reusing them for sorting pages in categories and presenting the lists with headings.

There is a separate modification to develop for grouping category contents by namespace. This should be developed and deployed first before even trying to change the schema for sort keys.

In fact, the first integration for displaying categories will be to use {{COLLATIONMAP:}} internally for displaying the headings in category pages. This won't require any change as well for the SQL schema.

Finally, the use of {{SORTKEY:}} and the addition of new keywords for use in category pages, that will allow a category to support multiple collations, will happen only at the last step. Because this is the most sensitive one and this is the only step that may finally involve a change of schema. There's much enough work to do for building all the first integration steps, as extensive tests in various locales will be needed, so that all wikis report their problems with the implemented collations and their supported tailorings.

Comment 183 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-07-22 17:57:42 UTC

Okay, look, if you make posts this long I'm not going to be able to read them.  You have to state your points briefly and clearly, or else I'm just not going to have time to read through carefully and figure out exactly what you're trying to say.  Walls of text are really not helpful.

Furthermore, you're focusing too narrowly on implementation details, like the exact interface a function should implement.  Please try to state your concerns primarily in terms of use-cases or problems -- scenarios that the implementation will need to handle.  Secondarily, it's fine if you propose specific solutions, but I can't evaluate your solutions unless you state clearly what problems they're trying to solve.

I'll comment on a few things you said.  If I miss anything important, please restate it *briefly*, and I can address that too.

(In reply to comment #172)
> What I wanted to say is that the computed sortkeys will have to be stored. But
> several sort keys for the same page in the same category are possible (one for
> each collation locale indicated by the target category).

If we store separate sortkeys per category link per locale, it would take vastly too much space, since there are likely to be dozens of locales.  categorylinks is already a big table; we can't justify making it ten times bigger for this.  Each category can have one locale and only one locale.

> That sortkey is needlessly using the pageID within the generated sortkey (this
> is is visible when crossing a page limit and navigating throught pages) so in
> fact the unique index is on (categoryID, "augmented sortkey"). Conceptually
> wrong and bogous (I think this was just a fast patch when there were unicity
> problems and multiple pages could be specified with the same sortkey).

It is not needless.  We need a unique key to sort by, or else paging doesn't work if many pages have the same sort key (e.g., if it was manually assigned).  See bug 4912.

> The generation of the concent of the sortkey column is the only major problem
> requiring a design decision. This is where it should not even depend on the SQL
> engine, and where it can be implemented within PHP, using the PHP extension
> that allows using ICU functions. That string does not have to be extremely long
> and does not have to be be humane readable.

Yes, we will certainly use ICU or something.  GerardM has said he much prefers CLDR, so maybe we'll use that instead.

> It can be safely be stored with a reasonnable length limit. So ICU-generated
> sortkeys are still safe if they get truncated. Notably because the unique index
> on:
>  (categoryID, sortkey, pageID, localeID)
> is also unique on its restriction:
>  (categoryID, pageID, localeID)
> And the sortkey generated by ICU, even if it's a string of binary bytes can
> still safely be stored in a table index that does not support blobs but want
> only "VARCHAR(n)" types, by serializing the binary sortkey to a safe encoding
> (the most basic that will work is hexadecimal) that does not even require the
> support of Unicode or UCA collation. Just use an ASCII only column to store the
> computed binary sortkey serialized as an ASCII-only string.
> 
> But if the database engine supports strings of bytes, just don't serialize the
> blob, use the supported SQL type that can store it directly, for example
> VARBINARY(n), if it remains sortable in binary order.
> 
> With this design, you are completely independant of the SQL engine, it will
> work identically on MySQL, PostrgresSQL, or others. And you'll have solved the
> problems of collation with multiple locales according to their rules, and
> possibly according to visitors preferences.

This was always the intended design.  (But note that truncating these sort keys might not always be fully safe: it could cause tons of collisions, depending on the nature of the sort key generation algorithm.  These would be resolved arbitrarily, by page_id, which is confusing to users.)

> It's not a complicate design, and it offers stability warranties and supports
> as well the possibility of upgrading the collations.

Upgrading the collation can be done in-place.  The worst case is that categories sort weirdly for a few hours.  Also, we would only realistically have to change the collation often on smaller wikis, since the largest wikis should have high-quality collations that cover almost all their pages to begin with.  I don't think we need to adjust the schema to prepare for this.

Comment 184 Roan Kattouw 2010-07-22 18:40:19 UTC

(In reply to comment #183)
> It is not needless.  We need a unique key to sort by, or else paging doesn't
> work if many pages have the same sort key (e.g., if it was manually assigned). 
> See bug 4912.
> 
It is needless in the sense that the cl_from field is right there and is indexed. We should just be using that field rather than tacking its contents onto the sortkey because we're too lazy to rewrite IndexPager so it can handle paging by more than one field (which is not very hard).

Comment 185 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-07-22 18:47:23 UTC

Oh, I didn't get what he was saying.  Yes, obviously we should use the actual cl_from field, not tack it onto cl_sortkey (is that what we do these days?).  I'm going to have to make a lot of changes to the category pager anyway to support paging by multiple things, so I can do that while I'm at it.

Comment 186 Philippe Verdy 2010-07-24 05:53:30 UTC

(In reply to comment #183)
> Upgrading the collation can be done in-place.  The worst case is that
> categories sort weirdly for a few hours.  Also, we would only realistically
> have to change the collation often on smaller wikis, since the largest wikis
> should have high-quality collations that cover almost all their pages to begin
> with.  I don't think we need to adjust the schema to prepare for this.

That'a a bad assumption : even the highest quality collations will need to be updated from time to time:
- Unicode evolves and new characters get encoded (new versions are published about every 1-2 years after synchronization and final balloting at both ISO WG2 and the UTC.
- The content of the Unicode DUCET is NOT stable: characters are inserted in the sequence so that the full list of collation weights needs to be offseted where the new characters get inserted.
- Collations for languages get corrected. We should be able to upgrade these rules when the CLDR project produces new tailorings (CLDR updates are published separately, about every 1-2 years.)

These corrections may be rare (every few months), but when they will occur, any upgrade could take many hours that could horce the site to go offline when recomputing sortkeys, or NO correction will be possible. Upgrading "in place" is effectively what I proposed, but how will you track which pages need to to reindexed?

A collation ID in the stored index can really help determine which collation rule was used to generate the stored sortkey; In addition it will allow to support multiple collations. This is the mean by which the "in place" recomputing can be safely be done.

Note: truncating the sortkeys will ALWAYS be needed, just because the database column will still have a length limit. Truncating is not so bad anyway, because:
- the compact binary sequence of primary collation weights, that starts the sort key will be at the begining of the sort key. Further length is used to store the compacted sequence of secundary collation weights, then the sequence of ternary collation weights.
- if truncation occurs, the effect will be that only the smallest differences will not be represented.

But if you accept to store only non-truncated sort keys, you'll still have the case where some pages will have some long name, plus the case where someone will have indicated for that page a very long {{DEFAULTSORT:sortkey}} or very long text in the second parameter of [[category:...|sortkey]]. To avoid this:
- page names already have a length limit. This also limits the length of sort keys computed from only them
- we should already truncate the string given in {{DEFAULTSORT:sortkey}} or {{category:..|sortkey]] so that the concatenation of this string and of the page name can be used to compute the binary sortkey.

If you can accept arbitrary lengths, so go with it, but it will be unsafe and your schema will not be able to put that in a sortable column (you'll be only able to put it in a referenced BLOB, just like the the text of articles, and databases can never sort external BLOB's)

Anyway you did not reply to the idea of first developin the parser functions and test them. Developping the SQL schema extension should not be attempted before at least the first function {{SORTKEY:text|locale|level}} has been fully developed and tested on specific pages (it can be tested easily in tables).

And with just this function, it should be possible on specific wikis to use it immediately to sort specific categories (for example by using templates using that function).

The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed immediately to develop the schema, but will be useful to restore the functionality of headings. Headings don't need to be stored as they can be computed on the fly, directly by reading sequentially the sorted result set from the SQL query:

You can compute headings from the returned page names, or from the existing stored "cl_sortkey" which should be used now ONLY to store the plain-text specified in articles with {{DEFAULTSORT:sortkey}} and [[category:...|sortkey]].
The existing cl_sortkey is just a forced "hint", it does not make the sort order unique. Otherwise it should remain completely empty with the new schema. It will always be locale neutral and will take precedence on the page name : to sort the pages effectively, the content of the cl_sortkey content and the pagename should be always concatenated inernally to compute the binary sortkey for various  locales.

Comment 187 Philippe Verdy 2010-07-24 06:17:18 UTC

(In reply to comment #185)
> Oh, I didn't get what he was saying.  Yes, obviously we should use the actual
> cl_from field, not tack it onto cl_sortkey (is that what we do these days?). 
> I'm going to have to make a lot of changes to the category pager anyway to
> support paging by multiple things, so I can do that while I'm at it.

This does not require any change in the schema. This can be made immediately by MediaWiki developers and will not influence the developement of all corrections needed for bug #164 itself.

The unique index just has to include the cl_from field (whatever it contains, it is already the unique ID of the page, and it is already stored). Only the SQL queries and the HTTP requests have to be able to specify this cl_from field separately.

Tacking it in the sortkey was a bad decision made by "lazy" developers before they realized that storing a separate cl_from field was necessary. It just polluted the indexes and had the bad effect of truncating more precious space for correctly sorting some pages.

Drop this tacking immediately in the PHP code: the cl_sortkey is only intended to store the clear-text sortkey "hint" specified in {{DEFAULTSORT:sortkey}} or [[category:...|sortkey]] to override the sort order, and this clear-text sortkey is not really a true sortkey but a sort hint to override the default sort order. 

For example in people's names whose page name is "FirstNames LastName" but that we want to sort as if they were "LastName, FirstNames" by indicating only {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in the wiki text, as this sort hint will not be unique and the group of pages using the same hint will still need to sort within this group using their natural order). Even in that case, there's no need to bogously tack the cl_from field in the stored field.

Comment 188 Philippe Verdy 2010-07-24 07:00:47 UTC

(In reply to comment #183)
> Okay, look, if you make posts this long I'm not going to be able to read them. 
> You have to state your points briefly and clearly, or else I'm just not going
> to have time to read through carefully and figure out exactly what you're
> trying to say.  Walls of text are really not helpful.
> 
> Furthermore, you're focusing too narrowly on implementation details, like the
> exact interface a function should implement.

An "interface" function is DEFINITELY NOT an "implementation" detail. My comment #180 _correctly_ describes and summarizes what is really wanted.

It's not focusing on how it will be implemented (using ICU or not, how to integrate it in PHP if it's used, otherwise how to represent the DUCET and how to represent the language tailorings...), and it explains why the development can be planned in several clearly independant steps that can be tested separately.

It correctly explains the dependancies and why any change in the SQL schema can be and should be delayed. In fact I consider the step (1) in my plan to have a high priority on all the rest, and it does not imply any immediate change in the schema to support it.

There will be some time needed only to assert that the implemented collations are effectively correct for various languages: this will be verifiable by users looking at list of terms in some wiki page, using a "sortable wikitable" whose rows are using the new {{SORTKEY:text|locale|level}} parser function.

If this is validated by users, then only two independant development phases can be started (in parallel):

- to develop the new schema for stored sortkeys, based only on the internal PHP functions implementing {{SORTKEY:text|locale|level}}. The schema should NOT be deployed before the collations have been tested extensively by users and their internal data structures reflect the wanted collations order and tailorings.

- to develop the {{COLLATIONMAP:text|locale|level|clusters}} parser function (for later inclusion in the generation of the correct "column headings" in the lists displayed by category pages, because for now these headings are almost useless for anything else than English, or in heavily populated categories).

The second function will be separately integrable in the display of category pages (before or after the schema has been updated for stored sort keys), but can be tested separately as well. And anyway, it will already be a very useful general clear-text function for use in wiki templates that are trying to equate "equivalent" strings (for example in a {{#switch:}}, even better than {{LC:text}}) and will be very convenient for some languages like Arabic, Hebrew and Russian that have optional diacritics in their orthographies (such as cantillation marks, dots for vowels, and tonic accents) or that may contain compatibility ligatures or display forms (e.g. narrow/wide kanas and square clusters of kanas in Japanese, "compatibility jamos" in Korean).

The name of this second function could be abbreviated as {{CMAP:}} (it does not matter, but it should be implemented based on its functional description by Unicode in one of its UAX) or could be localized as well like other "magic" wiki keywords. And it probably should use {{CONTENTLANGUAGE}} as the default value for the second parameter "locale" (because {{int:lang}} may cause rendering problems in some pages that do not want to depend on the user's prefered language). And by default the "level" parameter will be 1 (could be up to 3, any higher value will return the same text without performing any mapping because higher collation values have to contain all the collation differences), and the default "clusters" will be unlimited.

Comment 189 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-07-25 17:22:32 UTC

(In reply to comment #186)
> That'a a bad assumption : even the highest quality collations will need to be
> updated from time to time:
> - Unicode evolves and new characters get encoded (new versions are published
> about every 1-2 years after synchronization and final balloting at both ISO WG2
> and the UTC.
> - The content of the Unicode DUCET is NOT stable: characters are inserted in
> the sequence so that the full list of collation weights needs to be offseted
> where the new characters get inserted.
> - Collations for languages get corrected. We should be able to upgrade these
> rules when the CLDR project produces new tailorings (CLDR updates are published
> separately, about every 1-2 years.)

It's not critical for most wikis to use the very latest collations, though.  The English Wikipedia, for example, will do fine with even very out-of-date collations, since the article names overwhelmingly use characters that have been in Unicode for an extremely long time.  The same is true for most other large wikis; but on the other hand, to change collations on small wikis will be fast in any event.

However, Tim Starling independently suggested a cl_collation column, and my initial implementation does have one.  The current one doesn't allow the same row to be stored multiple times with different collations, so sorting might still be wrong as the collation is updated, but this might not be a big problem in practice.  If it is, we can go with your idea of extending the primary key to (cl_collation, cl_from, cl_to) instead of (cl_from, cl_to).

> Anyway you did not reply to the idea of first developin the parser functions
> and test them. Developping the SQL schema extension should not be attempted
> before at least the first function {{SORTKEY:text|locale|level}} has been fully
> developed and tested on specific pages (it can be tested easily in tables).
>
> The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed
> immediately to develop the schema, but will be useful to restore the
> functionality of headings. Headings don't need to be stored as they can be
> computed on the fly, directly by reading sequentially the sorted result set
> from the SQL query:

I tried to figure out what you were talking about here, and eventually figured out that you just want me to expose Language::convertToSortkey() and Language::firstLetterForLists() (as they're currently called) as parser functions, for the benefit of lists.  That might be useful, and should be easy to do, although it's not part of my assigned job here.

> You can compute headings from the returned page names, or from the existing
> stored "cl_sortkey" which should be used now ONLY to store the plain-text
> specified in articles with {{DEFAULTSORT:sortkey}} and
> [[category:...|sortkey]].

I've introduced cl_raw_sortkey to store that, while retaining cl_sortkey for actual sorting.  Based on your feedback, it might make more sense to rename cl_raw_sortkey to cl_sortkey_prefix, put {{defaultsort:}} into it, and make it the empty string by default.

(In reply to comment #187)
> This does not require any change in the schema. This can be made immediately by
> MediaWiki developers and will not influence the developement of all corrections
> needed for bug #164 itself.

Correct.  I'll probably do it anyway in the course of the work I'm doing, though, since I'll be rewriting CategoryPage.php's sort logic in any case.

> For example in people's names whose page name is "FirstNames LastName" but that
> we want to sort as if they were "LastName, FirstNames" by indicating only
> {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
> the wiki text, as this sort hint will not be unique and the group of pages
> using the same hint will still need to sort within this group using their
> natural order).

I can append the page title to cl_raw_sortkey before computing cl_sortkey.  That shouldn't be a problem.  As noted above, maybe renaming it to cl_sortkey_prefix would make more sense.

(In reply to comment #188)
> An "interface" function is DEFINITELY NOT an "implementation" detail. My
> comment #180 _correctly_ describes and summarizes what is really wanted.

Unfortunately, it's hard for me to understand what you're saing.  However, I think I've got it now, at least mostly.  I don't think parser functions are essential, here, and they're not strictly part of this bug.  They can be added after the initial implementation is finished.

> It correctly explains the dependancies and why any change in the SQL schema can
> be and should be delayed. In fact I consider the step (1) in my plan to have a
> high priority on all the rest, and it does not imply any immediate change in
> the schema to support it.

Writing those functions is not part of my job.  I expect the i18n people will handle that.  I'm just doing a prototype, which is agnostic about what sortkey function you give it.  My current sortkey function is just strtoupper(), for testing.  (This does improve English sorting when $wgCapitalLinks is off.)

> - to develop the new schema for stored sortkeys, based only on the internal PHP
> functions implementing {{SORTKEY:text|locale|level}}. The schema should NOT be
> deployed before the collations have been tested extensively by users and their
> internal data structures reflect the wanted collations order and tailorings.

This is basically what I'm doing, except I'm not going to work out exactly what sortkey function to use.

> - to develop the {{COLLATIONMAP:text|locale|level|clusters}} parser function
> (for later inclusion in the generation of the correct "column headings" in the
> lists displayed by category pages, because for now these headings are almost
> useless for anything else than English, or in heavily populated categories).

I'm not doing this part, but I'm setting up the framework so that i18n people can work on it (minus the parser function, which can be added later).  At least, if your COLLATIONMAP is meant to be anything like my Language::firstLetterForLists().  I'm not sure if it is.  If it's not, please explain better, because I don't follow.

Comment 190 Philippe Verdy 2010-07-26 18:38:10 UTC

Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but COLLATIONMAP is a more generic concept which reflects what is defined in Unicode standard annexes, which speaks about various mappings (including collan mapppings, but also case mappings)

One bad thing about the name Language::firstLetterforList(s) is that it implies that this should only be the first letter. In fact, for many locales, the significant unit is not the letter, but the collation element (for exemple digrams like "ch" or trigrams like "c’h").

For some categories, it should be convenient also to be able to use longer substrings, containing more than one grapheme cluster (in Wiktionnary for lists of terms belonging to a language, or in lists of people names, a category may need to be indexed and anchored with section headings containing the first 2 or 3 grapheme clusters, because the first grapheme is not discriminant enough and will remain identical an all columns of the disaplyed list on one page, and even sometimes on several or many successive pages : the first letter heading does not help, and is just an unneeded visual pollution)

For other categories that have very few contents, too many headings are frequently added that also appear as pollution. Being able to suppress all of them, by specifying 0 graphemeclusters in that category will better help readers locate the wanted item.

The collation map also has several levels of implementations, which match exactly the same levels as collation levels used to generate sort keys.

----

About sort keys now:

As sort keys are intended to be opaque binary objects, they do not qualify as being used directly as a parserfunction, without being exposed by a serialization to safe Unicode text, even if it means nothing for reader. That's why I proposed a Base-36 mapping to plain ASCII which will still sort correctly in binary order, and for use in sortable tables, but it could use any arbitrary Base that sorts correctly using binary ordering, and that uses ONLY valid Unicode characters.

The chosen base should be easy to compute, but all the standard Base-64 variants do not qualify (there's no warranty for the last two "letters" of all base-64 variants). We could use Base-62 (using all 10 digits, and the 26 pairs of Basic Latin letters), or Base-32 (simpler to compute but will generate longer texts). The only intent is not really to make the string visible in pages, but to help in the generation of accurate sort keys in sortable columns.

For now these sort keys are generated by templates as invisible text spans (using a CSS style="display:none" attribute), but ideally, the templates used in sortable tables that generate custome sortkeys should put them in some HTML attribute that can be specified on table cells, and that the Javascript will use directly. In my opinin, these opaque strings should be as compact as possible, but still safe for use inclusing in pages, and directly usable by simple Javascript without requiring any complex reimplemementation of UCA in the Javascript code.


Why do I think that exposing the functions as parser functions will be useful ? that's because it allows the implementation to be tested extensively on lots of cases, but only within a limited set of pages, long before the schema is developed, finalized and finally deployed.

In other words, it will not block the development of the schema update, as long as we agree about what are the essential functions to support, i.e. their interface that will be exposed (partly) in parser functions.

Also because I'm convinced that the exposed parser functions will not have this syntax changed, and that what they return will be well known:

- The {{COLLATIONMAP:}} function is very well described and will effectively return humane-readable text. Its formal implementation should follow the standard Unicode definitions.

You can expose it at least in a test wiki where you'll be able to track very easily the result and progress of its implementation (just create a page containing test words in various languages, arranged in a Wikitable).

- The {{SORTKEY:}} function can be exposed as well (and tested on the same test page for various languages, using the same list of words). Its result will be opaque for humane and compact. It will be easy to assert that it generates the expected sort order by using it in a sortable wikitable.


Both functions will be deployable rapidly, even on wikis that won't want to apply the schema change (so they will continue to use a single collation order for ALL their categories, and will anyway be able to sort specific categories using another supplied locale matching the category name).

If you think about it, changing the SQL schema may be rejected at end by lots of people. Exposing the parser functions will provide a convenient alternative that can be deployed much more simply, and with MUCH LESS risks, using the existing facilities offered by [[category:...|sortkey]] and {{DEFAULTSORT:sortkey}}, except that their parameter will be computed using the exposed {{SORTKEY:}} function:

  {{DEFAULTSORT:{{SORTKEY:text|locale|level}}}}

or:

  [[category:...|{{SORTKEY:text|locale|level}}]]

both being generalizable through helper templates.


There is such existing helper template named [[Modèle:Clé de tri]] in French Wiktionnary, that will NO LONGER need that we pass the article name without accents or extra punctuations or apostrophes (ignored at collation level 1), this parameter becoming ignored. Currently we use the template like this:

  {{clé de tri}}

when the article name contains nothing else than basic Latin letters or spaces, otherwise we have to pass:

  {{clé de tri|Basic latin}}

And we contantly need to verify that the passed parameter is correct. Instead the template would just generate this very simple code:

  {{DEFAULTSORT:{{SORTKEY:{{PAGENAME}}}}}}

As the {{SORTKEY:text|locale|level}} will use by default:

  locale={{CONTENTLANGUAGE}}|level=1

this will be sufficient for French Wiktionnary. In fact it may also be sufficient in English Wikipedia.


But in Chinese Wikipedia, one may still want to be able to use:

  {{DEFAULTSORT:{{SORTKEY:{{PAGENAME}}|{{int:lang}}}}}}

to support the prefered collation order of the user (traditional radical/stroke order, simplified radical/stroke order,  Bopomofo order, Pinyin order)

(Note also that section headings ("first letter") will have to be "translated" to correctly report the "first letter" of the Pinyin romanization, because the page names listed will continue to display their distinctive ideographs ! The only way to do that is to use the collation mapping exposed by {{COLLATIONMAP:}})

But you'll note that it won't be possible to sort the categories using multiple locales, so the page will be stored and indexed by parsing it using {{int:lang}} forced to {{CONTENTLANGUAGE}}, which will just be "zh", using only the simplified radical/stroke order by default.

To support more collations, the categories will need to support them explicitly (but this would force to reparse the page multiple times, once for each additinal locale specified in the category).

The alternative would be to create multiple parallel categories, one for each sort order, but then each article will have to specify all these categories.

My opinion is that the same category should be sortable using different locales, and that's why they should support multiple sortkeys par indexed page, one for each specified locale. Some wikis will only sort on the {{CONTENTLANGUAGE}} by default, but the Chinese Wiktionnary will benefit of sorting automatically all categories using at least the default "zh" locale which is an alias for "zh-hans", plus the "zh-hant" locale for traditional radical/stroke order, "zh-latn"  for the Pinyin order, and "zh-bpmf" for the Bopomofo order.

The exact locale to which "zh" corresponds will be a user preference, but one will be able to navigate by clicking the automatically generated links that will allow them to specify the other collation orders supported specifically by the category or by default throughout the wiki project.

For example, the Chinese Wiktionnary will display links on the page showing the available choice as:

 Sort by : [default] | [simplified radical/stroke] | [traditional radical/stroke] | [pinyin] | [bopomofo]

How can this be possible ? Either the wikiproject specifies that all categories will support these 4 orders, or the category page will list explicitly the additional sort orders that will be supported by the category.

The [default] link will use the index prefixes specified in the existing syntaxes [[category:...|sortkeyprefix]] or {{DEFAULTSORT:sortkeyprefix}}

All the other links will display the list sorted using the additional locales specified, but will ignore the sortkeyprefixes specified in categorized pages or indirectely via templates.

To add the additional sort orders in a category, you'll just need to insert in the category page some code like:

{{SORTAS:zh-hans}}
{{SORTAS:zh-hant}}
{{SORTAS:zh-latn}}
{{SORTAS:zh-bpmf}}

No article needs to be changed, these additional sort orders will just discard/ignore the sortkeyprefix when generating the actual opaque sortkey (specified with {{DEFAULTSORT:sortkeyprefix}} or in [[category:...|sortkeyprefix]].

However if the wikiproject offers several project-wide default locales the sortkeyprefix specified in pages will be honored for ONLY for these locales, and made immediately visible as the preselected [default] link, in the choice of sort orders.



Lets say for example that the Chinese Wiktionnary wants to support by default only the "zh-hans" and "zh-hant" collations.

This means that all categories will contain [default] sort keys computed for these two collations, from the text consisting in:

  {{{sortkeyprefix}}}{{KEYSEPARATOR}}{{PAGENAME}}

if a sortkeyprefix is specified, or just {{PAGENAME}} if no sortkeyprefix is specified.

A constant {{KEYSEPARATOR}} will be used that should sort lower than every other valid text usable in {{{sortkeyprefix}}} or {{PAGENAME}}. Ideally, this should be a control character like U+000A (LF) or U+0009 (TAB), after making sure that:

- this character will never appear in valid {{{sortkeyprefix}}} or {{PAGENAME}} (Mediawiki already process blanks and convert them to plain SPACE)

- this character will have a NON-ZERO (ignorable) primary collation weight that is smaller than all other collation weights. Its primary collation weight should then be 1 (if needed the collation weights coming from the DUCET or from loalized tailoring will just have to be offseted by 1, if they are non-zero)

- this character will have a ZERO collation weight for all the remaining supported levels in each locale

For all the additional sort orders specified in category pages, the {{{sortkeyprefix}} will be ignored as well as the {{KEYSEPARATOR}}, so the pages will just be indexed on {{PAGENAME}}, within the specified locale.

In the example Chinese Wiktionnary a category specifying:

{{SORTAS:zh-hans}}
{{SORTAS:zh-hant}}
{{SORTAS:zh-latn}}
{{SORTAS:zh-bpmf}}

will generate 4 additional (non [default]) sort keys, that will add to the two sortkeys already generated for "zh-hans" and "zh-hant" except that they will ignore the specified sortkeyprefixes.

This means that it will generate up to 6 sortkeys: 1 or 2 for "zh-hans", 1 or 2 for "zh-hant", and only 1 for each of "zh-latn" and "zh-bmpf"

In the English Wiktionary or on Commons, that will only use the "en" default collation order (identical to {{CONTENTLANGUAGE}}), it will be possible to specify, for specific categories, an additional sort order when the category is directly related to a specific language.

By default, that category will be sorted using the English collation rule, but it will be possible to select the additional specified collation order (in which case the defaultsortprefix specified in indexed pages will be ignored, the list will just be shown by using the natural order of page names in the manually clicked sort order).

So the [[Category:Chinese]] in English Wiktionnary will be able to specify at least:

{{SORTAS:zh-hans}}

And the [[Category:German]] in English Wiktionnary will be able to specify at least:

{{SORTAS:de|2}}

And the [[Category:French]] in English Wiktionnary will be able to specify at least:

{{SORTAS:fr|2}}

And this should be enough to be able to view the natural order of these languages (French will require collation level 2 for correct ordering by grouping letters with the same accents, and sorting them in backward order in this level)...

Note also that if the categories is presented in any selected locale, the secntion headings will also be computed in that same locale, and with the same collation level. By default it will show only 1 grapheme cluster. But if needed you can specify:

{{SORTAS:fr|2|3}}

to indicate that the category should use the first three French grapheme clusters (determined from collation mappings) for the headings, if the category is heavily populated, so you'll get the following headings:

a, à, â, ... aa, aar, ab, aba, abc, ac, aca, ace, acé, ... ad, ae, aé, ... b, ba, bac, bad, baf, bag, bai, bal, bam, ban, bar, bas, bat, bât, ....

(note that at level 2, the same heading will contain all words sharing the same letters and accents. Case will be ignored.

Headings don't need to be stored, they are generated on the fly from the index prefixes (returned in the result set, but only if this is one of the wiki's [default] sort orders, because otherwise it will be empty if the user selected a non-[default] sort order) and the pagename (also present in the result set).

Note that if you still want to present a category ordered so that all lowercase letters will sort tegether separately from all other uppercase letters, you'll need to indicate a separate collation order, by specifying an additional non-default locale in that specific category. This will look probalby not natural for most users, that's why it will be a distinct locale and not the default. For example to use it in English or in French:

{{SORTAS:en-x-uc}}<!-- ALL uppercase before lowercase in collation level 1-->
{{SORTAS:fr-x-lc}}<!-- ALL lowercase before uppercase in collation level 1 -->

This variant means that the natural tailoring is modified so that case differences will be considered as primary differences in that specific distinct localized collation. There will no longer be any ternary difference but there will be twice more headings generated (here as no maximum grapheme clusters is specified, only 1 grapheme cluster will be used in the headings, so you'll get the headings:

 A, B, C, D, ... Z, a, b, c, d, ... z, (in with en-x-uc)
 a, b, c, d, ... z, A, B, C, D, ... Z, (in with fr-x-uc)

And in both cases there will be no headings separating differences of accents (because we did not indeicate the collation level, the collation level remains 1)

Such options MUST use a standard BCP47 code, so the option needs to be at least 5 characters long after the language code, or must be used after the standard BCP47 suffix "-x-" for private extensions that are not indicating a language.

Comment 191 Philippe Verdy 2010-07-26 19:07:34 UTC

In all this discussion it appears that the development can be made in two separate projects developped independantly.

You can continue to develop the SQL schema extension, provided that:

- you call a PHP function/method that will be developped separately in a "Collator" class, to compute the collation sortkeys for each locale and specified collation level.

- you call a PHP function/method that will be developped separately in a "Collator" class, to compute the collation mappings for each locale and specified collation level and maximum grapheme clusters, in order to generate the headings on the fly in the category list

- you think about a HTTP query parameter that will allow to change the locale (this parameter exists, it's "uselang", but another one will be needed eventually to specify sort options like "-x-lc" or "-x-uc" (the "-x-" separator will be used impliclty. So the HTTP query may contain: &uselang=fr&opt=lc). These parameters will be used later when you'll be able to store multiple sortkeys.


Separately, another project will implement the Collator class, that will load the DUCET table, load the tailorings designed per locale or locale+"-x-"+option, and prepare the table of collation elements and their associated collation weights for each level. Then it will implement the methods:

- Collator::Collator(locale, level=1) which instanciates a Collator object for the specified locale string (language + optional options) and the collation level (it does not need to prepare the full table of colaltion weights, just those needed for the specified level). The default level will be 1.

- Collator::sortkey(text) which returns the opaque collation key (a binary array of bytes) for the specified Collator instance.

- Collator::map(text, maxelements='') which returns the humane readable text after applying the collation mapping associated to the Collator instance. It will stop after returning at most 'maxelements' collation elements in the remapped string, or will process all the text if maxelements is null or not specified. For category headings, you'll give by default maxelements=1 (unless another calue was specified when preparing a specific locale with the extra option of {{SORTAS:locale|maxelements}} in the visited category page.


Separately another simple project will use the Collator class to expose them in parser functions. In fact this third project will be very easy and fast to complete, even if the Collator class is not fully developped with all its options.

A very basic Collator class can be used to develop the parser function extension that will expose:
-  {{SORTKEY:text|locale|level}}
   defaults: locale={{CONTENTLANGUAGE}}|level=1
- {{COLLATIONMAP:text|locale|level|maxelements}}
   defaults: locale={{CONTENTLANGUAGE}}|level=1|maxelements=<!--empty/null-->

With it you can immediately deply it on a test wiki server, where you'll build test lists in a "sortable wikitable". and you can continue building the Collator class. to support various locales and all the needed options.

Comment 192 Philippe Verdy 2010-07-26 19:20:35 UTC

Note that this is the minimum functional interface for the Collator class. Other methods will be certianly needed to cache the already prepared collators, or to cache the preloaded tailoring tables and the common DUCET.

In other words, the Collator::Collator() constructor is in fact probably using the service of a CollatorFactory class that will cache the needed shared data. This will save a lot of overhead and will be important for the performance.

This factory should NOT be used directly by the SQL schema extension, that MUST only use the Collator::getCollator(locale, level) static function which will use the CollatorFactory in order to instantiate the Collator object through its constuctor.

The CollatorFactory will use ICU, most probably, via its already existing PHP-extension interface module. But a basic PHP-only implementation is still possible without using ICU immediately, or if MediaWiki must be kept compatible with existing PHP servers that can't load such native extensions (using DLLs or shared libraries or requiring to recompile/relink PHP itself) in their installation.

The CollatorFactory should NOT prepare all supported locales. Only those that are requested **on demand**.

Comment 193 Philippe Verdy 2010-07-26 19:41:57 UTC

Exemple of interface for you:

<?php

require('.../CollatorFactory.php');
$wgCollatorFactory=CollatorFactory(); // this is the only global PHP variable

...

// prepare a Collator object
//(note that $locale is NOT case sensitive in BCP47)

$locale = $wgContentLanguage; // use {{CONTENTLANGUAGE}} by default
if (isset($options) && $options > '')
  $locale .= '-x-' . $options; // append sort options if any (like 'uc', 'lc')
$level = 1; //use collation level 1 by default

$collator = $wgCollatorFactory->get($locale, $level);

...

// use the collator object to compute opaque sortkeys (opaque binary string)
// to store in categories:

$sortkey = $collator->sortKey($text);

// optionally map it to plain-text, if the database can't store and
// sort VARBINARY(N) fields, but only VARCHAR(N) objects:

$sortkey = base32($sortkey);

...

// use the collator object to compute a heading on the fly from a pagename:

$collationElements = 1; // by default generate only 1 collation element

$heading = $collator->map($pageName, $collationElements);

Comment 194 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-07-26 19:49:15 UTC

(In reply to comment #190)
> Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but
> COLLATIONMAP is a more generic concept which reflects what is defined in
> Unicode standard annexes, which speaks about various mappings (including collan
> mapppings, but also case mappings)

Which particular Unicode standards are relevant here?  It would probably be a good idea for me to take a look at them.

> For some categories, it should be convenient also to be able to use longer
> substrings, containing more than one grapheme cluster (in Wiktionnary for lists
> of terms belonging to a language, or in lists of people names, a category may
> need to be indexed and anchored with section headings containing the first 2 or
> 3 grapheme clusters, because the first grapheme is not discriminant enough and
> will remain identical an all columns of the disaplyed list on one page, and
> even sometimes on several or many successive pages : the first letter heading
> does not help, and is just an unneeded visual pollution)
> 
> For other categories that have very few contents, too many headings are
> frequently added that also appear as pollution. Being able to suppress all of
> them, by specifying 0 graphemeclusters in that category will better help
> readers locate the wanted item.

This is all beyond the scope of what I'm currently doing, but it should be possible to add on without too much trouble as a separate project.

> Why do I think that exposing the functions as parser functions will be useful ?
> that's because it allows the implementation to be tested extensively on lots of
> cases, but only within a limited set of pages, long before the schema is
> developed, finalized and finally deployed.

This presupposes that the sortkey generation algorithm is settled upon before the database changes are.  In fact, it's exactly the opposite: I've been contracted to do the backend changes, and other people will figure out how exactly to generate the sortkeys later.  Really, we could do the changes in either order.

> Both functions will be deployable rapidly, even on wikis that won't want to
> apply the schema change (so they will continue to use a single collation order
> for ALL their categories, and will anyway be able to sort specific categories
> using another supplied locale matching the category name).
> 
> If you think about it, changing the SQL schema may be rejected at end by lots
> of people.

The schema change will be part of the core software and will not be an optional update.  Anyone who doesn't want to accept it will likely have to stick with 1.16 forever, because we're not going to support the old schema.

> Exposing the parser functions will provide a convenient alternative
> that can be deployed much more simply, and with MUCH LESS risks, using the
> existing facilities offered by [[category:...|sortkey]] and
> {{DEFAULTSORT:sortkey}}, except that their parameter will be computed using the
> exposed {{SORTKEY:}} function:
> 
>   {{DEFAULTSORT:{{SORTKEY:text|locale|level}}}}
> 
> or:
> 
>   [[category:...|{{SORTKEY:text|locale|level}}]]
> 
> both being generalizable through helper templates.

It's not particularly less risky.  It does encourage each wiki to hack around the problem locally instead of pushing for a proper fix in the software.  You shouldn't have to do DEFAULTSORT on any pages where the right sorting is human-detectable -- it should only have to be for things like "Abraham Lincoln" being sorted as "Lincoln, Abraham", which the software can't do automatically.

> (Note also that section headings ("first letter") will have to be "translated"
> to correctly report the "first letter" of the Pinyin romanization, because the
> page names listed will continue to display their distinctive ideographs ! The
> only way to do that is to use the collation mapping exposed by
> {{COLLATIONMAP:}})

Surely the most sensible idea is just to disable the section headings altogether for CJK?

> My opinion is that the same category should be sortable using different
> locales, and that's why they should support multiple sortkeys par indexed page,
> one for each specified locale. Some wikis will only sort on the
> {{CONTENTLANGUAGE}} by default, but the Chinese Wiktionnary will benefit of
> sorting automatically all categories using at least the default "zh" locale
> which is an alias for "zh-hans", plus the "zh-hant" locale for traditional
> radical/stroke order, "zh-latn"  for the Pinyin order, and "zh-bpmf" for the
> Bopomofo order.
> 
> The exact locale to which "zh" corresponds will be a user preference, but one
> will be able to navigate by clicking the automatically generated links that
> will allow them to specify the other collation orders supported specifically by
> the category or by default throughout the wiki project.

This is doable if it's desired, as long as the number of locales is very limited (like four, not a hundred).  However, it will not be part of my initial implementation unless Tim asks me to do it.

> In the English Wiktionary or on Commons, that will only use the "en" default
> collation order (identical to {{CONTENTLANGUAGE}}), it will be possible to
> specify, for specific categories, an additional sort order when the category is
> directly related to a specific language.

This is certainly a useful feature for some wikis (especially Wiktionaries), and it could be added fairly easily.  It might make it into my initial implementation.

(In reply to comment #191)
> In all this discussion it appears that the development can be made in two
> separate projects developped independantly.
> 
> You can continue to develop the SQL schema extension, provided that:
> 
> - you call a PHP function/method that will be developped separately in a
> "Collator" class, to compute the collation sortkeys for each locale and
> specified collation level.
> 
> - you call a PHP function/method that will be developped separately in a
> "Collator" class, to compute the collation mappings for each locale and
> specified collation level and maximum grapheme clusters, in order to generate
> the headings on the fly in the category list

This is what I'm doing.

> - you think about a HTTP query parameter that will allow to change the locale
> (this parameter exists, it's "uselang", but another one will be needed
> eventually to specify sort options like "-x-lc" or "-x-uc" (the "-x-" separator
> will be used impliclty. So the HTTP query may contain: &uselang=fr&opt=lc).
> These parameters will be used later when you'll be able to store multiple
> sortkeys.

These don't need to be developed until that feature is actually implemented, which will probably not be in my initial implementation, unless Tim Starling asks me to.

Comment 195 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-07-26 19:50:19 UTC

By the way, I'm keeping track of progress here:

http://www.mediawiki.org/wiki/User:Simetrical/Collation

Comment 196 Philippe Verdy 2010-07-26 20:37:46 UTC

Note that the CollatorFactory may fail to locate the specified locale for which a collator is being requested. Additionally, several locales may share exactly the same collator.

In all cases, the collatorFactory will return a valid Collator object, whose locale can be identified: the Collator object returned by:

$collator = $wgCollatorFactory->get($locale, $level);

will have a property that will contain the effective locale code (normalized) and an other property containing the collation level from which it was effectively built. You should be able to access it simply with something like:

$effectiveLocale = $collator->locale();
$effectiveLevel = $collator->level();

after just getting the collator instance from the factory.

This may be useful to avoid storing duplicate equivalent binary sortkeys, or simply to determine which effective locale to use in SQL select queries (to retrieve the sorted list of pagenames ordered by a specified locale), when the SQL schema will be able to store several sortkeys for the same page in the same category.

The factory will also instanciate a collator with an effective locale and an effective collation level only once, caching it in an internal array, for repeated use.

This will save the complex preparation of tables, and will avoid building tables for all supported languages (for example in Commons where lots of languages may be desirable, weahc one with possibly several sort options, or supported conversions to other scripts or script variants).

The factory however should probably be able to load the DUCET table associated to the CLDR "root" locale completely and immediately when it is first instanciated and stored in the global variable (there's probably no need to test this each time vecaue of lazy initializations with null member fields); and it should most probably build the default collator (for $locale=$wgContentLanguage, and $collationlevel=1) immediately, storing it in the first position of its member array of already prepared Collator instances.

But you may think the opposite, in order to speedup the server startup by some (milli-)seconds or reduce the initial CPU/memory stress in the garbage collator of PHP. However I'm not convince that the server will be ready faster, and the extra tests that will be performed at each use of the $wgCollatorFactory->get() method may impact the performance at runtime...

Note also the ICU uses the same approach of a CollatorFactory to build and cache reusable Collator instances, because it's a proven good design pattern for implementing and using collators.

A collator object may also be used to compare to texts without even generating their sortkeys, or without mapping them, so it may help to include in the Collator interface this method:

$collator->compare($text1, $text2);

that will return an integer (in other words, a Collator also implements the Comparator interface), by parsing $text1 and $text2 collation element by collation element up to the end at level 1, comparing their collation weights only at theis level, before restarting with the next level. When the collator was instanciated at level 1, the successive collation elements need not be stored, but for higher levels, it helps if they are parsed only once and kept in an indexed array that will allow faster lookup for the next levels in the table of collation weights for these levels.

Comment 197 Philippe Verdy 2010-07-26 21:01:11 UTC

I note that you have started to define the Collator class as the Language class.

Well I'm not sure that this should be the same class: a language has several collators, even if a collator is for only one language (including a virtual language like "root" which is the base of all localizations including English, with lots of common properties that are shared in multiple languages derived from root).

Also I note that the collator you define is a stub. Great for your development in the SQL schema, but someone else must work in implementing the CollatorFactory, using ICU, or a full PHP implementation, or only a stub, depending on the PHP script path you'll require() to load this factory class.

Technically, collators also share lots of properties across several languages. This is why it should be separated from the Language class.

May be the Language class can query the CollatorFactory about the list of collators (i.e. their complete locale code, including options) that are supported by that language, using an interface method of the loaded global CollatorFactory instance.

This way, a stub factory, building a single Collator instance will still work immediately.

Someone with expertise in Unicode collations will help you build the Factory if you want a PHP-only implementation.

Someone else with ICU expertize could also work on its existing integration in PHP (needs testing in the MediaWiki installation scripts for deployments) and to use it in the CollatorFactory and in the Collator class.

The most tricky development is the Collator::map() function, because Unicode does not describe the algorithm completely.

I know the principles and could offer help if you give some pointers where it would be beneficial.

But really, to test a complete implementation of the CollatorFactory, you'll need to be able to expose it in the two optional builtin parser functions that I described. As this can be clearly developped separately even if you start with a stub for the factory, and tested with the very basic Parser Functions installed on a test server.

So I maintain my position for the non-risky ParserFunctions (notably also because it will be simpler for an existing non-Wikimedia installation of MediaWiki to install the simple functions only, without upgrading to the new schema immediately, knowing that the factory can also be a basic stub as well on these installations).

Comment 198 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-07-26 21:43:21 UTC

(In reply to comment #187)
> Drop this tacking immediately in the PHP code: the cl_sortkey is only intended
> to store the clear-text sortkey "hint" specified in {{DEFAULTSORT:sortkey}} or
> [[category:...|sortkey]] to override the sort order, and this clear-text
> sortkey is not really a true sortkey but a sort hint to override the default
> sort order. 
> 
> For example in people's names whose page name is "FirstNames LastName" but that
> we want to sort as if they were "LastName, FirstNames" by indicating only
> {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
> the wiki text, as this sort hint will not be unique and the group of pages
> using the same hint will still need to sort within this group using their
> natural order). Even in that case, there's no need to bogously tack the cl_from
> field in the stored field.

How do you propose this be implemented?  We would need some character that sorts before all characters that can legitimately occur in a sort key.

(In reply to comment #197)
> I note that you have started to define the Collator class as the Language
> class.

This is completely provisional and can easily be changed later.  As far as I understand, I'm not the one who will work on it.

> But really, to test a complete implementation of the CollatorFactory, you'll
> need to be able to expose it in the two optional builtin parser functions that
> I described. As this can be clearly developped separately even if you start
> with a stub for the factory, and tested with the very basic Parser Functions
> installed on a test server.
> 
> So I maintain my position for the non-risky ParserFunctions (notably also
> because it will be simpler for an existing non-Wikimedia installation of
> MediaWiki to install the simple functions only, without upgrading to the new
> schema immediately, knowing that the factory can also be a basic stub as well
> on these installations).

Upgrading the schema will be handled by running maintenance/update.php, which must be run on every update anyway.  We typically have a number of schema changes in every release, and those always must be applied when deploying the new code.  So this is not an issue.

Comment 199 Philippe Verdy 2010-07-26 22:23:47 UTC

>> (Note also that section headings ("first letter") will have to be "translated"
>> to correctly report the "first letter" of the Pinyin romanization, because the
>> page names listed will continue to display their distinctive ideographs ! The
>> only way to do that is to use the collation mapping exposed by
>> {{COLLATIONMAP:}})
>
>Surely the most sensible idea is just to disable the section headings
>altogether for CJK?

Don't need to do that.

The Collator instance returned by the factory for $locale="zh-latn" (which sorts using Pinyin) just has to return an empty string for its map() method, as long as this is a stub which can't safely map ideographs to Latin initial consonnants of the Pinyin syllable.

Note that the syntax of the Latin Pinyin ampping should be quite similar to the syntax used in Minnan as the Minnan syllables have more or less the same structure as the Pinyin syllables. for ideographs that are not supported, in the Pinyin romanization, there will be no substitution so they will sort at end and will not start by a Latin letter.

Here also, it is possible to infer a common heading for all of them, such as their radical component, or just the first ideographic radical encoded in the DUCET with a non zero primary weight, or even the first stroke.

The first CJK stroke is:
31C0  ; [*10E0.0020.0002.31C0] # CJK STROKE T

But I should look at the exact sequence in the "zh" tailoring of the DUCET, in the CLDR database.

There's a demo page for "zh-hans" collation here (in the ICU project which is used by the Unicode's CLDR project as a reference implementation):

http://demo.icu-project.org/icu-bin/locexp?d_=fr&_=zh_Hans

The interesting part is the data for "Rules" which orders the examplar sinograms, where the data for "Set" just show them in the numeric codepoint order or as ranges of character with ascending numeric code point values.
But on both cases they just concentrate on the most common basic subset used in GB2312, this is not the complete set defined in GB18030 and Unicode.

For actual transforms from Sinograms to Latin (Pinyin) there's this demo page:

http://demo.icu-project.org/icu-bin/translit

To see how the DUCET orders ideographs, look at:

http://unicode.org/charts/collation/

The first sinogram (non-stroke) defined with a non-zero primary weight in the DUCET sequence is U+4E00 (一) at it seems that it provides a very convenient geading for every sinogram that we can't sort or convert to Pinyin.

Note that in all cases all the sinograms are at end of the DUCET, just before the unsupported/reserved/unassaigned characters.

The "zh-hans" is just moving all the other letters starting in the "Variable" subset upward (in fact it just moves upwards the letters starting at Latin), to fit the sinograms before them.

The "zh-Hant" collation is like "zh-Hans" but swaps some positions, according to their expected radical, but also because they differ in thei stroke count.

Only about 31000 sinograms have known transiptions to Pinyin (this number is progressing), all the other will then appear under the remaining heading group starting by U+4E00 (一), except those in the "CJK-Extension blocks" that should be listed under the heading U+3400 (㐀).

Most non-extension CJK have now a pinyin transcription, most CJK extensions don't (but they are also the rarest character used, so there should not be a lot of pages indexed there)...

>> For example in people's names whose page name is "FirstNames LastName" but that
>> we want to sort as if they were "LastName, FirstNames" by indicating only
>> {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
>> the wiki text, as this sort hint will not be unique and the group of pages
>> using the same hint will still need to sort within this group using their
>> natural order). Even in that case, there's no need to bogously tack the cl_from
>> field in the stored field.
>
> How do you propose this be implemented?  We would need some character that
> sorts before all characters that can legitimately occur in a sort key.

This is exactly what UCA defines as a "tailoring". We have such a character available that we don't use in pagenames and in sortkey prefixes: Control characters.

Note that control characters are present in the DUCET, and they are NOT ALL ignorable. The first of them is TAB (U+0009) and it is the first character that we DON'T USE, and that has a primary weight in the DUCET and that is not ignorable or null.

All the characters have null weights are listed wihin the NULL group, which is immediately followed by ignorable characters.

TAB has a primary weight of 0201 in the DUCET (it comes far later, after all the ignorables)

The first ignorable character has a primary weight 0021, so the primary weight 0001 is free to serve as the separator.

All we have to do is then to tailor the primary weight of TAB to assign it the weight 0001 instead of 0000. In that case, the souce text to give to Collator:sortKey() just has to use the TAB character between the two fields.

If there's no sortkey prefix, don't generate the TAB, use directly the pagename:
<? php

require('.../CollatorFactory.php'); // choose the script for the implementation
global $wgCollatorFactor = CollatorFactory();

...
$collator = $wgCollatorFactory->get($locale, $level);

...
if ($sortkeyprefix != '')
  $text = $sortkeyprefix + '\t' + $pagename;
else
  $text = $pagename;
$sortkey = $collator->sortkey($text);

//optional to convert VARBINARY(N) to VARCHAR(N) (depends on SQL backend)
$sortkey = varbinaryToVarchar($sortkey); // e.g. Base-32

// you may also need to truncate to N characters,
// to fit the SQL sortable field maximum length constraint
// (according to the schema for this SQL backend)...

Comment 200 Philippe Verdy 2010-07-26 22:44:37 UTC

Note that if your collator stub does not really compute true sort keys (compacted in binary format and representing collation weights), it may return spaces (U+0020), that is represented by byte 0x20 in the UTF-8 encoding. The separator used must be lower than this encoded character, so the VARCHAR(N) database field has to accept this character.

If it does not, you may offset all the UTF-8 bytes returned by your stub by 1 (because UTF-8 bytes never use the byte value 0xFF), so that you'll be able to ue the byte value 0x20 for the separator (the UTF-8 encoded SPACE will be stored in the sortkey field as 0x21 after it has been offseted).

This does not matter because sortkeys are opaque, so this is stimple to do in the stub. The stored sort key field DOES NOT have to use the UTF-8 encoding explicitly, it must just accept binary encoded bytes that can fit a non-Unicode character. If the database wants you to specify a charset for this binary sortable field, use ISO-8859-1, as long as the database will not alter the binary order of ISO-8859-1 when computing ORDER BY clauses.

I really hope that we will always have the possibility of using VARBINARY(N) fields allowing compact random byte values, because it will be much more efficient for our use of opaque binary sortable sort keys (that are just streams of bytes).

Be also careful about the sign of bytes (i.e. their range value), otherwise byte 0x80.0xFF will sort before 0x00..0x7F in the order by clause. This should be the case if the field is declared to use the ISO 8859-1 encoding with its binary order.

Which SQL backends (and dialects) do you support in MediaWiki ? This may help me understanding some development constraints. I know you support MySQL, but are there simpler backends like BerkeleyDB, dBase files, or flat text files for small projects supported only via some ODBC driver with a PHP interface ?

Comment 201 V85 2010-07-26 23:02:33 UTC

For those of us who are less technical, but keen to see this bug resolved, is there somewhere where we can submit alphabets, so that they may be added for collation? I.e. where we submit the sorting order of specific languages for use with {{sort:en}} (or whatever the mark-up will be)?

Comment 202 Philippe Verdy 2010-07-28 22:06:54 UTC

I would suggest instead making those proposals in the CLDR project.
It already contains lots of data for many languages, related to their expected "sort order" (in fact more generally about their collation).

Go to http://www.unicode.org/cldr and visit the existing ressources browser, and the demo of ICU which uses the same data.

Our collator should "in fine" use the CLDR tailorings that are already defined, with very minor changes (if they are really needed).

Comment 203 Philippe Verdy 2010-07-28 22:18:49 UTC

May I even suggest that the name of the new column does not even use "sortkey"

i.e. "cl_sort_prefix" instead of "cl_sortkey_prefix"

Why? because this column should NEVER need to convert that prefix itself to a binary sortkey, it should just still be the humane-readable prefix that was specified in {{DEFAULTSORT:sort_prefix}} or in [[Category:...|sort_prefix]].

This will avoid confusion, but it will also allow simpler recomputing of actual sortkeys for various locales, without having to parse the page again for each locale:

Note that when evaluating and expanding the parameters of {{DEFAULTSORT:sort_prefix}} and of [[Category:...|sort_prefix]] in the wiki code of a page (or when transcluding templates), the result text should NEVER depend on the UI language (so if {{int:lang}} is used in that parameter, it should evaluate always as if it was {{CONTENTLANGUAGE}}).

Comment 204 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-08-03 17:37:45 UTC

(In reply to comment #200)
> Which SQL backends (and dialects) do you support in MediaWiki ? This may help
> me understanding some development constraints. I know you support MySQL, but
> are there simpler backends like BerkeleyDB, dBase files, or flat text files for
> small projects supported only via some ODBC driver with a PHP interface ?

We primarily support MySQL.  We also have at least theoretical support for PostgreSQL, SQLite, MSSQL, Oracle, and DB2.  However, if this doesn't work for all DB backends, it's not a big deal, as long as it works with MySQL.

(In reply to comment #201)
> For those of us who are less technical, but keen to see this bug resolved, is
> there somewhere where we can submit alphabets, so that they may be added for
> collation? I.e. where we submit the sorting order of specific languages for use
> with {{sort:en}} (or whatever the mark-up will be)?

We'll be using some kind of pre-rolled algorithm from CLDR or such, so you don't need to do this.  If the deployed algorithm turns out to be deficient, and it's not a bug on our side, you should probably contact CLDR, not us.

(In reply to comment #203)
> May I even suggest that the name of the new column does not even use "sortkey"
> 
> i.e. "cl_sort_prefix" instead of "cl_sortkey_prefix"
> 
> Why? because this column should NEVER need to convert that prefix itself to a
> binary sortkey, it should just still be the humane-readable prefix that was
> specified in {{DEFAULTSORT:sort_prefix}} or in [[Category:...|sort_prefix]].

Yes, this stores the plaintext, unconverted prefix.  I don't think the name needs to be changed, though.  Both names work well enough.

> This will avoid confusion, but it will also allow simpler recomputing of actual
> sortkeys for various locales, without having to parse the page again for each
> locale:

Yes, that's intended to work.

> Note that when evaluating and expanding the parameters of
> {{DEFAULTSORT:sort_prefix}} and of [[Category:...|sort_prefix]] in the wiki
> code of a page (or when transcluding templates), the result text should NEVER
> depend on the UI language (so if {{int:lang}} is used in that parameter, it
> should evaluate always as if it was {{CONTENTLANGUAGE}}).

If people want to put crazy stuff in sortkeys that changes based on who's viewing it, we can't stop them.  Curly braces are evaluated at an earlier stage than category links, so we can't make them behave differently based on whether they're being used in a sortkey, I don't think.  You could also put {{CURRENTTIME}} in sortkeys, or any other silly thing like that.

Comment 205 Philippe Verdy 2010-08-04 17:09:44 UTC

(In reply to comment #204)
> If people want to put crazy stuff in sortkeys that changes based on who's
> viewing it, we can't stop them.  Curly braces are evaluated at an earlier stage
> than category links, so we can't make them behave differently based on whether
> they're being used in a sortkey, I don't think.  You could also put
> {{CURRENTTIME}} in sortkeys, or any other silly thing like that.

That's because the Mediawiki parser still operates on the wrong level, and performs text subtitutions always without ingoring the context of use. Not only this is bad, but this is also very inefficient, because each processing level is converting a fullpage to another fullpage that needs to be reparsed again at the next level.

A much better scheme based on a true gramatical analyser using a finite state automata would really help defining the state at which the parser (or its ParserFunctions extensions) operate, without ever having to create huge complete page buffers between each level (which uses costly appending operations with many memory reallocations).

In other words, when you start parsing "[[" you enter in a "start-of-link" state, which then parses "category:" until it has found a colon, in which case the whole prefix is case-folded and goes to the "in-category" state, then the parser scans up to the next "{" or "|" or "]". It can then correctly process all the text using such rules.

I have suggested since long that the MediaWiki syntax should be reformulated using a LALR(1) formal syntax, from which a finite-state automata can be automatically built to cover all the state information, and then ported to PHP (Yacc, Bison or even better PCTCS could do that without difficulty). Then instead of calling parsing functions that process all the text, then return the converted text for processing to the next level, it will do that in a much simpler (and faster) processing loop, calling much simple generation methods and altering its state in a much cleaner and faster way (no more need to append small lexical items to various buffers, the atoms will be pased from level to level using a chained implementation.

This would also significantly speedup the expansion of templates, and would allow the parser to make distintions when {{int:....}} is encountered in the context of a [[category:...]] (which would have temporarily forced the UI language to the CONTENTLANGUAGE, until the final "]]" token is encountered that would restore the UI language within the parser's internal state. this would also have significant performance benefits (for example, no more need to convert and expand all the parameters of {{#if:...}} or {{#switch:...}}, only convert them lazily when they are really needed for generating the effective output.

Yes this comment is going too far out of the topic of this bug, it is architectural. PHP has all the rools needed to support the construction of tree-like data: instead of passing only strings to parser generation functions, you would pass it an ordered array, whose elements are the parsed parameters, themselves being either strings or subparsed arrays containing strings and other arrays. The builtin functions would then request back to the MediaWiki parser the evaluation/expansion of ONLY the parameters they need, and MediaWiki would still be able to call the expansions of ONLY these parameters, recursively. Most of the items in the wikicode would then be atomic and processed lazily, only when they are effectively needed.

The "crazy" things like {{CURRENTTIME}} or {{time:...}} or {{int:...}} found within [[category:...]] and whose result depends on time or on user preferences would be easily avoided. This would also simplify a lot the management of whitespaces (if they need ot be trimmed and/or compressed depends on the builtin expansion function called by the parser.

Comment 206 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-08-04 17:14:25 UTC

(In reply to comment #205)
> That's because the Mediawiki parser still operates on the wrong level, and
> performs text subtitutions always without ingoring the context of use. Not only
> this is bad, but this is also very inefficient, because each processing level
> is converting a fullpage to another fullpage that needs to be reparsed again at
> the next level.

People have tried to fix this before, and failed miserably.  It is not practical in the immediate future, if ever.  Everyone likes to say how great it would be to write a "real" parser, but nobody ever seems interested in producing working code that actually replicates the current parser to a good enough level to maintain acceptable backward compatibility.  If you're interested in that, please give it a try and get in touch with us when you have a working implementation.  Otherwise, it's not worth discussing pointlessly yet again.

Comment 207 Philippe Verdy 2010-08-04 17:41:52 UTC

In reply to comment #206)
Top-down analyzers are very easy to write, provided that you allow parsing to be made via TWO generation points, instead of one: the first one converts and accumulates the tokens in an array as long as it is not able to be processable as a whole; this requiers a first parsing loop. Then the final step will ask the expansion of parsed tokens lazily, in a basic processing loop which will "eat" the tokens lazily, replacing them by their expansion, until there remains only one token, a basic string that can be used as an expanded parameter.

I may try working on it, it's not so complicate to formulate, even without using Yacc/bison/PCTCS, I have many ideas on how it will work : basically each non-terminal token in the formal syntax becomes TWO functions, one is used when parsing individual tokens, until the last one is reached, then a second function will be used to expand itself, that will be used later (after returning from the  BY the caller, and ONLY when it will need the expansion in its OWN expansion function.

For example the MediaWiki parser would become TWO functions: mediawiki_parse() (which returns an array of tokens) and mediawiki_expand() which is stored as the first token in the array. All ..._expand() functions would have two parameters: one for the context in which they are to be evaluated (containing for example time properties, cache properties, language properties ), and the array itself to be expanded up to the end until they produce some simple text The main concept is that tokens will expand lazily, but with full control by the non-terminal parsed token which knows its context of expansion.

Basically the first step for each non-terminal token in the formal grammar is performing a bottom-up conversion, from a top-down analysis (performed recursively); this first step builds a AST tree (stored as a basic array of children). The second step is performing a *conditional* top-down expansion of the AST array into a flat text, ignoring nodes that don't need to be expanded as their effective expansion will not be needed for the expansion of the current AST node. Both steps are recursive, but in separate simple processing loops.

OK this is still out of topic, but I don't know for now where to post these concepts. I'll stop discussing this here.

Comment 208 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-08-27 00:27:48 UTC

*** Bug 24953 has been marked as a duplicate of this bug. ***

Comment 209 Chad H. 2010-09-10 17:53:07 UTC

*** Bug 25129 has been marked as a duplicate of this bug. ***

Comment 210 Roan Kattouw 2011-03-09 13:58:51 UTC

*** Bug 24142 has been marked as a duplicate of this bug. ***

Comment 211 Mark A. Hershberger 2011-03-19 04:03:09 UTC

On wikitech-l, Roan writes 

> This support is now in MW with the category collation code Aryeh
> wrote, and developers can now proceed to write language-specific
> collations if they want to.

See http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/52572

Comment 212 Lejonel 2011-07-15 19:53:46 UTC

(In reply to comment #211)
> On wikitech-l, Roan writes 
> 
> > This support is now in MW with the category collation code Aryeh
> > wrote, and developers can now proceed to write language-specific
> > collations if they want to.

So other collations are possible for categories. But this bug is also about sorting in other places. Is there any support for sorting lists in special pages? For example duplicate bug 353 is about Danish letters in Special:Allpages.

Comment 213 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-07-15 20:12:10 UTC

Getting sorting working in other places would require work, but we could use the same infrastructure.  For Special:Allpages, we'd need to add an extra column and index to the page table, which seems a bit excessive for such a minor special page.

Comment 214 Niklas Laxström 2011-07-16 11:51:15 UTC

There are lots of places where pages are listed - don't they also require or benefit form such index?

Comment 215 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-07-17 14:12:58 UTC

Maybe.  It depends.  Given the number of comments here and the number of people being spammed on every post here, though, I'd strongly suggest that people open new bugs for each place they want to have improved collation, rather than continuing to use this bug.  You can mark them as depending on this one so interested parties can find them.

Comment 216 Philippe Verdy 2011-07-17 16:50:16 UTC

Aryeh, Can you point me with a link where your implementation was added in MW ? I'd like to review it, and locate possible bugs or caveats.
Also because the announces integration does not give documentation details about how we can use it (Wiki code syntax in pages, possible support templates, integration points, server settings, possible user preferences, possible parser functions for generating helper data in the HTML so that collation will also work on the browser side because there's still no support anywhere of collation in Javascript and because most Javascript code found to handle special sorts are bogous and do not implement true UCA, except some HTML/Javascript applications using Ajax requests to perform sort on the server side).

---

Note: jQuery still does not have any "utility" function to support collation. It provides a ".sort()" method that requires a callback function in parameter to perform the actual compare(), but inefficiently as it still cannot use precomputed collation keys from actual table data, so the user-supplied compare function would recompute collations many times from the same data cell, while performing the sort).

I have asked to jQuery to extend its interface to allow precomputing collation keys in a first pass using a user-supplied callback, and then only supply these precomputed collation keys when comparing cells during the actual sort, but this is still a dead letter. I have also asked to jQuery to start a project for implementing a $.CollatorFactory object from which collators can be prepared and instantiated, and a common interface for these collators returned by the factory.

jQuery should then supply at least a very basic CollatorFactory (a stub implementing only binary sort). Then a separate subproject could start creating a more complete CollatorFactory that will fully support multiplevel UCA, that will integrate the DUCET table, and that will integrate support for locale-specific tailorings of the DUCET, before integrating a collection of tailorings (based on CLDR data, but not necessarily on the same format, and not necessarily by allowing a dynamic instantiation of any kind of tailoring using the CLDR syntax for them.

Note that the collection of tailorings (there are many locales in the CLDR) need not always be fully loaded into the browser Javascript: a server-side Ajax request could as well provide the necessary data to the browser (most probably in a JSON structure), for allowing tailored collators to be instantiated rapidly without having to parse the CLDR syntax).

The DUCET table dataset (most probably large) could as well be loaded to the browser in a delayed way as well through Ajax only on demand (and possibly only a part of it, under the control of the CollatorFactory sollicitated by specific collators that would just ask to the CollatorFactory to preload some segments of it, to save page load time and volume, and only when the Javascript will start running some collation function that requires this data to have been initialized and loaded.

I'd like to start my own project for that, and integrate it to jQuery, in a much better way than all what I've seen on the web (all the implementations I've seen for support of collation in Javascript are simply bogous, or at least very defective). Unfortunately, Javascript itself (or the most recent ECMAScript specification), has completely forgotten to develop the subject of collation, so it will take many years before we get correct browser-side sorting becoming standard in browsers: there's not even any ongoing project for that, or even just some preliminary specs, either at ECMA, or at W3C, Apache, jQuery... and even at Oracle, Microsoft, Apple, Mozilla and Opera).

Such support on the opposite is already present in Adobe Flash, in Microsoft's Silverlight (based on the .Net libray), and since many years in Java. This means that ALL web applications that want correct browser-side collation still need to run with browser plugins, and not with pure Javascript over the HTML/CSS DOM API. Given the ongoing efforts to deprecate these plugins in favor of HTML5, I don't understand why no such effort has ever been initiated.

Comment 217 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-07-18 18:25:02 UTC

The interesting code is here:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Collation.php?view=markup

That's the part that does the actual collation, which is really just handed off to the intl extension.  The part that actually hooks into category pages is in CategoryPage.php, but it's relatively boring.  I don't think there are any parser functions or anything else exposed to users right now.

If you have questions or comments, I strongly encourage you to post to wikitech-l, not this bug.  Bugzilla is used for bug tracking, not discussion, and this bug is fixed.  If you have further feature requests, like the addition of new parser functions that expose the sortkey functionality, file new bugs.  I am not going to answer any questions that are asked here.

Comment 218 Philippe Verdy 2011-07-20 08:54:53 UTC

OK, I immediately found a bug in the binary search function (which is defintely not beinary, has slower than expected convergence, and may even stale on testing the same index, due to incorrect handling of min/max indice after comparing):

291     function findLowerBound( $valueCallback, $valueCount, $comparisonCallback, $target ) {
292	                $min = 0;
293	                $max = $valueCount - 1;
294	                do {
295	                        $mid = $min + ( ( $max - $min ) >> 1 );
296	                        $item = call_user_func( $valueCallback, $mid );
297	                        $comparison = call_user_func( $comparisonCallback, $target, $item );
298	                        if ( $comparison > 0 ) {
299	                                $min = $mid;
300	                        } elseif ( $comparison == 0 ) {
301	                                $min = $mid;
302	                                break;
303	                        } else {
304	                                $max = $mid;
305	                        }
306	                } while ( $min < $max - 1 );
307	
308	                if ( $min == 0 && $max == 0 && $comparison > 0 ) {
309	                        // Before the first item
310	                        return false;
311	                } else {
312	                        return $min;
313	                }
314	        }

The authout should have better read how TESTED binary searches are implemented. This should really be:

    function findLowerBound( $valueCallback, $valueCount, $comparisonCallback, $target ) {
        $min = 0;
        $max = $valueCount - 1;
        while ( $min <= $max ) {
            $mid = ($min + $max) >> 1;
            $item = call_user_func( $valueCallback, $mid );
            $comparison = call_user_func( $comparisonCallback, $item, $target );
            if ( $comparison < 0 ) {
                $min = $mid + 1;
            } elseif ( $comparison > 0 ) {
                $max = $mid - 1;
            } else {
                return $mid; // or: $max = $mid; break;
            }
        }
        // $target not found, now $max < min (more exactly, $max = $min - 1).
        // $target is to insert between them: $max is the highest lower bound.

        if ( $max < 0 ) {
            // Before the first item
            return false; // Don't test with a simple if !
        } else {
            return $max; // May return 0 (lower bound on first item)
        }
    }

Note that the final test to return false is unnecessary and causes confusion on usage (if you had written this in C/C++ instead of PHP, you would not be able to make the distinction between false (no lower bound not found) and 0 (valid lower bound). Your code should not depend on those PHP hacks, which consists in returning values of distinct types (but with PHP's implicit promotion on basic types, it's really dangerous do do that) !

It's just simpler to let it return -1 (the content already in $max when we are before the first item), and then let the caller test the negative sign of the result, instead of comparing it then with false.

Comment 219 Philippe Verdy 2011-07-20 11:20:08 UTC

I note the following comment in r80443:
* Changed Title::getCategorySortkey() to separate its parts with a line break instead of a null character. All collations supported by the intl extension ignore the null character, i.e. "ab" == "a\0b". It would have required a lot of hacking to make it work.

If you had read my comments above, I had already spoken since long about the choice of the separator to use between the prefix and the rest of the full page name: this had to be a non-ignorable character, that is clearly lower than any character usable in the full pagename.

I had also stated which character to use, namely the TAB character that is the first non-ignorable (i.e. with non-all-zeroes weights). See my Comment #190 for the rationale (one year ago, much longer before your recent fix to use LF, because before yu had tried with NULL, which was wrong, and even without any separator which was also wrong).

You could have avoided these bugs, as I had analyzed it long before.

OK, LF (U+000A) is OK (because it is not allowed in full page names), even it is not the first one.

---

One word for performance and space: you have put too many fields in table "categorylinks". Only one essential think is needed there for referencial constraints and counting: the UNIQUE index on (cl_from, cl_to), which is architectural. As this index is unique, it should share the same store as the table itself. But SQL engines will never use more than one index per table to read it, because it requires additional random-access cursors to access the table. As the current unique index uses less fields than the table, it implicitly creates a random-access reference from this index to the table (basically, using internal rowids stored in the index, instead of reading directly from the table).

To avoid this problem, you have to make sure that the table is formatted so that its first fields match exactly the same order as the unique index. This is the case for the first unique index created just after that. So the table and the index share the same store, however the number of rows in the index per storage page will be limited (the table can use up to 1KB per row, this does not give a very compact and fast search in the index, as it will load too many storage pages in memory, even when using a "very" selective search).

You may optimize searches (at the price of storage space), by indicating that the unique index should still use a separate store. Normally this is the role of the specifier "PRIMARY" which indicates to not use a separate store (it is still UNIQUE). For now you don't have any PRIMARY index, only a UNIQUE index which only uses a separate index store.

Now comes the two main kind of requests performed by MediaWiki:

(1) getting the list of categories in which a page is currently listed (including when saving a page because this list must be flushed out and replaced by the categories found when parsing the page). Ideally, this should require at least an index on the cl_from. The UNIQUE index on (cl_from,cl_to) works perfectly there because it is in separate store. Other attributes in the table will not even be read, unless the SQL query wants to get these attributes (if this is the case, both the index and table store will be read, the index being read very rapidly and joined with the table using the internal table row id, stored in the index with the two keys). In all what I have seen in the MediaWiki code, this does not occur (including when rendering a page, to display the list of categories, it just needs this list but gets it from the page's wiki code parsing before saving it, and this gets saved in the page cache).

(2) getting the list of pages in a category. There you need to use collation as well because this list is sorted. Ideally, this should use an index and table that just contain the necessary fields, and that the SQL SORT BY clause will use those same fields, in the same order (and either all in ASCENDING, or all in DESCENDING order). This is the second index:

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_type, cl_sortkey, cl_from);

Obviously, it cannot use the same index store as the table. The introduction of the "cl_type" field is recent. It supposes that you will be paging the category views by page type, in the order set by cl_type, which is:
ENUM ('page', 'subcat', 'file').

I wonder if this is correct and if it's the expected order. But anyway this design means that the current code will actually perform three separate requests, each one will be paged separately: one request to get the list of normal pages, another to get the list of subcats, and another to get the list of files. This will be inefficient, and hard to manage for the paging, unless these three types can be effectively paged separately in the MediaWiki HTML UI (this is not the case, as of today).

So if we can only do paging on all categorized pages, whatever their type, the "cl_type" field in the index causes the "SORT BY cl_sortkey" clause in the SQL query to build a clostly temporary index, instead of reading directly from the index where it will finally get the "cl_from" field to display.

So as long as we cannot have separate paging by type, I really suggest that you either drop the "cl_type" field from this index, or that you place it after "cl_sortkey" needed for fast "ORDER BY". I.e. :

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_sortkey, cl_from);

My opinion if that "cl_type" is not even needed there, the page id stored in "cl_from" already has this information, because it will be always be used as a reference to the "pages" table to get the actual page name to display, as well as its namespace, none of them being stored in the "categorylinks" table itself.

Note that because this is a secondary index (in a separate store), it is still good to include 'cl_from' there, even if it's not needed for the selectibity of this index, just to avoid an additional random-access cursor to the "categorylinks" table to be used by the SQL engine using table row ids also stored in this index (SQL engines prefer to use referencial cursors, if possible, instead of creating temporary table/index stores for cross joins between tables and/or indexes, except if both joined parts are not very selective ; as this is SQL-engine specific to their internal query optimizer, I won't discuss more about the tricky details ; it's just simpler to avoid the selectivy problem here by including the small "cl_from" field in this index which does not fill up much space in the encoded index row and still allows a good compaction of many rows in the index pages).

So why "cl_type" is there (in the index) if we still cannot perform paging separately on these types when viewing a category content ? Even if you do so later, it will require separate SELECT requests, with separate sorts, and separate paging every 200 items. In this case only, "cl_type" will be specified as a constant and will become very selective, and probably better viewed with separate indexes (so even "cl_type" will not be needed if you have separate tables by splitting the "categorylinks" table for each type).

If you really want to maintain the "cl_type" field, for me it's just an informational denormalization of something really represented by "cl_from". Beware of the effects of such denormalizations (I hope that a page id will never change its type, including when a page is renamed to another namespace; but I won't bet it; I think this may be the cause of future database inconsistencies). For me the rest of the datashema is not prepared to handle such separation (notably between cl_type="page" and cl_type="file").

And in fact if you really want to maintain it in the index, as a convenience to simplify the MediaWiki client code perorming the SQL queries, it should better be, for now, only the last field of this index:

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_sortkey, cl_from, cl_type /* denormalized: see cl_from */);

This will still avoid the ORDER BY clause of your queries to create an expensive temporary index when viewing and paging the (sorted) contents of a category (notably for very populated categories, notably on Wiktionnaries that contain categories for large collections of words per language, with 200 000 entries or more).

Comment 220 Tim Starling 2011-07-20 12:14:54 UTC

(In reply to comment #218)
>             if ( $comparison < 0 ) {
>                 $min = $mid + 1;
>             } elseif ( $comparison > 0 ) {
>                 $max = $mid - 1;

The function is not a classical binary search, which only determines equality, instead it finds the lower bound of a range. When the test item sorts below the target ($comparison < 0), it can't be ruled out as a prospective lower bound. That's why you need $min = $mid, not $min = $mid + 1. 

Maybe $max = $mid - 1 is possible in the $comparison > 0 case, with your suggested alteration to the test for the before-the-start case. But the existing code is tested and does work, and the improvement in convergence rate would be small. 

> OK, I immediately found a bug in the binary search function (which is defintely
> not beinary, has slower than expected convergence, and may even stale on
> testing the same index, due to incorrect handling of min/max indice after
> comparing):

I assume by "stale" you mean an infinite loop. This does not occur because the midpoint is always different from either of the two endpoints until the difference $max - $min is reduced to 1 or 0, at which point the loop exits. 

In a textbook binary search, offsetting the midpoint by 1 allows the loop to continue until $min > $max. But as I said, we can't offset the midpoint in the $comparison < 0 case, so a slightly earlier loop termination is required to avoid an infinite loop.

> Note that the final test to return false is unnecessary and causes confusion on
> usage (if you had written this in C/C++ instead of PHP, you would not be able
> to make the distinction between false (no lower bound not found) and 0 (valid
> lower bound). Your code should not depend on those PHP hacks, which consists in
> returning values of distinct types (but with PHP's implicit promotion on basic
> types, it's really dangerous do do that) !
> 
> It's just simpler to let it return -1 (the content already in $max when we are
> before the first item), and then let the caller test the negative sign of the
> result, instead of comparing it then with false.

It's conventional for PHP functions to return false to indicate an error or other unusual situation, see for instance http://php.net/array_search . It's not at all dangerous. If I wrote the function in C, I would indeed have used -1, and I would have called it an ugly hack to work around C's lack of weak typing.

If you're aware of some actual bug in findLowerBound(), you should file a separate bug report, or note it on <http://www.mediawiki.org/wiki/Special:Code/MediaWiki/80443>

Comment 221 Domas Mituzas 2011-07-20 12:40:40 UTC

@Philippe, InnoDB treats first UNIQUE index as clustered PRIMARY key, which stores all data with it (I assume you don't come from InnoDB background, please correct me otherwise ;-)

cl_sortkey structure allows having dataset split by type (e.g. subcategories first, files last, whatever), and still maintain decent sorted order. cl_from is included for cases where we need to page over multiple entries with same sortkey.

do note, as secondary keys would include primary key values anyway, it is not much of a cost. 

Back in the day both major reads used to be covering ones, without random lookups, I'm not sure if there's a regression nowadays with all the new collation code though.

Comment 222 Philippe Verdy 2011-07-20 13:18:31 UTC

You should know that a binary search for equality of for lower bound is completely the same algorithm. (If not convined, look at the source of a C++ standard library, which uses the same function for both: finding exact matches or finding a lower bound for inserting a new item).

The difference only occurs at end of the loop, about what you return when an exact equality is not found. In both cases, the binary search loop will terminate with min=max+1 exactly if no exact match is found, so that max is the lower bound you want.

My suggestion fixes a few things:
- it uses a while()... instead of a do...while() which is incorrect for the case where the search space opperates with count<2.
- it has faster convergence
- it contains no special case to handle (your code will be erroneous in those cases).
- even when searching with count=0, it starts with min=0 and max=-1, which terminates the loop immediately, with the same condition for not found (max < min, and max if the lower bound)
- it works with count=1 without creating an infinite loop like in your code when the target is higher that the only element [0] to check, because it ensures that the loop will ALWAYS reduce the search space at each interation.

Comment 223 Philippe Verdy 2011-07-20 13:35:21 UTC

> The function is not a classical binary search, which only determines equality,
instead it finds the lower bound of a range. When the test item sorts below the
target ($comparison < 0), it can't be ruled out as a prospective lower bound.

And here you're completely wrong : in that case the comparison says that the mid idem is certainly not the lower bound. Don't forget that the item at max (and below) will also be tested: max will decrease as well, up to the point that it will either find an exact match, or will fall 1 position below min.

> That's why you need $min = $mid, not $min = $mid + 1. 

That's why you don't need that !

Comment 224 Brion Vibber 2011-08-24 20:25:53 UTC

I'm reopening this based on notes from bug 29788 etc.

* The fixes applied for this bug apply only for category member sorting, but the problem exists in many other places as mentioned in the summary, such as [[Special:Allpages]], [[Special:ListUsers]], etc

* There doesn't appear to be any straightforward way to set the collation to a locale-specific one!

The only available options are "uppercase" and "uca-default" -- sorting in a locale-specific way (say, Swedish per bug 29788) appears to require writing additional code

* For some reason, site content language doesn't appear to be taken into account in any way; you must set $wgCategoryCollation manually.

This seems very strange to me, and is an obvious additional burden on site administrators to figure out how to code up (!) and then activate a language-specific collation.

Comment 225 Siebrand Mazeland 2011-08-24 20:55:21 UTC

(In reply to comment #224)
> I'm reopening this based on notes from bug 29788 etc.
> 
> * The fixes applied for this bug apply only for category member sorting, but
> the problem exists in many other places

Don't forget sortable table sorting.

I'm not sure if this bug with 224 (225 now) comments is the best bug to keep tracking this. How about creating a tracking bug, and slicing this issue into smaller, more manageable issues?

Comment 226 Bawolff (Brian Wolff) 2011-08-24 22:40:39 UTC

(In reply to comment #224)
> I'm reopening this based on notes from bug 29788 etc.
> 
> * The fixes applied for this bug apply only for category member sorting, but
> the problem exists in many other places as mentioned in the summary, such as
> [[Special:Allpages]], [[Special:ListUsers]], etc

Related bug 24574. I suppose we would have to add a sortkey field to other various table.

> * There doesn't appear to be any straightforward way to set the collation to a
> locale-specific one!
> 
> The only available options are "uppercase" and "uca-default" -- sorting in a
> locale-specific way (say, Swedish per bug 29788) appears to require writing
> additional code
> 
> * For some reason, site content language doesn't appear to be taken into
> account in any way; you must set $wgCategoryCollation manually.

I think that was partially out of concern for php installs without the intl extension installed (So uca collation would not work). Since changing the collation requires running a maintenance script, if we just checked that the relevant class existed to determine which collation to use, and then someone upgraded their php so intl was suddenly present, things would become screwed up. (Or that's what I assumed the reasoning was. I could be wrong)

> 
> This seems very strange to me, and is an obvious additional burden on site
> administrators to figure out how to code up (!) and then activate a
> language-specific collation.

Perhaps making the value $wgCategoryCollation consider anything past uca to be the locale. For example it would consider the value of "uca-foo" to use IcuCollation( 'foo' ) as the Collation class.

Comment 227 Purodha Blissenbach 2011-08-25 16:02:59 UTC

> (In reply to comment #224)
> 
> Don't forget sortable table sorting.

Sortable tables are sorted client-side. This might offer additional
flexibility. If clients were enabled to choose collations for sortable
tables at the time of accutal sorting, we could offer them options
above those available in server side database sorts.

Comment 228 Philippe Verdy 2011-08-26 21:49:23 UTC

I am currently working on a Javascript implementation of UCA (i.e. with full support of the DUCET, at least 4 collation levels, and support for language-based tailorings). This is a very difficult task if we need performance. The trick is to use efficient lookup tables, but the difficulty is being able to update them (for tailorings); also performance will highly depend on how the DUCET is represented in Javascript, so that it can be initialized sufficiently fast that you won't notice this init time, when a collator object gets instanciated.

I'm still not sure that performance will be good enough (even with today's much faster Javascript implementations), and may be some optional cooperation with a server-side script will be needed (or possibly by using a client-side native plugin interfaced with Javascript, such as Flash) could limit the number of JSON requests to a server in order to provide the tailored lookup tables. How to detect ans use client-side plugins in an open way.

I really hope that someday, the Javascript/ECMAScript standard will specify a common standard interface to collators. I've already looked for help from authors of jQuery, where such common interface would become part of its "tools" library (even if this has nothing in common with working in the HTML DOM, that is the main focus of jQuery).

For now there exists absolutely no reliable implementation of UCA in Javascript working only on the client-side (but there does exist already some Javascript libraries that in fact use JSON or XML requests to a server, in order to build or retreive the necessary lookup tables).

For practical reasons, a server-side JSON/XML request could be optimized to reduce the volume of data exchanged (for example by providing only some parts of the collation elements, these data being then requested on demand even after the collator has already been -partly- initialized, and then cached in the collator object). But this is a complication in the design that for now I don't want to investigate too early, before I get a good image of the performances effectively needed in practice.

May be it will be just enough to initialize only a part of the DUCET for a particular language and its specific tailoring, sorting all the other many characters with default weights (not necessarily from the large DUCET).

My implementation will provide at least 3 interfaced functions:
- instantiating a collator for a given language & sort order
- computing collation keys from strings, for more efficient sorts of large tables (even if we use a QuickSort, the same strings are compared multiple times with other strings, so it may help to compute their collation keys only once); in addition, it may not be necessary to compute the full collation key, but only keys up to the level that's sufficient to perform a specific compare; collation keys can then be computed partly, on demand, and cached, instead of being fully computed for the full string at all collation levels; in addition, not all table sorts may require the maximum collation level supported, so collation keys don't necessarily have to be long (in memory);
- the tradeoff for precomputing keys instead of compring keys on the fly, is highly dependant with the table sizes: for small number of rows to be sorted, you gain little with precomputed keys, you just waste memory and get lower performance. So the interface will also include a 3rd function: comparing strings using this collator, without necessarily having to compute the collation keys.

I estimate that the tradeoff limit between precomputed collation keys and direct compares is for tables of about 100 rows, for which it may be helpful to provide a very fast response time when sorting them, but I may be wrong, because these small tables will not require precomputing and storing a lot of collation keys (so this first step before the QuickSort loop may be insignificant). The real limit could be memory use in very large tables for precomputing, storing or caching the collation keys; but such very large tables will probably be very uncommon in tables generated from Wiki pages (most of these tables would not include more than 500 rows, and storing 500 collation keys is not really a problem in today's browsers, except possibly in smartphones with limited memory resources and slow processors, compared to laptops, notebooks and tablets).

Comment 229 Andrew Dunbar 2011-08-27 10:03:28 UTC

@Philippe Verdy: I think work on the bigger Unicode problems for JavaScript is pretty important. I tried to file a bug or feature request somewhere a couple of years ago and the sentiment from the js community was that Unicode was too big and  unneeded, would bloat js and be too slow. I couldn't convince whoever was my audience that you just need it and in any case the js implememtations should be hooking into Unicode APIs provided by their host OS or browser which must have them anyway.

I have implemented some character/script property stuff from Unidata in js and for my purposes found it very fast indeed to do everything with binary search. And all the js implementations are much faster now.

I was doing stuff like looking up the script of every character in an HTML page.

I've also implemented internationalized sorting including UCA of page titles in en.wiktionary using the toolserver as a back end. All work was done on a first generation netbook (-: You have my support. Is there somewhere I can watch your project perhaps?

Comment 230 Philippe Verdy 2011-08-27 11:06:18 UTC

Until recently, it was too slow to initialize a large table like the DUCET. But now that Javascript starts being used for 2D/3D rendering with interactive speed, demonstrating complex animations, I don't think this is a performance issue. The only issue is the code size that must be loaded once (that's why I spoke about caching the DUCET and loading it by parts, on demand). But note that the DUCET is not as large as the full Unicode. You can easily exclude the Hangul syllables, and the sinograms, and can autobuild some parts of it for blocks of symbols. The core of the implementation is already written for the DUCET-based collation, I need more work to instantiate tailorings. But I've not determined the best way to handle on-demand loading of the DUCET. In addition, if I can finalize it, I will need a way to regenerate with some automatic tool the representation of the DUCET for each release of Unicode. For now I just use a static table, created with lots of manual operations: a code generator will be needed (just like in all existing full implementations of UCA).

Comment 231 Bawolff (Brian Wolff) 2011-08-28 23:35:55 UTC

Re JS sortable table stuff (this bug is about too many things...): Well client side js sorting stuff would be nice, since these tables aren't dynamic, wouldn't it be easier to: on the server side add a sortkey as a data attribute for each cell of the table, and then just sort that on client side. (Only issue with this is it might be complicated to implement in the parser [table code in parser is scary imho, but I'm sure some other people find it less scary], and lang preferences, but I think that just using the content language [or the page's language I suppose now that we have that] would be sufficient).

Thoughts?

Comment 232 Philippe Verdy 2011-08-29 01:29:45 UTC

If table contents are static, it's true that their collation weights could be delivered as well with the visible table.

In a past message above, I had already proposed that the function that already is needed to sort categories, and that computes the UCA collation keys, would be exposed as a ParserFunction. This would then obviously allow any wiki page to compose web tables containing both the visible data and the associated collation key.

Many very complex templates, that are used to generate sortable pseudo-collation keys would no longer be needed or would be largely simplified. In fact, I even proposed initially to first expose this function before even modifying the schema of categories : there was no immediate mergency to modify this schema, as long as we could experiment assigning sort keys when categorizing a page, using that ParserFunction.

And probably we would have been able to experiment with a better schema, including one that would have allowed multiple collation keys (generated from different locale parameters), something that has been forgotten completely (and that will require a new change in the database schema in the future, to support multiple collations according to users localisation preferences (and expectations).

Anyway, my work on JS implementation of UCA is already well advanced, and in fact more advanced now on some points that MySQL still does not support correctly, notably contractions and expansions.

And very soon I'll add support for contexts (see UTR#35 to see what I mean with those terms), because it's a trivial use of contractions and expansions, combined

But it can be implemented also as a separate function, to generate shorter keys, I'm stil not very concerned by choosing the most compact format for representing collation keys, but I have already worked a lot in how to compact automatically the sets of collation weights for each level, suppressing all unnecessary gaps that are shown in the standard DUCET table, and only provided for convenience with very basic implementation of UCA not using tailorings, or only using very basic tailorings and no contraction/expansions).

For now, the most critical algorithm (that takes most time when computing collation keys, or when directly comparing two strings) is the computing the NFD: I can save time when comparing pair of strings (for example processing only the begining of strings ontil a collation difference is found, and this does not always require strings being fully converted to NFD), but not when computing any collation key (unless I also add a constraint limiting the length of the generated collation keys, this constraint allowing to stop early). I have looked at severla implementation of Unicode normalizers in Javascript, I'm not satisfied with them as they are clearly not optimal.

In fact both the NFD transform and the generation of collation weights for each level are linked: if we sort only on primary collation level, we loose too much time in computing the NFD transform, that provides too many unnecessary details that will be finally ignored: this means that I'm also trying to perform a NFD transform, that removes these details. Such transform is effectively what the Unicode standard calls a "collation mapping" (something more powerful than just a "case mapping", or even a "case folding").

Such "collation mapping" would be extremely useful for implementing the "first letter" classification in categories, or even to provide thiner classifications in very populated categories (for example allowing two letters). This needs is in fact exactly equivalent to searching for the smallest string that has the smallest collation keys containing only two collation weights per collation level, and with a collation set to only one level, and this also can be largely optimized so that the NFD transform will remove all collation-ignorable details that would never be needed to compute the collation key.

All this is an interesting subject of research (and even ICU does not provide such a general mechanism...).

I will be also innovative in how to provide a very compact representation of tailorings. I also have ideas on how to represent, store, query and cache the precomputed lookup tables for tailorings. And on a standard interface that will allow plugable implementations in native code (if possible and needed), or for use with other languages, but also with other non-UCA based collations (including dummy collations, such as binary sorting, or sorting by standard case folding, or by standard case mappings), or complex collations requiring a lot more mapping data than the DUCET and a set of collation orders and contractions/expansions (for example for sorting Han ideographs on radical/stroke, or for sorting based on romanisations and other translitterations, that are all language-dependant; but for which I have not developed anything for now about translitterators, not even for the standard romanizations of Chinese and Japanese, that require lookup in a large dictionary, such as the huge CJK properties file found in the Unicode database as it cannot be performed algorithmically like standard romanizations of Russian or Arabic).

If you think about it more closely, ALL the existing algorithms that transform any Unicode text into another can be thought as "collation mappings", based on a language-dependant definition of a multilevel collation order. Collation is the central concept, and the distinctions between algorithms are not different from distinctions between collation tailorings (a tailoring may be language-neutral, or be language-dependant).

I will expose my solutions later, first on the Unicode mailing list, on the ICU mailing list, on the jQuery maiing list, expecting that there will be interesting contributions (or useful checks and corrections), before it can become a generic library that can be reused in other projects like MediaWiki. I'll use a licence that will be compatible both with free licences (GPL-like) and open-source licences (OSI), as well as with commercial projects. It will probably be BSD-like.

Comment 233 Mark A. Hershberger 2011-09-01 03:11:07 UTC

This bug has been split into 4 separate bus to track the different issues raised:

    bug 30672 -- improve sorting on other pages than category pages (tracking)
    bug 30673 -- Support locale-specific, or tailored, sorting (tracking)
    bug 30674 -- Support better client-side sorting (tracking)
    bug 30675 -- Use allkeys_CLDR.txt - the CLDR tailored DUCET instead of allkeys.txt

Comment 234 Philippe Verdy 2011-09-03 16:52:30 UTC

Don't close this bug as long as it blocks other bugs that are not closed:

   bug 3969 -- UTF-8 compatibility (tracking)
   bug 6948 -- natural number sorting
   bug 29788 -- sort Swedish letters correctly (at end of alphabet)

These bugs effectively depend on UCA (or UCA improvements)

My opinion is that bug 3969 should be detached, but the other two should be reattached to one of the 4 new bugs, on which they depend (most probably bug 30673).

And the new bugs should still be linked to this bug 164, transformed only in a tracking number (that should not be closed as long as the new 4 bugs are open)

Comment 235 Phillip Patriakeas 2011-09-03 17:10:54 UTC

(In reply to comment #234)
> Don't close this bug as long as it blocks other bugs that are not closed

I thought it worked the other way around; bugs that are blocked by open bugs can't/shouldn't be closed...?

Comment 236 Siebrand Mazeland 2011-09-03 17:22:30 UTC

RESOLVED / FIXED per hexmode. Please leave this closed. We discussed this at length about this during the triage session.

Comment 237 Huji 2011-09-18 17:40:08 UTC

*** Bug 30287 has been marked as a duplicate of this bug. ***

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links

anon.hui
arthit
ayg
bawolff+wn
benno79
brion
bugzilla.wikimedia
bugzilla
catlow
chriskwonlewis
circeus
crangasi2001
dbl2010
dionys
domas.mituzas
dragonlordofxanther
Drilnoth
Eugene.Zelenko
f-ar
federicoleva
ForoaW
gangleri
greg
gtisza
hanberke7
he7d3r+bugs
hippytrail
huji.huji
hydrogen_cyanide
juaned
jutiphan
lars
lejonel
liangent
mah
mail
mashiah.davidson
michael
misoss
monkbel
mormegil
niklas.laxstrom
pinar.bugzilla
raylton.sousa
reza.energy
roan.kattouw
roozbeh
s.moaddeli
siebrand
some_777
spacebirdy
spm
studerby
thecentercannothold
timwi
to.aru.shiroi.neko
tomchiukc
tstarling
v85.wikipedia
verdy_p
webmaster