Last modified: 2008-05-19 20:22:34 UTC
Since the written Korean language -- hangul -- is syllablic, pages in a category page are sectioned with their initial syllables other than letters or phonemes. As a result, almost every page has eventually its own section. Look at the URL, which is equivalent to the Category:People in the English Wikipedia. In the Korean category page, many pages have their own sections, such as Category:Austrian_people, which falls in the "Au" section, Category:Polish_people, which falls in the "Pol" section, etc. (They can be recategorized to Category:People_by_nationality of course, but that's not the point of the discussion.) Every hangul letter can be divided to consonants and vowels, and it could be the better index scheme for category pages if we section by the initial consonants of initial letters of the pages: * articles starting with from 가(U+AC00) to 낗(U+B097) under the section with a title ㄱ(U+1100), * from 나(U+B098) to 닣(U+B2E3) under ㄴ(U+1102), * from 다(U+B2E4) to 띻(U+B77B) under ㄷ(U+1103), * from 라(U+B77C) to 맇(U+B9C7) under ㄹ(U+1105), * from 마(U+B9C8) to 밓(U+BC13) under ㅁ(U+1106), * from 바(U+BC14) to 삫(U+C0AB) under ㅂ(U+1107), * from 사(U+C0AC) to 앃(U+C543) under ㅅ(U+1109), * from 아(U+C544) to 잏(U+C78F) under ㅇ(U+110B), * from 자(U+C790) to 찧(U+CC27) under ㅈ(U+110C), * from 차(U+CC28) to 칳(U+CE73) under ㅊ(U+110E), * from 카(U+CE74) to 킿(U+D0BF) under ㅋ(U+110F), * from 타(U+D0C0) to 팋(U+D30B) under ㅌ(U+1110), * from 파(U+D30C) to 핗(U+D557) under ㅍ(U+1111), * and from 하(U+D558) to 힣(U+D7A3) under ㅎ(U+1112).
A duplicate of bug 1984. *** This bug has been marked as a duplicate of 1984 ***
Created attachment 609 [details] Patch for LanguageUtf8.php
Changes in LanguageKo.php work fine in Korean Wikipedia, but multilingual projects like Meta-wiki Wikisource need to be updated too. I attached the patch file, which only modifies firstChar() to specially treat the Hangul Syllables Area(U+AC00 ~ U+D7A3), but for any other characters it will do as what it has been doing. But I'm not sure which file is the appropriate to be patched - Language.php or LanguageUtf8.php. Take this for a test - http://wikisource.org/wiki/Category:%ED%95%9C%EA%B5%AD%EC%96%B4 - which should be not more than 10 sections after commit.
It's now OK for Korean Wikisource ( http://ko.wikisource.org/wiki/%EB%B6%84%EB%A5%98:%EC%8B%9C%EC%A1%B0 ) but multilingual wiki like Meta-wiki still has this issue ( http://meta.wikimedia.org/wiki/Category:KO ). My point is that this feature should be applied universally if it matters with the pagename with Korean characters.
I second to this, this firstChar() of ko should apply to all wiki language, especially, on multilingual wiki. Not just on ko wiki.
Another vote for support here.
Done in r35055. Also did a tiny bit of cleanup to use utf8ToCodepoint() func instead of the manual UTF-8 decomp code. (Could just use raw characters here instead of the hex positions, should one desire, but this isn't a performance-critical code path.)