Last modified: 2008-05-19 20:22:34 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3701, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 1701 - special firstChar() routine for Korean characters
special firstChar() routine for Korean characters
Product: MediaWiki
Classification: Unclassified
Categories (Other open bugs)
All All
: High enhancement with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, patch-need-review
Depends on:
Blocks: 3950
  Show dependency treegraph
Reported: 2005-03-16 05:02 UTC by Puzzlet Chung
Modified: 2008-05-19 20:22 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---

Patch for LanguageUtf8.php (1.88 KB, patch)
2005-06-14 04:46 UTC, Puzzlet Chung

Description Puzzlet Chung 2005-03-16 05:02:35 UTC
Since the written Korean language -- hangul -- is syllablic, pages in a category
page are sectioned with their initial syllables other than letters or phonemes.
As a result, almost every page has eventually its own section. Look at the URL,
which is equivalent to the Category:People in the English Wikipedia. In the
Korean category page, many pages have their own sections, such as
Category:Austrian_people, which falls in the "Au" section,
Category:Polish_people, which falls in the "Pol" section, etc. (They can be
recategorized to Category:People_by_nationality of course, but that's not the
point of the discussion.)

Every hangul letter can be divided to consonants and vowels, and it could be the
better index scheme for category pages if we section by the initial consonants
of initial letters of the pages:
* articles starting with from 가(U+AC00) to 낗(U+B097) under the section with a
title ㄱ(U+1100),
* from 나(U+B098) to 닣(U+B2E3) under ㄴ(U+1102),
* from 다(U+B2E4) to 띻(U+B77B) under ㄷ(U+1103),
* from 라(U+B77C) to 맇(U+B9C7) under ㄹ(U+1105),
* from 마(U+B9C8) to 밓(U+BC13) under ㅁ(U+1106),
* from 바(U+BC14) to 삫(U+C0AB) under ㅂ(U+1107),
* from 사(U+C0AC) to 앃(U+C543) under ㅅ(U+1109),
* from 아(U+C544) to 잏(U+C78F) under ㅇ(U+110B),
* from 자(U+C790) to 찧(U+CC27) under ㅈ(U+110C),
* from 차(U+CC28) to 칳(U+CE73) under ㅊ(U+110E),
* from 카(U+CE74) to 킿(U+D0BF) under ㅋ(U+110F),
* from 타(U+D0C0) to 팋(U+D30B) under ㅌ(U+1110),
* from 파(U+D30C) to 핗(U+D557) under ㅍ(U+1111),
* and from 하(U+D558) to 힣(U+D7A3) under ㅎ(U+1112).
Comment 1 Ævar Arnfjörð Bjarmason 2005-04-27 04:04:05 UTC
A duplicate of bug 1984.

*** This bug has been marked as a duplicate of 1984 ***
Comment 2 Puzzlet Chung 2005-06-14 04:46:48 UTC
Created attachment 609 [details]
Patch for LanguageUtf8.php
Comment 3 Puzzlet Chung 2005-06-14 04:59:57 UTC
Changes in LanguageKo.php work fine in Korean Wikipedia, but multilingual
projects like Meta-wiki Wikisource need to be updated too.  I attached the patch
file, which only modifies firstChar() to specially treat the Hangul Syllables
Area(U+AC00 ~ U+D7A3), but for any other characters it will do as what it has
been doing.  But I'm not sure which file is the appropriate to be patched -
Language.php or LanguageUtf8.php.  Take this for a test - - which should
be not more than 10 sections after commit.
Comment 4 Puzzlet Chung 2005-11-13 07:34:26 UTC
It's now OK for Korean Wikisource ( ) but
multilingual wiki like Meta-wiki still has this issue ( ).

My point is that this feature should be applied universally if it matters with
the pagename with Korean characters.
Comment 5 Anon Sricharoenchai 2008-04-28 09:43:25 UTC
I second to this, this firstChar() of ko should apply to all wiki language, especially, on multilingual wiki.
Not just on ko wiki.
Comment 6 Kyungjoon Lee 2008-05-01 09:42:45 UTC
Another vote for support here.
Comment 7 Brion Vibber 2008-05-19 20:22:34 UTC
Done in r35055. Also did a tiny bit of cleanup to use utf8ToCodepoint() func instead of the manual UTF-8 decomp code.

(Could just use raw characters here instead of the hex positions, should one desire, but this isn't a performance-critical code path.)

Note You need to log in before you can comment on or make changes to this bug.