Last modified: 2014-11-17 09:55:43 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T22547, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 20547 - Update non-standard language codes in the projects
Update non-standard language codes in the projects
Status: NEW
Product: Wikimedia
Classification: Unclassified
Language setup (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-09-08 13:53 UTC by Gerard Meijssen
Modified: 2014-11-17 09:55 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Gerard Meijssen 2009-09-08 13:53:48 UTC
Hoi, Aryeh Gregor asked me to make these changes ... http://meta.wikimedia.org/w/index.php?title=Www.wikipedia.org_template&diff=1632626&oldid=1630733 these changes fix errors indicated by a validator. http://validator.nu/?doc=http%3A%2F%2Fwww.wikipedia.org&profile=permissive 

We can make our content comply with the standards when the language code is changed on the projects as well. I know these changes to be correct.

Please make these changes.. they demonstrate that we are good Internet citizen.. :)
Thanks,
    GerardM
Comment 1 This, that and the other (TTO) 2014-02-23 06:32:09 UTC
What exactly is meant by this bug? What code should be changed?
Comment 2 Andre Klapper 2014-02-24 11:09:27 UTC
The request is to change "zh-hak" to "hak" and others, but the list of "some language codes" is missing here so it's very unclear when this report would be "fixed". Gerard, could you clarify which language codes are affected?

http://www-01.sil.org/iso639-3/codes.asp
Comment 3 This, that and the other (TTO) 2014-02-25 08:42:23 UTC
> The request is to change "zh-hak" to "hak" and others

Assuming this refers to the www.wikipedia.org portal (which is my best guess), I should point out that this page is editable by Meta admins - so this is not a matter for Bugzilla.
Comment 4 Philippe Verdy 2014-02-25 23:52:39 UTC
Some of your "corrections" are not real corrections.

- The validator just complains about new HTML5 attributes (like srcset on images) or elements (like bdi) which do not cause any problem. They are not really errors
- You corrected codes that are perfectly valid (note that this is NOT ISO 639-3 which is used in HTML, but BCP 47; many valid BCP 47 codes do not exist in ISO 639-3, and many codes valid in ISO 639-3 are invalid in BCP 47 !!!)

Do not mix the (unstable) ISO 639 **language** codes with the standard BCP 47 language tags which have always been normative in HTML (including HTML4), and stable since decennials !

Note that BCP47 uses *some* codes from ISO 639-1 (not all), *some* codes from ISO 639-2 (not all), and only then *some* codes from IS 639-3. It also appends *some* codes from ISO 3166-1, *some* codes from UN M.49, *some* codes from ISO 15924, and *some* codes whose origin is the BCP 47 standard track itself.

The reference database for BCP 47 is *not* on on any ISO MA, but the IANA database for language subtags, BCP 47 documents which ISO codes may be imported in the IANA database as subtags and how supplementary extension subtags may be registered (for language variants, or for locales, such as the Unicode locale extension subtags)
Comment 5 This, that and the other (TTO) 2014-02-26 00:58:22 UTC
As I said in comment 2, if there are problems with the www.wikipedia.org portal page, please take the matter to [[m:Talk:www.wikipedia.org template]]. If you are concerned about the language codes somewhere else, please tell us exactly what you are referring to!
Comment 6 Philippe Verdy 2014-02-26 01:51:23 UTC
Yes but your coment 2 only restricts to ISO 639-3, which is plain wrong !
Comment 7 Philippe Verdy 2014-02-26 01:53:51 UTC
And no, your comment 2 (or any other one) did NOT poin t to the tal page you suggest now.
Comment 8 Philippe Verdy 2014-02-26 02:01:25 UTC
For example the change from "zh-hak" to "hak" only is NOT required for conformance to HTML standard; "zh-hak" remains fully conforming to BCP 47, even if it has now a "preferred" value, and is now in deprecation (but not obsolete).

The real language tags that are violating BCP 47 are for example:
* "nrm" (it also violates ISO 639-3)
* "roa-tara" (it also violates ISO 15924)
* "simple"

The language tag "pa-Guru" you "corrected" by replacing it by "pa" was perfectly correct; now it is more ambiguous (and breaks some renderers unable to choose the appropriate font to use for this language written in multiple scripts).
Comment 9 This, that and the other (TTO) 2014-02-26 11:21:26 UTC
(In reply to Philippe Verdy from comment #7)
> And no, your comment 2 (or any other one) did NOT poin t to the tal page you
> suggest now.

My apologies, I meant to point to comment 3. Sorry for the incorrect reference.

So I think I now understand the scope of this bug: you are stating that incorrect HTML lang attributes are being generated on the projects with language codes "nrm", "roa-tara", and "simple".

> The real language tags that are violating BCP 47 are for example:
> * "nrm" (it also violates ISO 639-3)

"nrm" refers to Narom language. However, IANA have not provided a language code for Norman, so I don't know what we're meant to do here. I notice that www.wikipedia.org uses the made-up code "roa-x-nrm" for this language.

> * "roa-tara" (it also violates ISO 15924)

[[roa-tara:]] has the nonsensical lang attribute value "roa-Tara", as if Tara is a script. Again, the Tarantino dialect lacks a unique code and will probably never get one. The www.wikipedia.org portal just uses "roa" for this language.

> * "simple"

[[simple:]] has the correct lang attribute value "en".

> The language tag "pa-Guru" you "corrected" by replacing it by "pa"

Not sure who you're talking to here, but it certainly wasn't me who did this. I doubt it was Gerard either.
Comment 10 Philippe Verdy 2014-02-26 16:23:30 UTC
> > The real language tags that are violating BCP 47 are for example:
> > * "nrm" (it also violates ISO 639-3)
> "nrm" refers to Narom language. However, IANA have not provided a language code for Norman, so I don't know what we're meant to do here. I notice that www.wikipedia.org uses the made-up code "roa-x-nrm" for this language.

The IANA database cannot reference this language if it's not even encoded in ISO 639 (so that one of the ISO 639 codes can be imported to the IANA database), and as long as there's not been any specific registration for the language in the IAN database.

"roa-x-nrm" would be conforming, but linguists still consider Norman to be a regional variant of French. "fr-x-norman" or just "fr-x-nrm" would be conforming and would make more sense than using the "roa" language family code

(in BCP 47, the use of language family codes is not invalid but it is highly discouraged, as opposed to codes of macrolanguages like zh/Chinese or sh/Serbocroatian grouping several isolated languages that have a large common base for mutual understanding, even if they are written with distinct scripts because translitterators work quite well within the same isolated language)

Other examples:

"be-x-old" is perfectly conforming to BCP 47 (and so is also conforming to HTML or XML), even if this orthography has now a preferred language tag (but the association between "be-x-old" and "be-tarask" is private to Wikiemdia projects, and not found in the IANA database), so for most softwares "be-x-old" and "be" alone cannot be distinguished.

On the opposite, "zh-gan", "zh-hak" or "zh-yue" are also conforming but they have now a documented preferred value in the IANA database without the "zh-" prefix of the macrolanguage.

"zh-cmn" is also conforming, just like "cmn" alone, but both have a preferred value which is "zh" (the code "zh" of the macro language, because Mandarin if the default language assumed in many applications for the Chinese macrolanguage)

One of the purposes of BCP 47 tags is also to allow easy mapping of language/locale fallbacks (fallbacks are definitely not a goal in ISO 639); but also to preserve backward compatibility of tagged contents (not warrantied by ISO 639 codes).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links