Last modified: 2013-07-25 09:00:06 UTC
'simple' isn't a valid language code, though we're outputting it for interlanguage links. We 'could' add in a simple hack here that will make 'simple' output lang="en" instead. Though I do have a bit of a more interesting idea. Instead of what, how about we swap simple for en-x-Simple and add in a code that lets us create aliases for language codes so that simple: will still be equivalent to en-x-Simple. Going by bcp47 (https://www.rfc-editor.org/rfc/bcp/bcp47.txt) the code en-x-Simple is valid. It's an 'en' lang code with a private subtag of 'Simple'. bcp47 reserves x-* for private use purposes, things that wouldn't be registered, essentially that's what we're talking about here.
This issue is much wider actually. It applies for all languages listed here: http://en.wikipedia.org/wiki/List_of_Wikipedias#Wikipedia_edition_codes The solution. We need to add a "language" tag normalization in the core trough which we can put the ll_lang of the Langlinks_table table of the database, before we actually generate a 'real' lang tag. Such a normalization table (language_mapping ?) would have ll_lang: the wiki defined interlanguage code wiki_variant: a wiki language variant code iso 639-1 iso 639-2 bcp47 code: (includes private codes, variant names, sign, transliteration etc) Could probably be built upon 'Extension:CLDR'
Slightly related: r103640
Created attachment 9591 [details] Use $wgDummyLanguageCodes for getting the right language code I think it is sufficient to use $wgDummyLanguageCodes (per r103640) for this, since it will contain all code mappings relevant for MediaWiki/Wikimedia. A database like you propose seems overkill to me.
This is weird, those attributes were only added just today: r104778. So, what did this bug report refer to? The class="interwiki-simple" on the <li> element?
Since 'simple' is in Language's list of language names, I think it'd be cleaner to have the logic for this living in Language. Maybe Language::normalizeCode( $code ) ? That could also normalize a number of fuzzy old things that we still have in our list for compatibility: * simple -> en or en-x-simple * bat-smg -> sgs * roa-rup -> rup * fiu-vro -> vro etc Note that there are manual language links on [[en:Main_Page]] at the bottom (not in the sidebar) which have 'lang' attributes on spans surrounding the links. The one for 'Simple English' does use 'simple' as the value here, but this can be changed by editing the page or template.
normalize would be ambiguous in this function. Should be something that refers to getting a standards compatible language code.
I was thinking about a Language function as well. Maybe getCorrectCode() or getActualCode()? We might also use it for other lang="" attributes, like on the html tag. I see that wgLanguageCode has been changed for several wikis (like 'alswiki' => 'gsw') but not all of them (e.g. fiu-vro not).
getBcp47Code()?
Created attachment 9656 [details] Language.php patch, including first go at a mapping table...
some comments: 1: We should probably have getBCP47LanguageTag( $code, [$variant] ) 2: My patch maps getCode() to use getBCP47LanguageTag(), but that was just to get some quick testing done of course. 3: The table... I'm not entirely sure we want to use wgDummyLanguageCodes. Or alternatively, wether that table should contain qqq qqz in the way that it does now. Perhaps adapt wgDummyLanguageCodes into wgLanguageTagConversion()=wgDummyLanguageCodes ++ qqq+ qqz; or something simliar
other way around of course. wgDummyLanguageCodes=wgLanguageTagConversionTable ++ qqq+ qqz;
See also r105812 and friends.
https://gerrit.wikimedia.org/r/22727
Changeset dropped.