Last modified: 2014-11-17 09:47:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 15161 - A generalized conversion engine
A generalized conversion engine
Status: NEW
Product: MediaWiki
Classification: Unclassified
Language converter (Other open bugs)
All All
: Low enhancement with 2 votes (vote)
: ---
Assigned To: C. Scott Ananian
Depends on:
Blocks: 26121
  Show dependency treegraph
Reported: 2008-08-14 07:41 UTC by Milos Rancic
Modified: 2014-11-17 09:47 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Milos Rancic 2008-08-14 07:41:27 UTC
(I am writing it here because, AFAIK, it is necessary to make changes inside of Edit.php for the implementation of this idea. Also, some DB changes are needed, too. However, feel free to move it wherever you think that it should stay.)

Present situation of conversion engine, designed by Zhengzhu, may be described as:

* There is one form archived in DB.
* Contributors have to know both (or more than two) scripts if they are willing to edit pages.
* Scripts have to be generally 1:1 in substituting elements.

Such approach may work in classical examples, like Chinese or Serbian engines are. Every educated Serbian knows Cyrillic and Latin alphabets (Cyrillic is learned from the 1st year of the primary school, Latin is learned from the 2nd year of the primary school). AFAIK, it is not so hard for one Chinese to find a meaning of a character from a non-native script. Also, in both examples scripts corresponds almost 100% 1:1 (there are some exceptions, but it is not so hard to add them inside of the markup for exceptions: -{ ... }-). (There are maybe up to 10 of implementations of this principle all over the MediaWiki languages.)

However, there are a number of very different situations in the world. Some scripts differ from each other a lot and education issues may be significant. For example, while Tajik and Persian are structurally the same language systems, it is not so common to find a Tajik who is able to read Perso-Arabic script and Persian who is able to read Cyrillic script. Also, there are complex issues in relation to the "interpunctional behavior" of letters: there are somewhat different rules for usage of small and capital letters in Cyrillic (Latin, Greek) and Arabic scripts.

So, the goals of the generalized conversion engine for MediaWiki are:

* Allow to contributors to see and edit pages in their preferred script.
* Make an open set of rules which may be applied easily for different cases.
* Solve different kinds of "interpunctional behavior" problems in a generalized manner.
* Introduce a dictionary-based conversions. (This was initially introduced into the Serbian engine for Ekavian-Iyekavian paradigm, but it was abandoned because no work on that issue was done after the initial implementation.)
* A future goal, completely possible if this engine is implemented: Transform a conversion engine to a user-side feature. When script differences are great, for some user it is easier to try to read the content in the preferred script (for example, for one European it is easier to read Chinese transcribed to Latin).

I was thinking about some of the approaches to this issue and I may guarantee that there are better ones :) However, I'll list some of them:

* There should be fields in database for different versions of the article. Or, inside of one field it should be possible to separate different versions. Here is the example for the second idea:
** There are a lot of situation when forms are exceptional. A classic example is from the relation of Latin and Arabic scripts. Arabic script doesn't recognize capital letters (or they have different rules).
** So, if the sentence is beginning with "Llll" in Latin, which is transcribed as "aaaa" in Arabic, form in the database should be something like -{ Latin: Llll; Arabic: aaaa }-. However, such markup shouldn't be seen from the side of editor. Editor of Latin text should see just "Llll" and editor of Arabic text should see just "aaaa".
** In this case, if editor of Latin text changes it, general rules should be applied. If editor of Arabic text changes it, some specific rules should be applied (like: if previous word has dot at the end, the letter should be capital, if not, the letter should stay small in Latin). But, if it is not correct in Latin (for example, the word is personal name and it is in the middle of the sentence), then when editor of Latin text is fixing the text, from "llll" (which corresponds to "aaaa") to "Llll" (which, also, corresponds to "aaaa") should be changed with -{ Latin: Llll; Arabic: aaaa }-.
** Of course, both editors should be able to go into the "meta mode", which would show to them all of the markup and allow them to make fine tuning.
** When everything is changed (major edit), some general and specific rules should be followed, but, also, it should be allowed to editors to fix errors.

The main issue why I am writing this as a bug is that I am not a PHP programmer (while I am able to program in PHP :) ), which means that I am not able to solve all of the complex programming issues needed for MediaWiki. However, as a [formal] linguist I am willing to participate actively in working on this issue. I am willing to cover all of the linguistic work needed for this (including finding relevant persons for problems related to different scripts).
Comment 1 Tim Starling 2009-12-31 02:55:44 UTC
I'm not sure what you mean by dictionary-based conversion, since the current converter already has dictionary-like tables. However there is a need for word segmentation, which the current converter does not have.

I'd like to know if there's a pre-existing domain-specific language we can use for rules, which users might be able to understand and edit, similar to the snowball language which is used for stemming.

I'm interested in transformations which are particularly challenging from a computational perspective, such as Arabic vowel marking. Life without challenges can be very dull. References would be appreciated.
Comment 2 Milos Rancic 2009-12-31 04:03:10 UTC
Dictionary-based conversion: The simplest example is difference between British and American English (and a lot of languages or language variants have such need): kilometer-kilometre. I didn't see the engine for more than a year. And I am not sure how hard it would be to implement such engine efficiently (while Robert said to me that it is possible). For example, Serbian has a class of ~10.000 lexemes * ~10 forms which give ~100.000 words for replacing (Ekavian/Iyekavian differences), with probably two of ten words which need to be replaced inside of the text.

I would use regular expressions as a domain-specific language + some simple syntax for variables, Perl-like or Python-like, or even XML-like. It is a part of contemporary education of linguists and there are a lot of programmers on wiki, who may help.

I would start with two or three "simple" tasks, so we may see where we are going:
* MediaWiki tasks:
** Editing Edit.php in such way that wiki text in database doesn't need to correspond with wiki text inside of edit box.
** Making a simple, but extensible syntax for adding conversion rules; with on-wiki pages for conversion.
* Language-specific tasks:
** The problem of Arabic-Latin conversion: capital letters, vowel marking (when it is possible).
** Two paradigms problem in Serbian: Cyrillic/Latin and Ekavian/Iyekavian.

I'll start with finding references. The problem related to this field is that it is relatively obscure: while regular expressions and some interpreted (AWK, Perl, Python) and markup languages (SGML and XML) became a part of education of linguists, it is still too early for papers in that field. Also, syntax is now much more interesting because of translation engines; and translation engines work on completely different models.

I suppose that the best which we may find is:
* Japanese script usage, which might cover a lot of problems. However, I am not sure how many of those papers are available in English.
* Some (probably very) general theories.
* Classical philological papers, like descriptively listing the rules. Such papers are usually written in native language. For example, I may find good references for Ekavian/Iyekavian conversion in Serbian.
* Blogspot/Google has Latin-Devanagari conversion engine. So, probably some papers on Hindu or similar may be find.

So, we should make some plan what to do step by step. Maybe the best idea is to articulate a project and to find interested linguists...
Comment 3 Tim Starling 2009-12-31 05:07:46 UTC
The current converter has a conversion table, which has entries of any length, they are not required to be single characters. Where multiple table entries match at a given location in the source string, the longest one is chosen. This works well for Chinese where groups of 2 or 3 distinctive characters (occasionally up to 6) need to be treated as a unit. But it's rather awkward for languages like English, since a rule "color -> colour" would cause colorado to be converted to colourado. A better algorithm for languages like English would be to split the string into words, and then to do a hashtable lookup for each word.

Regarding the number of table entries: Chinese has 6500 for simplified and 9600 for traditional. As long as the table fits in memory, lookups are fast enough, but the per-request initialisation speed is already quite slow for Chinese and would be much worse if the table was 10 times bigger. Some optimisation work is needed. With the initialisation overhead removed, say by better caching, then we could do a table with millions of entries.
Comment 4 Milos Rancic 2009-12-31 19:40:57 UTC
Ah, as a linguist, I am making difference between grapheme-level dictionary and word-level dictionary. Just word-level is a "dictionary" for me :)

Yes, words should be extracted from text.

Also, I really think that wiki text inside of database should be some kind of meta-wiki text. It consumes much less processor power if you do the conversion once, when text is submitted, than if you use a bunch of different rules whenever you want to read a page.

And when we have extracted words and if it is used just when text is submitted, we may be able to make even more language conversions and markups.
Comment 5 C. Scott Ananian 2013-08-15 15:14:18 UTC
Visual Editor already contains code to detect changed regions of pages and reserialize only the changed regions, in order to minimize dirty diffs.  It seems like this is a good foundation for implementing a more intelligent language converter: the entire article can be language-converter in the editor, but only the changed regions will get resaved in the translated variant.

It would probably be worth marking the variant used in each changed region as well.  That might be easier to represent in the DOM than in wikitext.

I am interested in working on this problem.

See also: bug 26121, bug 31015, bug 52661, and
Comment 6 C. Scott Ananian 2013-08-15 16:34:36 UTC
Some discussion from IRC.  Pig Latin would be a good english variant to explore some of the non-reversible language variant pairs (like Arabic/Latin).

(12:03:51 PM) cscott: James_F: i've been talking to liangent about language converter.  it would be nice if VE could present the text to be edited in the proper language variant.  the way that VE/parsoid selser works makes this feasible I think.
(12:04:40 PM) cscott: that is, we convert all the article text, but we only re-save the edited bits (in the converted variant).  needs some thought wrt how diffs appear, etc.
(12:04:48 PM) James_F: cscott: That sounds totally feasible - your talking about VE requesting zh-hans or zh-hant (or whatever) from Parsoid and showing that?
(12:05:23 PM) cscott: James_F: something like that.  not sure where in the stack language conversion will live exactly.  gwicke_away is talking about it as a post-processing pass.
(12:05:42 PM) cscott: this would also allow language converter to work on portuguese and even en-gb/en-us.
(12:06:32 PM) cscott: ie, you always see 'color' in VE even if the source text was 'colour', but it doesn't get re-saved as 'color' unless you edit the sentence containing the word. (or paragraph?  or word?)
(12:06:34 PM) James_F: Like link target hinting.
(12:07:04 PM) James_F: Selser is paragraph-level right now, I think?
(12:07:39 PM) cscott: i'm not sure, but i think so.  html element-level.
(12:08:39 PM) cscott: it might be that we want to be more precise for better variant support -- or maybe not.  maybe element-level marking of lang= is right (it avoids adding spurious <span> tags just to record the language variant) and we just want to be smarter about how we present diffs.
(12:09:18 PM) cscott: ie, color->colour shouldn't appear as a diff.  (or for serbian, the change from latin to cyrillic alphabet shouldn't be treated as a diff, if the underlying content is the same)
(12:10:36 PM) cscott there are some tricky issues -- for some language pairs one encoding has strictly more information than the other.  ie, in languages with arabic and latin orthographies, uppercase letters are specific to the latin script.  so if the user writes the text natively in arabic, we won't necessarily know the correct capitalization (and the capitalization of the rest of the paragraph might be lost).
(12:11:05 PM) cscott: so lots of details.  but we should be able to handle the 'easy' cases (where the languages convert w/o information loss) first.
(12:11:22 PM) ***cscott wonders if pig latin is a reversible transformation
(12:17:06 PM) MatmaRex: cscott: it's not, i'm afraid
(12:17:28 PM) MatmaRex: unless you rely on a dictionary
(12:17:39 PM) MatmaRex: as appleway might come from apple or wapple, i think
(12:17:41 PM) cscott: MatmaRex: well, i guess that makes it a great stand-in for the 'tricky' languages.
(12:18:20 PM) cscott: so much the better. ;)
(12:18:29 PM) MatmaRex: it's only the words starting with a vowel that are troublesome, though
(12:20:59 PM) cscott: i think the idea is that, if i edit in en-pig and type 'appleway' it should get saved as appleway and probably a default translation into en-us should be made?  (ie, in the latin/arabic pairs, assume lowercase).  There should be a specific UX affordance in VE to specify both sides of the variant, which serializes into -{en-pig:appleway,en-us:apple,en-gb:apple}-.
(12:24:45 PM) cscott: i guess when you edit text which was originally in en-us, it needs to be converted to -{en-us:apple,en-pig:appleway}- by the language converter so that information isn't lost when the edited en-pig text is saved back.
Comment 7 Gabriel Wicke 2013-08-26 19:39:50 UTC
Variant conversion is not bijective, so we can't generally save automatically converted variants without information loss. Even manual conversion of entire sections is considered vandalism in the Chinese Wikipedia. Saving just edited text (down to the word level) would promote more mixed-variant text within the same section, which might not be desirable for wikitext editors.

So this is not easy, and a lot of issues need to be considered. IMO we should first make sure to have solid Parsoid and VE support for unconverted editing of variant-enabled wiki content.
Comment 8 C. Scott Ananian 2013-08-26 19:41:51 UTC
Please see the worked example at the end of comment 6 for how variant conversion can be accomplished without information loss.
Comment 9 Gabriel Wicke 2013-08-26 19:46:30 UTC
(In reply to comment #8)
> Please see the worked example at the end of comment 6 for how variant
> conversion can be accomplished without information loss.

Right, by storing the original text. Which was my point.

Note You need to log in before you can comment on or make changes to this bug.