Last modified: 2007-07-13 18:49:45 UTC
If I enter a term containing umlauts in the search field on the left, but transliterate the
umlauts, the action fails and I am presented the search page, if there is no redirection for
that page. On the english Wikipedia most of the time there are such redirects. For the german
Wikipedia there would be not much sense to it.
Goedel (for Gödel) fails on de.wikipedia.org; on en.wikipedia.org it resolves correctly.
Godel resolves on the english page too.
Would it be possible to resolve transliterated umlauts automatically to the correct character? It surely
wouldn't break anything.
Automatically adding the reverse-transliterated umlauts to the search results
desirable in my opinion, in particular on de.wikipedia.org .
For example, entering "kuenstliche intelligenz" in the search box there
came up with the movie "A.I. – Künstliche Intelligenz", but not with the
main entry http://de.wikipedia.org/wiki/K%C3%BCnstliche_Intelligenz
which I was only able to find via the entry for the "AI" acronym.
It would be nice to add more than just the umlauts and to more than just the German Wikipedia: The same (or worse) problem occurs on any Wikipedia that uses the Latin alphabet with special characters: The Spanish, Portuguese, French, Scandinavian (...), Slavic (... ... ...), Turkish languages, to name just the largest groups (with obviously many subgroups).
I agree with #3, and would still add to it. It would be desirable to handle both transliterated special characters and the accent- and featureless plain latin characters from which they have been derived as possible occurences of that special character. For example oe (common in Germany) or o (common in Sweden) for ö, or aa / a for å. I would even extend this mechanism to handling some groups of punctuation characters as one character in search, for example different quotation marks " „ “ ” « », different dashes - – —, different apostrophes ' ’ (see German article "Germany’s next topmodel"; there is a redirect from the simple version, though) etc.
*** Bug 7002 has been marked as a duplicate of this bug. ***
This also applies to pinyin characters (latinization of chinese characters): for example "wuji" will not find "wújí" (as in german Wikipedias article "Taiji"). Both notations are common, the former especially in printed books.
Fixed in Lucene Search 2. Accents are always stripped, and common transliterations are added as aliases (see Bug 7002).
So, searching for Goedel should find Kurt Gödel as the first hit on both en and de wiki.