Last modified: 2014-11-13 10:56:33 UTC
Copied from the thread "MediaWiki API and Chinese language variants" on the mediawiki-api mailing list: Summary: When called with the action=raw parameter the index.php API does not return the correct language variant specified using the variant parameter. index.php?title=西恩塔&action=raw&variant=zh-cn and index.php?title=西恩塔&action=raw&variant=zh-tw should return markup in the specified variant but currently both return the same (zh?) variant. Full Thread: On 19/01/2008, Jan Hecking <jhecking@yahoo-inc.com> wrote: > > To follow up on this somewhat old thread: I finally got around to > > actually testing Paolo's suggestion of using the index.php API instead > > of api.php. Turns out it doesn't actually work. While the index.php API > > does have a variant parameter that allows to select one of the Chinese > > language variants (e.g. zh, zh-hk, zh-tw) it does not actually honor > > this parameter when combined with action=raw. When returning the raw > > Wiki markup it always returns the same variant (zh?) no matter what > > variant is specified. > > > > $ curl -s > > "http://zh.wikipedia.org/w/index.php?title=%E8%A5%BF%E6%81%A9%E5%A1%94&action=raw&variant=zh-cn" > > > zh-cn > > $ curl -s > > "http://zh.wikipedia.org/w/index.php?title=%E8%A5%BF%E6%81%A9%E5%A1%94&action=raw&variant=zh-tw" > > > zh-tw > > $ diff zh-cn zh-tw > > > > diff shows that the markup returned is identical. With action=view (the > > default) the output is clearly different. > > > > So it looks like there is actually no way to get the raw markup in > > different language variants? This is definitely a bug. I can verify that the variants work with action=render but not with action=raw. Please file a bug report on bugzilla. Andrew Dunbar (hippietrail) Thanks, > Jan > > > On 12/13/2007 6:45 PM, Jan Hecking wrote: > > > On 12/13/2007 11:28 PM, Paolo Liberatore wrote: > > > >> > >> On Thu, 13 Dec 2007, Jan Hecking wrote: >> > >> >> > >> >> > >> >>> > >>> On 12/13/2007 1:04 AM, Roan Kattouw wrote: >>> > >>> >>> > >>> >>>> > >>>> Jan Hecking schreef: >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>>> > >>>>> Hi, >>>>> > >>>>> >>>>> > >>>>> Is it possible to retrieve content in different Chinese language >>>>> > >>>>> variants using the /w/api.php API? There doesn't seem to be a variant or >>>>> > >>>>> language parameter that would allow selecting a variant like "zh-tw" or >>>>> > >>>>> "zh-hk". Is there some other way to do this? >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>> > >>>> How is this done in the regular user interface, then? >>>> > >>>> >>>> > >>>> >>>> > >>>> >>> > >>> I would like to know that as well. :) >>> > >>> >>> > >>> My suspicion is that the user interface, i.e. the frontend servers, do >>> > >>> the conversion. Which would mean that all users of the MediaWiki API >>> > >>> would have to replicate that work. That would severely limit the use of >>> > >>> the API for Chinese language content IMHO. But then I don't know much >>> > >>> about MediaWiki yet and maybe I have just missed something obvious. >>> > >>> >>> > >>> Thanks, >>> > >>> Jan >>> > >>> >>> > >>> >>> > >>> >>> > >>> >> > >> There is a "variant" parameter in >> > >> http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php >> > >> I believe it's only used for Chinese. >> > >> >> > >> > > > > > > Thanks for the reminder, Paolo! I hadn't considered index.php before > > > because I assumed it only returns the rendered HTML markup. But now I > > > saw that there is an action=raw parameter which returns the raw wiki > > > markup that I'm looking for. However this API has one other drawback: In > > > contrast to api.php it doesn't have an option to resolve redirects > > > automatically. When calling api.php I was using the redirects parameter > > > to do so but this doesn't seem to be supported by index.php when using > > > action=raw (only for action=view). That means I would potentially have > > > to make multiple calls to resolve redirects manually. Or is there a way > > > to avoid this? > > > > > > Thanks, > > > Jan > > >
Could someone please at least confirm whether the action=raw method of the index.php API is supposed to work with the variant parameter? If not then we will have to look into either doing the transcoding client-side or scrape the rendered HTML markup from the Wiki servers instead of using the raw markup. Both options don't look very good. :(
action=raw is by definition raw. No variant processing would be applied to output.
(In reply to comment #2) > action=raw is by definition raw. No variant processing would be applied to > output. I said that initially as well. But considering the alternatives (scraping action=render HTML or translating variants on the client side, both of which are evil), I think it would be best to add a variant parameter to action=raw, however inconsistent that may be. If no variant parameter is supplied, action=raw will still return the raw wikitext straight from the DB/cache, so no one using the action=raw interface the way it currently works will notice the difference.
Variant processing is for output, not raw source text. Performing variant processing on unprocessed source code is meaningless and would simply corrupt the code.
Since Brion says it doesn't belong at the index.php level I think there are a few options: * Do we really need raw+variant? The variant stuff may well interact with templates or other Wiki markup. Is there any reason people who want this can't get what they need from action=raw? What is the use case? * If not index.php we could add support to api.php * How independent is the variant processing code? Could we abstract a function for api.php that takes a string of one variant and transforms it to the other variant? * What about something like {{variant:xxx}}? We have similar wiki-functions already.
Hi Brion, Andrew, Here is some background on the user case for which I think raw+variant is needed - if there is a better way to achieve the same please let me know. We have integrated Wikipedia content into our mobile search product. Wikipedia has very relevant content on a very large number of typical search queries. However from the majority of the mobile devices that our users are using (particularly in emerging markets) this content is not easily accessible since there is no mobile specific version of the Wikipedia web site. So instead of linking directly to the Wikipedia article on <intl>.wikipedia.org we use the raw markup to render a mobile compatible version of the article within our mobile search product. We use the raw markup for this since we need to be able to render the content in different device specific markup languages. Currently we only do this for articles from en.wikipedia.org but we would like to apply this to other languages as well. However if we cannot get the raw markup in the relevant language variant we cannot do this. Here are a few examples how the integration looks like: Sample search results pages featuring Wikipedia content: http://us.m.yahoo.com/p/search?p=who+was+albert+einstein, http://us.m.yahoo.com/p/search?p=what+is+dynamite Albert Einstein article in oneSearch: http://us.m.yahoo.com/p/search/wiki?displayurl=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FAlbert_Einstein&.done=http%3A%2F%2Fus.m.yahoo.com%2Fp%2Fsearch%3Fp%3Dwho%2Bwas%2Balbert%2Beinstein Hope this clarifies the intended use case. Let me know what your thoughts on this are. Thanks, Jan
Conversion applies to output, not to raw source code. Early conversion would corrupt the markup (particularly for Latin<->Cyrillic, Latin<->Arabic, etc variants).
I'm not quite sure I can follow: How would the conversion corrupt the Wiki markup? The markup only consists of latin characters, right? Those wouldn't be affected by the conversion I assume. And if they are affected then how come the conversion does not affect the final output which is markup as well - HTML markup in this case. I guess I have to take a closer look at how the variant parameter works for Cyrillic/Arabic languages. Maybe it works differently than for Chinese. Thanks, Jan
Template names, magic keywords, tag names, HTML fragments, bla bla bla.
As far as I can see our alternatives are 1) scrape the HTML markup instead of the raw wiki markup, 2) use the raw markup and duplicate the whole transcoding logic on our servers. Am I missing anything? Thanks, Jan
1) scrape the HTML markup instead of the raw wiki markup This was always going to be easier anyway. From the URLs you posted earlier it looks like all you need is the plain text. There is plenty of existing free code to extract plain text from HTML. It's much easier than parsing wiki templates etc.