Last modified: 2014-07-21 07:08:59 UTC
Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1295/ Reported by: ganz-ru Created on: 2011-02-15 20:40:15 Subject: Problem with Tibetan script Original description: Here is hard edit war: http://en.wikipedia.org/w/index.php?title=Podolsk&action=history . Bots with the old python version add incorrect tibetan interwiki. And bot with version 2.7.1 do it correctly.
My crystal ball suggests: \* Wikitanvirbot is running from the toolserver, which has a patched 2.7.1 without unicode bug \* TXiKiBoT is running an old version of pywikipediabot \(there is no python version in the edit summary\) on python 2.6.5+ Conclusion: TXiKiBot should be blocked until its owner fixes his/hers setup.
Same problem have many other bots: my bot, LucienBOT, VolkovBot. All of them have versions newer than 2.6.5.
I'sorry. All of them have versions older than 2.6.5.
I see, indeed. Could you post the output of import query print query.json.\_\_file\_\_ for your bot? I cannot reproduce the bug, so I suspect it might be due to a buggy json package. I'll do some package sniffing to check this further.
If I did it right output is 4 files: \_\_init\_\_.pyc decoder.pyc encoder.pyc scanner.pyc Are you needing them? I'm sorry, I'm not the python programmer.
I meant the output if you type those lines into the python interpreter - but I've been poking around some more. It does not seem to be JSON related - or maybe it is, or maybe it isn't it. I think it has to do with some very old code called 'getall', which gets batches of pages. Sigh. I would very much like to say: "bad luck, try the rewrite" - I'm almost afraid to touch that piece of code. I'll see if I can whip up a test you can run, though, to confirm my suspicions. In the meanwhile, could you post the output of version.py? Thanks.
- **priority**: 5 --> 7
Version.py: Pywikipedia \[http\] trunk/pywikipedia \(r8948, 2011/02/13, 09:19:56\) Python 2.6.4 \(r264:75708, Oct 26 2009, 08:23:19\) \[MSC v.1500 32 bit \(Intel\)\] config-settings: use\_api = True use\_api\_login = True unicode test: ok I'll be glad to help if you write the test code.
Ok. There are two issues playing a role here. 1\) the 'correct' page name ends in a 0x200B ZERO WIDTH SPACE. This makes no sense, other than to annoy people. 2\) the XML parser strips spaces around titles, including the 0x200B ZERO WIDTH SPACE. 3\) Mediawiki does \*not\* do this So, first of all, I will rename the article, so it no longer has the 0x200B ZERO WIDTH SPACE in the title. I will see if I can pinpoint the XML bug, so someone else may fix it. However, due to the fact bots are killing eachother about it, I suspect this is a small change somewhere in the python APIs - or default setting that changed. Lastly, maybe we should broaden the discussion into mediawiki-tech -- should page titles be allowed to have unicode whitespace characters embedded, especially if they are invisible?
Except I don't have move privileges on that wiki. Added a comment on the user talk page of the guy who created the page.
Stripping is done in xmlreader.py:194. Calling strip\(\) seems to remove the U+200B character indeed.
I agree that this symbol in titles is absolutely useless. Not only for bots, but for usual users too since it can break their copy-paste operations. If you can start discussion on mediawiki-tech, please do it. Thank you.
JAn Dudik moved the page, so the problem should be fixed for now. Keeping this open \(it's a bug in pywikipedia, after all\). Related: Python 2.6.5 \(r265:79063, Apr 16 2010, 13:09:56\) \[GCC 4.4.3\] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u'\u200b'.strip\(\) u'' Python 2.7.1 \(r271:86832, Jan 4 2011, 13:57:14\) \[GCC 4.5.2\] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> u'\u200b'.strip\(\) u'\u200b' \u200b is technically not whitespace, so strip\(\) probably should not delete it. Of course, pwb should not be stripping page titles in the first place.
Aaand http://bugs.python.org/issue10567 is related to that. In essence: bots running < 2.7 were technically doing the wrong thing, but this did not go noticed as no-one used the interwiki to the tibetan wikipedia, and all bots did the same wrong thing. Now there are bots running 2.7+, from the toolserver, and the bug surfaced.
Wikimedia bugzilla bug entry: https://bugzilla.wikimedia.org/show\_bug.cgi?id=27446
- **Group**: --> confirmed - **Priority**: 7 --> 2
*** Bug 55227 has been marked as a duplicate of this bug. ***