Last modified: 2014-07-21 07:08:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T57246, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 55246 - Problem with 0x200B ZERO WIDTH SPACE in page titles
Problem with 0x200B ZERO WIDTH SPACE in page titles
Status: NEW
Product: Pywikibot
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: Low normal
: ---
Assigned To: Pywikipedia bugs
:
: 55227 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-05 04:50 UTC by Kunal Mehta (Legoktm)
Modified: 2014-07-21 07:08 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Kunal Mehta (Legoktm) 2013-10-05 04:50:21 UTC
Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1295/
Reported by: ganz-ru
Created on: 2011-02-15 20:40:15
Subject: Problem with Tibetan script
Original description:
Here is hard edit war: http://en.wikipedia.org/w/index.php?title=Podolsk&action=history . Bots with the old python version add incorrect tibetan interwiki. And bot with version 2.7.1 do it correctly.
Comment 1 Kunal Mehta (Legoktm) 2013-10-05 04:50:23 UTC
My crystal ball suggests:
\* Wikitanvirbot is running from the toolserver, which has a patched 2.7.1 without unicode bug
\* TXiKiBoT is running an old version of pywikipediabot \(there is no python version in the edit summary\) on python 2.6.5+

Conclusion: TXiKiBot should be blocked until its owner fixes his/hers setup.
Comment 2 Kunal Mehta (Legoktm) 2013-10-05 04:50:25 UTC
Same problem have many other bots: my bot, LucienBOT, VolkovBot. All of them have versions newer than 2.6.5.
Comment 3 Kunal Mehta (Legoktm) 2013-10-05 04:50:27 UTC
I'sorry. All of them have versions older than 2.6.5.
Comment 4 Kunal Mehta (Legoktm) 2013-10-05 04:50:29 UTC
I see, indeed.

Could you post the output of

import query
print query.json.\_\_file\_\_

for your bot? I cannot reproduce the bug, so I suspect it might be due to a buggy json package. I'll do some package sniffing to check this further.
Comment 5 Kunal Mehta (Legoktm) 2013-10-05 04:50:30 UTC
If I did it right output is 4 files:  
\_\_init\_\_.pyc 
decoder.pyc 
encoder.pyc 
scanner.pyc

Are you needing them?
I'm sorry, I'm not the python programmer.
Comment 6 Kunal Mehta (Legoktm) 2013-10-05 04:50:32 UTC
I meant the output if you type those lines into the python interpreter - but I've been poking around some more. It does not seem to be JSON related - or maybe it is, or maybe it isn't it. I think it has to do with some very old code called 'getall', which gets batches of pages.

Sigh. I would very much like to say: "bad luck, try the rewrite" - I'm almost afraid to touch that piece of code. I'll see if I can whip up a test you can run, though, to confirm my suspicions.

In the meanwhile, could you post the output of version.py?

Thanks.
Comment 7 Kunal Mehta (Legoktm) 2013-10-05 04:50:34 UTC
- **priority**: 5 --> 7
Comment 8 Kunal Mehta (Legoktm) 2013-10-05 04:50:36 UTC
Version.py:
Pywikipedia \[http\] trunk/pywikipedia \(r8948, 2011/02/13, 09:19:56\)
Python 2.6.4 \(r264:75708, Oct 26 2009, 08:23:19\) \[MSC v.1500 32 bit \(Intel\)\]
config-settings:
use\_api = True
use\_api\_login = True
unicode test: ok

I'll be glad to help if you write the test code.
Comment 9 Kunal Mehta (Legoktm) 2013-10-05 04:50:38 UTC
Ok. There are two issues playing a role here.

1\) the 'correct' page name ends in a 0x200B ZERO WIDTH SPACE. This makes no sense, other than to annoy people.
2\) the XML parser strips spaces around titles, including the 0x200B ZERO WIDTH SPACE.
3\) Mediawiki does \*not\* do this

So, first of all, I will rename the article, so it no longer has the 0x200B ZERO WIDTH SPACE in the title. I will see if I can pinpoint the XML bug, so someone else may fix it. However, due to the fact bots are killing eachother about it, I suspect this is a small change somewhere in the python APIs - or default setting that changed.

Lastly, maybe we should broaden the discussion into mediawiki-tech -- should page titles be allowed to have unicode whitespace characters embedded, especially if they are invisible?
Comment 10 Kunal Mehta (Legoktm) 2013-10-05 04:50:39 UTC
Except I don't have move privileges on that wiki. Added a comment on the user talk page of the guy who created the page.
Comment 11 Kunal Mehta (Legoktm) 2013-10-05 04:50:41 UTC
Stripping is done in xmlreader.py:194. Calling strip\(\) seems to remove the U+200B character indeed.
Comment 12 Kunal Mehta (Legoktm) 2013-10-05 04:50:43 UTC
I agree that this symbol in titles is absolutely useless. Not only for bots, but for usual users too since it can break their copy-paste operations.

If you can start discussion on mediawiki-tech, please do it.

Thank you.
Comment 13 Kunal Mehta (Legoktm) 2013-10-05 04:50:45 UTC
JAn Dudik moved the page, so the problem should be fixed for now. Keeping this open \(it's a bug in pywikipedia, after all\).

Related:

Python 2.6.5 \(r265:79063, Apr 16 2010, 13:09:56\)
\[GCC 4.4.3\] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u200b'.strip\(\)
u''

Python 2.7.1 \(r271:86832, Jan  4 2011, 13:57:14\)
\[GCC 4.5.2\] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u200b'.strip\(\)
u'\u200b'

\u200b is technically not whitespace, so strip\(\) probably should not delete it.

Of course, pwb should not be stripping page titles in the first place.
Comment 14 Kunal Mehta (Legoktm) 2013-10-05 04:50:47 UTC
Aaand http://bugs.python.org/issue10567 is related to that.

In essence: bots running < 2.7 were technically doing the wrong thing, but this did not go noticed as no-one used the interwiki to the tibetan wikipedia, and all bots did the same wrong thing. Now there are bots running 2.7+, from the toolserver, and the bug surfaced.
Comment 15 Kunal Mehta (Legoktm) 2013-10-05 04:50:49 UTC
Wikimedia bugzilla bug entry: https://bugzilla.wikimedia.org/show\_bug.cgi?id=27446
Comment 16 Kunal Mehta (Legoktm) 2013-10-05 04:50:50 UTC
- **Group**:  --> confirmed
- **Priority**: 7 --> 2
Comment 17 Merlijn van Deen (test) 2013-10-25 17:12:25 UTC
*** Bug 55227 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links