Last modified: 2014-10-20 15:26:29 UTC
Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1427/ Reported by: yfdyh000 Created on: 2012-03-30 18:49:19 Subject: Link identify errors Original description: Script mistakenly identify interwiki link, see: http://en.wikipedia.org/wiki/User\_talk:YFdyh-bot http://en.wikipedia.org/w/index.php?title=Template%3ANon-free\_video\_sample&diff=483600197&oldid=476118424 At the time I run the command: 2012-03-24 04:32:09 r10024 \(wikipedia.py\) Python 2.7.2 interwiki.py "-warnfile:logs\warning-wikipedia-en.log" "-lang:en" "-cleanup" "-autonomous" "-async"
The problem here is that someone used \[\[en:blah\]\] to link somewhere instead of \[\[:en:blah\]\]. This one could have been spotted because \[\[en:blah|fdafdsa\]\] never is an interwiki link.
texlib.getLanguageLinks() has always caught pipe characters, and https://www.mediawiki.org/wiki/Special:Code/pywikipedia/43 made clear that the first part should have been ignored. Adding '\|' to the regex would solve the problem for interwiki.py, but I'm afraid this would break other scripts. So, maybe is it safer to add an optional argument to avoid catching piped links? I am tempted by raising Severity...
In addition to the '|' being an indicator of non-interwiki-ness, the interwiki map may also be used to help solve this specific case, as en: and w: are never interwiki links and they are marked as localinterwiki="" in the interwikimap. Fabian has updated Site.interwiki() and Link.parse(), so this bug in interwiki.py might be automatically solved, but it would be good to inspect/debug interwiki.py to confirm this.
Link.parse() doesn't know if it's a piped link. A link like [[en:Blah|Blah]] is treated like [[en:Blah]] or [[:en:Blah]]: all link to Blah on en (whatever that is linked in the interwiki map). In fact Link doesn't store if it's an interwiki link and just returns the Site (which might be to a different site). Site.interwiki() also doesn't take "localinterwiki" into account because it doesn't make a difference as based on the URL, it should return the Site itself anyway. It sounds to me that before that it shouldn't recognize [[X:Y|Z]] as an link to be parsed anyway.
Okay with a little help from John, I think I know what the problem is, but I'm not sure how to tackle it the best. Basically if the first interwikiprefix is not preceeded by a colon, it appears on the sidebar only if it's not linking to it's own site. So [[de:en:Foobar]] appears in the sidebar on the English Wikipedia but [[en:Foobar]] not. Now I'm toying with the idea to add a "is_special" method to Link which says whether it's just a link or if it's "special". So [[:File:Foobar.png]] would be not special but [[File:Foobar.png]]. Same with [[:Category:Foobar]] and [[Category:Foobar]] and then with [[:interwiki:Foobar]] and [[interwiki:Foobar]] with the special exception that it's not an special link if the frist interwiki points to it's own site. I don't think we need to interpret "localinterwiki" for that, and could add that later independently to add a shortcut in Site.interwiki(), as I'm not sure when this was added (and if this is determined automatically it doesn't really matter for this bug).
(In reply to Fabian from comment #5) > .. to add a "is_special" There are a few different types of 'special' - it would be good to have different names for each of them. is_langlink (i.e. sidebar interlanguage links; git grep langlink shows Link already has some functionality about these) is_transclude (for [[File:Foobar.png]]) is_category is_metadata = is_langlink or is_category is_special = is_langlink or is_category or is_transclude Another type that comes to mind is [[/example/]] , which is not special in the same sense as the above, but it does cause many problems (i.e. during page moves)