Last modified: 2009-02-10 15:56:50 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T17261, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 15261 - Trimmed multibyte characters result in invalid XML
Trimmed multibyte characters result in invalid XML
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Roan Kattouw
http://hu.wikipedia.org/w/api.php?for...
:
: 16101 (view as bug list)
Depends on:
Blocks: 16106
  Show dependency treegraph
 
Reported: 2008-08-22 00:56 UTC by Daniel Tar
Modified: 2009-02-10 15:56 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Daniel Tar 2008-08-22 00:56:43 UTC
I'm just started to write a statistics program for Hungarian Wikipedia. While I downloaded the deletion log from January 2008, my program encountered an exception: the XML loaded from the API was bad encoded. I wondered why, so I checked it, and really, there is an error:

http://hu.wikipedia.org/w/api.php?format=xml&action=query&list=logevents&letype=delete&lestart=2008-01-25T22:12:03Z&lelimit=30

In element 'item' with logid 142820, the comment contains an unknown character at the end. Probably it would be a two byte length UTF-8 character, but it has been trimmed. The problem is not so serious as I can get rid of the comment attribute with using &leprop= in the URL as I don't need it, but if someone needs it, he/she won't able to load the file.

The bad line (see also in the link):
<item logid="142820" pageid="0" ns="0" title="Borisz Szpasszkij" type="delete" action="delete" user="Bináris" timestamp="2008-01-25T21:19:30Z" comment="[[Wikipédia:Homokozó|teszt]]: a lap tartalma: „Boris Vasilievich Spassky [szerkesztés] A Wikipédiából, a szabad lexikonból. Ugrás: <small>NAVIGÁCIÓ</small>, <small>KERESÉS</small>  Boris V Spassky () szovjet később francia...” (és csak �"/>
Comment 1 Max Semenik 2008-08-22 05:37:01 UTC
Related to bug 332
Comment 2 Roan Kattouw 2008-08-22 09:07:41 UTC
I don't see the problem. I opened the link in Firefox (which automatically parses XML and screams if there's something wrong with it), and I got no errors. I also confirmed that logid 142820 is in there, which it is. That means it's probably your XML parser's fault; closing as WORKSFORME.
Comment 3 Max Semenik 2008-08-23 19:03:21 UTC
http://validator.w3.org/check?uri=http%3A%2F%2Fhu.wikipedia.org%2Fw%2Fapi.php%3Fformat%3Dxmlfm%26action%3Dquery%26list%3Dlogevents%26letype%3Ddelete%26lestart%3D2008-01-25T22%3A12%3A03Z%26lelimit%3D30&charset=%28detect+automatically%29&doctype=Inline&group=0&user-agent=W3C_Validator%2F1.591

"Sorry, I am unable to validate this document because on line 44 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. 

The error was: utf8 "\xE2" does not map to Unicode"

Many XML parsers choke on broken UTF-8 entities. Of course, this is mostly a database problem, but the fact that API returns ill-formed data remains.
Comment 4 Platonides 2008-10-08 21:41:10 UTC
Same problem found on several image comments, such as http://commons.wikimedia.org/w/api.php?format=xml&action=query&prop=imageinfo&iiprop=comment&titles=Image:Algeria-map.png (trailing 0xE2)
Comment 5 Roan Kattouw 2008-10-25 13:04:53 UTC
*** Bug 16101 has been marked as a duplicate of this bug. ***
Comment 6 Roan Kattouw 2009-01-14 21:24:36 UTC
Should be fixed in r45749: invalid UTF-8 chars are replaced with the UTF-8 replacement character (U+FFFD).
Comment 8 Roan Kattouw 2009-02-10 15:56:50 UTC
(In reply to comment #7)
> Not fixed:
> http://en.wikipedia.org/w/api.php?action=query&format=xml&iiprop=comment&prop=imageinfo&titles=Image:Shakerredraider.jpg
> still outputs invalid UTF-8.
> 

Argh, array_walk_recursive() doesn't work the way I expected it to. Fixed in r47090

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links