Last modified: 2013-06-18 14:43:33 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2332, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 332 - Broken UTF-8 cutoff breaks display in some browsers


Summary:	Broken UTF-8 cutoff breaks display in some browsers

Status:	RESOLVED WORKSFORME

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All Windows XP

Importance:	Low minor with 4 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	testme

Duplicates:	5401 11087 19712 (view as bug list)
Depends on:
Blocks:	unicode 640
	Show dependency tree / graph

Reported:	2004-09-03 03:23 UTC by Timwi
Modified:	2013-06-18 14:43 UTC (History)
CC List:	13 users (show)

See Also:
Web browser:	Internet Explorer
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Partial screenshot of problem in IE 6/WinXP (7.68 KB, image/png) 2004-10-11 11:33 UTC, Mulukhiyya	Details
A naive code to solve similar problems (1.17 KB, text/plain) 2005-07-21 15:07 UTC, Mulukhiyya	Details
A naive code to solve similar problems (revised) (1.18 KB, text/plain) 2005-07-22 07:00 UTC, Mulukhiyya	Details
screen dump - special Recentchanges - deletion event text should truncate at UTF-8 character boundaries · 01.jpg (192.38 KB, image/jpeg) 2008-01-30 18:51 UTC, lɛʁi לערי ריינהארט	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Description Timwi 2004-09-03 03:23:03 UTC

BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=855680&group_id=34373&atid=411192
Originally submitted by Nobody/Anonymous - nobody  2003-12-07 10:32


When someone write a long summary comment, it 
messes RecentChanges, History, and other texts. 

I think this is unique to languages using 2-byte 
characters - when a character is cut-off in the middle, it 
turns into some wierd character, and affects other part 
of the page. 

As an example, please see the following history page in 
which the text (including the sidebar) is inappropriately 
italicized. 

http://ja.wikipedia.org/w/wiki.phtml?title=Wikipedia:%
E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%
83%87%E3%82%A3%E3%82%A2%E3%81%AE%E4%BB%
B2%E9%96%93&amp;action=history

When this happens at RecentChanges, it is quite difficult 
to read through it. 

As a fix, it would be nice to automatically detect too 
long summary comment and ask the user to shorten it. 

Or there may be a way to properly cut 2byte-char texts. 
That would be good, too. 

Or maybe some other solution is available. 

Thanks for the help,

Tomos ( wiki_tomos at hotmail dot com )

------------------------- Additional comments ------------------------
Date: 2003-12-07 11:31
Sender: SF user vibber

Confirmed; this seems to be a problem with how Internet
Explorer handles broken UTF-8 code; in at least some
circumstances it will eat the non-UTF-8-trail byte(s) that
follow the broken sequence. (I presume it's reading ahead
the entire number of bytes that the head byte specifies and
eating the false tail bytes instead of resynchronizing at
the break point. That's a real shame, since this ability is
one of the neatest things about UTF-8 compared with
traditional double-byte character sets.)

In the attached screenshot (from IE 6.0 on WinXP) this shows
it destroying the following &quot;)&quot; and even
the &quot;&lt;&quot; that starts
the closing &lt;/em&gt; tag, so the rest of the page is left in
italics when the markup is incorrectly interpreted.

Most other browsers I have tested (Mozilla, Camino, Safari)
replace only the broken sequence with a placeholder 'broken'
glyph, and correctly restart the UTF-8 interpretation at the
next byte, which as ASCII is itself a valid UTF-8 character
sequence. Konqueror 3.1.2 seems to break the following
&quot;)&quot;
but not the &quot;&lt;&quot;, so the tags at least are intact.

Text gets cutoff at maximum lengths in a number of places;
titles as well as comments have a max size in the database,
which knows nothing of UTF-8 and treats our data as raw byte
strings. We should add a function to our code to perform a
UTF-8-safe max-byte-length string trimmer to keep the bad
ones out on general principle; since we can't fix IE from
choking on them we should also go through and eliminate any
remaining in the database.

Impact: mostly a cosmetic annoyance, but because of the
ability to damage markup in some popular browsers it could
harm usability. It's unlikely that cross-site scripting
attacks are possible through this, but it's bad juju anyway.
Database should be cleaned of any broken strings there are
now, and code should be fixed to avoid putting them in in
the future.

Only affects UTF-8 wikis, but that's a large and growing
portion of the user base (and we want to switch everything
to UTF-8 at some point). Asian languages are particularly
affected because UTF-8 balloons to 3 bytes per character in
most Asian scripts, so the byte limits are reached with a
smaller number of characters.

Comment 1 Mulukhiyya 2004-10-11 11:33:42 UTC

Created attachment 90 [details]
Partial screenshot of problem in IE 6/WinXP

originally taken by Brion; copied from SF.net.

Comment 2 Mulukhiyya 2004-10-11 14:25:19 UTC

Another kind of effects also exist on the bug. I remembered :)

Example
http://ja.wikipedia.org/w/wiki.phtml?title=%E4%BB%99%E5%8F%B0%E5%B8%82&action=edit&oldid=805477

For some reason the end of wikitext in the editbox has broken, so there are no buttons et al., so 
it seems there is nothing but to revert it. As I suspected, if original edits stepped into such 
trouble, they wouldn't be able to be reverted without a sysop. Do, do, do, dōdeshō?

Comment 3 peter green 2004-12-28 01:31:14 UTC

im not convinced that the utf-8 tag is being used correctly here

utf8   	This keyword tags bugs that would automatically be fixed if all wikis
without exception would use UTF-8.

it seems that this bug is one that ONLY breaks utf-8 wikis the opposite of what
this tag is supposed to mean

another possible fix would be to parse for broken utf-8 at output time (Which
may be easier than trying to find all places where strings are chopped)

Comment 4 Mulukhiyya 2005-07-21 15:04:26 UTC

(In reply to comment #3)
> im not convinced that the utf-8 tag is being used correctly here
> 
> utf8   	This keyword tags bugs that would automatically be fixed if all wikis
> without exception would use UTF-8.
> 
> it seems that this bug is one that ONLY breaks utf-8 wikis the opposite of what
> this tag is supposed to mean

You are right; therefore I was TOO wrong and incomparably SLOW! I am sorry for
my poor comprehension, and thank you for correcting. Now I understand. Or, at
least, I hope so.

By the way, I have just tried to write a naive code for interest. But I cannot
guess how useful this is.

Comment 5 Mulukhiyya 2005-07-21 15:07:03 UTC

Created attachment 734 [details]
A naive code to solve similar problems

Comment 6 Mulukhiyya 2005-07-22 07:00:27 UTC

Created attachment 737 [details]
A naive code to solve similar problems (revised)

Comment 7 Rob Church 2006-04-06 22:21:41 UTC

*** Bug 5401 has been marked as a duplicate of this bug. ***

Comment 8 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-17 22:09:23 UTC

Is this still an issue?

Comment 9 Mulukhiyya 2007-09-27 08:00:54 UTC

Just a few months ago, an automated tool, on Wikimedia Toolserver, seemed to stumble at this bug (malformed XML whatever?). But sorry my memory about that case is a bit obscure...

Comment 10 lɛʁi לערי ריינהארט 2008-01-30 18:51:59 UTC

Created attachment 4597 [details]
screen dump - special Recentchanges - deletion event text should truncate at UTF-8 character boundaries · 01.jpg

(In reply to comment #8)
> Is this still an issue?

I just wanted to create a new report with the summary

[[special:Recentchanges]] - deletion event text should truncate at UTF-8 character boundaries

� the Unicode Character REPLACEMENT CHARACTER U+FFFD
http://www.fileformat.info/info/unicode/char/fffd/index.htm
HTML Entity (decimal)  	&#65533;
HTML Entity (hex) 	&#xfffd;
UTF-8 (hex) 	        0xEF 0xBF 0xBD (efbfbd)
shows up in [[yi:special:Recentchanges]].

It does not show up in [[yi:special:Logs/delete]].

Hiw does this relate to
bug 12359 Deletion summary lengths problems ?

Best regards Reinhardt [[user:Gangleri]]

references:

[[yi:special:Versios]] shows
    *  MediaWiki: 1.12alpha (r30286)
    * PHP: 5.1.4 (apache)
    * MySQL: 4.0.29-nightly-20070112-wikimedia-log

Comment 11 Marcin Cieślak 2008-08-17 00:21:07 UTC

Some recent examples:

Description cut-off at the first byte: http://commons.wikimedia.org/wiki/Image:Banka_mydlana.jpg?uselang=pl

hexdump:

00003970  44 20 2d 2d 3c 61 20 68  72 65 66 3d 22 2f 77 2f  |D --<a href="/w/|
00003980  69 6e 64 65 78 2e 70 68  70 3f 74 69 74 6c 65 3d  |index.php?title=|
00003990  57 69 6b 69 70 65 64 79  73 74 61 3a 4d 72 74 6e  |Wikipedysta:Mrtn|
000039a0  26 61 6d 70 3b 61 63 74  69 6f 6e 3d 65 64 69 74  |&amp;action=edit|
000039b0  26 61 6d 70 3b 72 65 64  6c 69 6e 6b 3d 31 22 20  |&amp;redlink=1" |
000039c0  63 6c 61 73 73 3d 22 6e  65 77 22 20 74 69 74 6c  |class="new" titl|
000039d0  65 3d 22 57 69 6b 69 70  65 64 79 73 74 61 3a 4d  |e="Wikipedysta:M|
000039e0  72 74 6e 20 28 6a 65 73  7a 63 7a 65 20 6e 69 65  |rtn (jeszcze nie|
000039f0  20 75 74 77 6f 72 7a 6f  6e 61 29 22 3e 4d 61 72  | utworzona)">Mar|
00003a00  63 69 6e 20 44 65 72 c4  99 67 6f 77 73 6b 69 3c  |cin Der..gowski<|
00003a10  2f 61 3e 20 32 31 3a 30  35 2c 20 32 36 20 73 69  |/a> 21:05, 26 si|
00003a20  65 20 32 30 30 34 20 28  43 45 53 54 29 20 20 5a  |e 2004 (CEST)  Z|
00003a30  64 6a c4 99 63 69 65 20  70 72 7a 65 64 73 74 61  |dj..cie przedsta|
00003a40  77 69 61 20 62 61 c5 29  3c 2f 73 70 61 6e 3e 3c  |wia ba.)</span><|

At byte 0x3a46 one can see 0xC5 byte standing alone.

Another one:

http://pl.wikipedia.org/w/index.php?useskin=monobook&title=Grafika%3ABialyMarszWloclawek.jpg&redirect=no

000026a0  9b 63 69 c5 82 20 77 20  31 39 39 31 20 72 6f 6b  |.ci.. w 1991 rok|
000026b0  75 20 4a 61 6e 20 50 61  77 65 c5 82 20 49 49 2c  |u Jan Pawe.. II,|
000026c0  20 64 6f 20 6f 64 64 61  6c 6f 6e 65 67 6f 20 31  | do oddalonego 1|
000026d0  32 6b 6d 20 70 6f 64 20  77 c5 82 6f 63 c5 82 61  |2km pod w..oc..a|
000026e0  77 73 6b 69 65 67 6f 20  6c 6f 74 6e 69 73 6b 61  |wskiego lotniska|
000026f0  20 4b 72 75 73 7a 79 6e  2c 20 67 64 7a 69 65 20  | Kruszyn, gdzie |
00002700  6f 64 62 79 c5 29 3c 2f  73 70 61 6e 3e 3c 2f 74  |odby.)</span></t|

And here this is byte 0x2704 - also cutoff character 0xc5

Comment 12 Marcin Cieślak 2008-08-17 00:39:24 UTC

Just found out that in the recordUpload2() function:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/filerepo/LocalFile.php?revision=38312&content-type=text%2Fplain

$comment parameter is inserted as-is into the 

image.img_description and oldimage.oi_description - both fields are TINYBLOBs, and they get cut off at the 255th character. 

An additional check should be introduced there plus existing database entries should be cleaned up.

Comment 13 Dan Jacobson 2008-08-19 16:50:53 UTC

*** Bug 11087 has been marked as a duplicate of this bug. ***

Comment 14 Mormegil 2008-08-29 19:36:52 UTC

Removing testme – still present in the current trunk (see e.g. http://cs.wikipedia.org/w/index.php?diff=2989674), raising severity at least to minor (we are generating invalid UTF-8!), adding a tracking bug dependence.

Comment 15 Fran Rogers 2008-09-15 00:59:11 UTC

Fixed in r40837. All output to the browser will now be scanned for invalid forms per the rules in RFC 3629; invalid forms will be replaced with �. :)

Comment 16 Mormegil 2008-09-15 09:42:05 UTC

Hold on for a second… OK, the is a solution to the “breaks display” part of this bug, and it is a nice improvement of the general behavior of MW. But still, shouldn’t we, in the first place, do the string cutoffs properly?

There is absolutely no reason why the history should display any � characters. Or should I open a new bug for that?

Comment 17 Brion Vibber 2008-09-15 17:44:07 UTC

a) the data should be stored correctly in the first place

b) this post-op scan on all output looks like it will perform abominably.

This needs to be reverted.

Comment 18 Brion Vibber 2008-09-15 17:52:04 UTC

And of course c) it duplicates existing code for UTF-8 fixups. :)

Reverted r40837, r40839, r40840 in r40861.

Comment 19 Siebrand Mazeland 2009-06-04 11:35:31 UTC

Adding testme. Please test with Internet Explorer 8 and note the result here.

Comment 20 Bryan Tong Minh 2009-07-14 19:38:12 UTC

*** Bug 19712 has been marked as a duplicate of this bug. ***

Comment 21 Ilmari Karonen 2010-04-06 19:39:36 UTC

It seems we're still generating upload summaries with badly truncated UTF-8.  See for example http://commons.wikimedia.org/wiki/File:NuclearMedicineImageOfAHandAfterShadowFilter-2.png

Comment 22 Fomafix 2011-08-28 12:05:42 UTC

Bug 28649 (fixed in r95456) is related to this bug. In r62387 also some truncate bugs are fixed. Are there still any truncate bugs?

Comment 23 Bawolff (Brian Wolff) 2011-11-16 19:52:46 UTC

(In reply to comment #21)
> It seems we're still generating upload summaries with badly truncated UTF-8. 
> See for example
> http://commons.wikimedia.org/wiki/File:NuclearMedicineImageOfAHandAfterShadowFilter-2.png

This was fixed in r103362 (well for new uploads anyways, uploads before this revision would still be affected).

I'm not aware of any more examples of this bug.

Comment 24 matanya 2012-07-25 21:34:21 UTC

Tested in IE, seems to have no issues any more. closing

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links