Last modified: 2013-06-18 14:43:33 UTC
BUG MIGRATED FROM SOURCEFORGE http://sourceforge.net/tracker/index.php?func=detail&aid=855680&group_id=34373&atid=411192 Originally submitted by Nobody/Anonymous - nobody 2003-12-07 10:32 When someone write a long summary comment, it messes RecentChanges, History, and other texts. I think this is unique to languages using 2-byte characters - when a character is cut-off in the middle, it turns into some wierd character, and affects other part of the page. As an example, please see the following history page in which the text (including the sidebar) is inappropriately italicized. http://ja.wikipedia.org/w/wiki.phtml?title=Wikipedia:% E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3% 83%87%E3%82%A3%E3%82%A2%E3%81%AE%E4%BB% B2%E9%96%93&action=history When this happens at RecentChanges, it is quite difficult to read through it. As a fix, it would be nice to automatically detect too long summary comment and ask the user to shorten it. Or there may be a way to properly cut 2byte-char texts. That would be good, too. Or maybe some other solution is available. Thanks for the help, Tomos ( wiki_tomos at hotmail dot com ) ------------------------- Additional comments ------------------------ Date: 2003-12-07 11:31 Sender: SF user vibber Confirmed; this seems to be a problem with how Internet Explorer handles broken UTF-8 code; in at least some circumstances it will eat the non-UTF-8-trail byte(s) that follow the broken sequence. (I presume it's reading ahead the entire number of bytes that the head byte specifies and eating the false tail bytes instead of resynchronizing at the break point. That's a real shame, since this ability is one of the neatest things about UTF-8 compared with traditional double-byte character sets.) In the attached screenshot (from IE 6.0 on WinXP) this shows it destroying the following ")" and even the "<" that starts the closing </em> tag, so the rest of the page is left in italics when the markup is incorrectly interpreted. Most other browsers I have tested (Mozilla, Camino, Safari) replace only the broken sequence with a placeholder 'broken' glyph, and correctly restart the UTF-8 interpretation at the next byte, which as ASCII is itself a valid UTF-8 character sequence. Konqueror 3.1.2 seems to break the following ")" but not the "<", so the tags at least are intact. Text gets cutoff at maximum lengths in a number of places; titles as well as comments have a max size in the database, which knows nothing of UTF-8 and treats our data as raw byte strings. We should add a function to our code to perform a UTF-8-safe max-byte-length string trimmer to keep the bad ones out on general principle; since we can't fix IE from choking on them we should also go through and eliminate any remaining in the database. Impact: mostly a cosmetic annoyance, but because of the ability to damage markup in some popular browsers it could harm usability. It's unlikely that cross-site scripting attacks are possible through this, but it's bad juju anyway. Database should be cleaned of any broken strings there are now, and code should be fixed to avoid putting them in in the future. Only affects UTF-8 wikis, but that's a large and growing portion of the user base (and we want to switch everything to UTF-8 at some point). Asian languages are particularly affected because UTF-8 balloons to 3 bytes per character in most Asian scripts, so the byte limits are reached with a smaller number of characters.
Created attachment 90 [details] Partial screenshot of problem in IE 6/WinXP originally taken by Brion; copied from SF.net.
Another kind of effects also exist on the bug. I remembered :) Example http://ja.wikipedia.org/w/wiki.phtml?title=%E4%BB%99%E5%8F%B0%E5%B8%82&action=edit&oldid=805477 For some reason the end of wikitext in the editbox has broken, so there are no buttons et al., so it seems there is nothing but to revert it. As I suspected, if original edits stepped into such trouble, they wouldn't be able to be reverted without a sysop. Do, do, do, dōdeshō?
im not convinced that the utf-8 tag is being used correctly here utf8 This keyword tags bugs that would automatically be fixed if all wikis without exception would use UTF-8. it seems that this bug is one that ONLY breaks utf-8 wikis the opposite of what this tag is supposed to mean another possible fix would be to parse for broken utf-8 at output time (Which may be easier than trying to find all places where strings are chopped)
(In reply to comment #3) > im not convinced that the utf-8 tag is being used correctly here > > utf8 This keyword tags bugs that would automatically be fixed if all wikis > without exception would use UTF-8. > > it seems that this bug is one that ONLY breaks utf-8 wikis the opposite of what > this tag is supposed to mean You are right; therefore I was TOO wrong and incomparably SLOW! I am sorry for my poor comprehension, and thank you for correcting. Now I understand. Or, at least, I hope so. By the way, I have just tried to write a naive code for interest. But I cannot guess how useful this is.
Created attachment 734 [details] A naive code to solve similar problems
Created attachment 737 [details] A naive code to solve similar problems (revised)
*** Bug 5401 has been marked as a duplicate of this bug. ***
Is this still an issue?
Just a few months ago, an automated tool, on Wikimedia Toolserver, seemed to stumble at this bug (malformed XML whatever?). But sorry my memory about that case is a bit obscure...
Created attachment 4597 [details] screen dump - special Recentchanges - deletion event text should truncate at UTF-8 character boundaries · 01.jpg (In reply to comment #8) > Is this still an issue? I just wanted to create a new report with the summary [[special:Recentchanges]] - deletion event text should truncate at UTF-8 character boundaries � the Unicode Character REPLACEMENT CHARACTER U+FFFD http://www.fileformat.info/info/unicode/char/fffd/index.htm HTML Entity (decimal) � HTML Entity (hex) � UTF-8 (hex) 0xEF 0xBF 0xBD (efbfbd) shows up in [[yi:special:Recentchanges]]. It does not show up in [[yi:special:Logs/delete]]. Hiw does this relate to bug 12359 Deletion summary lengths problems ? Best regards Reinhardt [[user:Gangleri]] references: [[yi:special:Versios]] shows * MediaWiki: 1.12alpha (r30286) * PHP: 5.1.4 (apache) * MySQL: 4.0.29-nightly-20070112-wikimedia-log
Some recent examples: Description cut-off at the first byte: http://commons.wikimedia.org/wiki/Image:Banka_mydlana.jpg?uselang=pl hexdump: 00003970 44 20 2d 2d 3c 61 20 68 72 65 66 3d 22 2f 77 2f |D --<a href="/w/| 00003980 69 6e 64 65 78 2e 70 68 70 3f 74 69 74 6c 65 3d |index.php?title=| 00003990 57 69 6b 69 70 65 64 79 73 74 61 3a 4d 72 74 6e |Wikipedysta:Mrtn| 000039a0 26 61 6d 70 3b 61 63 74 69 6f 6e 3d 65 64 69 74 |&action=edit| 000039b0 26 61 6d 70 3b 72 65 64 6c 69 6e 6b 3d 31 22 20 |&redlink=1" | 000039c0 63 6c 61 73 73 3d 22 6e 65 77 22 20 74 69 74 6c |class="new" titl| 000039d0 65 3d 22 57 69 6b 69 70 65 64 79 73 74 61 3a 4d |e="Wikipedysta:M| 000039e0 72 74 6e 20 28 6a 65 73 7a 63 7a 65 20 6e 69 65 |rtn (jeszcze nie| 000039f0 20 75 74 77 6f 72 7a 6f 6e 61 29 22 3e 4d 61 72 | utworzona)">Mar| 00003a00 63 69 6e 20 44 65 72 c4 99 67 6f 77 73 6b 69 3c |cin Der..gowski<| 00003a10 2f 61 3e 20 32 31 3a 30 35 2c 20 32 36 20 73 69 |/a> 21:05, 26 si| 00003a20 65 20 32 30 30 34 20 28 43 45 53 54 29 20 20 5a |e 2004 (CEST) Z| 00003a30 64 6a c4 99 63 69 65 20 70 72 7a 65 64 73 74 61 |dj..cie przedsta| 00003a40 77 69 61 20 62 61 c5 29 3c 2f 73 70 61 6e 3e 3c |wia ba.)</span><| At byte 0x3a46 one can see 0xC5 byte standing alone. Another one: http://pl.wikipedia.org/w/index.php?useskin=monobook&title=Grafika%3ABialyMarszWloclawek.jpg&redirect=no 000026a0 9b 63 69 c5 82 20 77 20 31 39 39 31 20 72 6f 6b |.ci.. w 1991 rok| 000026b0 75 20 4a 61 6e 20 50 61 77 65 c5 82 20 49 49 2c |u Jan Pawe.. II,| 000026c0 20 64 6f 20 6f 64 64 61 6c 6f 6e 65 67 6f 20 31 | do oddalonego 1| 000026d0 32 6b 6d 20 70 6f 64 20 77 c5 82 6f 63 c5 82 61 |2km pod w..oc..a| 000026e0 77 73 6b 69 65 67 6f 20 6c 6f 74 6e 69 73 6b 61 |wskiego lotniska| 000026f0 20 4b 72 75 73 7a 79 6e 2c 20 67 64 7a 69 65 20 | Kruszyn, gdzie | 00002700 6f 64 62 79 c5 29 3c 2f 73 70 61 6e 3e 3c 2f 74 |odby.)</span></t| And here this is byte 0x2704 - also cutoff character 0xc5
Just found out that in the recordUpload2() function: http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/filerepo/LocalFile.php?revision=38312&content-type=text%2Fplain $comment parameter is inserted as-is into the image.img_description and oldimage.oi_description - both fields are TINYBLOBs, and they get cut off at the 255th character. An additional check should be introduced there plus existing database entries should be cleaned up.
*** Bug 11087 has been marked as a duplicate of this bug. ***
Removing testme – still present in the current trunk (see e.g. http://cs.wikipedia.org/w/index.php?diff=2989674), raising severity at least to minor (we are generating invalid UTF-8!), adding a tracking bug dependence.
Fixed in r40837. All output to the browser will now be scanned for invalid forms per the rules in RFC 3629; invalid forms will be replaced with �. :)
Hold on for a second… OK, the is a solution to the “breaks display” part of this bug, and it is a nice improvement of the general behavior of MW. But still, shouldn’t we, in the first place, do the string cutoffs properly? There is absolutely no reason why the history should display any � characters. Or should I open a new bug for that?
a) the data should be stored correctly in the first place b) this post-op scan on all output looks like it will perform abominably. This needs to be reverted.
And of course c) it duplicates existing code for UTF-8 fixups. :) Reverted r40837, r40839, r40840 in r40861.
Adding testme. Please test with Internet Explorer 8 and note the result here.
*** Bug 19712 has been marked as a duplicate of this bug. ***
It seems we're still generating upload summaries with badly truncated UTF-8. See for example http://commons.wikimedia.org/wiki/File:NuclearMedicineImageOfAHandAfterShadowFilter-2.png
Bug 28649 (fixed in r95456) is related to this bug. In r62387 also some truncate bugs are fixed. Are there still any truncate bugs?
(In reply to comment #21) > It seems we're still generating upload summaries with badly truncated UTF-8. > See for example > http://commons.wikimedia.org/wiki/File:NuclearMedicineImageOfAHandAfterShadowFilter-2.png This was fixed in r103362 (well for new uploads anyways, uploads before this revision would still be affected). I'm not aware of any more examples of this bug.
Tested in IE, seems to have no issues any more. closing