Last modified: 2014-05-14 17:47:56 UTC
mediawiki has a setting for blacklisting browsers that can't deal with unicode properly. Currently this setting only lists IE for the mac. Furthermore all this blacklist does is give a warning which is liable to be missed or ignored. This issue leads to fairly frequent bad edits messing up unicode charactors. to deal with this issue we need two new features 1: store the user agent with every request so that problem browsers can be identified. 2: provide an alternative means of editing (possiblly based on entities or UTF-7 or something) for those browsers which are incapable of handling unicode. (UTF-7 would probablly be easy but would be ugly as hell).
ok i have a plan the process described below is only for browsers that are known to be non unicode aware. this idea was inspired by the x-codo system used on the eo wikipedia. 1: use strtr to add an extra leading 0 to existing hexadecimal entities in the page text 2: replace characters outside the 7 bit ascii range with hexadecimal entities with no leading zeros 3: send the resulting text to the old browser for editing 4: get the edited text back 5: replace hexadecimal entities with no leading 0s with the characters they represent 6: remove a leading 0 from every html entity in the page. rationale: this process will 1: not effect parts of the text the user doesn't edit 2: make the text in the edit box still valid wiki code that can be copy/pasted to/from other wikis without issues. i'm going to try to implement this but i'm new to php and php doesn't seem too friendly to this type of processing work.
That's kind of sick, but might work. :) Suggestion: use preg_replace_callback(), it tends to simplify these things nicely. Take a look also at the existing UTF-8 support code in includes/normal (and the Sanitizer code for interpreting character references to UTF-8). This includes some simple helper functions for translating between numeric codepoints and UTF-8 characters.
ok status update. the conversion routines are written and tested. the text is converted when the user of mac ie (since i don't have access to a mac myself i'm using firefoxes user agent selector extention to test). i still have to figure out how to make the conversion on save once thats working a patch will follow.
Created attachment 698 [details] implementation of workaround for non-unicode browsers as described
patch
will test this out...
Created attachment 752 [details] Updated version of plugwash's patch Here's plugwash's patch with my changes as I'm committing it: * Reformatted to match other code * Removed some duplicate code to use preexisting UTF-8 functions * Arranged functions to extract checks from the mainline code. * Added phpdoc comments. * Adjusted the message warning a bit.
Applied to CVS HEAD and installed on Wikimedia. Seems to work correctly with IE 5.2/Mac as tested on my machine.
i think you made a slight mistake in the comments + * Filter an output field through a Unicode de-armoring process if it + * came from an old browser with known broken Unicode editing issues. shouldn't that be + * Filter an output field through a Unicode armoring process if it is + * going to an old browser with known broken Unicode editing issues
Ah, the dangers of cut and paste ;) Comment typo fixed.
also not urgent since the browser is pretty uncommon now but the blacklist entry for netscape 4.x should cover all platforms and all 4.x versions not just 4.78 for linux
This is great, but Unicode-unaware *browsers* aren't the only problem. A lot of people want to work in Unicode-unaware text editors as well, and this makes it difficult for them. They'd have to fake out the server into thinking they had an old browser or something. I have a different proposal: 1. Convert all HTML entities (named or Unicode numbers or whatever) into plain Unicode characters in the wikisource. 2. Provide an option in the editing interface to view the source in either "plain Unicode" format (with actual characters) or "plain text" format (with entities) on a per-edit basis. 2.a. When editing in "plain text" mode, all the bad characters (non-ASCII?) will be converted into named HTML entities if possible (— and the like), or into numbered HTML entities if not possible (— and the like). 2.b. The default editing format will be selectable in preferences.
(In reply to comment #12) Hi Omegatron! Your request should be covered by Bug 4012: feature request: add a felexible magic character conversion to the build in editor You request a kind of "convert all UTF-8" caharcter setup described there. Best regards reinhardt [[user:gangleri]]
I don't think the current scheme protects Unicode non-breaking space characters (U+00A0). I often enter these into en.wikipedia by typing alt-space in Safari, and they work fine and survive most edits—wikitext is ''much'' cleaner than littered with a bunch of  . But once in a while, some other editor's browser will convert them all to plain spaces. Should U+00A0 be added to the list of characters protected from old browsers?
"I don't think the current scheme protects Unicode non-breaking space characters (U+00A0)." it does, all non-ascii characters are protected, the problem is that as of right now the bad browser list is very limited (and if the "bad browser" is a plugin or similar that doesn't affect the headers in any way or the user is copy/pasting into a seperate editor there isn't much we can do). "I often enter these into en.wikipedia by typing alt-space in Safari, and they work fine and survive most edits—wikitext is ''much'' cleaner than littered with a bunch of  ." cleaner maybe but unless you are using a specialist editor that highlights non breaking spaces virtually impossible to edit correctly.
Thanks for the reply. I don't see non-breaking spaces as a problem. I only enter them where it's good practice and recommended by the MOS, e.g., in unit expressions such as "100 mm". If they need to be found for some reason, wikitext can be pasted into practically any text editor or word processor for more sophisticated processing.