Last modified: 2014-05-14 17:47:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T4676, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 2676 - dealing with non unicode aware browsers.
dealing with non unicode aware browsers.
Product: MediaWiki
Classification: Unclassified
Page editing (Other open bugs)
Macintosh Mac System 9.x
: High normal with 1 vote (vote)
: ---
Assigned To: peter green
: patch, patch-reviewed
Depends on:
Blocks: unicode
  Show dependency treegraph
Reported: 2005-07-02 23:26 UTC by peter green
Modified: 2014-05-14 17:47 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---

implementation of workaround for non-unicode browsers as described (6.48 KB, patch)
2005-07-11 21:41 UTC, peter green
Updated version of plugwash's patch (7.38 KB, patch)
2005-07-29 10:01 UTC, Brion Vibber

Description peter green 2005-07-02 23:26:32 UTC
mediawiki has a setting for blacklisting browsers that can't deal with unicode
properly. Currently this setting only lists IE for the mac. Furthermore all this
blacklist does is give a warning which is liable to be missed or ignored. This
issue leads to fairly frequent bad edits messing up unicode charactors.

to deal with this issue we need two new features
1: store the user agent with every request so that problem browsers can be
2: provide an alternative means of editing (possiblly based on entities or UTF-7
or something) for those browsers which are incapable of handling unicode. (UTF-7
would probablly be easy but would be ugly as hell).
Comment 1 peter green 2005-07-11 01:55:29 UTC
ok i have a plan 

the process described below is only for browsers that are known to be non
unicode aware.

this idea was inspired by the x-codo system used on the eo wikipedia.
1: use strtr to add an extra leading 0 to existing hexadecimal entities in the
page text
2: replace characters outside the 7 bit ascii range with hexadecimal entities
with no leading zeros
3: send the resulting text to the old browser for editing
4: get the edited text back
5: replace hexadecimal entities with no leading 0s with the characters they
6: remove a leading 0 from every html entity in the page.

this process will
1: not effect parts of the text the user doesn't edit
2: make the text in the edit box still valid wiki code that can be copy/pasted
to/from other wikis without issues.

i'm going to try to implement this but i'm new to php and php doesn't seem too
friendly to this type of processing work.
Comment 2 Brion Vibber 2005-07-11 03:13:11 UTC
That's kind of sick, but might work. :)

Suggestion: use preg_replace_callback(), it tends to simplify these things nicely.

Take a look also at the existing UTF-8 support code in includes/normal (and the 
Sanitizer code for interpreting character references to UTF-8). This includes some 
simple helper functions for translating between numeric codepoints and UTF-8 
Comment 3 peter green 2005-07-11 20:06:27 UTC
ok status update.

the conversion routines are written and tested.

the text is converted when the user of mac ie (since i don't have access to a
mac myself i'm using firefoxes user agent selector extention to test).

i still have to figure out how to make the conversion on save once thats working
a patch will follow.
Comment 4 peter green 2005-07-11 21:41:12 UTC
Created attachment 698 [details]
implementation of workaround for non-unicode browsers as described
Comment 5 peter green 2005-07-12 21:59:20 UTC
Comment 6 Brion Vibber 2005-07-21 12:50:05 UTC
will test this out...
Comment 7 Brion Vibber 2005-07-29 10:01:41 UTC
Created attachment 752 [details]
Updated version of plugwash's patch

Here's plugwash's patch with my changes as I'm committing it:
* Reformatted to match other code
* Removed some duplicate code to use preexisting UTF-8 functions
* Arranged functions to extract checks from the mainline code.
* Added phpdoc comments.
* Adjusted the message warning a bit.
Comment 8 Brion Vibber 2005-07-29 10:16:17 UTC
Applied to CVS HEAD and installed on Wikimedia.

Seems to work correctly with IE 5.2/Mac as tested on my machine.
Comment 9 peter green 2005-07-30 01:00:04 UTC
i think you made a slight mistake in the comments 

+	 * Filter an output field through a Unicode de-armoring process if it
+	 * came from an old browser with known broken Unicode editing issues.

shouldn't that be

+	 * Filter an output field through a Unicode armoring process if it is
+	 * going to an old browser with known broken Unicode editing issues
Comment 10 Brion Vibber 2005-07-30 08:03:22 UTC
Ah, the dangers of cut and paste ;)

Comment typo fixed.
Comment 11 peter green 2005-07-30 22:43:18 UTC
also not urgent since the browser is pretty uncommon now but the blacklist entry
for netscape 4.x should cover all platforms and all 4.x versions not just 4.78
for linux
Comment 12 Omegatron 2005-12-05 19:24:39 UTC
This is great, but Unicode-unaware *browsers* aren't the only problem.  A lot of
people want to work in Unicode-unaware text editors as well, and this makes it
difficult for them.  They'd have to fake out the server into thinking they had
an old browser or something.  I have a different proposal:

1. Convert all HTML entities (named or Unicode numbers or whatever) into plain
Unicode characters in the wikisource.

2. Provide an option in the editing interface to view the source in either
"plain Unicode" format (with actual characters) or "plain text" format (with
entities) on a per-edit basis.

2.a. When editing in "plain text" mode, all the bad characters (non-ASCII?) will
be converted into named HTML entities if possible (— and the like), or
into numbered HTML entities if not possible (— and the like).

2.b. The default editing format will be selectable in preferences.
Comment 13 lɛʁi לערי ריינהארט 2005-12-05 20:12:59 UTC
(In reply to comment #12)

Hi Omegatron!

Your request should be covered by
Bug 4012: feature request: add a felexible magic character conversion to the
build in editor
You request a kind of "convert all UTF-8" caharcter setup described there.

Best regards reinhardt [[user:gangleri]]
Comment 14 Michael Zajac 2006-08-01 23:19:44 UTC
I don't think the current scheme protects Unicode non-breaking space 
characters (U+00A0).  I often enter these into en.wikipedia by typing 
alt-space in Safari, and they work fine and survive most edits—wikitext 
is ''much'' cleaner than littered with a bunch of  .  

But once in a while, some other editor's browser will convert them all to 
plain spaces.  Should U+00A0 be added to the list of characters 
protected from old browsers?
Comment 15 peter green 2006-08-02 22:11:29 UTC
"I don't think the current scheme protects Unicode non-breaking space
characters (U+00A0)."
it does, all non-ascii characters are protected, the problem is that as of right
now the bad browser list is very limited (and if the "bad browser" is a plugin
or similar that doesn't affect the headers in any way or the user is
copy/pasting into a seperate editor there isn't much we can do).

"I often enter these into en.wikipedia by typing
alt-space in Safari, and they work fine and survive most edits—wikitext
is ''much'' cleaner than littered with a bunch of  ."
cleaner maybe but unless you are using a specialist editor that highlights non
breaking spaces virtually impossible to edit correctly. 
Comment 16 Michael Zajac 2006-08-03 17:20:01 UTC
Thanks for the reply.  

I don't see non-breaking spaces as a problem.  I only enter them where it's good practice and recommended by the MOS, 
e.g., in unit expressions such as "100 mm".  If they need to be found for some reason, wikitext can be pasted into 
practically any text editor or word processor for more sophisticated processing.

Note You need to log in before you can comment on or make changes to this bug.