Last modified: 2012-04-19 21:42:59 UTC
Created attachment 6699 [details]
Screen print of the error
Reporting against Babaco Release : r57957
Steps to Reproduce ::
Link : http://prototype.wikimedia.org/si.wikipedia.org/%E0%B6%B8%E0%B7%94%E0%B6%BD%E0%B7%8A_%E0%B6%B4%E0%B7%92%E0%B6%A7%E0%B7%94%E0%B7%80
1)Select a random page
2)Edit a section
3)Select a word and select a replace word
<<Extra character is added>>
There should not be any extra character
Browser (User-Agent): Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.0 (KHTML, like Gecko)Chrome/188.8.131.52 Safari/532.0
Browser (User-Agent): Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)
Browser (User-Agent): Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:184.108.40.206) Gecko/20090824 Firefox/3.5.3
My gut says this is probably due to a bad interaction between regexes and multibyte strings; if that's the case, we can't do much about it.
Basically what I think is happening is that the [^ ] part of the regex is selecting one byte, but the character at that position is really two (or more) bytes long. That one byte will be matched and replaced, but the second (and any subsequent) bytes will stick around and be interpreted as a different character. I'll try to confirm this suspicion later.
The suspicion in comment #1 doesn't seem to be right, so now I think this may have something to do with compound characters. Could you paste all texts from the PDF (textarea contents before, search regex, replace string, textarea contents after) in a bug comment?
The underlying search and replace code is completely different now that we are using an iframe rather than a textarea.
(In reply to comment #3)
> The underlying search and replace code is completely different now that we are
> using an iframe rather than a textarea.
That doesn't necessarily mean that multibyte character handling is magically fixed. Reopening and asking Calcey to try and reproduce again; please close as FIXED or WORKSFORME if this can't be reproduced any more.
I've tested this with double-byte characters quite a bit now, and am sure it's fixed.
Note that Sinhala seems to be using three-byte characters.
Verified and closed