Last modified: 2010-07-24 18:23:52 UTC
With the white listing of the <abbr>, the function doMagicLinks() of Parser.php mix <a> and <abbr> together.
Created attachment 7235 [details]
regular expression modification
1) Do you have a test case that demonstrates the problem? I.e., what's some markup that parses incorrectly because of this bug?
2) Your change doesn't seem quite right -- whitespace other than a simple space would be valid HTML here (although I haven't looked closely enough to see if it would actually be possible at this stage in the parsing). I would suggest (<a[^a-z0-9].*?</a>).
1) The wiki markup bellow get incorrectly parsed. You can also check [[User:GuillaumeBeaudoin]] for more example.
<abbr>(fr)</abbr> ISBN 2753300917 [http://bit.ly/bZAjtg La méthode Google]
The <abbr> tag is extensively used on the French wikipedia and the issue have been first found on [[fr:Wikipedia]] by [[fr:User:Manu1400]].
2) You're right, a tab or any whitespace other than a simple space would not make good on my regular expression. We could use \s for any whitespaces (option A). The one likes what you've proposed (option B).
Option A - <a[\w>].*?</a>
Option B - <a[^a-zA-Z0-9].*?</a>
Option C - <a[^[:alnum:]].*?</a>
Altough, I'm not sure what capital letters would do.
Committed a modified version in r64113. I went with (<a[ \t\r\n>].*?</a>) in the end, matching the HTML5 spec as far as I'm reading it: <http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#before-attribute-name-state> Thanks for the patch!
Thanks you Aryeh. Merci!
Since this is fixed, removing Bug #617 as a "blocks" dependency.
Woops, typo. Corrected: Since this is fixed, removing Bug #671 as a "blocks" dependency.