Last modified: 2014-09-23 23:16:16 UTC
Hello, ISBN, RFC and PMID "magic links" might support non-breaking spaces ( ) too, like: ISBN 978-1-2345-6789-0 see also: http://www.mediawiki.org/wiki/Markup_spec/BNF/Magic_links
Might as well do it when fixing bug 29025 (a literal 0160 char, not when done with entities)
Created attachment 9894 [details] patch proposal for Bug 28950 and Bug 29025 First patch proposal for Bug 28950 and Bug 29025. Seems to be working great, nevertheless any suggestion would be very welcome. The benefits of this patch are: - (Bug 28950) non-breaking spaces (both literal char and HTML entities) support - (Bug 29025) no surprising link creation if several \n's (like "ISBN\n\n1234567890") The only limitation I am aware of is that \n isn't implemented (yet), so for example "ISBN\n1234567890" doesn't produce a link. But don't forget cases like "ISBN \n123...", "ISBN\n 123..." (<pre> insertion!), "ISBN\n 123...", and so on. \n support is feasible, I don't know if it would be that useful, however I'd like to be as close as possible to "normal" wikicode parsing.
I've added the "patch" and "need-review" tags so developers know to review this. Thanks for the patch! Next time you can do that yourself to speed things up.
Created attachment 9897 [details] patch proposal (v2) for Bug 28950 and Bug 29025 New patch proposal for Bug 28950 and Bug 29025. Now also perfectly manages \n's :-) Here's my approach: the \n's are captured by the regex, as does the production code, but no link is created if the spaces between "ISBN" and "123..." met any of these conditions: - several \n's (not necessarily consecutive); which forcibly leave the current <p> - \n followed by a normal space; which trigger <pre> creation Also, I removed the following lines I added in my previous patch: $spaces = preg_replace( '![ \t]+!', ' ', $spaces ); $spaces = preg_replace( '!&\#0?160;|\x{0160}!u', ' ', $spaces ); I noticed the parser usually doesn't do this cleanup (it only replaces entities with  ), so for consistency and performances I just let the spaces unmodified. Please inform me of anything I would have forgotten. Sumana, you're welcome; I'm keeping it in mind for the next time.
The entities look a bit like cluttering. Also, why do you need to replace the spaces?
I'm not sure to understand what you mean? I just support the different ways of providing a non-break space in the wikisource, which are the entities ,  ,  , as well as the literal \u00A0 char. It is important for consistency to recognize all of them and not just an arbitrary part. Also, I said I **don't** replace anymore the spaces ;-) I realized after my first patch that it was useless...
The latest patch would remove linking for five links in the English Wikipedia found by the following regexp: '(?:(?:RFC|PMID)[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}([0-9]+)|ISBN[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b))' ---------------------------------------------------------- == Match: [[Ridda wars]] == on, Trident Press Ltd., 2001, p. 81-84. ISBN 1-900724-47-2.</ref> === Buzakha === On receiving i == Match: [[Hiroyuki Agawa]] == Imperial Navy<br>ISBN 9780870113550<br>ISBN 9784770025395 |[[Biography]]; translation by John Bes == Match: [[USS Fulton (AS-11)]] == Ships'', pg. 377. Amber Books, London. ISBN 9781905704439 </ref>, launched on 27 December 1940 by == Match: [[Leinster League Division Two]] == 1997/1998 Ashbourne 1998/1999 Coolmine RFC 1999/2000 [[Garda RFC|Garda]] 2000/2001 Ark == Match: [[Bluff Cove Air Attacks]] == sic primer''. Diane publishing, p. 235. ISBN 1585660914</ref> A total of 56 British servicemen ------------------------------------------------------- This is rare enough (and can easily be fixed manually). In the patch, the checks for the number of newlines and the leading space (pre) could also be folded into the regexp. Apart from that, it would be a good idea to benchmark the performance impact from switching the regexp to unicode mode.
Thank you for this deep research. These articles should be corrected manually. Also, your 4th match seems to be wrong ;) About the unicode regex mode, if the performance impact is worth it, we could remove that "\x{00A0}" part. It shouldn't be a big problem since directly using the nbsp char is a bad idea: not visible, and removed from textarea by some browsers.
I have not tested the performance impact yet, but have doubts about promoting manual non-breaking space insertion. An alternative to consider would be to insert non-breaking spaces automatically in these links, which would avoid cluttering the source with ugly HTML entities. The length of IDs and especially of RFC/ISBN/PMID prefixes is quite short, so a non-breaking space should not cause major display problems, even on small displays. ISBN links in narrow table columns (infoboxes maybe) could still be an issue, and would need to be checked. Independent of the non-breaking space issue I am in favor of restricting the number of newlines between link prefix and ID just as you did in your patch. Could you rework the patch to include the count directly in the regexp, similar to my regexp above?
The trunk version of this regexp already uses Unicode regexp mode, so your patch should not affect performance significantly. The current patch no longer applies on current trunk though. Could you refresh and tweak it?
(In reply to comment #10) > The current patch no longer applies on current trunk though. Could you refresh > and tweak it? Thus marking "reviewed" - thanks, Gabriel.
Of the 5 exceptions I have corrected 3, one was corrected already (but there was an instance of a tab character, and one was a false positive (Rugby Football Club, not Request For Comments).
Hi Vlakoff, thank you for the patch! As you may already know, MediaWiki is currently revamping its PHP-based parser into a "Parsoid" prototype component, to support the rich-text Visual Editor project: https://www.mediawiki.org/wiki/Parsoid https://www.mediawiki.org/wiki/Visual_editor Folks interested in enhancing the parser's capabilities are very much welcome to join the Parsoid project, and contribute patches as Git branches: https://www.mediawiki.org/wiki/Git/Tutorial#How_to_submit_a_patch Compared to .diff attachments in Bugzilla tickets, Git branches are much easier for us to review, refine and merge features together. Each change set has a distinct URL generated by the "git review" tool, which can be referenced in Bugzilla by pasting its gerrit.wikimedia.org URL as a comment. If you run into any issues with the patch process, please feel free to ask on irc.freenode.net #wikimedia-dev and the wikitext-l mailing list. Thank you!