Last modified: 2014-09-23 23:16:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T30950, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 28950 - ISBN, RFC and PMID "magic links" might support non-breaking spaces ( ) too


Summary:	ISBN, RFC and PMID "magic links" might support non-breaking spaces ( ) too

Status:	NEW

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	patch, patch-reviewed

Depends on:	29025
Blocks:	29473
	Show dependency tree / graph

Reported:	2011-05-13 00:45 UTC by vlakoff
Modified:	2014-09-23 23:16 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
patch proposal for Bug 28950 and Bug 29025 (3.61 KB, patch) 2012-01-23 13:28 UTC, vlakoff	Details
patch proposal (v2) for Bug 28950 and Bug 29025 (3.52 KB, patch) 2012-01-23 20:56 UTC, vlakoff	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Description vlakoff 2011-05-13 00:45:06 UTC

Hello,

ISBN, RFC and PMID "magic links" might support non-breaking spaces (&amp;nbsp;) too, like:

ISBN&amp;nbsp;978-1-2345-6789-0


see also:
http://www.mediawiki.org/wiki/Markup_spec/BNF/Magic_links

Comment 1 Platonides 2011-05-25 21:33:01 UTC

Might as well do it when fixing bug 29025
(a literal 0160 char, not when done with entities)

Comment 2 vlakoff 2012-01-23 13:28:37 UTC

Created attachment 9894 [details]
patch proposal for Bug 28950 and Bug 29025

First patch proposal for Bug 28950 and Bug 29025. Seems to be working great, nevertheless any suggestion would be very welcome.

The benefits of this patch are:
- (Bug 28950) non-breaking spaces (both literal char and HTML entities) support
- (Bug 29025) no surprising link creation if several \n's (like "ISBN\n\n1234567890")


The only limitation I am aware of is that \n isn't implemented (yet), so for example "ISBN\n1234567890" doesn't produce a link. But don't forget cases like "ISBN \n123...", "ISBN\n 123..." (<pre> insertion!), "ISBN\n&nbsp;123...", and so on.

\n support is feasible, I don't know if it would be that useful, however I'd like to be as close as possible to "normal" wikicode parsing.

Comment 3 Sumana Harihareswara 2012-01-23 19:37:21 UTC

I've added the "patch" and "need-review" tags so developers know to review this.  Thanks for the patch!  Next time you can do that yourself to speed things up.

Comment 4 vlakoff 2012-01-23 20:56:41 UTC

Created attachment 9897 [details]
patch proposal (v2) for Bug 28950 and Bug 29025

New patch proposal for Bug 28950 and Bug 29025.

Now also perfectly manages \n's :-)

Here's my approach: the \n's are captured by the regex, as does the production code, but no link is created if the spaces between "ISBN" and "123..." met any of these conditions:
- several \n's (not necessarily consecutive); which forcibly leave the current <p>
- \n followed by a normal space; which trigger <pre> creation


Also, I removed the following lines I added in my previous patch:
$spaces = preg_replace( '![ \t]+!', ' ', $spaces );
$spaces = preg_replace( '!&\#0?160;|\x{0160}!u', '&nbsp;', $spaces );

I noticed the parser usually doesn't do this cleanup (it only replaces entities with &#160;), so for consistency and performances I just let the spaces unmodified.


Please inform me of anything I would have forgotten.

Sumana, you're welcome; I'm keeping it in mind for the next time.

Comment 5 Platonides 2012-01-23 21:16:05 UTC

The entities look a bit like cluttering. Also, why do you need to
replace the spaces?

Comment 6 vlakoff 2012-01-23 22:18:05 UTC

I'm not sure to understand what you mean?

I just support the different ways of providing a non-break space in the wikisource, which are the entities &nbsp;, &#160;, &#0160;, as well as the literal \u00A0 char. It is important for consistency to recognize all of them and not just an arbitrary part.

Also, I said I **don't** replace anymore the spaces ;-) I realized after my first patch that it was useless...

Comment 7 Gabriel Wicke 2012-02-29 13:55:15 UTC

The latest patch would remove linking for five links in the English Wikipedia found by the following regexp:

'(?:(?:RFC|PMID)[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}([0-9]+)|ISBN[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b))'

----------------------------------------------------------
== Match: [[Ridda wars]] ==
on, Trident Press Ltd., 2001, p. 81-84. ISBN
                 
1-900724-47-2.</ref>                                                                             

=== Buzakha ===

On receiving i
== Match: [[Hiroyuki Agawa]] ==
 Imperial Navy<br>ISBN 9780870113550<br>ISBN 
                 
9784770025395                                                                                    
|[[Biography]]; translation by John Bes
== Match: [[USS Fulton (AS-11)]] ==
 Ships'', pg. 377. Amber Books, London. ISBN 
                                                                                                 
9781905704439 </ref>, launched on 27 December 1940 by                                            
== Match: [[Leinster League Division Two]] ==
1997/1998 Ashbourne

1998/1999 Coolmine RFC
                 
1999/2000 [[Garda RFC|Garda]]                                                                    

2000/2001 Ark
== Match: [[Bluff Cove Air Attacks]] ==
sic primer''. Diane publishing, p. 235. ISBN
                 
1585660914</ref>                                                                                 

A total of 56 British servicemen
-------------------------------------------------------

This is rare enough (and can easily be fixed manually). In the patch, the checks for the number of newlines and the leading space (pre) could also be folded into the regexp. Apart from that, it would be a good idea to benchmark the performance impact from switching the regexp to unicode mode.

Comment 8 vlakoff 2012-02-29 18:49:20 UTC

Thank you for this deep research. These articles should be corrected manually. Also, your 4th match seems to be wrong ;)

About the unicode regex mode, if the performance impact is worth it, we could remove that "\x{00A0}" part. It shouldn't be a big problem since directly using the nbsp char is a bad idea: not visible, and removed from textarea by some browsers.

Comment 9 Gabriel Wicke 2012-03-08 18:59:28 UTC

I have not tested the performance impact yet, but have doubts about promoting manual non-breaking space insertion. 

An alternative to consider would be to insert non-breaking spaces automatically in these links, which would avoid cluttering the source with ugly HTML entities. The length of IDs and especially of RFC/ISBN/PMID prefixes is quite short, so a non-breaking space should not cause major display problems, even on small displays. ISBN links in narrow table columns (infoboxes maybe) could still be an issue, and would need to be checked.

Independent of the non-breaking space issue I am in favor of restricting the number of newlines between link prefix and ID just as you did in your patch. Could you rework the patch to include the count directly in the regexp, similar to my regexp above?

Comment 10 Gabriel Wicke 2012-03-09 10:00:49 UTC

The trunk version of this regexp already uses Unicode regexp mode, so your patch should not affect performance significantly. 

The current patch no longer applies on current trunk though. Could you refresh and tweak it?

Comment 11 Sumana Harihareswara 2012-03-10 19:53:09 UTC

(In reply to comment #10)
> The current patch no longer applies on current trunk though. Could you refresh
> and tweak it?

Thus marking "reviewed" - thanks, Gabriel.

Comment 12 Rich Farmbrough 2012-05-12 18:10:08 UTC

Of the 5 exceptions I have corrected 3, one was corrected already (but there was an instance of a tab character, and one was a false positive (Rugby Football Club, not Request For Comments).

Comment 13 au 2012-06-17 18:41:10 UTC

Hi Vlakoff, thank you for the patch!

As you may already know, MediaWiki is currently revamping its PHP-based parser
into a "Parsoid" prototype component, to support the rich-text Visual Editor
project:

   https://www.mediawiki.org/wiki/Parsoid
   https://www.mediawiki.org/wiki/Visual_editor

Folks interested in enhancing the parser's capabilities are very much welcome
to join the Parsoid project, and contribute patches as Git branches:

   https://www.mediawiki.org/wiki/Git/Tutorial#How_to_submit_a_patch

Compared to .diff attachments in Bugzilla tickets, Git branches are much easier
for us to review, refine and merge features together.

Each change set has a distinct URL generated by the "git review" tool, which
can be referenced in Bugzilla by pasting its gerrit.wikimedia.org URL as a
comment.

If you run into any issues with the patch process, please feel free to ask on
irc.freenode.net #wikimedia-dev and the wikitext-l mailing list. Thank you!

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links