Last modified: 2014-09-23 23:16:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 28950 - ISBN, RFC and PMID "magic links" might support non-breaking spaces ( ) too
ISBN, RFC and PMID "magic links" might support non-breaking spaces ( ) too
Status: NEW
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, patch-reviewed
Depends on: 29025
Blocks: 29473
  Show dependency treegraph
 
Reported: 2011-05-13 00:45 UTC by vlakoff
Modified: 2014-09-23 23:16 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
patch proposal for Bug 28950 and Bug 29025 (3.61 KB, patch)
2012-01-23 13:28 UTC, vlakoff
Details
patch proposal (v2) for Bug 28950 and Bug 29025 (3.52 KB, patch)
2012-01-23 20:56 UTC, vlakoff
Details

Description vlakoff 2011-05-13 00:45:06 UTC
Hello,

ISBN, RFC and PMID "magic links" might support non-breaking spaces ( ) too, like:

ISBN 978-1-2345-6789-0


see also:
http://www.mediawiki.org/wiki/Markup_spec/BNF/Magic_links
Comment 1 Platonides 2011-05-25 21:33:01 UTC
Might as well do it when fixing bug 29025
(a literal 0160 char, not when done with entities)
Comment 2 vlakoff 2012-01-23 13:28:37 UTC
Created attachment 9894 [details]
patch proposal for Bug 28950 and Bug 29025

First patch proposal for Bug 28950 and Bug 29025. Seems to be working great, nevertheless any suggestion would be very welcome.

The benefits of this patch are:
- (Bug 28950) non-breaking spaces (both literal char and HTML entities) support
- (Bug 29025) no surprising link creation if several \n's (like "ISBN\n\n1234567890")


The only limitation I am aware of is that \n isn't implemented (yet), so for example "ISBN\n1234567890" doesn't produce a link. But don't forget cases like "ISBN \n123...", "ISBN\n 123..." (<pre> insertion!), "ISBN\n&nbsp;123...", and so on.

\n support is feasible, I don't know if it would be that useful, however I'd like to be as close as possible to "normal" wikicode parsing.
Comment 3 Sumana Harihareswara 2012-01-23 19:37:21 UTC
I've added the "patch" and "need-review" tags so developers know to review this.  Thanks for the patch!  Next time you can do that yourself to speed things up.
Comment 4 vlakoff 2012-01-23 20:56:41 UTC
Created attachment 9897 [details]
patch proposal (v2) for Bug 28950 and Bug 29025

New patch proposal for Bug 28950 and Bug 29025.

Now also perfectly manages \n's :-)

Here's my approach: the \n's are captured by the regex, as does the production code, but no link is created if the spaces between "ISBN" and "123..." met any of these conditions:
- several \n's (not necessarily consecutive); which forcibly leave the current <p>
- \n followed by a normal space; which trigger <pre> creation


Also, I removed the following lines I added in my previous patch:
$spaces = preg_replace( '![ \t]+!', ' ', $spaces );
$spaces = preg_replace( '!&\#0?160;|\x{0160}!u', '&nbsp;', $spaces );

I noticed the parser usually doesn't do this cleanup (it only replaces entities with &#160;), so for consistency and performances I just let the spaces unmodified.


Please inform me of anything I would have forgotten.

Sumana, you're welcome; I'm keeping it in mind for the next time.
Comment 5 Platonides 2012-01-23 21:16:05 UTC
The entities look a bit like cluttering. Also, why do you need to
replace the spaces?
Comment 6 vlakoff 2012-01-23 22:18:05 UTC
I'm not sure to understand what you mean?

I just support the different ways of providing a non-break space in the wikisource, which are the entities &nbsp;, &#160;, &#0160;, as well as the literal \u00A0 char. It is important for consistency to recognize all of them and not just an arbitrary part.

Also, I said I **don't** replace anymore the spaces ;-) I realized after my first patch that it was useless...
Comment 7 Gabriel Wicke 2012-02-29 13:55:15 UTC
The latest patch would remove linking for five links in the English Wikipedia found by the following regexp:

'(?:(?:RFC|PMID)[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}([0-9]+)|ISBN[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b))'

----------------------------------------------------------
== Match: [[Ridda wars]] ==
on, Trident Press Ltd., 2001, p. 81-84. ISBN
                 
1-900724-47-2.</ref>                                                                             

=== Buzakha ===

On receiving i
== Match: [[Hiroyuki Agawa]] ==
 Imperial Navy<br>ISBN 9780870113550<br>ISBN 
                 
9784770025395                                                                                    
|[[Biography]]; translation by John Bes
== Match: [[USS Fulton (AS-11)]] ==
 Ships'', pg. 377. Amber Books, London. ISBN 
                                                                                                 
9781905704439 </ref>, launched on 27 December 1940 by                                            
== Match: [[Leinster League Division Two]] ==
1997/1998 Ashbourne

1998/1999 Coolmine RFC
                 
1999/2000 [[Garda RFC|Garda]]                                                                    

2000/2001 Ark
== Match: [[Bluff Cove Air Attacks]] ==
sic primer''. Diane publishing, p. 235. ISBN
                 
1585660914</ref>                                                                                 

A total of 56 British servicemen
-------------------------------------------------------

This is rare enough (and can easily be fixed manually). In the patch, the checks for the number of newlines and the leading space (pre) could also be folded into the regexp. Apart from that, it would be a good idea to benchmark the performance impact from switching the regexp to unicode mode.
Comment 8 vlakoff 2012-02-29 18:49:20 UTC
Thank you for this deep research. These articles should be corrected manually. Also, your 4th match seems to be wrong ;)

About the unicode regex mode, if the performance impact is worth it, we could remove that "\x{00A0}" part. It shouldn't be a big problem since directly using the nbsp char is a bad idea: not visible, and removed from textarea by some browsers.
Comment 9 Gabriel Wicke 2012-03-08 18:59:28 UTC
I have not tested the performance impact yet, but have doubts about promoting manual non-breaking space insertion. 

An alternative to consider would be to insert non-breaking spaces automatically in these links, which would avoid cluttering the source with ugly HTML entities. The length of IDs and especially of RFC/ISBN/PMID prefixes is quite short, so a non-breaking space should not cause major display problems, even on small displays. ISBN links in narrow table columns (infoboxes maybe) could still be an issue, and would need to be checked.

Independent of the non-breaking space issue I am in favor of restricting the number of newlines between link prefix and ID just as you did in your patch. Could you rework the patch to include the count directly in the regexp, similar to my regexp above?
Comment 10 Gabriel Wicke 2012-03-09 10:00:49 UTC
The trunk version of this regexp already uses Unicode regexp mode, so your patch should not affect performance significantly. 

The current patch no longer applies on current trunk though. Could you refresh and tweak it?
Comment 11 Sumana Harihareswara 2012-03-10 19:53:09 UTC
(In reply to comment #10)
> The current patch no longer applies on current trunk though. Could you refresh
> and tweak it?

Thus marking "reviewed" - thanks, Gabriel.
Comment 12 Rich Farmbrough 2012-05-12 18:10:08 UTC
Of the 5 exceptions I have corrected 3, one was corrected already (but there was an instance of a tab character, and one was a false positive (Rugby Football Club, not Request For Comments).
Comment 13 au 2012-06-17 18:41:10 UTC
Hi Vlakoff, thank you for the patch!

As you may already know, MediaWiki is currently revamping its PHP-based parser
into a "Parsoid" prototype component, to support the rich-text Visual Editor
project:

   https://www.mediawiki.org/wiki/Parsoid
   https://www.mediawiki.org/wiki/Visual_editor

Folks interested in enhancing the parser's capabilities are very much welcome
to join the Parsoid project, and contribute patches as Git branches:

   https://www.mediawiki.org/wiki/Git/Tutorial#How_to_submit_a_patch

Compared to .diff attachments in Bugzilla tickets, Git branches are much easier
for us to review, refine and merge features together.

Each change set has a distinct URL generated by the "git review" tool, which
can be referenced in Bugzilla by pasting its gerrit.wikimedia.org URL as a
comment.

If you run into any issues with the patch process, please feel free to ask on
irc.freenode.net #wikimedia-dev and the wikitext-l mailing list. Thank you!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links