Last modified: 2014-10-30 19:44:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34717, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 32717 - Question: Bidi overrides and Unicode spaces removal from titles: why not zero-width space and horizontal tab?
Question: Bidi overrides and Unicode spaces removal from titles: why not zero...
Status: UNCONFIRMED
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
1.20.x
All All
: Low trivial (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-11-30 09:20 UTC by Lupo
Modified: 2014-10-30 19:44 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Lupo 2011-11-30 09:20:19 UTC
This is not a bug, just a question.

Looking at Title::secureAndSplit at
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Title.php?revision=104051&view=markup#l2722
(related to old bug 3696), I wonder why

1. the zero-width space (U+200B, or UTF-8 E2 80 8B) is not stripped?

2. the horizontal tab \t is not included in the whitespace regexp to be replaced by an underscore?

Oversights, or is there some reason? Are these stripped somewhere else already?
Comment 1 Mark A. Hershberger 2011-11-30 20:23:02 UTC
Adding Brion since he probably knows the answer and if this is a bug or not.
Comment 2 Brion Vibber 2011-11-30 21:44:15 UTC
Zero-width space is required for some scripts, to insert a break between letters that would otherwise form ligatures.

Tab ... SHOULD be stripped, lemme check. :)
Comment 3 Brion Vibber 2011-11-30 21:45:16 UTC
\t is just an outright forbidden char in titles.
Comment 4 Lupo 2011-12-01 08:12:26 UTC
(In reply to comment #2)
> Zero-width space is required for some scripts, to insert a break between
> letters that would otherwise form ligatures.

Maybe I misunderstand the purpose of these Unicode characters; I'm not a Unicode specialist. I thought that was the purpose of the zero-width non-joiner (U+200C)? Granted, I think the zero-width space (U+200B) also would need to have the same effect as the ZWNJ as it indicates an (invisible) word boundary, but I'd say that's just a side effect. Also, this normally invisible word boundary may be expanded into visible whitespace by text justification according to [[en:zero-width space]]. So right, stripping it would not be right, but maybe it should be treated as an underscore.

Anyway, thanks for the answer, I see the rationale now. Whether it's 100% correct is less important to me. And perhaps people are using U+200B where they should actually use U+200C, and it's thus more user-friendly to treat it that way. I was just trying to understand what the thoughts behind this were.

> 
> Tab ... SHOULD be stripped, lemme check. :)

"Outright forbidden": do I see this right that this is rejected at line 2834
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Title.php?revision=104051&view=markup#l2834
and depends on the configuration of $wgLegalTitleChars?

So, is an installation allowed to define that \t was a legal title character, and if so, what happens then? (Or what would make most sense then?) Replace by underscore?
Comment 5 Andre Klapper 2014-04-08 13:01:39 UTC
I see a question and discussion on behavior here, but not sure if there is a valid bug in this report...

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links