Last modified: 2014-10-30 19:44:26 UTC
This is not a bug, just a question. Looking at Title::secureAndSplit at http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Title.php?revision=104051&view=markup#l2722 (related to old bug 3696), I wonder why 1. the zero-width space (U+200B, or UTF-8 E2 80 8B) is not stripped? 2. the horizontal tab \t is not included in the whitespace regexp to be replaced by an underscore? Oversights, or is there some reason? Are these stripped somewhere else already?
Adding Brion since he probably knows the answer and if this is a bug or not.
Zero-width space is required for some scripts, to insert a break between letters that would otherwise form ligatures. Tab ... SHOULD be stripped, lemme check. :)
\t is just an outright forbidden char in titles.
(In reply to comment #2) > Zero-width space is required for some scripts, to insert a break between > letters that would otherwise form ligatures. Maybe I misunderstand the purpose of these Unicode characters; I'm not a Unicode specialist. I thought that was the purpose of the zero-width non-joiner (U+200C)? Granted, I think the zero-width space (U+200B) also would need to have the same effect as the ZWNJ as it indicates an (invisible) word boundary, but I'd say that's just a side effect. Also, this normally invisible word boundary may be expanded into visible whitespace by text justification according to [[en:zero-width space]]. So right, stripping it would not be right, but maybe it should be treated as an underscore. Anyway, thanks for the answer, I see the rationale now. Whether it's 100% correct is less important to me. And perhaps people are using U+200B where they should actually use U+200C, and it's thus more user-friendly to treat it that way. I was just trying to understand what the thoughts behind this were. > > Tab ... SHOULD be stripped, lemme check. :) "Outright forbidden": do I see this right that this is rejected at line 2834 http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Title.php?revision=104051&view=markup#l2834 and depends on the configuration of $wgLegalTitleChars? So, is an installation allowed to define that \t was a legal title character, and if so, what happens then? (Or what would make most sense then?) Replace by underscore?
I see a question and discussion on behavior here, but not sure if there is a valid bug in this report...