Last modified: 2011-04-14 15:13:16 UTC
A space before "»" (» - right-pointing double angle quotation mark) or a space after "«" (« - left-pointing double angle quotation mark) will be converted to a no-break space ( ). This may be appropriate for most french text, but breaks line wrapping in languages where guillemets are used in the opposite order (»quote« instead of «quote» or « quote »). Compare http://en.wikipedia.org/wiki/Guillemets .
Agreed, e.g. the use of guillemets on the Czech Wikisource is quite problematical because of this. This should be applied only if the content language is French. Or, more generally – we should probably have per-language rules. See bug #13619. See also bug #3158.
Workaround is to write something like text »quote« text. MediaWiki doesn't recognize   as space at the point where it replaces them with s.
Sounds like checking for word breaks should do the job reasonably well here. Eg: ...quoted » outside \s»\W -> break outside »quoted... \s»\w -> no break As long as nobody uses this form: outside » quoted... in which case it would be much more difficult to distinguish which side the non-break space belongs on, requiring heuristics to try to see where the quote was started.
Would be better than now, assuming "break" means "nbsp" (i.e. "no break"). But it won't work for cases like "the sign »,« is a comma", citations starting/ending with an ellipsis or other punctuation (like »... text ...« or »[…] text!«) or Spanish-style »¿uh?« (but guillemets aren't common in Spanish). And it doesn't work for most languages if the replacement operates on bytes instead of chars, like the code snippet in bug 13619 comment 3 suggests. The \w needs to match the appropriate Unicode classes. BTW, I don't think these simple heuristics are useful at all. E.g., they cause code like <code>x = flag ? 0 : 1;</code> to be unusable after copying and break valid CSS like <span style="color : red ; background : yellow"/>.
This should fix most occurences in French without breaking much elsewhere: s/((?:[\s(]|^)«) /$1 / s/ »(?=\.?\)|[.,]?(?:\s|<ref[\s>]|$))/ »/ Should also work with raw UTF-8 bytes if « and » are written as \302\253 and \302\273. BTW, the current code seems to have a bug: '/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1 \\2' should be either '/(.) (\\?|:|;|!|%|\\302\\273)/' => '\\1 \\2' or '/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1 '
I missed the common cases ''« text »'' vs »''text''«, and <ref/>s seem to be already expanded at that stage (by looking at the code; I have no MediaWiki installation to test): s/((?:[\s(]|<[a-zA-Z]+>|^)«) /$1 / s/ »(?=\.?\)|[.,]?(?:\s|<(?:\/|sup[\s>])|$))/ »/ This handles also <blockquote>« citation »</blockquote> and similar (a line break isn't likely to occur at the beginning of a block element, but it makes a difference if text-align:justify (in Unicode compliant browsers)). It doesn't handle start tags with attributes like <span style="...">« text »</span> because that would be very expensive if done properly. The better solution would be a configuration switch to apply these substitutions only for languages where they make sense. The only one of the current substitutions that makes some sense in most languages is s/ %/ %/ (but it still destroys <code>x = y % z</code>).