Last modified: 2010-05-15 15:33:46 UTC
MagicWord.php contains code for using "$1" in magic word definitions to capture a variable - as currently used by "$1px", which sets the rescaled width of an image. However, on investigation this seems to work only by a couple of lucky coincidences: the regex will not match correctly if there is more than one synonym, and the part returned is the whole string not the variable part (e.g. "100px" not "100"; PHP just happens to treat this as "100" if you ask it to turn it into an integer). After much gnashing of teeth and trial-and-error, I've come up with an implementation that works as originally intended. My patch will follow shortly.
Created attachment 92 [details] patch to fix broken regexes OK, here's my fix, diffed against the newest HEAD I could access on cvs.defau.lt, which included what looked like a special-case pseudo-fix for the same problem. I spent hours figuring out the best regexes for this, so I think this should be a more complete fix than that. Note that the patch applies cleanly, but I haven't actually tested it for a while, because I broke my test-install a few days ago. I see no reason for it to have become broken in the intervening time, however. This patch also disables the use of Title:legalChars() as a class for variable arguments, because I don't see why this should be the case for every conceivable use of magic word variables. If this check is necessary for a particular case, it should surely happen elsewhere - either before the MagicWord is checked, or after the variable has been returned.
Fixed in CVS HEAD. The patch given above would re-introduce the bug fixed by http://cvs.sourceforge.net/viewcvs.py/wikipedia/phase3/includes/MagicWord.php?rev=1.27&view=log#rev1.26
OK, that was an oversight on my part: my version creates a regex of the form "^foo$|^bor$" for the startToEndRegex; we could easily do the same for RegexStart. Alternatively, we could probably use "(?: ... )" rather than "( ... )" to group the alternatives but avoid creating an unnecessary extra captured match. I'll make an updated patch that does that. The main reason for creating the patch, however, remains (as far as I can see): if there are multiple synonyms for a magic word with a $1 capture, only the first will work correctly. This is because each synonym creates an extra captured value, hence the use of array_values(array_filter()) to remove empty matches. e.g. in the regex "/foo=(.*?)|bar=(.*?)/" there are 2 different captures * matching against the string "foo=100" would produce an array with $m[0]="foo=100" and $m[1]="100" (as expected) * but matching against "bar=100" would produce $m[0]="foo=100", $m[1]="", $m[2]="100", since the first set of brackets, after "foo=" was never matched
Created attachment 100 [details] patch that also deals with regexStart appropriately With the "(?: ... )" approach, we don't even need to do the str_replace() more than once, since mVariableStartToEndRegex is just mVariableRegex with anchors placed around it. Again, I'm afraid I can't test this at the moment, but I'm pretty sure this will fix *all* the bugs.
(In reply to comment #4) > With the "(?: ... )" approach, we don't even need to do the str_replace() more > than once, since mVariableStartToEndRegex is just mVariableRegex with anchors > placed around it. $this->mRegex = "/{$this->mBaseRegex}/{$case}"; ... $this->mVariableRegex = str_replace( '\$1', '(.*?)', $this->mRegex ); $this->mVariableStartToEndRegex = "/^(?:{$this->mVariableRegex})$/{$case}"; Mistakes like this love to creep into untested patches. ;) Also, I don't think any magic word contains a character not allowed in a title, so using "." rather than Title::legalChars() is asking for trouble for no good reason.
(In reply to comment #5) > $this->mRegex = "/{$this->mBaseRegex}/{$case}"; > ... > $this->mVariableRegex = str_replace( '\$1', '(.*?)', $this->mRegex ); > $this->mVariableStartToEndRegex = "/^(?:{$this->mVariableRegex})$/{$case}"; > > Mistakes like this love to creep into untested patches. ;) Ah, foo! I really must get that test install fixed, mustn't I. *sigh* Reading the code properly would have helped, too. Sorry about that. Of course, arguably, we should do the substitution right back at mBaseRegex anyway, since we never actually want to match a literal "$1". I won't post another untested patch, though, because who knows what I'll break next! :-/ > Also, I don't think any magic word contains a character not allowed in a > title, so using "." rather than Title::legalChars() is asking for trouble > for no good reason. The (.*?) / ([$variableclass]*?) doesn't apply to the magic words themselves, it applies to what $1 is recognised as. At the moment, there's only one magic word that *uses* $1, which is "$1px", but surely the point of the system is to be reusable, not defined for one specific instance. Confining the variable to Title::legalChars() just seemed to me a bit arbitrary: the one use we have so far should actually only contain ([0-9x]*?) - more specifically, the variable should match "/^([0-9]*)(?:x([0-9]*))?$/" - but quite rightly the code for checking that comes *after* the call to MagicWord. Of course, currently, we have nothing that breaks because of this either, but for instance a full implementation for bug 539 would need variables that could contain anything an internal or external link might contain, including things like '#', '+', etc (it was that bug that first got me hacking MagicWord in the first place). As far as I can see, a call to legalChars() gains us nothing, since it's far simpler to match (.*?) and check for validity on a case-by-case basis.
Created attachment 127 [details] fully tested patch OK, here at last is a properly tested patch that fixes the behaviour properly, using "(?:)" and actually guaranteeing that matched variables are returned. The most controversial change is eliminating $variable_class, but like I say, I really don't see why this check belongs here, rather than being made more thoroughly and specifically later in processing of each MagicWord. I even ran maintenance/parserTests.php, and only got unrelated FAILs. That's not to say I haven't made any more stupid mistakes, but I can't think what they'd be. And if I haven't, I would appreciate it if someone could commit this for me; thanks.
Patch applied to CVS HEAD. This fixes the cyrillic form of 'px' for eg [[image:foo|150пкс]] in Russian and some other localizations.