Last modified: 2010-05-15 15:33:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 688 - MagicWord.php uses incorrect regex and logic for "$1" variable capturing
MagicWord.php uses incorrect regex and logic for "$1" variable capturing
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.4.x
All All
: Normal normal with 1 vote (vote)
: ---
Assigned To: Rowan Collins [IMSoP]
: patch
Depends on:
Blocks: 539
  Show dependency treegraph
 
Reported: 2004-10-11 16:55 UTC by Rowan Collins [IMSoP]
Modified: 2010-05-15 15:33 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
patch to fix broken regexes (1.98 KB, patch)
2004-10-11 20:32 UTC, Rowan Collins [IMSoP]
Details
patch that also deals with regexStart appropriately (1.83 KB, patch)
2004-10-12 19:50 UTC, Rowan Collins [IMSoP]
Details
fully tested patch (2.04 KB, patch)
2004-11-16 20:30 UTC, Rowan Collins [IMSoP]
Details

Description Rowan Collins [IMSoP] 2004-10-11 16:55:46 UTC
MagicWord.php contains code for using "$1" in magic word definitions to capture
a variable - as currently used by "$1px", which sets the rescaled width of an
image. However, on investigation this seems to work only by a couple of lucky
coincidences: the regex will not match correctly if there is more than one
synonym, and the part returned is the whole string not the variable part (e.g.
"100px" not "100"; PHP just happens to treat this as "100" if you ask it to turn
it into an integer). After much gnashing of teeth and trial-and-error, I've come
up with an implementation that works as originally intended.

My patch will follow shortly.
Comment 1 Rowan Collins [IMSoP] 2004-10-11 20:32:07 UTC
Created attachment 92 [details]
patch to fix broken regexes

OK, here's my fix, diffed against the newest HEAD I could access on
cvs.defau.lt, which included what looked like a special-case pseudo-fix for the
same problem. I spent hours figuring out the best regexes for this, so I think
this should be a more complete fix than that.

Note that the patch applies cleanly, but I haven't actually tested it for a
while, because I broke my test-install a few days ago. I see no reason for it
to have become broken in the intervening time, however.

This patch also disables the use of Title:legalChars() as a class for variable
arguments, because I don't see why this should be the case for every
conceivable use of magic word variables. If this check is necessary for a
particular case, it should surely happen elsewhere - either before the
MagicWord is checked, or after the variable has been returned.
Comment 2 JeLuF 2004-10-11 22:32:07 UTC
Fixed in CVS HEAD.

The patch given above would re-introduce the bug fixed by
http://cvs.sourceforge.net/viewcvs.py/wikipedia/phase3/includes/MagicWord.php?rev=1.27&view=log#rev1.26
Comment 3 Rowan Collins [IMSoP] 2004-10-12 19:36:53 UTC
OK, that was an oversight on my part: my version creates a regex of the form
"^foo$|^bor$" for the startToEndRegex; we could easily do the same for
RegexStart. Alternatively, we could probably use "(?: ... )" rather than "( ...
)" to group the alternatives but avoid creating an unnecessary extra captured
match. I'll make an updated patch that does that.

The main reason for creating the patch, however, remains (as far as I can see):
if there are multiple synonyms for a magic word with a $1 capture, only the
first will work correctly. This is because each synonym creates an extra
captured value, hence the use of array_values(array_filter()) to remove empty
matches.

e.g. in the regex "/foo=(.*?)|bar=(.*?)/" there are 2 different captures
* matching against the string "foo=100" would produce an array with
$m[0]="foo=100" and $m[1]="100" (as expected)
* but matching against "bar=100" would produce $m[0]="foo=100", $m[1]="",
$m[2]="100", since the first set of brackets, after "foo=" was never matched
Comment 4 Rowan Collins [IMSoP] 2004-10-12 19:50:25 UTC
Created attachment 100 [details]
patch that also deals with regexStart appropriately

With the "(?: ... )" approach, we don't even need to do the str_replace() more
than once, since mVariableStartToEndRegex is just mVariableRegex with anchors
placed around it. 

Again, I'm afraid I can't test this at the moment, but I'm pretty sure this
will fix *all* the bugs.
Comment 5 Wil Mahan 2004-10-13 05:55:42 UTC
(In reply to comment #4)
> With the "(?: ... )" approach, we don't even need to do the str_replace() more
> than once, since mVariableStartToEndRegex is just mVariableRegex with anchors
> placed around it. 

 $this->mRegex = "/{$this->mBaseRegex}/{$case}";
 ...
 $this->mVariableRegex = str_replace( '\$1', '(.*?)', $this->mRegex );
 $this->mVariableStartToEndRegex = "/^(?:{$this->mVariableRegex})$/{$case}";

Mistakes like this love to creep into untested patches. ;)

Also, I don't think any magic word contains a character not allowed in a
title, so using "." rather than Title::legalChars() is asking for trouble
for no good reason.
Comment 6 Rowan Collins [IMSoP] 2004-10-13 11:38:43 UTC
(In reply to comment #5)
>  $this->mRegex = "/{$this->mBaseRegex}/{$case}";
>  ...
>  $this->mVariableRegex = str_replace( '\$1', '(.*?)', $this->mRegex );
>  $this->mVariableStartToEndRegex = "/^(?:{$this->mVariableRegex})$/{$case}";
> 
> Mistakes like this love to creep into untested patches. ;)

Ah, foo! I really must get that test install fixed, mustn't I. *sigh* Reading
the code properly would have helped, too. Sorry about that. Of course, arguably,
we should do the substitution right back at mBaseRegex anyway, since we never
actually want to match a literal "$1". I won't post another untested patch,
though, because who knows what I'll break next! :-/

> Also, I don't think any magic word contains a character not allowed in a
> title, so using "." rather than Title::legalChars() is asking for trouble
> for no good reason.

The (.*?) / ([$variableclass]*?) doesn't apply to the magic words themselves, it
applies to what $1 is recognised as. At the moment, there's only one magic word
that *uses* $1, which is "$1px", but surely the point of the system is to be
reusable, not defined for one specific instance. Confining the variable to
Title::legalChars() just seemed to me a bit arbitrary: the one use we have so
far should actually only contain ([0-9x]*?) - more specifically, the variable
should match "/^([0-9]*)(?:x([0-9]*))?$/" - but quite rightly the code for
checking that comes *after* the call to MagicWord. 

Of course, currently, we have nothing that breaks because of this either, but
for instance a full implementation for bug 539 would need variables that could
contain anything an internal or external link might contain, including things
like '#', '+', etc (it was that bug that first got me hacking MagicWord in the
first place). As far as I can see, a call to legalChars() gains us nothing,
since it's far simpler to match (.*?) and check for validity on a case-by-case
basis.
Comment 7 Rowan Collins [IMSoP] 2004-11-16 20:30:21 UTC
Created attachment 127 [details]
fully tested patch

OK, here at last is a properly tested patch that fixes the behaviour properly,
using "(?:)" and actually guaranteeing that matched variables are returned. The
most controversial change is eliminating $variable_class, but like I say, I
really don't see why this check belongs here, rather than being made more
thoroughly and specifically later in processing of each MagicWord.

I even ran maintenance/parserTests.php, and only got unrelated FAILs. That's
not to say I haven't made any more stupid mistakes, but I can't think what
they'd be. And if I haven't, I would appreciate it if someone could commit this
for me; thanks.
Comment 8 Brion Vibber 2005-05-04 07:51:08 UTC
Patch applied to CVS HEAD.

This fixes the cyrillic form of 'px' for eg [[image:foo|150пкс]] in Russian and some other 
localizations.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links