Last modified: 2011-03-13 18:04:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20844, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 18844 - Make all grouping non-capturing to save on processing
Make all grouping non-capturing to save on processing
Status: RESOLVED WONTFIX
Product: MediaWiki extensions
Classification: Unclassified
Spam Blacklist (Other open bugs)
unspecified
All All
: Lowest normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-19 15:08 UTC by Mike.lifeguard
Modified: 2011-03-13 18:04 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Test script (2.49 KB, application/x-gzip)
2009-05-19 19:04 UTC, Victor Vasiliev
Details

Description Mike.lifeguard 2009-05-19 15:08:18 UTC
To make things faster, please change any capturing groups to non-capturing ones when compiling the regex from the blacklist page. This also eliminates the need for people to know and remember to use non-capturing groups.

I can't think of any use for capturing groups in the spam blacklist at all, but it's possible I'm not imaginative enough. At the least, we're not using that in any blacklist I know of, so it can't be terribly useful.
Comment 1 Happy-melon 2009-05-19 15:47:04 UTC
Each regex is a capturing group that captures the url that triggered the list, which is passed through to and displayed in the error message. Removing these would make it impossible to resolve spam matches, because you wouldn't know which url needed to be removed. I assume you mean automatically make any capturing groups defined *within* a regex line non-capturing?  That could certainly be done.
Comment 2 Mike.lifeguard 2009-05-19 15:54:04 UTC
(In reply to comment #1)
> Each regex is a capturing group that captures the url that triggered the list,
> which is passed through to and displayed in the error message. Removing these
> would make it impossible to resolve spam matches, because you wouldn't know
> which url needed to be removed. I assume you mean automatically make any
> capturing groups defined *within* a regex line non-capturing?  That could
> certainly be done.
> 

Yes, that's what I meant :D
Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-05-19 15:56:18 UTC
WONTFIX as "premature optimization is the root of all evil"?
Comment 4 Mike.lifeguard 2009-05-19 16:15:45 UTC
(In reply to comment #3)
> WONTFIX as "premature optimization is the root of all evil"?
> 

Seth & VVV did some testing and determined it was a significant reduction to use non-capturing groups, which is why we use them (well, that and someone was very persistent in annoying people till they did it). It'd be better to have that done by the extension than by humans. Also, then when someone next whinges at us to use non-capturing groups for optimization we can tell them to stuff it because the extension does it internally (I wonder what the next thing well be that we urgently need to do to avoid a few ms 9.9)
Comment 5 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-05-19 16:33:54 UTC
Are we talking ms or us here?  I'm guessing us.

Besides, wouldn't capturing groups be necessary for something like

(['"]).*?\1

or whatever?  They aren't only used for replacements.  This reduces functionality for at *best* some dubious microoptimization effort that has no real-world benchmarks attached.  Real-world means saying how much user or real time it measurably adds to actual page loads on a real site, not "it takes 97% less time if I make a 50M regex solely out of capturing expressions".  Also not "some people did some testing and said it was significant".

Still say WONTFIX.
Comment 6 Mike.lifeguard 2009-05-19 16:38:25 UTC
(In reply to comment #5)
> (['"]).*?\1
> 

I've never seen backreferences used in any spam blacklist.

FWIW, VVV said he used the Meta blacklist, IRRC. CCed him to verify.

But, if there's definitely no benefit, then this should be WONTFIXED and I'll stop bothering to use non-capturing groups manually too, which will be lovely :D
Comment 7 Brion Vibber 2009-05-19 16:44:39 UTC
Guys, can you do a quick test and list out the *actual* speeds with the actual blacklist regexes here? This seems like it's an easy thing to resolve empirically. :)
Comment 8 Victor Vasiliev 2009-05-19 19:04:02 UTC
Created attachment 6136 [details]
Test script

I benchmarked it (benchmarking tools attached) and got following results (on Toolserver):
grouping: 0.0158297939301 s
nongrouping: 0.0151363241673 s
Comment 9 Brion Vibber 2009-05-19 20:12:56 UTC
So that's less than a 5% difference, absolute difference of 7ms in the test case (151 vs 158ms total). The 'spambl' file seems to be missing from the test case archive, so I can't reproduce results locally.

Is functionality unchanged with the switch to nongrouping, or would this require changing other components?
Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-05-19 22:26:36 UTC
That's less than a 5% difference doing *nothing else*.  It's *less than a millisecond*.  It would probably sink to more like 0.5% of the entire page save process, a negligible gain.  On the other hand, you're removing functionality, whether or not it may be unused at present.  Still recommend WONTFIX.

(In reply to comment #9)
> So that's less than a 5% difference, absolute difference of 7ms in the test
> case (151 vs 158ms total).

That's 15.8 vs 15.1 ms total -- a difference of 0.7 ms, or 700 us.

> The 'spambl' file seems to be missing from the test
> case archive, so I can't reproduce results locally.

It's created by running run-tests, which also runs the benchmark.  On localhost I get about the same results (~1ms difference, 5% gain).

> Is functionality unchanged with the switch to nongrouping, or would this
> require changing other components?

Functionality is reduced -- it's no longer possible to use back-references.  It also gives the Principle of Least Surprise a severe smack on the head (who would expect something to take raw regexes and then change their functionality weirdly?).
Comment 11 Brion Vibber 2009-05-19 23:24:31 UTC
Woops, got a digit wrong there. :) Yes, 0.7ms not 1ms. Looks like there's not a compelling reason to poke this then for now. Resolving WONTFIX. Reopen later if the difference becomes more compelling and there's a cleaner way to do it without potentially breaking backref use.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links