Last modified: 2009-02-27 20:51:15 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T19677, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 17677 - $wgSpamRegex should be seperated into summary- and page text-regex
$wgSpamRegex should be seperated into summary- and page text-regex
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Aaron Schulz
: easy
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-02-26 14:52 UTC by Dan Jacobson
Modified: 2009-02-27 20:51 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dan Jacobson 2009-02-26 14:52:20 UTC
In 1.14.0 RELEASE-NOTES we see
* $wgSpamRegex now matches the edit summary and page move descriptions in
  addition to body text.

I'm sorry, but that's absolutely crazy, reckless, irresponsible. I'm
commenting it out in EditPage.php:

# Check for spam
$match = false; #JIDANNI turning OFF!!: $match = self::matchSpamRegex( $this->summary );

Please consider e.g.,:

$wgSpamRegex=array('/^\B$/',

This regular expression is what our wiki uses to prevent vicious page
blanking. (By the way, if one triggers it, oddly the function that
usually shows the user what the problem was doesn't say anything.)

Anyway, a blanked page is bad, but a blank comment is fine!

Now let's look at another regexp we use on our sites:

'/^[^{][[:ascii:]]*$/');

This regular expression means the user's edit must have at least one
Chinese character in it, because our wikis are all zh-tw language
wikis, and a pure ASCII post is surely spam.

However, a quick English, or NULL _summary_ is very common and
accepted on our wikis.

Anyway, the rash decision to glue 'edit summary', 'page move descriptions'
'body text' together will have users banging down my door saying why
are their postings getting rejected now! ***Please let the
administrator glue them together if he wishes!:
($wgSpamRegex['edit summary']= $wgSpamRegex['page move descriptions']=
$wgSpamRegex['body text'];) Don't arbitrarily glue them all together
for us! ***

Please instead run each one as a separate test.
You (MediaWiki team) can have an array of arrays, and just do something like the PHP
version of foreach('edit summary', 'page move description', 'body
text' as $bla){ run the matcher of $wgSpamRegex[$bla] on $get->$bla}
or however you write it in PHP, which I am poor at.
And of course you need three different MediaWiki:Spamprotectiontext
now too. And please allow us to set them in LocalSettings.php:
$wgSpamProtectionText['body text']= and the other two too. Setting
them in MediaWiki:Spamprotectiontext is a big pain when you are making
a Wiki Family.

By the way, we also have a rule
/{{[Cc]\|\d\d\d\.\d{0,3}}}/
that I mention in Spamprotectiontext:
Radio frequencies must have at least four digits after the decimal place.

What would be neat is if each regexp could have its own optional text
that gets printed out.

Ah, you might say I should stop complaining and use this mentioned in DefaultSettings.php:
   * For a complete example, have a look at the SpamBlacklist extension.
   */
  $wgFilterCallback = false;

Well I'll have you know that I did look at it, and it is all 100 times
overkill and un-understandable gobbledygook, so sorry. It didn't help
me one bit.

Anyway, I was doing fine until you glued all the tests together.
Next time I'll test while your release candidate is fresh. Sorry I
only discovered this (glue mess) now.

By the way, I also use /<[Aa]/, which stops attempted spam links. This
regexp I wish to use in all three places: summary, body text, etc.

I.e., I cannot live for long with no summary filtering (caused by my
above commenting out), as I know it is only a matter of time before they
attack, therefore I hope you will separate the three tests (and not
just toss in some var $ignoreEditSummary), by version 1.14.1. Thank
you.
Comment 1 Dan Jacobson 2009-02-26 15:12:47 UTC
P.S. the above example should be '/^[[:ascii:]]*$/');

(No need to show the "[^{]", which is our local (
http://taizhongbus.jidanni.org/index.php?title=Template:B
http://radioscanningtw.jidanni.org/index.php?title=Template:C )
jazz, meaning it is OK to not even have one Chinese character, if one is
entering a bus stop or police frequency via these templates.)
Comment 2 Dan Jacobson 2009-02-26 15:21:08 UTC
Or maybe a even fancier array is needed
[/REGEXP1/,0,1,0,"No xyz allowed"]
[/REGEXP2/,1,1,1,null]
...
the 0,1,0 stuff are the three tests, followed by an optional message, which if null, just prints the /REGEXP/ that triggered.

I.e., instead of three arrays, which will probably have a lot of duplication, use one array... OK, anything is OK, except the current gluing with no way to unglue short of hacking the source.
Comment 3 Dan Jacobson 2009-02-26 15:24:33 UTC
Hmmm, [/REGEXP2/,1,1,1,null] doesn't look too expandable for the future with more tests added. Sorry.
Maybe [/REGEXP2/,[1,1,1],null] would be better, so if a fourth test was added, older LocalSettings would still work.
(By older, I mean older than 1.14.2, but younger than 1.14.0 :-) OK, bye.)

Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-02-26 15:54:27 UTC
You make some fairly good points.  This change ignores some fairly reasonable use-cases for the spam regex.
Comment 5 Dan Jacobson 2009-02-26 16:03:47 UTC
(In reply to comment #4)
> You make some fairly good points.  This change ignores some fairly reasonable
> use-cases for the spam regex.
Thanks. By the way, the patterns I mentioned, and no more, have kept us 100% spam free for years!

Comment 6 Platonides 2009-02-26 22:14:53 UTC
Also consider that some regex are not applicable to the summary, and thus is a wasted regex check.
IMHO a different regex for summary is the way to go.
$wgSummarySpamRegex = $wgSpamRegex; is easy enough for people which like using the same.
Comment 7 Aaron Schulz 2009-02-27 20:51:15 UTC
Done in r47876

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links