Last modified: 2007-08-19 21:29:15 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T5632, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 3632 - Large blacklists fail due to PCRE regex size limits
Large blacklists fail due to PCRE regex size limits
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
Spam Blacklist (Other open bugs)
unspecified
All All
: High major with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
http://meta.wikimedia.org/wiki/SpamBl...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-10-06 21:11 UTC by Amir Szekely
Modified: 2007-08-19 21:29 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
the proposed patch (2.84 KB, patch)
2005-10-06 21:11 UTC, Amir Szekely
Details
Modification to avoid size limit (5.68 KB, application/x-php)
2005-10-10 20:43 UTC, Gregory Szorc
Details
Patch to avoid "regular expression too large" error (2.78 KB, patch)
2006-09-06 14:03 UTC, Clodoaldo Pinto Neto
Details

Description Amir Szekely 2005-10-06 21:11:15 UTC
The SpamBlacklist extension uses preg_match to search for bad URLs. When used
along with a big blacklist like chongqed's[1], it fails with the following
warning message:

Warning: preg_match(): Compilation failed: regular expression too large at offset 0

According to the PCRE source code used in PHP[2], that pattern limit is 64kb
(2^16). It's pretty big, but chongqed's list is over 100kb.

It's possible to use eregi instead, but it's very slow. On SourceForge servers,
it even dies because of the 10 seconds execution time limit on medium pages.

Attached is a patch that uses the regular expression only once, after it finds
one of the bad URLs without using a regular expression. The regular expression
is used just to make sure a URL was really found and not just the domain name.

Thinking about this again, I realized I'm forfeiting the advantages of a regular
expression to catch bad URLs. However, as regular expressions usage is very low,
I believe this a very reasonable trade off. Especially if you consider that the
other possibility is not having a blacklist at all.

It might be possible to split the regular expression to several regular
expressions, smaller than 64kb, and try them one by one. But I haven't tried
going down that road.

[1] http://blacklist.chongqed.org/
[2]
http://chora.php.net/co.php/php-src/ext/pcre/pcrelib/pcre_internal.h?php=ea47bb18dd995a11cb26cbb56196f3a4&r=1.1#198
Comment 1 Amir Szekely 2005-10-06 21:11:54 UTC
Created attachment 943 [details]
the proposed patch
Comment 2 Gregory Szorc 2005-10-10 20:43:23 UTC
Created attachment 967 [details]
Modification to avoid size limit

Modification of original extension so that the list obtained is sliced into
segments small enough to fit into preg_match without error.
Comment 3 Amir Szekely 2005-11-10 22:22:07 UTC
Comment on attachment 943 [details]
the proposed patch

>Index: SpamBlacklist/SpamBlacklist_body.php
>===================================================================
>RCS file: /cvsroot/wikipedia/extensions/SpamBlacklist/SpamBlacklist_body.php,v
>retrieving revision 1.11
>diff -u -r1.11 SpamBlacklist_body.php
>--- SpamBlacklist/SpamBlacklist_body.php	13 Jul 2005 22:41:14 -0000	1.11
>+++ SpamBlacklist/SpamBlacklist_body.php	6 Oct 2005 22:02:30 -0000
>@@ -3,7 +3,7 @@
> if ( defined( 'MEDIAWIKI' ) ) {
> 
> class SpamBlacklist {
>-	var $regex = false;
>+	var $blacklist = false;
> 	var $previousFilter = false;
> 	var $files = array();
> 	var $warningTime = 600;
>@@ -55,11 +55,11 @@
> 		}
> 
> 
>-		if ( $this->regex === false || $recache ) {
>+		if ( $this->blacklist === false || $recache ) {
> 			if ( !$recache ) {
>-				$this->regex = $wgMemc->get( "spam_blacklist_regex" );
>+				$this->blacklist = $wgMemc->get( "spam_blacklist_blacklist" );
> 			}
>-			if ( !$this->regex ) {
>+			if ( !$this->blacklist ) {
> 				# Load lists
> 				$lines = array();
> 				wfDebug( "Constructing spam blacklist\n" );
>@@ -100,31 +100,35 @@
> 				# Strip comments and whitespace, then remove blanks
> 				$lines = array_filter( array_map( 'trim', preg_replace( '/#.*$/', '', $lines ) ) );
> 
>-				# No lines, don't make a regex which will match everything
>+				# No lines, don't make a blacklist which will match everything
> 				if ( count( $lines ) == 0 ) {
> 					wfDebug( "No lines\n" );
>-					$this->regex = true;
>+					$this->blacklist = false;
> 				} else {
>-					# Make regex
>-					# It's faster using the S modifier even though it will usually only be run once
>-					$this->regex = 'http://[a-z0-9\-.]*(' . implode( '|', $lines ) . ')';
>-					$this->regex = '/' . str_replace( '/', '\/', preg_replace('|\\\*/|', '/', $this->regex) ) . '/Si';
>+					$this->blacklist = $lines;
> 				}
>-				$wgMemc->set( "spam_blacklist_regex", $this->regex, 3600 );
>+				$wgMemc->set( "spam_blacklist_blacklist", $this->blacklist, 3600 );
> 			}
> 		}
>-		if ( $this->regex{0} == '/' ) {
>+
>+		$retVal = false;
>+
>+		if ( $this->blacklist ) {
> 			# Do the match
>-			wfDebug( "Checking text against regex: {$this->regex}\n" );
>-			if ( preg_match( $this->regex, $text, $matches ) ) {
>-				wfDebug( "Match!\n" );
>-				EditPage::spamPage( $matches[0] );
>-				$retVal = true;
>-			} else {
>-				$retVal = false;
>+			wfDebug( "Checking text against blacklist\n" );
>+
>+			foreach ( $this->blacklist as $badurl ) {
>+				if ( stristr( $text, $badurl ) ) {
>+					$pattern  = '/' . preg_quote('http', '/') . 's?' . preg_quote('://', '/');
>+					$pattern .= '[a-z0-9\-\.]*' . preg_quote($badurl, '/') . '/i';
>+					if ( preg_match( $pattern, $text ) ) {
>+						wfDebug( "Match!\n" );
>+						EditPage::spamPage( $badurl );
>+						$retVal = true;
>+						break;
>+					}
>+				}
> 			}
>-		} else {
>-			$retVal = false;
> 		}
> 
> 		wfProfileOut( $fname );
Comment 4 Vaclav Vobornik 2006-06-21 12:47:00 UTC
This patch wasn't implemented into the main tree and the patch doesn't work with
a current SpamBlacklist extension version. And the problem with large limited
regex persists. Could be this improvement implemented into the main tree or
could someone update this patch, please? 
Comment 5 Clodoaldo Pinto Neto 2006-09-06 14:03:35 UTC
Created attachment 2319 [details]
Patch to avoid "regular expression too large" error

This patch still uses preg_match(). To avoid the "regular expression too large"
it breaks the regex using array_chunk().
Comment 6 Rob Church 2007-08-19 21:29:15 UTC
Fixed in r16537, with further improvements in r19197.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links