Last modified: 2007-08-19 21:29:15 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T5632, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 3632 - Large blacklists fail due to PCRE regex size limits


Summary:	Large blacklists fail due to PCRE regex size limits

Status:	RESOLVED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	Spam Blacklist (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High major with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://meta.wikimedia.org/wiki/SpamBl...
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2005-10-06 21:11 UTC by Amir Szekely
Modified:	2007-08-19 21:29 UTC (History)
CC List:	0 users

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
the proposed patch (2.84 KB, patch) 2005-10-06 21:11 UTC, Amir Szekely	Details
Modification to avoid size limit (5.68 KB, application/x-php) 2005-10-10 20:43 UTC, Gregory Szorc	Details
Patch to avoid "regular expression too large" error (2.78 KB, patch) 2006-09-06 14:03 UTC, Clodoaldo Pinto Neto	Details
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Description Amir Szekely 2005-10-06 21:11:15 UTC

The SpamBlacklist extension uses preg_match to search for bad URLs. When used
along with a big blacklist like chongqed's[1], it fails with the following
warning message:

Warning: preg_match(): Compilation failed: regular expression too large at offset 0

According to the PCRE source code used in PHP[2], that pattern limit is 64kb
(2^16). It's pretty big, but chongqed's list is over 100kb.

It's possible to use eregi instead, but it's very slow. On SourceForge servers,
it even dies because of the 10 seconds execution time limit on medium pages.

Attached is a patch that uses the regular expression only once, after it finds
one of the bad URLs without using a regular expression. The regular expression
is used just to make sure a URL was really found and not just the domain name.

Thinking about this again, I realized I'm forfeiting the advantages of a regular
expression to catch bad URLs. However, as regular expressions usage is very low,
I believe this a very reasonable trade off. Especially if you consider that the
other possibility is not having a blacklist at all.

It might be possible to split the regular expression to several regular
expressions, smaller than 64kb, and try them one by one. But I haven't tried
going down that road.

[1] http://blacklist.chongqed.org/
[2]
http://chora.php.net/co.php/php-src/ext/pcre/pcrelib/pcre_internal.h?php=ea47bb18dd995a11cb26cbb56196f3a4&r=1.1#198

Comment 1 Amir Szekely 2005-10-06 21:11:54 UTC

Created attachment 943 [details]
the proposed patch

Comment 2 Gregory Szorc 2005-10-10 20:43:23 UTC

Created attachment 967 [details]
Modification to avoid size limit

Modification of original extension so that the list obtained is sliced into
segments small enough to fit into preg_match without error.

Comment 3 Amir Szekely 2005-11-10 22:22:07 UTC

Comment on attachment 943 [details]
the proposed patch

>Index: SpamBlacklist/SpamBlacklist_body.php
>===================================================================
>RCS file: /cvsroot/wikipedia/extensions/SpamBlacklist/SpamBlacklist_body.php,v
>retrieving revision 1.11
>diff -u -r1.11 SpamBlacklist_body.php
>--- SpamBlacklist/SpamBlacklist_body.php	13 Jul 2005 22:41:14 -0000	1.11
>+++ SpamBlacklist/SpamBlacklist_body.php	6 Oct 2005 22:02:30 -0000
>@@ -3,7 +3,7 @@
> if ( defined( 'MEDIAWIKI' ) ) {
> 
> class SpamBlacklist {
>-	var $regex = false;
>+	var $blacklist = false;
> 	var $previousFilter = false;
> 	var $files = array();
> 	var $warningTime = 600;
>@@ -55,11 +55,11 @@
> 		}
> 
> 
>-		if ( $this->regex === false || $recache ) {
>+		if ( $this->blacklist === false || $recache ) {
> 			if ( !$recache ) {
>-				$this->regex = $wgMemc->get( "spam_blacklist_regex" );
>+				$this->blacklist = $wgMemc->get( "spam_blacklist_blacklist" );
> 			}
>-			if ( !$this->regex ) {
>+			if ( !$this->blacklist ) {
> 				# Load lists
> 				$lines = array();
> 				wfDebug( "Constructing spam blacklist\n" );
>@@ -100,31 +100,35 @@
> 				# Strip comments and whitespace, then remove blanks
> 				$lines = array_filter( array_map( 'trim', preg_replace( '/#.*$/', '', $lines ) ) );
> 
>-				# No lines, don't make a regex which will match everything
>+				# No lines, don't make a blacklist which will match everything
> 				if ( count( $lines ) == 0 ) {
> 					wfDebug( "No lines\n" );
>-					$this->regex = true;
>+					$this->blacklist = false;
> 				} else {
>-					# Make regex
>-					# It's faster using the S modifier even though it will usually only be run once
>-					$this->regex = 'http://[a-z0-9\-.]*(' . implode( '|', $lines ) . ')';
>-					$this->regex = '/' . str_replace( '/', '\/', preg_replace('|\\\*/|', '/', $this->regex) ) . '/Si';
>+					$this->blacklist = $lines;
> 				}
>-				$wgMemc->set( "spam_blacklist_regex", $this->regex, 3600 );
>+				$wgMemc->set( "spam_blacklist_blacklist", $this->blacklist, 3600 );
> 			}
> 		}
>-		if ( $this->regex{0} == '/' ) {
>+
>+		$retVal = false;
>+
>+		if ( $this->blacklist ) {
> 			# Do the match
>-			wfDebug( "Checking text against regex: {$this->regex}\n" );
>-			if ( preg_match( $this->regex, $text, $matches ) ) {
>-				wfDebug( "Match!\n" );
>-				EditPage::spamPage( $matches[0] );
>-				$retVal = true;
>-			} else {
>-				$retVal = false;
>+			wfDebug( "Checking text against blacklist\n" );
>+
>+			foreach ( $this->blacklist as $badurl ) {
>+				if ( stristr( $text, $badurl ) ) {
>+					$pattern  = '/' . preg_quote('http', '/') . 's?' . preg_quote('://', '/');
>+					$pattern .= '[a-z0-9\-\.]*' . preg_quote($badurl, '/') . '/i';
>+					if ( preg_match( $pattern, $text ) ) {
>+						wfDebug( "Match!\n" );
>+						EditPage::spamPage( $badurl );
>+						$retVal = true;
>+						break;
>+					}
>+				}
> 			}
>-		} else {
>-			$retVal = false;
> 		}
> 
> 		wfProfileOut( $fname );

Comment 4 Vaclav Vobornik 2006-06-21 12:47:00 UTC

This patch wasn't implemented into the main tree and the patch doesn't work with
a current SpamBlacklist extension version. And the problem with large limited
regex persists. Could be this improvement implemented into the main tree or
could someone update this patch, please?

Comment 5 Clodoaldo Pinto Neto 2006-09-06 14:03:35 UTC

Created attachment 2319 [details]
Patch to avoid "regular expression too large" error

This patch still uses preg_match(). To avoid the "regular expression too large"
it breaks the regex using array_chunk().

Comment 6 Rob Church 2007-08-19 21:29:15 UTC

Fixed in r16537, with further improvements in r19197.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links