Last modified: 2007-08-19 21:29:15 UTC
The SpamBlacklist extension uses preg_match to search for bad URLs. When used along with a big blacklist like chongqed's[1], it fails with the following warning message: Warning: preg_match(): Compilation failed: regular expression too large at offset 0 According to the PCRE source code used in PHP[2], that pattern limit is 64kb (2^16). It's pretty big, but chongqed's list is over 100kb. It's possible to use eregi instead, but it's very slow. On SourceForge servers, it even dies because of the 10 seconds execution time limit on medium pages. Attached is a patch that uses the regular expression only once, after it finds one of the bad URLs without using a regular expression. The regular expression is used just to make sure a URL was really found and not just the domain name. Thinking about this again, I realized I'm forfeiting the advantages of a regular expression to catch bad URLs. However, as regular expressions usage is very low, I believe this a very reasonable trade off. Especially if you consider that the other possibility is not having a blacklist at all. It might be possible to split the regular expression to several regular expressions, smaller than 64kb, and try them one by one. But I haven't tried going down that road. [1] http://blacklist.chongqed.org/ [2] http://chora.php.net/co.php/php-src/ext/pcre/pcrelib/pcre_internal.h?php=ea47bb18dd995a11cb26cbb56196f3a4&r=1.1#198
Created attachment 943 [details] the proposed patch
Created attachment 967 [details] Modification to avoid size limit Modification of original extension so that the list obtained is sliced into segments small enough to fit into preg_match without error.
Comment on attachment 943 [details] the proposed patch >Index: SpamBlacklist/SpamBlacklist_body.php >=================================================================== >RCS file: /cvsroot/wikipedia/extensions/SpamBlacklist/SpamBlacklist_body.php,v >retrieving revision 1.11 >diff -u -r1.11 SpamBlacklist_body.php >--- SpamBlacklist/SpamBlacklist_body.php 13 Jul 2005 22:41:14 -0000 1.11 >+++ SpamBlacklist/SpamBlacklist_body.php 6 Oct 2005 22:02:30 -0000 >@@ -3,7 +3,7 @@ > if ( defined( 'MEDIAWIKI' ) ) { > > class SpamBlacklist { >- var $regex = false; >+ var $blacklist = false; > var $previousFilter = false; > var $files = array(); > var $warningTime = 600; >@@ -55,11 +55,11 @@ > } > > >- if ( $this->regex === false || $recache ) { >+ if ( $this->blacklist === false || $recache ) { > if ( !$recache ) { >- $this->regex = $wgMemc->get( "spam_blacklist_regex" ); >+ $this->blacklist = $wgMemc->get( "spam_blacklist_blacklist" ); > } >- if ( !$this->regex ) { >+ if ( !$this->blacklist ) { > # Load lists > $lines = array(); > wfDebug( "Constructing spam blacklist\n" ); >@@ -100,31 +100,35 @@ > # Strip comments and whitespace, then remove blanks > $lines = array_filter( array_map( 'trim', preg_replace( '/#.*$/', '', $lines ) ) ); > >- # No lines, don't make a regex which will match everything >+ # No lines, don't make a blacklist which will match everything > if ( count( $lines ) == 0 ) { > wfDebug( "No lines\n" ); >- $this->regex = true; >+ $this->blacklist = false; > } else { >- # Make regex >- # It's faster using the S modifier even though it will usually only be run once >- $this->regex = 'http://[a-z0-9\-.]*(' . implode( '|', $lines ) . ')'; >- $this->regex = '/' . str_replace( '/', '\/', preg_replace('|\\\*/|', '/', $this->regex) ) . '/Si'; >+ $this->blacklist = $lines; > } >- $wgMemc->set( "spam_blacklist_regex", $this->regex, 3600 ); >+ $wgMemc->set( "spam_blacklist_blacklist", $this->blacklist, 3600 ); > } > } >- if ( $this->regex{0} == '/' ) { >+ >+ $retVal = false; >+ >+ if ( $this->blacklist ) { > # Do the match >- wfDebug( "Checking text against regex: {$this->regex}\n" ); >- if ( preg_match( $this->regex, $text, $matches ) ) { >- wfDebug( "Match!\n" ); >- EditPage::spamPage( $matches[0] ); >- $retVal = true; >- } else { >- $retVal = false; >+ wfDebug( "Checking text against blacklist\n" ); >+ >+ foreach ( $this->blacklist as $badurl ) { >+ if ( stristr( $text, $badurl ) ) { >+ $pattern = '/' . preg_quote('http', '/') . 's?' . preg_quote('://', '/'); >+ $pattern .= '[a-z0-9\-\.]*' . preg_quote($badurl, '/') . '/i'; >+ if ( preg_match( $pattern, $text ) ) { >+ wfDebug( "Match!\n" ); >+ EditPage::spamPage( $badurl ); >+ $retVal = true; >+ break; >+ } >+ } > } >- } else { >- $retVal = false; > } > > wfProfileOut( $fname );
This patch wasn't implemented into the main tree and the patch doesn't work with a current SpamBlacklist extension version. And the problem with large limited regex persists. Could be this improvement implemented into the main tree or could someone update this patch, please?
Created attachment 2319 [details] Patch to avoid "regular expression too large" error This patch still uses preg_match(). To avoid the "regular expression too large" it breaks the regex using array_chunk().
Fixed in r16537, with further improvements in r19197.