Last modified: 2006-06-22 20:11:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3733, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 1733 - Whitelist for spam blacklist


Summary:	Whitelist for spam blacklist

Status:	RESOLVED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://meta.wikimedia.org/wiki/Spam_b...
Whiteboard:
Keywords:

Depends on:
Blocks:	4462
	Show dependency tree / graph

Reported:	2005-03-22 23:00 UTC by Jeff Bonham
Modified:	2006-06-22 20:11 UTC (History)
CC List:	0 users

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Jeff Bonham 2005-03-22 23:00:52 UTC

This has been planned for a while but there was no bug for it. 
 
See whitelist-related material on http://meta.wikimedia.org/wiki/Spam_blacklist

Comment 1 Jeff Bonham 2005-03-22 23:01:56 UTC

Make that http://meta.wikimedia.org/wiki/Talk:Spam_blacklist, duh

Comment 2 Brion Vibber 2005-03-23 00:17:08 UTC

Can't this already be done with the regular expressions?

Comment 3 Jeff Bonham 2005-04-06 22:50:53 UTC

Not as far as I know.

Comment 4 Philippe Verdy 2006-05-04 19:15:13 UTC

Couldn't the huge spam list be broken per domain, for much faster finding?

As far as I know, the valid TLDs are strictly limited and wellknown (their list is publised by ICANN). So 
invalid TLDs (including commercial pseudo-TLDs that have not been approved by ICANN and use specific DNS 
systems or that require a client-side DNS client patch like NewNet which is most often stealing privacy, 
i.e. spyware) can be eliminated immediately. Keep just the ICANN list.

Then break the spam list per valid TLD, as it will also ease its management, as the list becomes huge...
Each TLD list should also come into two parts: one using simple string equality (scanned first, it is 
sorted alphabetically for fast finding), and a final section using regexps (regexps require too much memory 
resource on the server).

For efficient finding, it should be useful to reverse the order of domain name parts in the domain name:
www.xyz.com becomes com.xyz.www, which is then splitted into physical file folders (or virtual ones on 
memory using arrays) if there are multiple exclusions:

   com/
     xyz/
       www
For example:

blacklist = array(0, //block all other non-ICANN TLDs
  com=>array(1, //pass all .com by default
    xyz=>array(1, //pass "xyz.com" except the following subdomains:
      www=>0, //block this host and subdomains
      //the other hosts in ".xyz.com" pass as set in the parent rule
      ),
    spamsite=>0,//block this domain and all subdomains
    // other simple xxx.com block rules come here...
    "*" => array(1, //using regexps, pass by default
      "[a-z][0-9]{5,}"=>0 // block <numeric>.com with 5 digits or more
      ),
    ),
  net=>array(1, //pass all .net by default
    //block rules for .net come here
    ),
  org=>array(1, //pass all .org by default
    //block rules for .org come here
    ),
  de=>array(1, //pass all .de by default
    //block rules for .de come here
    ),
  fr=>array(1, //pass all .fr by default
    //block rules for .de come here
    ),
  //other accepted TLDs come here...
);
Then domain name can be performed by simple table lookup, using one domain name part at a time:
* if the value is an integer, then it gives the blocking rule for the current domain and all its subdomains
* if the value is an array, then the first entry at index 0 gives the blocking rule (0=pass or 1=block), 
and the other entries contain other domain name parts to scan for exceptions.
* if there's no entry for the scanned domain namepart in the array, then look if there's a "*" entry. If 
so, uses regexps matching for scanning its list from first to last and get their blocking rule.

This will reduce a lot the use of regexps. The array above can be easily built by reading and parsing once 
a text file where these rules are summarized and maintained.

Comment 5 Brion Vibber 2006-06-22 20:11:32 UTC

I've implemented a whitelist in r14912. It's editable by 
local admins at MediaWiki:Spam-whitelist, and is in the same 
format as the blacklist page.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links