Last modified: 2013-04-30 17:33:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 14951 - Global spam blacklist should have less or no lag time before taking effect
Global spam blacklist should have less or no lag time before taking effect
Status: NEW
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: Low enhancement with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on:
  Show dependency treegraph
Reported: 2008-07-28 00:07 UTC by Mike.lifeguard
Modified: 2013-04-30 17:33 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Mike.lifeguard 2008-07-28 00:07:15 UTC
Currently the global blacklist at meta is cached (and the cache is invalidated every 10-15 minutes) - this allows spammers to abuse the lag time to continue spamming even after they have been caught in the act. Some method of reducing or eliminating this lag time is needed.

I am unclear why we cannot have a database specifier (currently a URL is specified) for the global blacklist for all WMF wikis, as we host our wikis on the same set of machines. This would allow the cache to be invalidated upon edit just as the local blacklists function; global blacklisting would be as fast as local blacklisting to take effect.
Comment 1 Siebrand Mazeland 2008-08-11 09:32:47 UTC
Product: Wikimedia. Component: General/Unknown (caching)
Comment 2 billinghurst 2011-12-04 01:52:17 UTC
Adding an aspect to management of the spam blacklist.  Currently we have

and this file is now >>15k lines in length.  Adding to this file at this point in time is a slow old task, and it seems that we are over-processing by maintaining just the one spam blacklist, especially as it is more likely that we add to the file, and not often remove.

Would it be possible to have a longer slower file that isn't edited very often, and a smaller quicker file that is able to be updated quickly, and possibly more often.  Then we can look to merge the bigger and the smaller on occasions.

I would also comment that we need quick responsive additions, whereas removals can be slower and occur when able.  Thanks.
Comment 3 Nemo 2012-08-23 22:38:56 UTC
Adding to csteipp's list at least for analysis.
Comment 4 Chris Steipp 2012-08-24 17:02:36 UTC
There are several options available:

1) The 15 minute cache only applies to remote blacklists, like the one fetched from meta on most of the other wikis. We could reduce the timeout setting, but that would directly affect performance.

2) Pulling from a database article is supported-- so each wiki is free to define a database (wiki) + article that contains a spamblacklist. The article text will be read each time from a slave database, without caching. So even re-defining the blacklist location to be "DB: metawiki Spam_blacklist" instead of "" would keep the list from being cached.

3) Multiple blacklists are supported, so it would be easy for each wiki to define another wiki-specific, or an article shared on meta for rules that are being quickly / frequently updated.

It sounds to me like you probably want to leave the cached version of meta's Spam_blacklist, so that all 15k rules that we want to keep are stored and cached. But then define another page on meta that all of the wikis can share, which will have a small list of rules that can be frequently updated, and immediately effective as soon as it's updated.
Comment 5 billinghurst 2012-08-25 07:58:52 UTC
Working with that how about even a simple measure of a quick and slow cache blacklist, and based on the main blacklist. Once a day, append the quick to the slow, and recache  It is generally only the very occasional removal that would occur in the blacklist, and if it take 24 hours to clear a long-standing blacklist, so be it.  If we need to improve a filter, we can remove from the old, and add to the new.  If we need an immediate override of the slow blacklist, then we can whitelist, not that I have seen this need.   

With this schema, I would suggest that the bulk of the data in the spam blacklist is in another filename, which makes the addition a lot quicker, it is slowish doing it now. We would need to look at the tools that exist either way as they search and add all to the same space.

Note You need to log in before you can comment on or make changes to this bug.