Last modified: 2014-10-07 21:04:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69439, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 67439 - Deal with logging query spam on crawler 404 floods


Summary:	Deal with logging query spam on crawler 404 floods

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	1.24rc
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Aaron Schulz

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-07-02 22:12 UTC by Aaron Schulz
Modified:	2014-10-07 21:04 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Aaron Schulz 2014-07-02 22:12:06 UTC

As seen at https://tendril.wikimedia.org/report/, we have a bunch of crawlers of various types hitting non-existent pages. We do a move/delete log query on such page views...which is fine except when lots of queries come in at once. They end up taking 16s to 18s.

Possible solution is to avoid calling the LogEventList method in showMissingArticle based on a Bloom Filter in Redis. This would be updated on the fly. Not sure how to estimate the set size to keep the false hit rate down.

Comment 1 Aaron Schulz 2014-07-03 17:53:55 UTC

(In reply to Aaron Schulz from comment #0)
> As seen at https://tendril.wikimedia.org/report/, we have a bunch of
> crawlers of various types hitting non-existent pages. We do a move/delete
> log query on such page views...which is fine except when lots of queries
> come in at once. They end up taking 16s to 18s.
> 
> Possible solution is to avoid calling the LogEventList method in
> showMissingArticle based on a Bloom Filter in Redis. This would be updated
> on the fly. Not sure how to estimate the set size to keep the false hit rate
> down.

Of course a bloom filter requires scanning all of `logging` and using add() for new deletes. This is problematic if the redis server is not durable or is downed (since repopulation cannot be on the fly). Maybe the rebuilding could be automatic and batched (switching the filter on when done).

Comment 2 Aaron Schulz 2014-07-03 20:47:55 UTC

Also it might help to route non-user based logging queries to all DBs rather than just db1055 (the partitioning of that table by user is necessary for this query).

Comment 3 Gerrit Notification Bot 2014-09-03 17:50:39 UTC

Change 143802 merged by jenkins-bot:
Added BloomCache classes

https://gerrit.wikimedia.org/r/143802

Comment 4 Aaron Schulz 2014-10-07 21:04:16 UTC

Deployed and populated (on enwiki, mostly automatically).

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links