Last modified: 2014-05-18 13:06:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T16114, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 14114 - Duplication of blacklisted links already in page text is possible


Summary:	Duplication of blacklisted links already in page text is possible

Status:	NEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	Spam Blacklist (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low minor with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://svn.wikimedia.org/viewvc/media...
Whiteboard:
Keywords:

Depends on:	1505
Blocks:
	Show dependency tree / graph

Reported:	2008-05-14 09:14 UTC by Daniel Friesen
Modified:	2014-05-18 13:06 UTC (History)
CC List:	1 user (show)

See Also:	65447
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Daniel Friesen 2008-05-14 09:14:11 UTC

In his commit Brion noted that when a spam link is already on the page, it can be duplicated and thus added in again.

I'm just posting this bug to note that behavior, and as a reminder to myself to come back, as I believe I may be able to eliminate this behavior.

Comment 1 Brion Vibber 2008-05-15 20:25:28 UTC

The trick is that we don't currently keep track of how _many_ times a given link is used on the page, either in the parser or the externallinks table. Without a count record, we can't easily track duplications.

Comment 2 Daniel Friesen 2008-05-16 08:24:10 UTC

I thought we were filtering the spam urls with the EditFilter hook, or whatever it was named? That's basically what the hook is for.

The [[mw:Extension:ProtectSection|ProtectSection]] extension actually makes use of that filter interestingly. It checks the before and after set of protect tags and makes sure that they are all still inside the page with the exact same content. This is basically what we're trying to do with the spam blacklist, except we would be using the url regex and ensuring that there are no spam urls in the after that don't appear in the before, and because of how that grabs individuals extra spam links of the same url will be considered extras.

Comment 3 Brion Vibber 2008-05-16 18:51:53 UTC

To avoid reparsing the page fifty times or getting false negatives/false positives, we check the actual parser results.

Comment 4 Daniel Friesen 2008-05-16 19:15:46 UTC

Ok, then I'm just confused as to what Spam Blacklist is actually trying to do.

I thought all we were trying to do was stop users from saving a page with extra spam links. Which the EditFilter would do.

Comment 5 Brion Vibber 2008-05-19 16:45:30 UTC

Define "with" -- that's the hard part! If you just do a regex on the raw text, you'll miss templates, things split over comments, bla bla bla. That's why it's become more and more complex over time, reparsing the text, then pulling data out of preparsed text to reduce the complexity and performance hit and increase reliability.

Comment 6 Mike.lifeguard 2009-03-10 03:40:05 UTC

changed summary

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links