Last modified: 2014-05-18 13:06:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T16114, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 14114 - Duplication of blacklisted links already in page text is possible
Duplication of blacklisted links already in page text is possible
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
Spam Blacklist (Other open bugs)
All All
: Low minor with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on: 1505
  Show dependency treegraph
Reported: 2008-05-14 09:14 UTC by Daniel Friesen
Modified: 2014-05-18 13:06 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Daniel Friesen 2008-05-14 09:14:11 UTC
In his commit Brion noted that when a spam link is already on the page, it can be duplicated and thus added in again.

I'm just posting this bug to note that behavior, and as a reminder to myself to come back, as I believe I may be able to eliminate this behavior.
Comment 1 Brion Vibber 2008-05-15 20:25:28 UTC
The trick is that we don't currently keep track of how _many_ times a given link is used on the page, either in the parser or the externallinks table. Without a count record, we can't easily track duplications.
Comment 2 Daniel Friesen 2008-05-16 08:24:10 UTC
I thought we were filtering the spam urls with the EditFilter hook, or whatever it was named? That's basically what the hook is for.

The [[mw:Extension:ProtectSection|ProtectSection]] extension actually makes use of that filter interestingly. It checks the before and after set of protect tags and makes sure that they are all still inside the page with the exact same content. This is basically what we're trying to do with the spam blacklist, except we would be using the url regex and ensuring that there are no spam urls in the after that don't appear in the before, and because of how that grabs individuals extra spam links of the same url will be considered extras.
Comment 3 Brion Vibber 2008-05-16 18:51:53 UTC
To avoid reparsing the page fifty times or getting false negatives/false positives, we check the actual parser results.
Comment 4 Daniel Friesen 2008-05-16 19:15:46 UTC
Ok, then I'm just confused as to what Spam Blacklist is actually trying to do.

I thought all we were trying to do was stop users from saving a page with extra spam links. Which the EditFilter would do.
Comment 5 Brion Vibber 2008-05-19 16:45:30 UTC
Define "with" -- that's the hard part! If you just do a regex on the raw text, you'll miss templates, things split over comments, bla bla bla. That's why it's become more and more complex over time, reparsing the text, then pulling data out of preparsed text to reduce the complexity and performance hit and increase reliability.
Comment 6 Mike.lifeguard 2009-03-10 03:40:05 UTC
changed summary

Note You need to log in before you can comment on or make changes to this bug.