Last modified: 2014-05-18 13:06:26 UTC
In his commit Brion noted that when a spam link is already on the page, it can be duplicated and thus added in again.
I'm just posting this bug to note that behavior, and as a reminder to myself to come back, as I believe I may be able to eliminate this behavior.
The trick is that we don't currently keep track of how _many_ times a given link is used on the page, either in the parser or the externallinks table. Without a count record, we can't easily track duplications.
I thought we were filtering the spam urls with the EditFilter hook, or whatever it was named? That's basically what the hook is for.
The [[mw:Extension:ProtectSection|ProtectSection]] extension actually makes use of that filter interestingly. It checks the before and after set of protect tags and makes sure that they are all still inside the page with the exact same content. This is basically what we're trying to do with the spam blacklist, except we would be using the url regex and ensuring that there are no spam urls in the after that don't appear in the before, and because of how that grabs individuals extra spam links of the same url will be considered extras.
To avoid reparsing the page fifty times or getting false negatives/false positives, we check the actual parser results.
Ok, then I'm just confused as to what Spam Blacklist is actually trying to do.
I thought all we were trying to do was stop users from saving a page with extra spam links. Which the EditFilter would do.
Define "with" -- that's the hard part! If you just do a regex on the raw text, you'll miss templates, things split over comments, bla bla bla. That's why it's become more and more complex over time, reparsing the text, then pulling data out of preparsed text to reduce the complexity and performance hit and increase reliability.