Last modified: 2010-01-15 18:01:20 UTC
Hi all! I could not found this bug by searching in the DB, so I fill it there.. Hope it not only noise. ------- Package: mediawiki1.7 Version: 1.7.1-1 Severity: normal I noticed that when using the "recent changes" mediawiki RSS feed with liferea, it keeps showing duplicate entries. According to the liferea documentation[1], this appears to be a problem with mediawiki[2]. It would be nice if this could be fixed. References: [1] file:///usr/share/liferea/doc/html/faq_en.html "Q: Why do feed items keep being displayed as new? A: This is usually due to a bad feed which associated a particular ID to multiple items. You should check your feed against a feed validator such as feedvalidator.org. If the validator does not report any error, please submit a bug report including the URL of the problem feed to the Liferea bugtracker. Note: If you experience this problem with a planet feed the reason might be that the planet feed does not provide unique item ids for one or all off its source feeds. If this is the case Liferea has no chance to match identical items." [2] http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fmeta.wikimedia.org%2Fw%2Findex.php%3Ftitle%3DSpecial%3ANewpages%26feed%3Drss "line 67, column 203: item should contain a guid element (50 occurrences)" -- System Information: Debian Release: 3.1 Architecture: i386 (i686) Kernel: Linux 2.6.8-3-k7 Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8)
The spec explicitly allows this. Please contact liferea authors and inform them that this is legit. :)
Resolving this bug as "invalid" is not correct. Yes, the <guid> tag is not required, but then, only the <title>, <link> and <description> tags are listed as required by the specification. "This “Global Unique Identifier” allows you to republish or update specific items without duplicating these items in an aggregator. If you change an item without using the <guid> element, then the aggregator has no way of determining that the new item is replacing an old item. In that case, the aggregator will retain the old item and the new item, forcing the user to read it twice. If the <guid> element exists (and is the same as a previous item’s <guid>) then the aggregator can (at the users option) replace the old item with the new one. If the user has not read the item yet, then all they will see is the updated item. If they have read the old item already, then they can optionally read the update or ignore it." -- http://www.feederreader.com/TechnicalGuides/RSS_Basic.html That is referenced from the current spec, which is apparently housed here: http://cyber.law.harvard.edu/rss/rss.html In other words... Without <guid> an "aggregator" has no way to determine whether an individual <item> is new or if it has been seen before. So while not required by the spec, not including it actually is a bug. No major feed that I have been able to find - and that claims to be RSS 2 - omits the <guid> tag.
And I thought it somehow had to do with how I manage my wiki. All those bytes (bug 17058) and no guid!
If not implementing for Special:RecentChanges&feed=rss, then at least implement for Special:NewPages&feed=rss, whose URLs are more robust.
Created attachment 6828 [details] Patch to add <guid> element to RSS items. Small patch to solve issue. I tested the result with <URI:http://www.feedvalidator.org/>. Note that for debugging I had to clear objectcache, otherwise the output remained the same :-).
Created attachment 6829 [details] add a <guid> element to RSS feeds Created this when I noticed that just using the URL still had duplicate items appearing. With the time of the edit added to the URL and the guid flagged as not being a permalink this works extremely well. A much better way to manage things would be to provide a perma-link URL for the feed code - but failing that this should work well.
I don't quite unterstand that. First, adding the time of the edit should make no difference at all, since it is redundant to the revision ID already contained in the URL. Second, by my reading of the specification, isPermaLink (with a guid of solely the URL) should be true (or omitted) as the diff link is unique and stable. In any case, in RSSFeed::outItem $item->getDate() does not seem to be guaranteed to exist, so if it is to be added to the guid, it should be checked for properly.
Neither did I when it started happening. But apparently the URL is constructed with two ID's so the unique diff can be pulled up. If there is a change made after that, then the link was changing - however, I didn't try it under a newer version of the code-base. And I didn't know that ->getDate() wasn't guaranteed to exist - since it seems to always exist in my install. In any case... As I said it would be better to use a link to that specific revision without setting it as a diff as the guid. Because then it is guaranteed to not change. And it's just hit me that relying on the specific date is stupid, so I withdraw my proposed patch. If the URL is changing (or was - I've since updated to a much more recent code-base) then the duplication would be seen regardless. (And I have been seeing it) So... I'm going to work on a more in-depth fix that will change the GUID to a URL that is that specific revision without the diff contents. I should have a patch for that by Monday.
I don't see how the guid pointing to the revision itself would be an improvement. Keep in mind that the purpose of the feed is to point to the changes, i. e. the diffs, not to the revisions. If a guid would point to the revision, it could collide with other feeds that for example list new pages. Regarding RSSFeed::getDate(), I'm just deducing from the if-clause three lines above that it is not guaranteed to exist. Maybe someone likes to overhaul the entire process as many functions and structures (abstract base class with rather rigid structure, selectors that silently escape to XML, etc.) look very hackish.
The RSS feeds for article history are also affected by this bug (at least in Liferea). I am subscribed to few RSS feeds for article history and am sorry to say that it doesn't work that well. Here is a screenshot which clearly shows duplicated entries (please note that some of them are not duplicated!): http://img403.imageshack.us/img403/518/zrzutekranuliferea.png The Feedvalidator shows the same: "item should contain a guid element": http://feedvalidator.org/check.cgi?url=http://en.wikipedia.org/w/index.php?title=1Q84&feed=rss&action=history I hope that would be fixed sometime in the future as this is pretty annoying... Thanks, Tomasz
Created attachment 6939 [details] patch to add guid and permalink support to feeds Despite it technically being ok not to have a guid, it is an annoyance. And a quote from the RSS Spec says "In all cases, it's recommended that you provide the guid, and if possible make it a permalink. This enables aggregators to not repeat items, even if there have been editing changes." And infact this is the main problem I am having in that I have an extension that uses the RSS feed system, and if i make a change to an item, it will be repeated as the RSS software can not tell that it is an old item. I have made a patch, that not only allows adding of a guid, allows you to set it as a permalink for RSS (this is not needed for atom). it also makes the atom use the new guid, which by default is set to the url, but can be changed with a setuniqueid call on the item. Please can we get this sorted as soon as possible!
Created attachment 6950 [details] patch to add guid and permalink support to feeds Feed.php hadnt been updated in over a year. I maek my patch and then someone cleans the file up! here is a new patch that applies to latest svn. Perhaps someone can have a look at this before Feed.php is again changed? patch includes a couple of minor cleanups/fixes to the file also.
Created attachment 6951 [details] patch to add guid and permalink support to feeds oops. couple of indentation mistakes in that one, and i made the ordering more logical.
Created attachment 6959 [details] 6951: patch to add guid and permalink support to feeds change some parameter names in new setUniqueId function removed cosmetic changes from the patch to make it more readable.
Last patch committed as r61090 after discussion in #mediawiki.