Last modified: 2009-12-30 05:05:14 UTC
Google has released a protocol for sitemaps. It is in an experimental stage right now and google does not guarantee anything. The announcement is at http://googleblog.blogspot.com/2005/06/webmaster-friendly.html. The protocol is explained at https://www.google.com/webmasters/sitemaps/docs/en/protocol.html The FAQ for this project is at https://www.google.com/webmasters/sitemaps/docs/en/faq.html. Currently, Wikipedia's pages do not get updated by google "fast enough". I've monitored the article [[Sarah Kane]] at de.wikipedia when it was first written and well-linked. It took several weeks until the url appeared and several weeks more until the content was indexed. With more than 2 million wikipedia articles and many more pages such as user-pages, talks, other namespaces, simply crawling the site might not be the best solution. Mediawiki could automatically provide sitemaps. The protocol is supporting gzip compression. A single file may not be larger than 10MB (uncompressed) and may not contain more than 50.000 urls. It is allowed to have more than one sitemap file and link them in a sitemap index file. Sitemap index files may not list more than 1,000 Sitemaps. Under these circumstances, it seems to be possible not to run into protocol limits in the near time, given the size of the individual large mediawiki installations (such as en.wikipedia and de.wikipedia). The XML-DTD contains several tags which would have to be filled with content: * changefreq: Enumerated list. Valid values are "always", "hourly", "daily", "weekly", "monthly", "yearly" and "never". Suggestion: Date on which the article was created minus the current date. Divide the result by the number of revisions. A finer solution might be just to monitor the frequency of edits within the last 2 months of that article to reflect "current event"-articles better. * lastmod: time that the URL was last modified. We already have that information in the cur-table. * loc: URL for that page. obvious. * priority: Optional. The priority of a particular URL relative to other pages on the same site. The value for this tag is a number between 0.0 and 1.0, where 0.0 identifies the lowest priority page(s) on your site and 1.0 identifies the highest priority page(s) on your site. It would be simple to give all the articles a 0.7 and other namespaces sigificant lower priorities. People might consider to use a more sophisticated approach based on the number of backlinks or whatever.
I like the idea, and I belive this could be easily done by extending the code that creates the RSS feed for the recent changes page. However, this would probably be helpful mostly for small wikis, as the large number of edits on the big wikipedias is likely to be more than google is willing to handle. But it's worth a try... Maybe it would be a good idea to talk to google directly about it - they may be willing to handle wikipedia specially (they already do, for their "definition" feature).
So, basically a frigg'n huge RC feed for searchengines;) It might be good to just put changefreq to "always" to avoid the performance penalty of computing it (you'd have to load the whole timestamp row (or a period like the last 10) rather than just look at the recentchanges table, at least make it optional. It's probably best to skip priority, or make it optional and compute it from backlinks like Mathias suggested.
(In reply to comment #2) > So, basically a frigg'n huge RC feed for searchengines;) There is a proverb in German "Seit wann kommt der Berg zum Prophet?" (does "Since when does the mountain go to the prophet?" make any sense?). I would rather not call this an RC feed. It is simply a "standardized" and slightly more informative sitemap which can be found on some web sites. An RC feed would look more similar to trackback/pings. > It might be good to just put changefreq to "always" to avoid the performance > penalty of computing it (you'd have to load the whole timestamp row (or a period > like the last 10) rather than just look at the recentchanges table, at least > make it optional. Putting the changefreq to "always" would not be a good idea. The point of this sitemap is to make crawler and spider bots work more efficient. An "always" for all pages would eliminate all the advantages over stupid crawling of the web site. "Always" is meant for pages that change every time you check their page. Most standard wikipedia articles do not change fast. I see that there is some computation effort to get useful results but putting an changefreq="always" to articles like [[Mucocutaneous boundary]] is simply wrong. This article looks the same since 18 month. If you really want to have a fixed setting for all articles, I would recommend nothing more frequent as "weekly". It would be a possible solution to have a standard setting for "weekly" for all articles while running a log on the recentchanges-channel in the IRC. A script could find out all the articles which have changed more than twice within two days (just a thought) and so a simple search-and-replace to "daily" in the sitemap files. This would reduce computing power, I guess. I don't know if this idea would survive contact to reality. > It's probably best to skip priority, or make it optional and compute it from > backlinks like Mathias suggested. Actually, priority *is* optional according to the google web site. Computing it from the backlinks might not really reflect the "real" priority of that article, whatever that may be. I guess we could all agree that the namespace 0 has a higher priority in any case than all the other namespaces. A log of the number of backlinks to the power of 4 divided by 10 (max: 0.5) plus 0.5 would give priority between 0.5 and 1.0 (are there articles with more than 1024 backlinks on en right now?). This might be better than nothing. The backlink-information is in the mysql tables anyway, right?
(In reply to comment #3) > (In reply to comment #2) > > So, basically a frigg'n huge RC feed for searchengines;) > > [snip] > > I would rather not call this an RC feed. It is simply a "standardized" and > slightly more informative sitemap which can be found on some web sites. An RC > feed would look more similar to trackback/pings. I was under the impression that this index would only list a subset of the pages on the wiki, but just so that we're clear, would it be a complete index (though perhaps not totally up to date) of them all? > > It might be good to just put changefreq to "always" to avoid the performance > > penalty of computing it (you'd have to load the whole timestamp row (or a period > > like the last 10) rather than just look at the recentchanges table, at least > > make it optional. > > Putting the changefreq to "always" would not be a good idea. The point of this > sitemap is to make crawler and spider bots work more efficient. An "always" for > all pages would eliminate all the advantages over stupid crawling of the web > site. "Always" is meant for pages that change every time you check their page. > Most standard wikipedia articles do not change fast. I see that there is some > computation effort to get useful results but putting an changefreq="always" to > articles like [[Mucocutaneous boundary]] is simply wrong. This article looks the > same since 18 month. If you really want to have a fixed setting for all > articles, I would recommend nothing more frequent as "weekly". It would be a > possible solution to have a standard setting for "weekly" for all articles while > running a log on the recentchanges-channel in the IRC. A script could find out > all the articles which have changed more than twice within two days (just a > thought) and so a simple search-and-replace to "daily" in the sitemap files. > This would reduce computing power, I guess. I don't know if this idea would > survive contact to reality. I agree, not using "always" is probably best. > > It's probably best to skip priority, or make it optional and compute it from > > backlinks like Mathias suggested. > > Actually, priority *is* optional according to the google web site. Computing it I know, read the spec;) > from the backlinks might not really reflect the "real" priority of that article, > whatever that may be. I guess we could all agree that the namespace 0 has a > higher priority in any case than all the other namespaces. A log of the number Yeah, NS_MAIN should be highest. > of backlinks to the power of 4 divided by 10 (max: 0.5) plus 0.5 would give > priority between 0.5 and 1.0 (are there articles with more than 1024 backlinks > on en right now?). Defenetly, stuff like [[Europe]] gets linked alot. > This might be better than nothing. The backlink-information > is in the mysql tables anyway, right? Indeed, but something like this is way too heavy to be dynamically generated, it would have to be done by a cronjob or made from a database dump.
(In reply to comment #4) > I was under the impression that this index would only list a subset of the pages > on the wiki, but just so that we're clear, would it be a complete index (though > perhaps not totally up to date) of them all? This is what this sitemap is meant for (as far as I understand it): A complete list of all pages of a certain web site (which are to be found by a search engine). > > This might be better than nothing. The backlink-information > > is in the mysql tables anyway, right? > > Indeed, but something like this is way too heavy to be dynamically generated, it > would have to be done by a cronjob or made from a database dump. While I do not consider an "uptodate"-version of this sitemap as an impossible thing, a sitemap would only have to made once a while (having a weekly output would still mean an improvement for google and - in the medium and long run - for us)
All I did was add the recent changes RSS feeds to my google sitemap account http://www.wikicities.com/wiki/Community_portal#Google_Sitemaps Don't know if it's having much effect, but it seems to be retrieving the pages about every 12 hours. Perhaps it could be done more efficiently by batch.
See http://wiki.case.edu/misc/googleSiteMap.phps
We now have a siteman generator in CVS HEAD, marking this as FIXED.