Last modified: 2010-01-02 03:22:01 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T22818, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 20818 - urls of the form wiki/article?curid=something should be indexable by robots on wikinews
urls of the form wiki/article?curid=something should be indexable by robots o...
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
unspecified
All All
: Normal critical with 11 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
http://en.wikinews.org/wiki/Airliner_...
: code-update-regression
Depends on: 21302
Blocks:
  Show dependency treegraph
 
Reported: 2009-09-26 03:42 UTC by Bawolff (Brian Wolff)
Modified: 2010-01-02 03:22 UTC (History)
12 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Google News SiteMap: an Atom/RSS feed Special page extension. (30.00 KB, application/x-tar)
2009-10-26 05:15 UTC, Amgine
Details
Google News SiteMap: an Atom/RSS feed Special page extension. (30.00 KB, application/x-tar)
2009-10-27 00:21 UTC, Amgine
Details
Google News SiteMap: an Atom/RSS/SiteMap feed Special page extension. (30.00 KB, application/x-tar)
2009-11-01 04:57 UTC, Amgine
Details

Description Bawolff (Brian Wolff) 2009-09-26 03:42:59 UTC
In the last code update, the way the robots meta tag works appear to have changed. Pages with the curid parameter are now set to noindex,follow . Wikinews (english), for various reasons (see [[n:Wikinews:Google news]] for gory details), we use pages with their curid appended at end so that google news will syndicate us. This change has stopped us from being syndicated by google news. I would like to request, with urgency, that pages of the form http://en.wikinews.org/wiki/Some_article?curid=some_numb have the robots policy of index,follow . or just have no meta robots tag at all.

Thanks
-bawolff
Comment 1 Brian McNeil 2009-09-26 10:02:41 UTC
This is rather annoying, as the ability to provide DPL display links to pages with the curid was *precisely* for en.wikinews to get listed in Google News.
Comment 2 Brion Vibber 2009-09-28 19:12:46 UTC
I don't think those get squid-cached properly...
Comment 3 Jon Davis 2009-10-01 23:44:32 UTC
I realize that having these pages be not be cached is a Bad Thing (TM), and there is logic behind adding the noindex/follow to these essentially duplicate pages... but for Wikinews, this is a real problem.  We worked very hard to get included on Google News, and now we're... off, and it is completely out of our hands.  Is there any possibility that this change (adding the noindex/follow) PLEASE be rolled back, at least until a better solution can be found?
Comment 4 Platonides 2009-10-03 21:40:34 UTC
> I don't think those get squid-cached properly...

curid requests should provide a Content-Location header. Then squids 
could use their cache for that url. Currently it seems to take 
Content-Location into account only for purging (and on squid3, 
wm is using 2.7.STABLE6), but it seems like a sensible feature.
It's probably dependant on http://bugs.squid-cache.org/show_bug.cgi?id=1631 
though.

Comment 5 Amgine 2009-10-21 03:55:04 UTC
As an alternative solution, I've built an RSS/Atom feed which can be served to Google News as a sitemap, so they do not need to have curid. The script is sensitive to flaggedrevs, uses DynamicPageList-style url parameters to sort out news articles, includes configurable maximum/minimum returns, and has limitations to the number of category/notcategory parameters to search on.

It's just beginning to be tested, and I'm looking for beta volunteers.
Comment 6 Amgine 2009-10-26 05:15:03 UTC
Created attachment 6714 [details]
Google News SiteMap: an Atom/RSS feed Special page extension.

This Special page extension creates an Atom/RSS feed based on categories/notcategories/namespace and other url-passed criteria alà DynamicPageList (Wikimedia).

It's a bit crufty at the moment, including stuff from DPL which isn't relevant to an xml feed, but it does work. It is not fully tested. And it would get Wikinews back onto Google News.
Comment 7 Platonides 2009-10-26 14:19:53 UTC
Doesn't look  ready for wmf deployment imho.
Too much code, most of it likely unneeded. And there's no usage description, so 
it's hard to understand what it is expected to do, even.

Moreover, I don't see how this extension could fix the problem.
Wikinews:Google_news states that the usage of curid= is due to Google News only 
following links which contain numbers.

The issue could be fixed by having dpl add a dummy parameter and tricking the 
squids to ignore it, adding invalidations for titles with curid=...
Comment 8 Bawolff (Brian Wolff) 2009-10-26 14:27:59 UTC
Google news gives us two options for allowing them to index us.
*Treat any article linked from the main page with a number in the url as news (This has the problem of we don't like numbers in our article titles, hence the curid. Plus it doesn't let us put developing articles on main page, lest one of the titles has a number in it)
*Option two, treat anything in a google news sitemap (Which is slightly different from a normal sitemap. Essentially an xml document listing pub date, categories, title, and url) as recent published (on their website they say they want anything published in the last three days to be on sitemap

cheers.
Comment 9 Amgine 2009-10-26 15:48:49 UTC
Platonides: I agree, it's probably not ready for deployment. I'm looking for feedback on that which I haven't been able to find via other communication routes. Unfortunately, most of the code is valid because the feed is also designed for additional uses. What isn't required is the display-related elements (wgUser).

A Google News SiteMap is registered with and polled by GN. They prefer this method over spidering a website because they only get the latest links. In effect it's an API, one which can also be used by more than just Google News.

Comment 10 Amgine 2009-10-27 00:21:39 UTC
Created attachment 6723 [details]
Google News SiteMap: an Atom/RSS feed Special page extension.

Update to GNSM Special page extension

- decrufted DPL parameters
- Tested most parameters
-- Remaining untested: suppress errors, usecurid, usenamespace not relevant at this point
- added brief usage notes
Comment 11 Amgine 2009-10-28 00:48:04 UTC
Changes on the Google News side now require sitemap xml feeds only. I'm writing an additional feed class to produce this; however it may require a different feed item as well as the url containers hold slightly different values.
Comment 12 Amgine 2009-11-01 04:57:04 UTC
Created attachment 6745 [details]
Google News SiteMap: an Atom/RSS/SiteMap feed Special page extension.

Version 0.2.something

This version provides complete Atom/RSS/SiteMap xml output.

- support for Sitemap <news:keywords>
- support for Sitemap <lastmod>
- support for Sitemap <url> tweaked
- support for Sitemap <news:pubdate> tweaked
Comment 13 Amgine 2009-11-02 14:50:40 UTC
There are a couple minor fixes, removing some debug code, adding # of days parameter, error message, but I won't be able to update for a few hours at least. Platonides said xe'd be reviewing this today, so just wanted to make known there's a slight lag.
Comment 14 Amgine 2009-11-14 00:11:09 UTC
Comment on attachment 6745 [details]
Google News SiteMap: an Atom/RSS/SiteMap feed Special page extension.

Concedo
Comment 15 Brian McNeil 2009-12-16 22:54:34 UTC
This has *again* become very urgent for Wikinews.

A hack was introduced to have a hidden list of URLs with numbers in them on the main page. Google is no longer picking these up and insists on the URL containing a minimum of 3 digits. Redirects are not working.

Can this be addressed as a matter of urgency, or Amgine's proposed patch/extension be seriously reviewed with a view to putting it into use.

Unless I am unable to I wish to reopen that for a proper review.

/me mutters "Where's Brion?"
Comment 16 Brian McNeil 2009-12-16 23:09:43 UTC
Comment on attachment 6745 [details]
Google News SiteMap: an Atom/RSS/SiteMap feed Special page extension.

I do not know if this is obsolete, but it is urgently needed for enWN to maintain a listing in Google News.

I understand it was previously given a half-assed review and the vital internal SQL not security audited. This *needs* done. Wikinews must be listed in Google News to generate user contributions from the competition that's coming up, and Jimmy Wales doing WikiVoices next week on Wikinews where we want him to write an article and get it listed in Google News.

I do not want a Skypecast recorded where I'm telling Jimmy, "Uh, Yes, you need to thing of a title with a three digit number in it so your article will appear in Google News".
Comment 17 iain.macdonald 2009-12-17 16:34:04 UTC
Based on the fact that it is now considered fairly key that WN is listed on GNews to keep its readership and contributions up, I have upgraded from major to critical. This needs fixed. Now.
Comment 18 Platonides 2009-12-17 21:37:42 UTC
> /me mutters "Where's Brion?"

Working for identi.ca?
He's not even CCed for this bug, so I don't think he's going to read you.
Note that the caching concerns were addressed at bug 21302.
Comment 19 Kim Bruning 2009-12-18 00:09:40 UTC
* I'm looking for the latest copy of the patch. I'd fix any issues found under code review. 
* Gigs(and I) were  wondering if there is a test server with this code running someplace already? Else I'll set one up.

Comment 20 Brian McNeil 2009-12-18 01:32:00 UTC
I believe a test is running on wiki.enwn.net. That's ShakataGaNai on-wiki you'd need to speak to. (wiki@consoletek.com).

I didn't pay much attention to them setting up the code management system for that though, so I don't know what the state of things is.

Comment 21 Jon Davis 2009-12-18 05:43:28 UTC
That would be me (ShakataGaNai that is).  For some unknown reason the test environment is hosed in all sorts of spectacular ways.  Oddly enough the only part that does work is GNSM, as seen here http://wiki.enwn.net/index.php/Special:SpecialGNSM .  I can setup a new demo environment if need be (one that actually works).  Ping me off bug at this address if you want me to.
Comment 22 Max Semenik 2009-12-18 07:06:15 UTC
Note that I've committed it into SVN yesterday (r60172) in order to facilitate more collaboration.
Comment 23 Amgine 2009-12-18 15:13:30 UTC
You may want the description page as well: http://en.wikinews.org/wiki/User:Amgine/Google_News_Sitemap
Comment 24 Tim Starling 2009-12-21 07:05:16 UTC
Why are you all just waiting patiently instead of emailing me? This is a serious issue.
Comment 25 Amgine 2009-12-21 07:09:40 UTC
We aren't waiting for you. There's been at least one working solution for more than a month. A temporary hack was working for most of that time.
Comment 26 Tim Starling 2009-12-21 07:40:23 UTC
The relevant breaking change was made in r45360, in January 2009. I don't think it would be good to revert it. The use of curid by DPL is incorrect, any random string would have done just as well to fool the google bot, and any other random string wouldn't have had the undesired side-effect. I've committed and deployed a change to set a dpl_id parameter on links when the DPL parameter "googlehack" is set. As far as I can see, this will fix the issue as reported. Just change your templates to use googlehack instead of showcurid.

Requests to review and enable a mostly unrelated site map extension should be made on a separate bug report.
Comment 27 Brian McNeil 2009-12-21 10:56:10 UTC
The problem is is that, precisely as the parameter name reveals, this is a HACK.

For initial listing in Google News Wikinews had to completely cease listing any developing stories on the main page. This is a significant disincentive to attracting new contributors. (If any developing story simply had "1234" in the title it would automatically be indexed by Google News).

I do not want to disparage the key MediaWiki developers; they are stretched thin and any issue on Wikipedia is far, far more visible than on Wikinews. However, Amgine's extension is a powerful, and flexible general-purpose solution to the issue of RSS, Atom, and Google News feeds from any MediaWiki install.

Can we have assurances that the permanent solution (and addition of a gallery option to DPL) will be reviewed seriously and - hopefully - implemented in the near future? The gallery option is of great interest to Commons where they would like to de-emphasise lower resolution Featured Images which are generally older.
Comment 28 Bawolff (Brian Wolff) 2009-12-21 20:42:12 UTC
Per Tim's suggestion, I have filed a separate bug for GNSM, Bug 21919 . Well the googlehack parameter of the DPL is certainly a step in the right direction, we would still really appreciate having a google sitemap.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links