Last modified: 2011-06-23 20:24:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23919, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 21919 - Write and implement Google News SiteMaps (GNSM) extension for Wikinews
Write and implement Google News SiteMaps (GNSM) extension for Wikinews
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Low enhancement with 7 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, patch-reviewed
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-21 20:39 UTC by Bawolff (Brian Wolff)
Modified: 2011-06-23 20:24 UTC (History)
19 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
diff to add support for mapping between category schemas. (2.83 KB, patch)
2009-12-21 22:37 UTC, Bawolff (Brian Wolff)
Details
file google wants (53 bytes, text/html)
2011-06-13 22:01 UTC, Mark A. Hershberger
Details

Description Bawolff (Brian Wolff) 2009-12-21 20:39:35 UTC
See discussion on Bug 20818. (I'm not sure if this should be filed under a different product then 'Wikimedia', it didn't really seem to fit anywhere). per Tim's suggestion I'm refiling this in a separate bug from Bug 20818

Basically, currently a variety of hacks are used to get Wikinews on Google news. Google news spiders our main page, and uses a variety of techniques to try to determine which links are links to news articles, and which are not. These techniques have a tendency not to always work. (For example the following articles were not listed despite the hack: http://en.wikinews.org/wiki/Payment_pending;_Canadian_recording_industry_set_for_six_billion_penalties%3F , http://en.wikinews.org/wiki/Boeing_787_%22Dreamliner%22_makes_maiden_flight , http://en.wikinews.org/wiki/Israel_%22illegally_annexing%22_east_Jerusalem,_EU_reports and we have really no idea why)

Google news also allows for news websites to submit there articles via a "Google News Sitemap" which is a custom xml format that describes pages on a website, with a minimal amount of meta data. When submitting news via a sitemap, Google relaxes their standards for detecting what a news url is, and anything that they do not index, they give a reason for why it was not indexed ( http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=93994 ) This would hopefully reduce the above mentioned issues of articles not appearing on google news, and in any case where an article still did not appear, we would at least know why.

To that end, as many of you know, Amgine has developed an extension called GNSM (for Google news site map), which more or less takes the intersection extension,  and allows the results to be output as a sitemap that is accessed via a special page. (extension is currently at: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/GNSM/ demo page at http://wiki.enwn.net/index.php/Special:SpecialGNSM ).

This has the additional benefit of potentially (should it be enabled) of also giving Wikinews on-site RSS feeds. Currently we rely on off wiki methods of generating our rss feeds. (Someone [CSpurrier?] runs a script which feeds an rss feed which is forwarded to feedburner for the main feed, some of the secondary feeds are from the toolserver, which has a tendency to break). It would be nice to have feeds generated on-wiki, if simply because it would look more professional to have the rss feed on the same domain as the site.

To conclude, as has been said before, well we understand that in some people's opinion GNSM may not be fully ready to be deployed as of yet, we would really really appreciate it if it was developed to the point where it is ready, and as soon as possible reviewed/deployed on wikinews.

Thank you.
-bawolff
Comment 1 Bawolff (Brian Wolff) 2009-12-21 22:37:51 UTC
Created attachment 6899 [details]
diff to add support for mapping between category schemas.

I've added a diff that does the following
*Fixes a syntax error in the i18n file
*Provides a method to map category to keywords for the sitemap (Google uses different category names then we do: http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=116037 ). It adds a new message gnsm_categorymap which takes a list of items of the form: "*mediawiki_category_name_without_leading_ns_prefix|keyword_to_output"
or "*mediawiki_cat_name_no_ns|__MASK__" to supress outputting that category. (the extension should perhaps automatically mask hidden categoris, i wasn't sure).

Cheers.
Comment 2 Jon Davis 2009-12-21 23:25:48 UTC
Someone munged the FR translations in SpecialGNSM.i18n.php .  Among the bad characters in almost every line, specifically 'gnsm_noincludecats' contains an extra apostrophe (') in "despace".  Could someone please fix this?

Secondarily I've updated http://wiki.enwn.net/index.php/Special:Google_News_SiteMap to SVN (As seen http://wiki.enwn.net/index.php/Special:Version )
Comment 3 Bawolff (Brian Wolff) 2009-12-21 23:37:00 UTC
/me notes patch above that fixes apostraphe issue :P
Comment 4 Kim Bruning 2009-12-22 13:20:43 UTC
I've created a vm with the latest versions of mediawiki and gnsm from svn
( http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/GNSM )

Please send me mail with your pubkey if you would like access.
Comment 5 Priyanka Dhanda 2009-12-22 19:57:33 UTC
Just as a reference it may be good to look at these:
http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=74288
and
http://www.google.com/support/webmasters/bin/answer.py?answer=82299

I'll also apply Bawolff's patch and test all the changes. 
Comment 6 Priyanka Dhanda 2009-12-23 00:52:55 UTC
Amgine,
I noticed that the parameters work as part of the url
http://wiki.enwn.net/index.php/Special:Google_News_SiteMap/feed=rss&order=descending
but not as request parameters
http://wiki.enwn.net/index.php/Special:Google_News_SiteMap?feed=rss&order=descending

http://www.mediawiki.org/wiki/Manual:Special_pages#The_Body_File
It may be better to use $wgRequest->getValues() instead of the $par

Bawolff,
Do you thing the category mapping may be better as part of the local setting? It may be easier to configure per site.
-p
Comment 7 Bawolff (Brian Wolff) 2009-12-23 04:15:21 UTC
Note, Amgine is currently on a road trip, and might have limited internet connectivity, and might take longer to respond. (however that might have just been for today.)

I am fairly certain that Amgine would not be opposed to changing the feed to use the request parameter method.

As for category mapping, I think it would be easier doing it as a system message, since:
*Google's categories they accept might change, or be expanded
*The category scheme that we use might change
*Each language would use a different scheme
Since we can modify a system message without bothering/waiting for the devs, i feel that using a system message would be easier.

However with that being said, if you think it'd be better to do in some other fashion, please don't hesitate. I'm more concerned that such a functionality exists rather then how it is implemented.

Cheers,
Bawolff
Comment 8 Brian McNeil 2009-12-23 11:47:48 UTC
I'm going to ask Josh at Google if there's someone from there can take a look at this. We're effectively developing a general-purpose addon which a main use of will be to get a MediaWiki install indexed by Google using a sitemap.

The category matching tables that would be needed on-wiki to give Google appropriate categories are one thing but, I noticed on one of the other highlighted Google help pages that company stock ticker listings can be provided. If the table with mappings of [[Category:Microsoft]] to a stock ticker can be indicated by a parameter this is another very useful thing to set up on Wikimedia wikis.
Comment 9 Roan Kattouw 2009-12-23 13:14:21 UTC
(In reply to comment #6)
> Amgine,
> I noticed that the parameters work as part of the url
> http://wiki.enwn.net/index.php/Special:Google_News_SiteMap/feed=rss&order=descending
> but not as request parameters
> http://wiki.enwn.net/index.php/Special:Google_News_SiteMap?feed=rss&order=descending
> 
> http://www.mediawiki.org/wiki/Manual:Special_pages#The_Body_File
> It may be better to use $wgRequest->getValues() instead of the $par
> 
For stuff like this, you should really use $wgRequest->getVal('feed') instead of $par. You should really only use $par for one parameter, like Special:Listusers/sysop (alias for Special:Listusers?group=sysop); otherwise you end up splitting and parsing $par yourself, which is kind of unnecessary. If you want to use $par for something, the feed parameter makes most sense IMO.
Comment 10 Priyanka Dhanda 2009-12-23 19:50:38 UTC
Bawolff, makes sense.

The fix for the request params is pretty simple, I can make it if we don't hear back from Amgine soon.

Thanks for the feedback everyone.
-p
Comment 11 Jason Giglio 2009-12-23 19:54:21 UTC
I was planning on doing that as the first stage in cleaning it up.

I've got the test environment going and plan to take a look at it tonight.
Comment 12 Amgine 2009-12-23 22:06:41 UTC
Sorry for the delays: I'm road tripping into a winter storm here, but I should have internet access fairly regularly for the next two days (or longer if we get snowed in.)

$par vs. $wgRequest: I used the first/fastest tool I found - start to finish here was about 3 weeks. After 5 weeks of trying to get exactly this kind of feedback, I gave up on the project. It's GPL, so fix/expand as you see fit. I don't own it, I just tried to fix a problem Wikinews asked me to address.
Comment 13 Priyanka Dhanda 2009-12-30 22:29:54 UTC
Brian, Jason,
I did a little cleanup. Let me know when you have verified it and whether you are satisfied with it now.
-p
Comment 14 Brian McNeil 2009-12-31 01:23:45 UTC
Priyanka,

I'm looking at the changes you checked in (I'm a noob to php)

#par is used by dpl_param on line 305 thus:

$feedType = explode('/',$par,2)

What is actually in $par? It appears the result of the operation on line 305 should produce an array based on items separated by '/'.

If $par is the url path after "Special:GNSM/" then line 305 will only work if the "?" and parameters after it are stripped before it gets to there. That would also mean that "Special:GNSM/<feedtype>/<junk path and name>?option=xxx" would work.

Of course, "Special:GNSM/<feedtype>/<url-escaped-feed-title>?option=xxx" would be nice and allow for more descriptive URLs.
Comment 15 Jason Giglio 2009-12-31 03:17:17 UTC
I don't have access to svn right now, (on vacation), but I have already done most of the changes to migrate to wgRequest and get away from par entirely..  I have not implmented the multiple categories yet and it's still got some bugs.  I'll try to get some time in on it tomorrow and see if I can integrate the new changes that were checked in.
Comment 16 Kim Bruning 2010-01-17 02:56:22 UTC
What's current status on this? Has activity died down again?
Comment 17 Amgine 2010-02-09 20:08:02 UTC
(In reply to comment #16)
> What's current status on this? Has activity died down again?

Told you so.
Comment 18 Priyanka Dhanda 2010-10-23 20:40:31 UTC
Assigning this to Rob Lanphier so that he can figure out how to proceed.
Comment 19 Rob Lanphier 2010-10-26 16:17:54 UTC
I'm going to unassign this to signal that we'd like volunteer help on this.  Mark Hershberger and I will both remain on the cc list, and can help a volunteer get this through the process.
Comment 20 Amgine 2010-10-26 18:25:00 UTC
Let me see, written by volunteers, begged for feedback for months, rewritten by staff, left on hold for review for 10 months...

What do you want, Rob Lanphier?
Comment 21 Rob Lanphier 2010-10-26 18:44:14 UTC
Amgine, this extension only came to my attention this weekend.  I think you'll do best by asking nicely for help on wikitech-l.
Comment 22 Amgine 2010-10-26 18:54:32 UTC
No thanks. I've washed my hands of it.
Comment 23 Kim Bruning 2010-10-26 19:22:36 UTC
This extension is awaiting review, approval, and roll-out to en.wikinews, with what was critical blocker priority for the en.wikinews community at the time. 

I believe there have been some rubber bands, sellotape and matchsticks holding things together in the mean time. 

I understand that we have some capacity to review and roll out?

I would like to get this rolled out as quickly as possible. What would you like me to do to proceed?
Comment 24 Rob Lanphier 2010-10-26 19:45:41 UTC
Hi Kim, this is being discussed on IRC on #mediawiki: http://toolserver.org/~mwbot/logs/%23mediawiki/20101026.txt  

Roan had a few concerns (see log) that would need to be addressed before deploying something like this.  Mark Hershberger (hexmode on IRC) can also help with the review process.

We still need someone to volunteer as the main developer to make the fixes.  Kim, is that what you're signing up for?
Comment 25 Jason Giglio 2010-10-26 21:06:19 UTC
I will try to find some time to look at it again.  I had a version that was partially rewritten.

I concur with Roan's observations about the code quality.  It's really not up to modern PHP standards.

Don't count this as "volunteering to be the main developer", but count it as a "I'll try to look at it"
Comment 26 Kim Bruning 2010-11-07 00:43:10 UTC
(In reply to comment #25)
> I will try to find some time to look at it again.  I had a version that was
> partially rewritten.
> 

Cool, Can you show me?
Comment 27 Mark A. Hershberger 2010-11-09 18:49:09 UTC
Kim, Jason,

I'm contacting you via email to see if we can work with you to get this fixed up and deployed.

Mark.
Comment 28 Mark A. Hershberger 2010-11-12 03:22:36 UTC
Copying this from my notes, will put on CR if it doesn't format here.

* From RK:
** Detection for FlaggedRevs presence is fragile and I'm not sure it'll even work any more
** Also, there's an arbitrary, unlimited number of joins against categorylinks, that's a no-go (has potential to make the DB servers sweat)
** It's doing DIY category intersection
** Also, at least one of the two sort modes is likely to be very DB-inefficient. Can't tell for sure cause I can't read the query too well, the code is not very pretty and the query is complex and very dynamic
** Amgine: That's fine; if you can give me a general idea of what the query is gonna look like, I can poke at it (to optimize the above)
** Amgine: What I've given you just now should keep a dev busy for a few hours already. Am of course available for general help, but don't think it'd be useful to do an exhaustive review before those things have been taken care of
* From me:
** Echo?!?! REALLY? Don't do it. That and direct output (e.g. closing ?>) should not be done
*** To read: http://www.mediawiki.org/wiki/Manual:Special_pages#OutputPage.php
** Don't try to create XML yourself, it doesn't work.
Use XMLWriter or DOM to create XML.  That way you don't have do
xmlEncode().
** Check  http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=74288 it looks like there are some things in the GNSM format that should be handled.
*** should limit number of urls (1000?)
*** is there a way to integrate multiple site maps
*** document robots.txt?
** Hate the use of params as member variable.
*** You should just copy things from wgRequest after checking them.
*** Having a member variable named params seems like it is the beginning of confusion.
*** Also put wgRequest directly in the foreach, or at the very least closer to the point of use to avoid confusion with the others. 
*** If you don't intend for the result of $wgRequest to be used, then it shouldn't be in $params at all.
** code indention looks lacking.
** Incorrect use of wfEmptyMsg(??) not sure about this but the function def doesn't match function declaration in GlobalFunctions while examples elsewhere abound.
Comment 29 Mark A. Hershberger 2010-11-12 03:45:15 UTC
See r76544 for a easier to read review.
Comment 30 Jason Giglio 2010-11-12 19:03:42 UTC
Thanks Mark.   That will be helpful. 

As an update, Kim and I are trying to find my partially refactored version.  I'm sure we have a copy somewhere. :)
Comment 31 Kim Bruning 2010-11-15 04:53:19 UTC
And I have now found the partially refactored version.
Comment 32 Mark A. Hershberger 2010-11-23 18:33:33 UTC
Kim, Could you commit that code?
Comment 33 Kim Bruning 2010-11-25 01:27:56 UTC
(In reply to comment #32)
> Kim, Could you commit that code?

Sure. One small snag. 

Going through the steps to commit,

While attempting to merge and compare with code that was already there,

the location:
    http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/GNSM 

no longer appears to exist.

Further, the google query:
    GNSM site:http://svn.wikimedia.org/

returns no hits.

Has this been deleted or moved?  What is the current location?
Comment 35 Kim Bruning 2010-11-25 01:43:31 UTC
Ok, there's some pretty major discrepancies. All the files have been renamed, for starters. I'm going to have to take a better look this coming week. ^^;;
Comment 36 Jason Giglio 2010-11-29 05:12:58 UTC
I know I merged two of the files.  The renaming, I don't remember if I did it or if it's happened in SVN since my development started.  

As far as my changes, mostly from memory:

Started to clean up the parameter handling to use wgrequest instead of the home rolled thing.  This means the documentation is going to be wrong about the naked options (i.e. you need to use var=option instead of just /option)

Major cleanup of the incorrect indent levels.  All spaces should be tab chars now per the MW standards.

Merged the XML generation into the main file.  Still need to move that to XMLWriter or something similar, but the structure of what it's doing is similar now.  Moved some of it out of a class where it didn't really belong.

Cut out some sections of cruft, and simplified many of the deeply nested "if" statements to use early return logic.

Broke the "notcategories" handling, I don't remember why but there was a good reason.  I think it was part of the params conversion.
Comment 37 Sam Reed (reedy) 2010-11-29 08:35:43 UTC
For an "easy" changelog, please see http://www.mediawiki.org/w/index.php?path=/trunk/extensions/GoogleNewsSitemap&title=Special:Code/MediaWiki/path

There hasn't been much functional changes (bar people going through and removing stuff while doing other general cleanup along the way)

I split the code files

A version of Mark Hershbergers review above, is also on Code Review http://www.mediawiki.org/wiki/Special:Code/MediaWiki/76544#c11017
Comment 38 Jason Giglio 2010-11-29 15:02:45 UTC
I think the easiest thing would be to check my copy in losing Reedy's changes.  

Mine seems to be further along, and it should include all the whitespace fixing that Reedy did as well, since we duplicated effort there.

I don't care about the naming if the SVN name makes more sense.
Comment 39 Sam Reed (reedy) 2010-11-29 15:30:08 UTC
Feel free to put it over the top
Comment 40 Jason Giglio 2011-02-01 22:34:40 UTC
So at this point, most of the major code style problems have been addressed.  There's still a lot of overlap between this and DPL though.  I don't really have a clear way forward because I don't know which DPL features we want to keep in this and which ones can be stripped out.
Comment 41 Mark A. Hershberger 2011-02-11 22:22:47 UTC
Jason:

Sorry I didn't see your update earlier.  The most recent changes I see are from you about a month ago (2010-12-31).  Let me know if those are your most recent changes and I or someone else will look at what more needs to happen.
Comment 42 Amgine 2011-03-09 05:48:39 UTC
There is a renewed need for this functionality on en.WN. Is there any way WMF devs could take a look at this bug?
Comment 43 Jason Giglio 2011-03-09 16:37:18 UTC
The code style issues raised before should be addressed now.   SQL is parameterized, and the output is using an XML library.

The only remaining problem is that there remains a large overlap between this and DPL.  Without a clear statement of what functionality can be removed and what is required, no one can go forward with stripping out unnecessary DPL-based functions.
Comment 44 Sam Reed (reedy) 2011-03-09 17:01:31 UTC
It's annoying. Though, if it's not an easy task to strip out the duplication, IMHO, it's not really blocker for deployment, as long as the code is up to standard
Comment 45 Sam Reed (reedy) 2011-03-09 17:15:13 UTC
Though, the echo's need tidying up...
Comment 46 Jason Giglio 2011-03-09 19:18:47 UTC
I don't think we ever want to output the full skinned html here.
Comment 47 Bawolff (Brian Wolff) 2011-03-17 04:51:39 UTC
I poked at this last weekend, and a little bit this week.

I think most of the issues brought up in previous reviews have been addressed, and that this might be ready for another review.

Some remaining issues that I'm not sure and would really like some input on:
Major ones:
*Caching. Currently it caches the feed in memcache for up to a maximum of 12 hours (It also does checks to see if the feed is still valid, so the moment someone adds something to the relevant categories the cache is no longer considered valid, so everything is nice)
**Is 12 hours an appropriate max time limit (Given that cache entries are checked that they're still good before being used).
**Should we be sending maxage/s-maxage headers to cache on the client side and/or in squids. (Currently its sent with cache-control private i believe)
***If we did do caching on that level, the longest (s)maxage that would be acceptable would be very short (like half an hour) since we want these things to be up to date
**Should cache times differ between Sitemap and RSS/ATOM formats.

*It does a check to see if cl_timestamp is null, and returns the current timestamp if so. This seems very wrong. I'm not sure what circumstances a null cl_timestamp can happen (There was one but it looked accidental and is now removed). I'm not sure if checking for a null cl_timestamp and throwing an exception if that happens is the right way to go. (Is there any circumstance cl_timestamp can be null?)

*The flagged revision support is flaky.
This may be a valid concern. Its outside of my knowledge of flagged revs what the best approach is. It appears that DPL also detects dpl in the same way.

Minor ones:
*The default parameters could be tweaked. I'm not sure what the ideal defaults would be. A lot depends on which of the use-cases we want to optimize for.
*[bikeshed issue] This extension does rss feeds based on category membership. That feature may be of general interest to some people totally independent from the Sitemap aspect. Perhaps a more general name should be chosen like IntersectionFeed. But thats really not important.




Response to some of the review comments:
*"It's doing DIY category intersection"

That is sort of the feature request. (The intersection part. The DIY part I'm not sure what was meant)

*"Also, at least one of the two sort modes is likely to be very DB-inefficient. Can't tell for sure cause I can't read the query too well, the code is not very pretty and the query is complex and very dynamic"

The lastedit (which is really order by last_touched, which doesn't correspond to edit) sort order is rather scary (I'm not familiar with db scaling and stuff ery much, but I think it does a full table scan of the page table, and a file sort which is very very bad from my understanding). Anyways, lastedit is not very important to the use-case and could be removed.

the Categoryadd sort order is not as bad (Still not great). It might filesort if the smallest category specified is not the first category specified. It is no worst than the already deployed intersection (DynamicPageList) extension (Which has many scary queries ;) Actually its the same query as intersection.

*"General hating on direct output/echo/$wgOut->disable()".

This is done the same way as its done in core. In ideal world, this extension might be better done as an api module, but I don't think thats a critical issue. Most special pages that output an xml document call $wgOut->disable(). All the feed stuff in core calls $wgOut->disable() and outputs with echo.

*"is there a way to integrate multiple site maps?"
We are not anywhere near outputting more than 1000 articles over a 2 day period. So I don't feel this is a near or long term concern.


Thanks,
-bawolff
Comment 48 Bawolff (Brian Wolff) 2011-03-17 15:37:30 UTC
The other thing I wanted to mention is the order parameter (order ascending or descending). Do we really need that. Is there any valid use case for wanting an rss feed of the earliest 20 articles?
Comment 49 Mark A. Hershberger 2011-04-19 19:58:36 UTC
(In reply to comment #47)

> **Is 12 hours an appropriate max time limit (Given that cache entries are
> checked that they're still good before being used).

IMO, yes.  Someone who really knows about caching would have to give you a more definitive answer.  See Roan or Tim.

How are the cache entries checked? (I'm about to look at the code, so I may find out.)

> **Should we be sending maxage/s-maxage headers to cache on the client side
> and/or in squids. (Currently its sent with cache-control private i believe)

I think so.  Again, ask Tim or Roan to be sure.

> **Should cache times differ between Sitemap and RSS/ATOM formats.

No idea.  Why would you want them to be different?

> * (Is there any circumstance cl_timestamp can be null?)

Dunno.

> *The flagged revision support is flaky.

We should definitely get Aaron Schultz to comment.

(More in a bit.)
Comment 50 Mark A. Hershberger 2011-04-19 21:23:54 UTC
Adding  Aaron to get his input on FlaggedRevisions support.
Comment 51 Mark A. Hershberger 2011-04-19 21:41:19 UTC
(In reply to comment #48)
> The other thing I wanted to mention is the order parameter (order ascending or
> descending). Do we really need that. Is there any valid use case for wanting an
> rss feed of the earliest 20 articles?

I don't think so, no.

(In reply to comment #47)
> *"General hating on direct output/echo/$wgOut->disable()".
>
> This is done the same way as its done in core. In ideal world, this extension
> might be better done as an api module, but I don't think thats a critical
> issue. Most special pages that output an xml document call $wgOut->disable().
> All the feed stuff in core calls $wgOut->disable() and outputs with echo.

I'm pretty sure (but could be wrong) that using the excuse, "core works that way" is the wrong reasoning for an extension to use.  If you know of another widely deployed extension that does output this way, then I think you have a case.

Now, going to look at the code.
Comment 52 Bawolff (Brian Wolff) 2011-04-20 02:57:24 UTC
>I'm pretty sure (but could be wrong) that using the excuse, "core works that
>way" is the wrong reasoning for an extension to use.  If you know of another
>widely deployed extension that does output this way, then I think you have a
>case.

I'm pretty sure if we didn't do that, we'd need to re-implement most of the RSS stuff in the extension, which seems pointless given the classes in core. As for other extensions We've got SiteMatrix (when you do ?action=raw), OpenID, OAI, and some others.
Comment 53 Jason Giglio 2011-04-20 15:55:24 UTC
I don't see any point in not using direct output either, and I don't see a reasonable alternative to it in this situation.
Comment 54 Roan Kattouw 2011-04-22 11:22:14 UTC
(In reply to comment #49)
> (In reply to comment #47)
> 
> > **Is 12 hours an appropriate max time limit (Given that cache entries are
> > checked that they're still good before being used).
> 
> IMO, yes.  Someone who really knows about caching would have to give you a more
> definitive answer.  See Roan or Tim.
> 
12 hours seems good to me. Given that you're checking the validity using a query before using the cached data, I guess a higher value like 24 or 48 might be better, but I have no strong feelings either way.

> How are the cache entries checked? (I'm about to look at the code, so I may
> find out.)
> 
Basically it queries cat_pages for every category (using a single query) and MAX(cl_timestamp) for every category (using a big UNION). This is in getCacheInvalidationInfo() .

> > **Should we be sending maxage/s-maxage headers to cache on the client side
> > and/or in squids. (Currently its sent with cache-control private i believe)
> 
> I think so.  Again, ask Tim or Roan to be sure.
> 
Yes, you should set Cache-Control: public, max-age=1800, s-maxage=1800 to get the half-hour caching you suggested. If you do that and ensure stable URLs (i.e. always use the same URL for the same data) there'll only be one app server regenerating each sitemap once every half hour at most, which sounds very reasonable to me.

> > **Should cache times differ between Sitemap and RSS/ATOM formats.
> 
> No idea.  Why would you want them to be different?
> 
I agree with Mark here, unless you have a good reason for them to be different, make them the same.
Comment 55 Bawolff (Brian Wolff) 2011-04-28 20:26:27 UTC
>> > **Should cache times differ between Sitemap and RSS/ATOM formats.
>> 
>> No idea.  Why would you want them to be different?
>> 
>I agree with Mark here, unless you have a good reason for them to be different,
>make them the same.

The reasoning for maybe doing that, is that people would want the published articles appearing on google the moment they're published, where the general rss feeds are less important. Thus it might make sense to cache the rss/atom output in squid for half an hour or what not, but have the GNSM output not be squid cached.
Comment 56 Bawolff (Brian Wolff) 2011-04-29 22:11:19 UTC
It should now have an s-maxage of 1800 in r87140.

I was originally going to output the header directly in order to set the max-age to 1800 as well, but it appears that the underlying RSS feed class in core already uses $wgOut->sendCacheControl so I just used $wgOut's methods for setting s-maxage, which means it doesn't set the max-age header for client side caching, and it won't set the s-maxage header if the user is logged in, which is unideal, but I don't think its really a problem.
Comment 57 Chad H. 2011-04-29 23:57:52 UTC
I've deployed this to testwiki. I encourage people to test it out a bit then we can look at a push to the live site.
Comment 58 p858snake 2011-04-30 00:09:09 UTC
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
Comment 59 Bawolff (Brian Wolff) 2011-04-30 01:54:29 UTC
(In reply to comment #57)
> I've deployed this to testwiki. I encourage people to test it out a bit then we
> can look at a push to the live site.

Could you also deploy r86451 of FlaggedRevs? The flagged revs part requires at least that revision. Thanks.

I've also noticed some messages seem to be missing (?)
Comment 60 Bawolff (Brian Wolff) 2011-05-02 06:20:06 UTC
Ok, there was some serious issues with how it did flagged revs (As in did a db error/otherwise didn't work), but I believe i fixed them in r87236. They appear to have been in the extension since the get go, so I guess I somehow managed to never test the flagged revs part of this extension until now :s. Whoops.

On a related note, the extension won't recognize that the cache (the part stored in $wgMemc that lasts 12 hours) is outdated if a page can suddenly be included because it suddenly got sighted (or if it got all revisions unsighted). I do not think this is a significant issue for Wikinews since any change in sighting also involves adding (respectively removing) category:Published to the article. And if in doubt the extension respects ?action=purge. However it could probably be addressed by storing and checking how many articles are sighted in namespace X (which i believe is stored somewhere in the db since its shown on special:validationStatistics), similar to what it does for categories currently.
Comment 61 Chad H. 2011-05-18 17:53:29 UTC
Has been deployed to enwikinews -> FIXED
Comment 62 Amgine 2011-05-18 22:26:17 UTC
<coughs> Temporarily undeployed. I hope.
Comment 63 Kim Bruning 2011-05-18 23:58:19 UTC
Somewhat befuddeled, I went information-hunting on irc.

GNSM is temporarily undeployed due to cpu usage spike on wn server. Several people are On Planes atm. A new deployment will be tried on friday.
Comment 64 Kim Bruning 2011-05-19 00:14:05 UTC
CPU spike might have been coincidental with activation of GNSM. Next try will probably be monday (not friday). Actual cause of CPU-spike has not yet been determined.
Comment 65 Bawolff (Brian Wolff) 2011-05-23 06:33:56 UTC
btw, before I forget, when this does get re-enabled, enwikinews should have the following config setting:
$wgGNSMcommentNamespace = 102;

Since enwikinews uses namespace 102 instead of the normal talk namespace for "comments". (Not that important, since only outputted in rss mode, and no one uses it, but might as well).
Comment 66 Bawolff (Brian Wolff) 2011-05-26 23:27:31 UTC
Alrighty, sitemap appears to work:

In order to actually submit it to google, we need someone with access to "Google Webmaster Tools" for the en.wikinews domain. Hopefully someone in ops has such access? See http://www.google.com/support/webmasters/bin/answer.py?answer=74289
Comment 67 Sam Reed (reedy) 2011-05-27 00:04:07 UTC
It'll just require someone to verify authenticity, and set it up

There are 3 ways to do this from memory
Comment 68 Amgine 2011-05-27 14:15:10 UTC
(In reply to comment #66)
> In order to actually submit it to google, we need someone with access to
> "Google Webmaster Tools" for the en.wikinews domain.

Bawolff: have you verified with User:Brian_McNeil that this step has not already been accomplished?
Comment 69 Platonides 2011-05-27 15:11:21 UTC
I think "someone" already has Google Webmaster Tools access.
Comment 70 Bawolff (Brian Wolff) 2011-05-27 18:57:10 UTC
(In reply to comment #68)
> (In reply to comment #66)
> > In order to actually submit it to google, we need someone with access to
> > "Google Webmaster Tools" for the en.wikinews domain.
> 
> Bawolff: have you verified with User:Brian_McNeil that this step has not
> already been accomplished?

Yes, I discussed with him on irc. It has not been done yet. He mentioned that he had some contact info for google on an old computer that would be difficult to find. I don't think he has access to Google Webmaster tools.

-----

btw, just to clarify the relevant url we'd want to submit to google would be: http://en.wikinews.org/wiki/Special:NewsFeed?hourcount=48&notcategories=AutoArchived|Archived|No_publish|disputed
Comment 72 Bawolff (Brian Wolff) 2011-05-27 19:06:38 UTC
(In reply to comment #71)
> Shouldn't adding
> 
>  Sitemap:
> http://en.wikinews.org/wiki/Special:NewsFeed?hourcount=48&notcategories=AutoArchived|Archived|No_publish|disputed
> 
> to http://en.wikinews.org/wiki/MediaWiki:Robots.txt work too?

According to the docs, if you've previously submitted one and just need to change the url, you can do that, but the first time you submit a news sitemap you need to do it through webmaster tools. See http://www.google.com/support/webmasters/bin/answer.py?answer=74289
Comment 73 p858snake 2011-05-27 23:29:37 UTC
NeilK might be worth a talking to, iirc he was the wmf employee talking to google about our sitemaps support.

(CCing NeilK).
Comment 74 Neil Kandalgaonkar 2011-05-28 17:01:16 UTC
I don't know anything special about the Google News Sitemap format. I was looking into the Image extensions for Commons. Then I got interested in the question of why we don't do sitemaps generally. I've collected some info about that which I can share here, but I think that WikiNews is thinking somewhat differently about Sitemaps.

It seems that you're really interested in getting recent items indexed fast, so GNSM works (as far as I can tell) more like an RSS feeder -- there's a single URL you hit to get the latest stuff, and if you miss an hour or two, too bad.

But for Wikipedia and Commons, we want the entire collection indexed, and hopefully by more than just Google, and with richer metadata. This is going to take some kind of incremental approach, long term, although we can do some of it even with the old batch-oriented sitemap scripts.
Comment 75 Bawolff (Brian Wolff) 2011-05-28 23:32:03 UTC
(In reply to comment #74)
> I don't know anything special about the Google News Sitemap format. I was
> looking into the Image extensions for Commons. Then I got interested in the
> question of why we don't do sitemaps generally. I've collected some info about
> that which I can share here, but I think that WikiNews is thinking somewhat
> differently about Sitemaps.
> 
> It seems that you're really interested in getting recent items indexed fast, so
> GNSM works (as far as I can tell) more like an RSS feeder -- there's a single
> URL you hit to get the latest stuff, and if you miss an hour or two, too bad.
> 
> But for Wikipedia and Commons, we want the entire collection indexed, and
> hopefully by more than just Google, and with richer metadata. This is going to
> take some kind of incremental approach, long term, although we can do some of
> it even with the old batch-oriented sitemap scripts.

Yes, quite correct. Normal sitemaps, and "news" sitemaps serve rather different purposes. At this point we just need someone with access to Google Webmaster tools.
Comment 76 Amgine 2011-05-31 22:31:48 UTC
As per hexmode in IRC; I'd like to ask pdhanda to sign up for and verify for google's webmaster tools for en.Wikinews, and then force Bawolff to be responsible for it (since xe is both a community member and current maintainer of the extension.)
Comment 77 Amgine 2011-06-09 15:01:46 UTC
Is this bug still waiting for someone to verify for google's webmaster tools? This is a 5 minute fix.
Comment 78 Mark A. Hershberger 2011-06-13 21:46:14 UTC
I now have the account webmaster@wikimedia.org and will stash the password with Robla and CTWoo.  After that, updating this will take just as long as it takes, but no longer.
Comment 79 Mark A. Hershberger 2011-06-13 22:01:36 UTC
Created attachment 8649 [details]
file google wants

For not-to-distant future reference.
Comment 80 p858snake 2011-06-13 22:24:46 UTC
+patch, we don't remove this keyword ;p
Comment 81 Ryan Lane 2011-06-13 23:05:35 UTC
Pushed the file.
Comment 82 Mark A. Hershberger 2011-06-13 23:57:28 UTC
finally closing!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links