Last modified: 2009-11-25 14:04:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T22481, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 20481 - Wget and Perl get stale pages
Wget and Perl get stale pages
Status: CLOSED INVALID
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.16.x
All All
: Normal major (vote)
: ---
Assigned To: Nobody - You can work on this!
http://en.wikipedia.org/wiki/Doria's_...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-09-03 09:03 UTC by Lee G
Modified: 2009-11-25 14:04 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Lee G 2009-09-03 09:03:55 UTC
http://en.wikipedia.org/wiki/Doria's_Tree-kangaroo

Wget and Perl get stale pages

perl -w -MLWP::UserAgent -e "print LWP::UserAgent->new->get('http://en.wikipedia.org/wiki/Doria\'s_Tree-kangaroo')->decoded_content"

Both wget and Perl get a version of the page from August, not the version I fixed in September.
Comment 1 Max Semenik 2009-11-24 18:10:28 UTC
Encoding problem? The URL should be http://en.wikipedia.org/wiki/Doria%27s_Tree-kangaroo
Comment 2 Lee G 2009-11-25 07:09:07 UTC
Thanks for the suggestion, but not an encding problem: the UTF8 URI is fine, try it in your browser: http://en.wikipedia.org/wiki/Doria's_Tree-kangaroo

You can see my edit on the history.

Note the escaped backslash within the URI in the Perl example is simply because Perl has the URI quote in single quotes. Using a different quote operator would have been clearer:

perl -w -MLWP::UserAgent -e "print LWP::UserAgent->new->get(qq(http://en.wikipedia.org/wiki/Doria's_Tree-kangaroo))->decoded_content" 
Comment 3 Tim Starling 2009-11-25 07:21:12 UTC
Can't reproduce. Did someone purge it?
Comment 4 Tim Starling 2009-11-25 07:43:05 UTC
It's not important. MediaWiki will only purge canonical article URLs, in this case the one with the %27 not the one with the '. It can't purge every possible variant URL, there are thousands. That is not a bug. So if you request non-canonical URLs you can expect to get stale pages.

You could argue that it's a bug that it responds to the bad URL with a 200, when it should redirect. But that's a topic for another bug report, I'm marking this one invalid.
Comment 5 Lee G 2009-11-25 08:26:06 UTC
(In reply to comment #4)
> It's not important. MediaWiki will only purge canonical article URLs, in this
> case the one with the %27 not the one with the '. It can't purge every possible
> variant URL, there are thousands. That is not a bug. So if you request
> non-canonical URLs you can expect to get stale pages.

Can you demonstrate to me why the non-URI-encoded URI is not canonical? 

The URI seems to return 200/OK, with full content, which does not mention a redirect, nor a canonical URI.

Are the two URIs not identical in effect, just differently encoded?
Comment 6 Roan Kattouw 2009-11-25 10:49:46 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > It's not important. MediaWiki will only purge canonical article URLs, in this
> > case the one with the %27 not the one with the '. It can't purge every possible
> > variant URL, there are thousands. That is not a bug. So if you request
> > non-canonical URLs you can expect to get stale pages.
> 
> Can you demonstrate to me why the non-URI-encoded URI is not canonical? 
> 
Because the URL-encoded one is the URL MediaWiki gives you when you search for or link to this article, I guess.

> The URI seems to return 200/OK, with full content, which does not mention a
> redirect, nor a canonical URI.
> 
Like Tim said, that should be fixed, but it's a separate bug.

> Are the two URIs not identical in effect, just differently encoded?
> 
Squid doesn't seem to think so, as it seems to cache the two separately.
Comment 7 Lee G 2009-11-25 13:15:21 UTC
It might help if you link to the "separate bug" you mention, or if you are the first to identify it, create the ticket and link to it.

As for Squid's opinion, is that relevant to Media Wiki? 

Your use of "canonical" seems at odds with that in the mark-up of the Wiki pages, where both the encoded and non-encoded URIs are canonical, as opposed to a different page which redirects to the one in question. 

I'm surprised I had to say this.
Comment 8 Roan Kattouw 2009-11-25 13:30:16 UTC
(In reply to comment #7)
> It might help if you link to the "separate bug" you mention, or if you are the
> first to identify it, create the ticket and link to it.
> 
That would be bug 21027.

> As for Squid's opinion, is that relevant to Media Wiki? 
> 
It is relevant to Wikimedia wikis, as they run a Squid caching layer on top of MediaWiki.

> Your use of "canonical" seems at odds with that in the mark-up of the Wiki
> pages, where both the encoded and non-encoded URIs are canonical, as opposed to
> a different page which redirects to the one in question. 
> 
> I'm surprised I had to say this.
> 
Do wiki pages contain different URLs for the same page? I would expect all URLs generated by MediaWiki to be canonicalized somehow, i.e. that all links to the same page use the same URL.
Comment 9 Lee G 2009-11-25 13:48:19 UTC
Thank you for providing the link to the ticket.

Do you not think the use of Squid is separate from MediaWiki? Should not both implement the same standards for URIs/URLs? Ideally, the URI should be treated the same whether encoded or not, by all software. 

The term "canonical", in terms of URLs/URIs in general is reasonably-well defined on Wikipedia: http://en.wikipedia.org/wiki/URL_normalization

In terms of MediaWiki end-users, I think the term is used to refer to the ultimate resource for a term. If page A instantly redirects to page B, because an editorial decision has been made that term A is a synonyms for term B, then page A will contain a Javascript variable with the canonical term.

In both cases, URI-encoding is dropped prior to the creation of the canonical ID.

I am sorry I do not have an example to hand.

Cheers
Lee
Comment 10 Roan Kattouw 2009-11-25 13:54:48 UTC
(In reply to comment #9)
> Thank you for providing the link to the ticket.
> 
> Do you not think the use of Squid is separate from MediaWiki? Should not both
> implement the same standards for URIs/URLs? Ideally, the URI should be treated
> the same whether encoded or not, by all software. 
> 
Of course. A suggestion made on the linked bug is that Squid should be fixed to purge alternates as well.

> The term "canonical", in terms of URLs/URIs in general is reasonably-well
> defined on Wikipedia: http://en.wikipedia.org/wiki/URL_normalization
> 
> In terms of MediaWiki end-users, I think the term is used to refer to the
> ultimate resource for a term. If page A instantly redirects to page B, because
> an editorial decision has been made that term A is a synonyms for term B, then
> page A will contain a Javascript variable with the canonical term.
> 
> In both cases, URI-encoding is dropped prior to the creation of the canonical
> ID.
> 
> I am sorry I do not have an example to hand.
> 
I understand what you mean by 'canonical', and I agree that each page should have exactly one canonical URL. It's my impression that MediaWiki enforces this correctly apart from not redirecting on under- or over-encoded URLs, as outlined in this bug and the one I linked to.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links