Last modified: 2012-04-12 13:53:48 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T27718, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 25718 - Title->getLocalURL does not respect $wgArticlePath when given a query
Title->getLocalURL does not respect $wgArticlePath when given a query
Status: RESOLVED WONTFIX
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.16.x
All All
: Lowest minor (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-10-30 16:45 UTC by Kiran Jonnalagadda
Modified: 2012-04-12 13:53 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch to add handling for $wgArticlePath with query parameters (384 bytes, patch)
2010-10-30 17:10 UTC, Kiran Jonnalagadda
Details

Description Kiran Jonnalagadda 2010-10-30 16:45:15 UTC
MediaWiki's includes/Title.php defines a Title class with a getLocalURL function. If $wgArticlePath is defined (for pretty URLs) when this function is called, it returns a pretty URL. However, if a query parameter is given (without an action), $wgArticlePath is not considered. Compare code:

Without a query parameter:
http://svn.wikimedia.org/viewvc/mediawiki/tags/REL1_16_0/phase3/includes/Title.php?view=markup#l801

With a query parameter:
http://svn.wikimedia.org/viewvc/mediawiki/tags/REL1_16_0/phase3/includes/Title.php?view=markup#l824

This shows up with extensions that add query parameters to some pages. Here's an example from Semantic Result Formats calendar:

Calendar with current month: http://discoursedb.org/wiki/Opinion_calendar
Calendar with specific month: http://discoursedb.org/w/index.php?title=Opinion_calendar&month=11&year=2010

Expected URL: http://discoursedb.org/wiki/Opinion_calendar?month=11&year=2010
Comment 1 Kiran Jonnalagadda 2010-10-30 17:10:58 UTC
Created attachment 7767 [details]
Patch to add handling for $wgArticlePath with query parameters

Patch fixes the described problem.

Does not handle the situation of query containing a title parameter again (example: http://example.com/wiki/Page_title?title=Page_title&month=11&year=2010). This does not seem to be a problem as MediaWiki ignores the second title parameter.
Comment 2 Niklas Laxström 2010-10-30 19:05:32 UTC
I think this is by design. If you want prettier urls you can use $wgActionPaths.
Comment 3 Bryan Tong Minh 2010-10-30 19:13:09 UTC
This breaks urls like action=raw, which can only be used in ugly mode.
Comment 4 Daniel Friesen 2010-10-30 19:38:44 UTC
Additionally I believe this is by design so that action=edit urls on wiki that are configured like Wikipedia can be made to always have /w/index.php style urls so they can be blanket blocked by robots.txt

/wiki/Article?action=edit can't adequately be covered in robots.txt and while noindex is on the page itself it's much better if search engines can be prevented from even waisting bandwidth by loading urls they are just going to get a noindex on.

So you'll either have to use $wgActionPaths or start a bug asking for some sort of $wgShortActions array which can be loaded with an explicit list of actions that should take the /wiki/Article?action= form instead of the default index.php?title=&action= form.
Comment 5 Kiran Jonnalagadda 2010-10-30 23:03:04 UTC
getLocalURL() already has special handling for when it sees 'action' in the query parameters. This bug report and patch are for queries that do not specify an action. Therefore, $wgActionPaths does not apply here and action=raw isn't affected.

Here is where 'action' is checked for: http://svn.wikimedia.org/viewvc/mediawiki/tags/REL1_16_0/phase3/includes/Title.php?view=markup#l804

The patch assumes $wgArticlePath is setup for pretty URLs. This may or may not be handled by the wfAppendQuery() function called elsewhere from within getLocalURL(). I'm not sure where wfAppendQuery() is defined.
Comment 6 Platonides 2010-10-30 23:15:16 UTC
If you change an url like /w/index.php?title=Opinion_calendar&month=11&year=2010
to /wiki/Opinion_calendar?month=11&year=2010

Then robots.txt blocking /w/ won't apply and a spider will follow the whole calendar.

I think this is a won't fix.
Comment 7 Kiran Jonnalagadda 2010-10-30 23:30:18 UTC
I've confirmed that the patch does the right thing even when using ugly URLs.

Platonides has a very good point. Maybe rel=nofollow applies there?
Comment 8 Daniel Friesen 2010-10-31 02:43:32 UTC
Even if you rel=nofollow "a" url, it's possible that someone may use WikiText to link to the same url and this link may not contain rel=nofollow, it's also possible for the external link to be from another website, so there is no way to ensure that there are never any non-nofollow links sitting around pointing to urls that you don't want search engines to spider.

rel=nofollow also does not stop spiders from following a url, spiders have taken more of a stance lately of reading rel=nofollow as "I don't endorse this, don't count it towards a pagerank" rather than "NEVER use this link to go to that url". Even if you rel=nofollow something they may still follow the link to that page, so the only way to keep a spider from ever spidering a url is to use robots.txt it's the only reliable tool (We know rel=noindex is used on the pages themselves, but who wants search engines spidering every noindex'ed page they can see, it's a waste of requests on high traffic wiki).
Comment 9 Kiran Jonnalagadda 2010-10-31 11:47:34 UTC
Thank you, Daniel. This is a great explanation.

However, the exact same problem exists without the patch, in MediaWiki's default configuration with ugly URLs. Consider the URLs it would generate for an installation in /wiki:

/wiki/index.php?title=Example_page
/wiki/index.php?title=Example_page&month=11&year=2010

There is no way to use robots.txt to disallow the second URL without blocking the first. This, therefore, is a problem for the Semantic Result Formats extension to fix. It is not introduced by this patch.

As far as I can tell, there is no use case within MediaWiki's default configuration that touches line 824 of Title.php -- there is never a URL that includes a query but not an action parameter (apart from the default 'view' action, which also bypasses line 824).

Does MediaWiki come with unit tests to verify this? I'm new to the source.

It appears that the patch touches an area of code that is rarely used and hence has been long overlooked. The indexing problem it potentially causes with SRF Calendar already exists without the patch. The patch only makes the function behave consistently.
Comment 10 Kiran Jonnalagadda 2010-10-31 12:05:31 UTC
One possible solution is to define a new 'null' or 'none' action in $wgActionPaths that gets looked up when no action is called for (this being different from the default 'view' action).

getLocalURL can then use the installation's preferred URL syntax instead of assuming the use of pretty or ugly URLs.
Comment 11 Kiran Jonnalagadda 2010-10-31 12:49:33 UTC
> As far as I can tell, there is no use case within MediaWiki's default
> configuration that touches line 824 of Title.php -- there is never a URL that
> includes a query but not an action parameter (apart from the default 'view'
> action, which also bypasses line 824).

Whoops. Completely wrong here. History browsing uses query parameters without an action.
Comment 12 Kiran Jonnalagadda 2010-10-31 13:06:54 UTC
> Whoops. Completely wrong here. History browsing uses query parameters without
> an action.

... and those URLs are generated without touching line 824 of Title.php, so the presence of this patch makes no difference.

From the documentation for getLocalURL:
http://svn.wikimedia.org/doc/classTitle.html#a75d9fae7aabf6187318c5a298b01c5ef

> $query 	Mixed: an optional query string; if not specified, $wgArticlePath will be used.

So the documentation is specific: $wgArticlePath is only used when no query is specified. This patch does the logically expected thing, but violates the documented expectation.
Comment 13 Daniel Friesen 2010-10-31 14:38:31 UTC
I just took an actual look at the patch and larger look at the code, I see two things staring at me...

Firstly, it feels for some reason to me that something is wrong with the $variant stuff... I'm not quite up to speed on the variant stuff, so it's either a MW bug where variant handling that should be in the second block isn't there. Or variant handling is supposed to be done only for /short/Urls and the patch adds a new location where /short/Urls end up showing up but is missing variant handling code. Or it's only meant to happen when $query is empty and I'm just to far behind.

Though that's probably less of an issue compared to this.

Looking at Line 821 I see what seams to be part of a feature that appears to allow you to pass "-" to getLocalURL as a method of saying to getLocalURL "I don't have any query, but I want you to give me a long index.php style url because I have a case where getting short urls will cause chaos, so NO short urls", although it does look like a tweak would be nice so there is no trailing &.

The way the patch is written seams to break that... if this patch is applied it looks like $query = "-" will start outputting short urls when getLocalURL was explicitly asked to output long urls, and additionally this will be done without variant handling that would have otherwise been done.

Also now that I notice the comment, the &title= showing up in a short url does not look like proper implementation of a feature, even without the other issues I'd reject the patch till that was fixed.


Re-reading comments again, it looks to me you have a lingering misunderstanding of $wgArticlePath, $wgActionPaths and what those blocks of code actually do.

Firstly of course, that note on how in default config robots.txt can't be used is moot, the robots.txt stuff is an advanced configuration for advanced robots optimized shorturl style configuration and for the reasons you describe requires use of short urls to function. Just because a "default" configuration doesn't support it doesn't mean that it should be broken universally (this patch) when it is only supported in an advanced configuration. I should also point out that if MediaWiki detects that the server is able to support it, MediaWiki actually sets up $wgArticlePath to work in a /index.php/Article form, so robots.txt IS actually usable by default on the right server by disallowing "/index.php?".

Now for the actual misunderstanding you commented "As far as I can tell, there is no use case within MediaWiki's default
configuration that touches line 824 of Title.php -- there is never a URL that
includes a query but not an action parameter (apart from the default 'view'
action, which also bypasses line 824)."
As I understand from this -- even taking your later comment on history into account -- you believe that calls to getLocalURL containing a query with an action parameter are always caught by the chunk of code between lines 807-819.
That is incorrect. The block of code there is ONLY ever applied if the non-default $wgActionPaths (NOT $wgArticlePath, but $wgActionPaths) which in fact there are very very few MediaWiki installations that actually enable. So in actuality the code in lines 810-817 which does things with getLocalURL calls that have a action= in the query is actually almost NEVER run, at least not unless the wiki is one of the rare few wiki that have specially configured $wgActionPaths. The line 824 you believe is being bypassed by default in fact is actually NEVER EVER bypassed by default. By default config line 824 handles every single call to getLocalURL which contains anything at all inside the $query.

And to re-iterate, the expected behavior for getLocalURL and the like is that urls with no query, and not marked with a faux "-" query to disallow shorturls will have a shorturl returned if possible. While any url with a query is ALWAYS returned in long index.php form. Changing things so that things that currently output long urls output short urls will break things. I forgot about it, but there is another reason why that /w/ robots.txt trick is used. The nature of most pages that use queries is that they are dynamic. In other words they likely contain links within them which could cause search engines to endlessly spider dynamically generated content that the wiki does not want them to spider (ie: search queries). So changing these to short urls may cause dynamic things to suddenly start being spidered by search engines when they were originally intended not to. Because of this, I believe that ANY feature we add that allows getLocalURL to return short urls with queries instead of long urls MUST be an explicit opt-in where getLocalURL is passed with an extra optional argument telling it that shorturls with queries are ok and that any code changed to use this be done by someone understanding the potential ramifications.

To be frank, I actually think that the original use case for such a feature actually shouldn't even be using it. The original request is to do with calendar pages, where /wiki/Calendar is the normal calendar page, and /index.php?title=Calendar&year=&month= is a calendar page for ANY other month in ANY year, and both these pages contain forward/back links pointing to other months. In other words, this looks like a prime example of exactly what we do NOT want search engines spidering. If we were to drop /index.php?title=Calendar&year=##&month=## to /wiki/Calendar?year=####&month=## on a wiki using a properly setup robots.txt the search engine would begin to spider starting with /wiki/Calendar. Currently because the long url format is used, the search engine's spider is not allowed to continue spidering other months. But dropping those to short urls, the search engine WILL continue to spider. After spidering /wiki/Calendar it may find a next month link to /wiki/Calendar?year=2010&month=11, where it will find a link to /wiki/Calendar?year=2010&month=12, where it will find a link to /wiki/Calendar?year=2011&month=1, and continue... In other words because of the dynamic nature of the calendar, if we allow it to link to other short urls with year/month queries search engines may dive into an endless game of trying to index years and months that will never have any actual content on them in the wiki (and pound the wiki with requests it shouldn't be making in doing so).


So to finish up my tl;dr post. The patch DOES introduce bugs. The SRF issue is not an issue that already exists and the patch doesn't change, as the problem is that the patch breaks existing "advanced" configurations of robots.txt that expect all dynamic urls with queries to use the full script url, not short urls by making configurations using explicit $wgArticlePath or automatic /index.php/$1 as setup by default MediaWiki installation on some servers to use short urls for dynamic pages instead of the long urls they are supposed to.
And additionally because line 842 IS handed by all default non-$wgActionPaths["raw"] wiki the patch DOES break MediaWiki causing ->getLocalURL('action=raw'); to output /wiki/$1?action=raw style urls instead of the required (ie: if you apply this patch, you'll probably break Wikipedia).
^_^ And finally... I'm typing this out while semi-half-asleep so *disclaimer, disclaimer, disclaimer*...

...so I think we should mark this WONTFIX... or is it INVALID?
Comment 14 Kiran Jonnalagadda 2010-10-31 14:56:45 UTC
Thank you for the most excellent comment, Daniel. getLocalURL is far too core to MediaWiki to make changes without understanding the implications thoroughly, which I clearly don't.

I propose this ticket be marked INVALID.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links