Last modified: 2011-11-14 09:28:33 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T31854, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 29854 - External link search and protocol-relative URLs
External link search and protocol-relative URLs
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.20.x
All All
: High normal (vote)
: ---
Assigned To: Roan Kattouw
: patch, patch-need-review
: 31721 (view as bug list)
Depends on:
Blocks: 20342
  Show dependency treegraph
 
Reported: 2011-07-13 00:22 UTC by Roan Kattouw
Modified: 2011-11-14 09:28 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Possible patch to implement Brion's solution (6.87 KB, patch)
2011-11-01 16:10 UTC, Brad Jorsch
Details

Description Roan Kattouw 2011-07-13 00:22:43 UTC
I'm pretty sure that *something* is gonna break when searching for protocol-relative external links, at least through the API. I haven't actually tried and I'm not sure what would break exactly, but the API code is scary enough to convince me that it won't make it through unscathed.

This is mostly just a reminder to myself to fix this soonish.
Comment 1 Brion Vibber 2011-07-13 17:32:07 UTC
To my great pleasure, wfMakeUrlIndex does not appear to explode on these on trunk:

> return wfMakeUrlIndex('http://foobar.com/baz/quux');
http://com.foobar./baz/quux

> return wfMakeUrlIndex('//foobar.com/baz/quux');
//com.foobar./baz/quux

However it *does* explode on them in 1.17 and 1.18:

> return wfMakeUrlIndex('//foobar.com/baz/quux');
PHP Notice:  Undefined index: scheme in /var/www/rel1.18/includes/GlobalFunctions.php on line 2716
PHP Stack trace:
PHP   1. {main}() /var/www/rel1.18/maintenance/eval.php:0
PHP   2. eval() /var/www/rel1.18/maintenance/eval.php:79
PHP   3. wfMakeUrlIndex() /var/www/rel1.18/maintenance/eval.php(79) : eval()'d code:1
PHP   4. wfParseUrl() /var/www/rel1.18/includes/GlobalFunctions.php:2740
PHP Notice:  Undefined index: scheme in /var/www/rel1.18/includes/GlobalFunctions.php on line 2718
PHP Stack trace:
PHP   1. {main}() /var/www/rel1.18/maintenance/eval.php:0
PHP   2. eval() /var/www/rel1.18/maintenance/eval.php:79
PHP   3. wfMakeUrlIndex() /var/www/rel1.18/maintenance/eval.php(79) : eval()'d code:1
PHP   4. wfParseUrl() /var/www/rel1.18/includes/GlobalFunctions.php:2740
./

Probably need to backport the updates to that function (I recall fixes related to file: URLs, which probably fixed this as a side effect).


However a bigger problem is that even if you can get them past the UI on the external links search page, this forces you to *know* whether you're looking for an http-only, an https-only, or a protocol-independent URL.


A possible way to make these work 'naturally' might be to actually change how we save external URLs into the table... instead of saving one entry without a protocol, we could save *two* entries -- one http and one https.

This would allow either to be searched for, but would also list both if fetching lists of external links for a given page. Which seems legit, to be honest. ;)
Comment 2 Roan Kattouw 2011-07-15 01:22:07 UTC
(In reply to comment #1)
> Probably need to backport the updates to that function (I recall fixes related
> to file: URLs, which probably fixed this as a side effect).
> 
No, I deliberately fixed wfParseUrl() for prot rel URLs in r92024.

> However a bigger problem is that even if you can get them past the UI on the
> external links search page, this forces you to *know* whether you're looking
> for an http-only, an https-only, or a protocol-independent URL.
> 
> 
> A possible way to make these work 'naturally' might be to actually change how
> we save external URLs into the table... instead of saving one entry without a
> protocol, we could save *two* entries -- one http and one https.
> 
> This would allow either to be searched for, but would also list both if
> fetching lists of external links for a given page. Which seems legit, to be
> honest. ;)
I guess that would work, yeah.
Comment 3 LordAndrew 2011-10-18 22:55:17 UTC
*** Bug 31721 has been marked as a duplicate of this bug. ***
Comment 4 Brad Jorsch 2011-11-01 15:12:53 UTC
This is very much a bug: at the moment neither Special:LinkSearch nor the API's list=exturlusage can find protocol-relative links at all, because trying to search for a link with an empty protocol defaults to searching for "http".

What is the desired solution?
1. Allow [[Special:LinkSearch///example.com]] and prop=exturlusage&euquery=example.com&euprotocol=// (or something similar) to search for protocol-relative links explicitly. This is easy to do, but as noted above requires the user to perform additional searches.
2. When searching for "http", also search for protocol-relative. This is really hacky.
3. Brion's solution, save entries with el_index "http" and with "https". The major drawback here is that it needs to fix the existing entries in externallinks when upgrading.

Personally, I'd prefer #3 if someone will actually do it in a timely manner (i.e. unlike bug 27480). #1 would be ok otherwise, at least until #3 is actually done.
Comment 5 Roan Kattouw 2011-11-01 15:18:12 UTC
(In reply to comment #4)
> Personally, I'd prefer #3 if someone will actually do it in a timely manner
> (i.e. unlike bug 27480). #1 would be ok otherwise, at least until #3 is
> actually done.
#3 shouldn't require actually running refreshlinks, we should be able to find all protocol-relative entries and convert them pretty easily, without even accessing the page text.

I'll hopefully do this on Thursday or Friday.
Comment 6 Brad Jorsch 2011-11-01 16:10:09 UTC
Created attachment 9340 [details]
Possible patch to implement Brion's solution

(In reply to comment #5)
> #3 shouldn't require actually running refreshlinks, we should be able to find
> all protocol-relative entries and convert them pretty easily, without even
> accessing the page text.

The complaint was just that that other bug requiring database updates seems to be being ignored. It could take just two SQL queries to fix the DB:

    INSERT INTO externallinks (el_from,el_to,el_index) SELECT el_from, el_to, CONCAT('http:',el_index) FROM externallinks WHERE el_index LIKE '//%';
    UPDATE externallinks SET el_index = CONCAT('https:',el_index) WHERE el_index LIKE '//%';

Avoiding killing the site while running those queries on Wikipedia is probably the hard part.

I tried my hand at writing the necessary patch; I copied doTemplatelinksUpdate for the MySQL database updating. There are probably things that could be done better. Even if the patch is good, someone would still have to write updaters for the non-MySQL databases.
Comment 7 Sumana Harihareswara 2011-11-09 20:48:16 UTC
Adding the "patch" and "need-review" keywords for a patch that needs review by developers.  Brad, thank you for your patch.
Comment 8 Roan Kattouw 2011-11-14 09:28:10 UTC
(In reply to comment #6)
> Created attachment 9340 [details]
> Possible patch to implement Brion's solution
> 
Thanks for the patch. I modified it a bit and applied it in r101951 and r101954.

> I tried my hand at writing the necessary patch; I copied doTemplatelinksUpdate
> for the MySQL database updating. There are probably things that could be done
> better. Even if the patch is good, someone would still have to write updaters
> for the non-MySQL databases.
I did end up changing that part quite a bit, yeah. I took the slow replication-aware code and moved it into a maintenance script. I also changed the queries a bit, they now use INSERT IGNORE and there's now two INSERT queries followed by a DELETE.
Comment 9 Roan Kattouw 2011-11-14 09:28:33 UTC
(In reply to comment #8)
> Thanks for the patch. I modified it a bit and applied it in r101951 and
> r101954.
> 
Of course I meant r102951 and r102954.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links