Last modified: 2008-07-26 12:55:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T12931, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 10931 - Wikimedia redirect tables are missing many entries
Wikimedia redirect tables are missing many entries
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal with 8 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, patch-need-review, shell
: 9799 12182 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-08-14 21:37 UTC by Robert Stojnic
Modified: 2008-07-26 12:55 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Proposed patch (11.21 KB, patch)
2008-04-09 19:07 UTC, Roan Kattouw
Details

Description Robert Stojnic 2007-08-14 21:37:38 UTC
The redirect table for en.wiki has about ~1m entries, however, 
parsing the xml dump gives around ~2m redirects. So we are
missing a lot of redirects. Having all the redirects in the 
table is, among other things, needed for the lucene incremental 
updater.

To fix this, run maintenance/refreshLinks.php --redirects-only
on all wikis. However, I urge someone smarter than me to review 
if this script is doing what it's ought to be doing :)
Comment 1 Rob Church 2007-08-14 21:49:29 UTC
*** Bug 9799 has been marked as a duplicate of this bug. ***
Comment 2 John Lehmann 2007-10-24 16:47:01 UTC
I can confirm that the 2 million redirects from the XML dump are valid in the sense that they do resolve to existing, normal pages in the page table.
Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-10-25 00:59:09 UTC
It would possibly be best to contact Brion on IRC about this, if you can find a rare moment when he's not busy.  Or e-mail him, brion@pobox.com.  He's (AFAIK) the sole maintainer of the DB dumps, as of so many other things.
Comment 4 Brion Vibber 2007-12-03 20:59:07 UTC
The redirect table is updated on demand. Redirects which have not been changed since its introduction won't be listed in it.
Comment 5 Brian Jason Drake 2007-12-04 08:02:39 UTC
What's wrong? Why is this bug still open?
Comment 6 Brion Vibber 2007-12-04 21:38:43 UTC
Probably because it would be nice if the table were up to date. :)
Comment 7 Broken Arrow 2008-01-05 13:07:46 UTC
*** Bug 12182 has been marked as a duplicate of this bug. ***
Comment 8 Pietrodn 2008-01-20 16:23:46 UTC
*** Bug 12507 has been marked as a duplicate of this bug. ***
Comment 9 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-04 16:29:55 UTC
It seems like refreshLinks.php could use a "fix only old redirects" mode.  It's currently assuming that it has to check every single page for redirect-ness.  If a mode were added so it just checked pages from

SELECT page_id FROM page LEFT JOIN redirect ON page_id=rd_from WHERE page_is_redirect=1 AND rd_from IS NULL

then it would be considerably faster, I imagine.  It would still have to scan the page table, but at least it wouldn't have to do it a row at a time and send the page text for every row over the wire.

Maybe this should even be added to update.php.
Comment 10 Brian Jason Drake 2008-03-05 02:21:25 UTC
Bug 12507 was reopened with a comment that included the statement "Bug 10931 is about missing entries in the redirect table in a
particular wiki ..." The summary for this bug says that a redirect table is missing half its entries, without stating which table (though the description says it's en.wiki).

However, the most recent comments are about the software and not any particular wiki.1

What's this bug really about?
Comment 11 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-05 02:36:41 UTC
The most recent comments were about Wikimedia wikis, which is what this bug is about (note Product=Wikimedia).  My comment remarked that currently the available scripts are not ideal for this and they should be improved for this to be done reasonably.  I also added offhandedly that if that was done, update.php should maybe run that script, but that was a side point and not really related to this bug.
Comment 12 Brian Jason Drake 2008-03-05 02:46:48 UTC
I've updated the summary to remove the ambiguity, but I still think that the most recent "nontrivial" comments (comments 4 and 9) appear to be purely about the software. Given that the summary, description and product fields all state that this bug is about Wikimedia wikis, shouldn't the software issues be addressed in another bug?
Comment 13 Brian Jason Drake 2008-03-05 02:50:33 UTC
Why is the component field set to "Downloads"? This bug doesn't seem to have anything to do with downloads.
Comment 14 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-05 02:52:26 UTC
That's the only user-visible effect.
Comment 15 Brian Jason Drake 2008-03-05 02:59:37 UTC
According to bug 9799, this affects Special:DoubleRedirects. Is viewing a special page considered a "download" (which is apparently defined here as anything to do with "public data dumps at download.wikimedia.org")?
Comment 16 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-05 03:02:17 UTC
Okay, so it has other (indirect) user-visible effects too.  Is any of this really relevant to fixing the problem?
Comment 17 Brian Jason Drake 2008-03-05 03:18:22 UTC
(In reply to comment #12)
> I've updated the summary to remove the ambiguity, but I still think that the
> most recent "nontrivial" comments (comments 4 and 9) appear to be purely about
> the software. Given that the summary, description and product fields all state
> that this bug is about Wikimedia wikis, shouldn't the software issues be
> addressed in another bug?

Bugzilla's guidelines state, as one of their 5 principles, "one bug per report." I always took this to mean that issues with Wikimedia wikis and associated MediaWiki issues should be covered by separate reports.
Comment 18 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-03-05 03:28:57 UTC
Spamming ten people's e-mail boxes with this discussion is not very productive.  If you really care, please e-mail any followups directly to me.  Yes, a bug could be created for the software issue (blocking this) if someone wanted.  It doesn't have to be.
Comment 19 Roan Kattouw 2008-04-08 11:14:44 UTC
We should be discussing solutions here, not BugZilla guidelines.

Anyway, why don't we change the procedure for viewing redirects? Currently it's:

1. Check the page_is_redirect field
2. If it's 1, fetch the page's content
3. See if it contains #REDIRECT
4. If so, check whether the redirect actually points somewhere
5. If so, go there.

We could change that to:

1. Check the page_is_redirect field
2. If it's 1, query the redirect table (faster than fetching the page's content)
3. If there is no redirect table entry, do the old page content thingy AND ADD A ROW TO THE REDIRECT TABLE

Thoughts?

Comment 20 Pietrodn 2008-04-08 13:34:00 UTC
I approve this idea. This will resolve all these redirect problems in the future. Are there any problems in implementing it?
Comment 21 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-04-08 13:35:31 UTC
(In reply to comment #19)

That sounds like an excellent idea.
Comment 22 Roan Kattouw 2008-04-08 14:05:43 UTC
Thanks for the praise so far. I'll submit a patch tomorrow (I think it's best that I wrote this thing, 'cause I wanna integrate it into the API too, see bug 13651)
Comment 23 aaron brick 2008-04-08 20:00:23 UTC
sounds fine to me, and i look forward to your patch, roan.
Comment 24 Roan Kattouw 2008-04-09 19:07:47 UTC
Created attachment 4798 [details]
Proposed patch

The attached patch does quite a range of things:

Stuff related to this bug:
* Introduce Article::getRedirectTarget(), which queries the redirect table to get the article's redirect target (if it has one). If the article doesn't have an entry in the redirect table, insertRedirect() is called
* Introduce Article::insertRedirect(), which obtains the redirect target from the page text and inserts a row into the redirect table. Not meant to be called directly
* Use getRedirectTarget() in some places in Article.php (except those 
* In ApiPageSet::getRedirectTargets(), call Article::insertRedirect() for every redirect that wasn't found in the redirect table

Stuff not related to this bug (sorry for these, but it's nothing complicated):
* Kill a redundant DB query in ApiPageSet::getRedirectTargets()
* Introduce ApiMain::scheduleCommit() which schedules a $dbw->immediateCommit() just before the end of ApiMain::execute()
* Use scheduleCommit() in all API edit modules and ApiPageSet::getRedirectTargets()

Numbers of DB queries in various situations:
If the page is in the redirect table already: 1 SELECT
If the page isn't in the redirect table yet: 3 SELECTs, 1 INSERT
If the page isn't in the redirect table yet and is queried through the API: 6 SELECTs and 1 INSERT per page + 1 SELECT (yes, that sucks, caused by crappy code in the Article class which is full of FIXMEs)

If the page is in the redirect table already:
SELECT /* Article::getRedirectTarget */  rd_namespace,rd_title  FROM `redirect`  WHERE rd_from = '123'  


If the page isn't in the redirect table yet:
SELECT /* Article::getRedirectTarget */  rd_namespace,rd_title  FROM `redirect`  WHERE rd_from = '123'
/* Revision::fetchRow */ SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,page_namespace,page_title,page_latest  FROM `page`,`revision`  WHERE (page_id=rev_page) AND rev_id = '466'  LIMIT 1
/* Revision::loadText 127.0.0.1 */ SELECT  old_text,old_flags  FROM `text`  WHERE old_id = '256'  LIMIT 1
INSERT /* Database::insert 127.0.0.1 */  INTO `redirect` (rd_from,rd_namespace,rd_title) VALUES ('123','0','Foo')

If the page isn't in the redirect table yet and is queried through the API:
SELECT /* ApiPageSet::getRedirectTargets 127.0.0.1 */  rd_from,rd_namespace,rd_title  FROM `redirect`  WHERE rd_from = '123'
/* LinkCache::addLinkObj */ SELECT  page_id,page_len,page_is_redirect  FROM `page`  WHERE page_namespace = '0' AND page_title = 'Foo'  LIMIT 1
/* Article::pageData */ SELECT  page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len  FROM `page`  WHERE page_namespace = '0' AND page_title = 'Foo'  LIMIT 1
SELECT /* Title::loadRestrictions */  *  FROM `page_restrictions`  WHERE pr_page = '123'
/* Title::loadRestrictionsFromRow */ SELECT  page_restrictions  FROM `page`  WHERE page_id = '123'  LIMIT 1
/* Revision::fetchRow */ SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,page_namespace,page_title,page_latest  FROM `page`,`revision`  WHERE (page_id=rev_page) AND rev_id = '123'  LIMIT 1
/* Revision::loadText */ SELECT  old_text,old_flags  FROM `text`  WHERE old_id = '102'  LIMIT 1
INSERT /* Database::insert */  INTO `redirect` (rd_from,rd_namespace,rd_title) VALUES ('123','345','Foo')
Comment 25 Roan Kattouw 2008-04-11 15:20:56 UTC
Patch committed in r33133
Comment 26 Jelte (WebBoy) 2008-05-05 17:26:08 UTC
Fix disabled by r33381, reopening bug
Comment 27 Lejonel 2008-07-26 12:55:42 UTC
This seems to be fixed.

https://wikitech.leuksman.com/index.php?title=Server_admin_log&diff=15383&oldid=15382 ("Tim: fixing redirect table on all wikis")


Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links