Last modified: 2008-07-26 12:55:42 UTC
The redirect table for en.wiki has about ~1m entries, however, parsing the xml dump gives around ~2m redirects. So we are missing a lot of redirects. Having all the redirects in the table is, among other things, needed for the lucene incremental updater. To fix this, run maintenance/refreshLinks.php --redirects-only on all wikis. However, I urge someone smarter than me to review if this script is doing what it's ought to be doing :)
*** Bug 9799 has been marked as a duplicate of this bug. ***
I can confirm that the 2 million redirects from the XML dump are valid in the sense that they do resolve to existing, normal pages in the page table.
It would possibly be best to contact Brion on IRC about this, if you can find a rare moment when he's not busy. Or e-mail him, brion@pobox.com. He's (AFAIK) the sole maintainer of the DB dumps, as of so many other things.
The redirect table is updated on demand. Redirects which have not been changed since its introduction won't be listed in it.
What's wrong? Why is this bug still open?
Probably because it would be nice if the table were up to date. :)
*** Bug 12182 has been marked as a duplicate of this bug. ***
*** Bug 12507 has been marked as a duplicate of this bug. ***
It seems like refreshLinks.php could use a "fix only old redirects" mode. It's currently assuming that it has to check every single page for redirect-ness. If a mode were added so it just checked pages from SELECT page_id FROM page LEFT JOIN redirect ON page_id=rd_from WHERE page_is_redirect=1 AND rd_from IS NULL then it would be considerably faster, I imagine. It would still have to scan the page table, but at least it wouldn't have to do it a row at a time and send the page text for every row over the wire. Maybe this should even be added to update.php.
Bug 12507 was reopened with a comment that included the statement "Bug 10931 is about missing entries in the redirect table in a particular wiki ..." The summary for this bug says that a redirect table is missing half its entries, without stating which table (though the description says it's en.wiki). However, the most recent comments are about the software and not any particular wiki.1 What's this bug really about?
The most recent comments were about Wikimedia wikis, which is what this bug is about (note Product=Wikimedia). My comment remarked that currently the available scripts are not ideal for this and they should be improved for this to be done reasonably. I also added offhandedly that if that was done, update.php should maybe run that script, but that was a side point and not really related to this bug.
I've updated the summary to remove the ambiguity, but I still think that the most recent "nontrivial" comments (comments 4 and 9) appear to be purely about the software. Given that the summary, description and product fields all state that this bug is about Wikimedia wikis, shouldn't the software issues be addressed in another bug?
Why is the component field set to "Downloads"? This bug doesn't seem to have anything to do with downloads.
That's the only user-visible effect.
According to bug 9799, this affects Special:DoubleRedirects. Is viewing a special page considered a "download" (which is apparently defined here as anything to do with "public data dumps at download.wikimedia.org")?
Okay, so it has other (indirect) user-visible effects too. Is any of this really relevant to fixing the problem?
(In reply to comment #12) > I've updated the summary to remove the ambiguity, but I still think that the > most recent "nontrivial" comments (comments 4 and 9) appear to be purely about > the software. Given that the summary, description and product fields all state > that this bug is about Wikimedia wikis, shouldn't the software issues be > addressed in another bug? Bugzilla's guidelines state, as one of their 5 principles, "one bug per report." I always took this to mean that issues with Wikimedia wikis and associated MediaWiki issues should be covered by separate reports.
Spamming ten people's e-mail boxes with this discussion is not very productive. If you really care, please e-mail any followups directly to me. Yes, a bug could be created for the software issue (blocking this) if someone wanted. It doesn't have to be.
We should be discussing solutions here, not BugZilla guidelines. Anyway, why don't we change the procedure for viewing redirects? Currently it's: 1. Check the page_is_redirect field 2. If it's 1, fetch the page's content 3. See if it contains #REDIRECT 4. If so, check whether the redirect actually points somewhere 5. If so, go there. We could change that to: 1. Check the page_is_redirect field 2. If it's 1, query the redirect table (faster than fetching the page's content) 3. If there is no redirect table entry, do the old page content thingy AND ADD A ROW TO THE REDIRECT TABLE Thoughts?
I approve this idea. This will resolve all these redirect problems in the future. Are there any problems in implementing it?
(In reply to comment #19) That sounds like an excellent idea.
Thanks for the praise so far. I'll submit a patch tomorrow (I think it's best that I wrote this thing, 'cause I wanna integrate it into the API too, see bug 13651)
sounds fine to me, and i look forward to your patch, roan.
Created attachment 4798 [details] Proposed patch The attached patch does quite a range of things: Stuff related to this bug: * Introduce Article::getRedirectTarget(), which queries the redirect table to get the article's redirect target (if it has one). If the article doesn't have an entry in the redirect table, insertRedirect() is called * Introduce Article::insertRedirect(), which obtains the redirect target from the page text and inserts a row into the redirect table. Not meant to be called directly * Use getRedirectTarget() in some places in Article.php (except those * In ApiPageSet::getRedirectTargets(), call Article::insertRedirect() for every redirect that wasn't found in the redirect table Stuff not related to this bug (sorry for these, but it's nothing complicated): * Kill a redundant DB query in ApiPageSet::getRedirectTargets() * Introduce ApiMain::scheduleCommit() which schedules a $dbw->immediateCommit() just before the end of ApiMain::execute() * Use scheduleCommit() in all API edit modules and ApiPageSet::getRedirectTargets() Numbers of DB queries in various situations: If the page is in the redirect table already: 1 SELECT If the page isn't in the redirect table yet: 3 SELECTs, 1 INSERT If the page isn't in the redirect table yet and is queried through the API: 6 SELECTs and 1 INSERT per page + 1 SELECT (yes, that sucks, caused by crappy code in the Article class which is full of FIXMEs) If the page is in the redirect table already: SELECT /* Article::getRedirectTarget */ rd_namespace,rd_title FROM `redirect` WHERE rd_from = '123' If the page isn't in the redirect table yet: SELECT /* Article::getRedirectTarget */ rd_namespace,rd_title FROM `redirect` WHERE rd_from = '123' /* Revision::fetchRow */ SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,page_namespace,page_title,page_latest FROM `page`,`revision` WHERE (page_id=rev_page) AND rev_id = '466' LIMIT 1 /* Revision::loadText 127.0.0.1 */ SELECT old_text,old_flags FROM `text` WHERE old_id = '256' LIMIT 1 INSERT /* Database::insert 127.0.0.1 */ INTO `redirect` (rd_from,rd_namespace,rd_title) VALUES ('123','0','Foo') If the page isn't in the redirect table yet and is queried through the API: SELECT /* ApiPageSet::getRedirectTargets 127.0.0.1 */ rd_from,rd_namespace,rd_title FROM `redirect` WHERE rd_from = '123' /* LinkCache::addLinkObj */ SELECT page_id,page_len,page_is_redirect FROM `page` WHERE page_namespace = '0' AND page_title = 'Foo' LIMIT 1 /* Article::pageData */ SELECT page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len FROM `page` WHERE page_namespace = '0' AND page_title = 'Foo' LIMIT 1 SELECT /* Title::loadRestrictions */ * FROM `page_restrictions` WHERE pr_page = '123' /* Title::loadRestrictionsFromRow */ SELECT page_restrictions FROM `page` WHERE page_id = '123' LIMIT 1 /* Revision::fetchRow */ SELECT rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user_text,rev_user,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,page_namespace,page_title,page_latest FROM `page`,`revision` WHERE (page_id=rev_page) AND rev_id = '123' LIMIT 1 /* Revision::loadText */ SELECT old_text,old_flags FROM `text` WHERE old_id = '102' LIMIT 1 INSERT /* Database::insert */ INTO `redirect` (rd_from,rd_namespace,rd_title) VALUES ('123','345','Foo')
Patch committed in r33133
Fix disabled by r33381, reopening bug
This seems to be fixed. https://wikitech.leuksman.com/index.php?title=Server_admin_log&diff=15383&oldid=15382 ("Tim: fixing redirect table on all wikis")