Last modified: 2013-06-27 20:34:11 UTC
I imported the page, pagelinks and redirect tables from the 20120702 dump in order to analyze the pagelink structure. For this i need to resolve redirects and found some weirdness in the process: There are some pages in the redirect table which don't have the page.page_is_redirect flag set: select * from redirect join page on rd_from=page_id where page_is_redirect != 1 limit 100; Results in 23 rows: http://p.defau.lt/?smGJYbXXlqwLPhreqbe8VA Another weirdness: there are pages marked as page.page_is_redirect but don't appear in the redirect table, even though that might be explained here: http://www.mediawiki.org/wiki/Manual:Redirect_table (see the NOTE as of Aug 2007) select * from page left join redirect on page_id=rd_from where page_is_redirect != 0 and rd_from is null limit 100; Results in 5 rows: http://p.defau.lt/?pwJZMIVtTGGZYb69o8YWxg Matthew_ from #wikimedia-tech ran these queries on the toolserver and seems the same problems exist in production. Yet another weirdness can be found in the pagelinks table: there are ~25000 pagelinks for redirect pages which don't just link to the redirected page but to other pages alone in the article namespace: select count(1) from page inner join redirect on page_id = rd_from inner join pagelinks on rd_from=pl_from and rd_namespace=pl_namespace where pl_namespace = 0 and rd_title != pl_title; +----------+ | count(1) | +----------+ | 25168 | +----------+ 1 row in set (1 hour 44 min 35.16 sec) Would be cool if someone could have a look.
(In reply to comment #0) > Matthew_ from #wikimedia-tech ran these queries on the toolserver and seems the > same problems exist in production. s/production/the TS replica/
It's going to be a bit longer before Sam can get to this. Unassigning, though Sam is primarily responsible for "shell" bugs and will still have this implicitly on his list.
Hi, looks like the second point has been fixed in the prod db: mysql:giovanni@db1047.eqiad.wmnet [enwiki]> select count(*) from page left join redirect on page_id = rd_from where page_is_redirect != 0 and rd_from is null; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (39.33 sec) mysql:giovanni@db1047.eqiad.wmnet [enwiki]>
(In reply to comment #3) > Hi, > > looks like the second point has been fixed in the prod db: > > mysql:giovanni@db1047.eqiad.wmnet [enwiki]> select count(*) from page left > join > redirect on page_id = rd_from where page_is_redirect != 0 and rd_from is > null; > > > +----------+ > | count(*) | > +----------+ > | 0 | > +----------+ > 1 row in set (39.33 sec) > > mysql:giovanni@db1047.eqiad.wmnet [enwiki]> Agreed. First one done too now. It only had 5 rows now. 1 was an erroneous redirect, presumably just never deleted. 1 was a redirect in user js files which I don't think ever worked. 2 were non existent pages. mysql:wikiadmin@db60 [enwiki]> select * from redirect join page on rd_from=page_id where page_is_redirect != 1 limit 100; Empty set (19.58 sec)
What actually wants doing for the 3rd query? I'm currently running it in a screen session against a quiet slave (selecting all fields, not just counting), so will have a resultset to actually look at. I will see about attaching it to this bug when done, though I guess it's going to take upto a couple of hours to run