Last modified: 2014-11-17 10:36:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T18112, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 16112 - Run "refreshLinks.php --dfn-only" on all wikis periodically
Run "refreshLinks.php --dfn-only" on all wikis periodically
Status: NEW
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
unspecified
All All
: Normal minor with 9 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: ops
: 15152 16603 16895 33817 (view as bug list)
Depends on: 36195 42180
Blocks: 29782 23816 24480
  Show dependency treegraph
 
Reported: 2008-10-25 16:46 UTC by Beau
Modified: 2014-11-17 10:36 UTC (History)
23 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Beau 2008-10-25 16:46:39 UTC
There have been added [[Special:Wantedfiles]] and [[Special:Wantedtemplates]] recently. However like all other Special:Wanted* pages they are pretty useless, because  *links tables have lots of crap rows (some pages and categories are listed, but nothing links to them). Dead links (those from deleted/nonexistend pages) should be removed from database.

What is more [[Special:Wantedfiles]] lists files placed in shared repository... Refreshing such pages is a waste of resources.
Comment 1 P.Copp 2008-10-25 17:56:05 UTC
Same on dewiki and probably other projects. Changed summary to reflect that and added shell-Keyword.
Comment 2 P.Copp 2008-12-09 23:00:38 UTC
*** Bug 16603 has been marked as a duplicate of this bug. ***
Comment 3 Brion Vibber 2008-12-09 23:20:49 UTC
Trevor -- couple quick notes on this:

This action can be done via an existing maintenance script:
php maintenance/refreshLinks.php --dfn-only

However, the implementation (deleteLinksFromNonexistent() in refreshLinks.inc) isn't currently feasible for Wikimedia's use because it's a potentially very slow query, which can mess with our DB replication and disrupt the site for users until the slave databases catch up.

Currently it's doing a single DELETE per table to clear out all matching rows. I'd recommend breaking this out into two parts for each table:

1) SELECT the relevant page ID numbers (those for which no page record exists).

2) DELETE the matching rows from the link table, preferably in batches. (One at a time means it'll be very slow if there are many results; doing all at once means we might disrupt replication or hit SQL limits.)

Once the function's cleaned up we can go ahead and run it on the live sites.
Comment 4 P.Copp 2009-01-05 21:40:16 UTC
*** Bug 16895 has been marked as a duplicate of this bug. ***
Comment 5 Merlijn van Deen (test) 2009-01-06 02:13:25 UTC
This should be able to be cleared up as the function has been rewritten in
r45431 ( http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=45431 ).

Citing Brion:
03:11 < brion> w00t
03:11 < brion> i'll poke over it tomorrow
Comment 6 Merlijn van Deen (test) 2009-02-12 16:18:59 UTC
An updated (and actually working version..) was committed about four weeks ago [1]. The wikimedia wiki's have been updated, so this commit is available at the servers.
Has Brion, or any other administrator, had time to run the script?

[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/45721
Comment 7 Mike.lifeguard 2009-03-20 18:09:24 UTC
for r45721:
    * 23:59, 14 January 2009 Brion VIBBER (Talk | contribs | block) changed the status of this revision [removed: new added: ok]

So, it just needs to be run on shell. I don't know if Trevor will do that - could be assigned to wikibugs or someone who does shell requests.
Comment 8 Melancholie 2009-09-18 12:58:29 UTC
Just want to give two examples:

* [[wikt:de:Spezial:Linkliste/Template:Französisch|Template:Französisch]]
* [[wikt:de:Spezial:Linkliste/Template:Substantiv-Tabelle (Englisch)|Template:Substantiv-Tabelle (Englisch)]]

are both listed at [[wikt:de:Special:WantedTemplates]], for a long time now. "Substantiv-Tabelle (Englisch)" is an extremely old example, like some others, just stuck. They all tell "1 link" or more would be there, but there are none. Latest update: 06:24, 18. Sep 2009. We are watching this for quite some time now on dewikt.

So the script mentioned above should either be run again (maybe regularly) or for the first time.
Please see Mike's comment #7:
> So, it just needs to be run on shell. I don't know if Trevor will do that -
> could be assigned to wikibugs or someone who does shell requests.
Comment 9 Merlijn van Deen (test) 2010-01-02 02:51:30 UTC
*** Bug 21962 has been marked as a duplicate of this bug. ***
Comment 10 Krinkle 2010-01-02 03:09:02 UTC
Can this please be excecuted any time soon ?
For all wiki's and a special request for nl.wikipedia =)

--
Krinkle
Comment 11 Antoine "hashar" Musso (WMF) 2011-01-22 13:03:40 UTC
I have refresh links on nlwiki :

$ php refreshLinks.php --wiki nlwiki --dfn-only
Retrieving illegal entries from pagelinks... 0..100..200..300..312
Retrieving illegal entries from imagelinks... 0..100..110
Retrieving illegal entries from categorylinks... 0..100..200..243
Retrieving illegal entries from templatelinks... 0..100..200..300..400..500..600..700..800..900..1000..1100..1200..1300..1400..1500..1600..1700..1800..1900..2000..2100..2195
Retrieving illegal entries from externallinks... 0..10
$

As well as Wantedcategories, Wantedfiles and Wantedtemplate :

 Wantedfiles                    got 5000 rows in 1m 33.69s
 Wantedcategories               got 12 rows in 22.64s
 Wantedtemplates                got 33 rows in 2m 34.86s

We still have to run the link refresher on all wiki.
Comment 12 Nemo 2011-04-19 21:06:43 UTC
This probably blocks bug 24480.
Comment 13 Priyanka Dhanda 2011-07-05 22:32:20 UTC
I think this was a request to run refreshLinks.php and updateSpecialPages.php more regularly on all wikis. Re-assigning to default assignee for someone to pick up.
Comment 14 Nemo 2011-07-05 22:39:10 UTC
(In reply to comment #13)
> I think this was a request to run refreshLinks.php and updateSpecialPages.php
> more regularly on all wikis. Re-assigning to default assignee for someone to
> pick up.

It wouldn't be bad to do this once, though. We have some errors since years ago, e.g. bug 24480.
Comment 16 Umherirrender 2011-11-13 17:28:57 UTC
An alternative is to exclude non existing pages from the count: bug 32395
Comment 17 Umherirrender 2011-11-27 10:54:36 UTC
(In reply to comment #16)
> An alternative is to exclude non existing pages from the count: bug 32395

Better than that bug is to run always/weekly/monthly the maintenance script "refreshLinks.php --dfn-only" before "updateSpecialPages.php" in the same cron job, because bug 32395 only fixed the Wanted*-Specialpages. In theory this ghost entries also effected other querypages like Special:MostLinkedPages, but in that big count the ghost entries are not verifiably.

It is possible to add that script to the existing cron job? Thanks.
Comment 18 MZMcBride 2011-11-28 05:35:04 UTC
(In reply to comment #17)
> (In reply to comment #16)
>> An alternative is to exclude non existing pages from the count: bug 32395
> 
> Better than that bug is to run always/weekly/monthly the maintenance script
> "refreshLinks.php --dfn-only" before "updateSpecialPages.php" in the same cron
> job, because bug 32395 only fixed the Wanted*-Specialpages. In theory this
> ghost entries also effected other querypages like Special:MostLinkedPages, but
> in that big count the ghost entries are not verifiably.
> 
> It is possible to add that script to the existing cron job? Thanks.

That sounds like the subject of a separate bug/ticket.
Comment 19 Umherirrender 2011-11-28 18:50:04 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> >> An alternative is to exclude non existing pages from the count: bug 32395
> > 
> > Better than that bug is to run always/weekly/monthly the maintenance script
> > "refreshLinks.php --dfn-only" before "updateSpecialPages.php" in the same cron
> > job, because bug 32395 only fixed the Wanted*-Specialpages. In theory this
> > ghost entries also effected other querypages like Special:MostLinkedPages, but
> > in that big count the ghost entries are not verifiably.
> > 
> > It is possible to add that script to the existing cron job? Thanks.
> That sounds like the subject of a separate bug/ticket.

I am not sure, because you have to run "refreshLinks.php --dfn-only" once on each wiki to fix this bug or you add it to the cron job and wait that the cron job is running on each wiki and than this bug and my comment is fixed. But feel free to clone this Bug, if necessary.
Comment 20 Umherirrender 2012-01-19 20:11:32 UTC
No action since weeks? It is possible to get a decision for running once or running in a cron job? And than run it or update the cron job?

Thanks.
Comment 21 Beta16 2012-01-27 10:00:13 UTC
[[:w:it:Special:WantedTemplates]] has the same issue.
Today, 24 of the 50 most wanted templates have 0 (zero!) trasclusion in live pages:
* Template:Geobox statoNoQuadre (58 collegamenti)
* Template:Geobox ISTAT (58 collegamenti)
* Template:Geobox festivo (58 collegamenti)
* Template:Geobox patrono (58 collegamenti)
* Template:Geobox catasto (58 collegamenti)
* Template:Geobox coordinate comuni (55 collegamenti)
* Template:Geobox comuniSmall (55 collegamenti)
* Template:Da aiutare mese (47 collegamenti)
* Template:Da aiutare (46 collegamenti)
* Template:Wik (44 collegamenti)
* Template:Stub comuni (25 collegamenti)
* Template:Geografia/colorestub (25 collegamenti)
* Template:Musica (23 collegamenti)
* Template:Letteratura (18 collegamenti)
* Template:URSSPD (18 collegamenti)
* Template:Da tradurre (16 collegamenti)
* Template:Link esterni (15 collegamenti)
* Template:Qif (15 collegamenti)
* Template:Stub bio (14 collegamenti)
* Template:Film/rinvio (13 collegamenti)
* Template:Cinema/rinvio (13 collegamenti)
* Template:Da wikificare (12 collegamenti)
* Template:NavigazioneSport (11 collegamenti)
* Template:Trama (9 collegamenti)
And most of the others have a count wrong.

Please run the script periodically on all wikies.

Thanks.
Comment 22 Nemo 2012-01-27 10:29:32 UTC
Reducing scope of this bug; let's open a separate one for different requests and try to solve at least part of the problems at last.
Comment 23 Nemo 2012-01-27 10:30:12 UTC
*** Bug 15152 has been marked as a duplicate of this bug. ***
Comment 24 Nemo 2012-01-27 10:33:37 UTC
*** Bug 27480 has been marked as a duplicate of this bug. ***
Comment 25 Umherirrender 2012-01-27 16:17:09 UTC
I am requesting a periodically run, because the ghost entries some times came back. Thanks.
Comment 26 Rob Lanphier 2012-01-28 00:21:14 UTC
Turning into an ops request, raising priority, and filing a ticket with ops (RT #2355)
Comment 27 MZMcBride 2012-02-09 23:06:04 UTC
(In reply to comment #26)
> Turning into an ops request, raising priority, and filing a ticket with ops (RT
> #2355)

What's the status of this?
Comment 28 Sam Reed (reedy) 2012-02-09 23:35:26 UTC
(In reply to comment #27)
> (In reply to comment #26)
> > Turning into an ops request, raising priority, and filing a ticket with ops (RT
> > #2355)
> 
> What's the status of this?

Status:	NEW
Comment 29 Beta16 2012-03-19 09:07:31 UTC
There're news from RT #2355?
Comment 30 Mark A. Hershberger 2012-03-19 17:22:09 UTC
robla and CT have been communicating on this, most recently on 2-13.  I've pinged CT.
Comment 31 Mark A. Hershberger 2012-03-19 17:35:04 UTC
CT has said mutante can work on it
Comment 32 Umherirrender 2012-03-30 20:31:45 UTC
What is the status of this? Thanks.
Comment 33 Umherirrender 2012-04-08 18:41:13 UTC
Status?
Comment 34 Umherirrender 2012-04-15 19:33:14 UTC
Next week is over, what is the status of the RT? Thanks.
Comment 35 Mark A. Hershberger 2012-04-16 15:13:47 UTC
Sam and Mutante just handled something on this.  I've asked them to update the ticket.
Comment 36 Mark A. Hershberger 2012-04-16 16:52:05 UTC
(In reply to comment #35)
> Sam and Mutante

Should have said "Daniel"...
Comment 37 Daniel Zahn 2012-04-16 17:22:20 UTC
here's a suggestion: (via puppet of course):

https://gerrit.wikimedia.org/r/#patch,sidebyside,5104,3,manifests/mediawiki.pp

feel free to comment directly in the code in gerrit if you like
Comment 38 Umherirrender 2012-04-23 16:14:10 UTC
Gerrit change #5104 was successfully merged. Thanks!

The cron job is run every day at the hour of the number inside cluster name?

How does work the monitoring of that cron job, when its fails or did not run?
Comment 39 Beta16 2012-04-24 07:27:38 UTC
See bug 36195 for enwiki
Comment 40 Nemo 2012-04-24 07:52:40 UTC
Now that this has (almost) been fixed, someone ma want to look into the similar bug 27480 to see what's needed there.
Comment 41 Daniel Zahn 2012-04-24 11:57:22 UTC
@Umherirrender: the cron job ran for all clusters except s1. Yes, at the hour of the number in the cluster name first. But since s1 failed i disabled the others (they had just refreshed succesfully anyways), and i am now running it on s1 again manually in a screen. The cron jobs write logfiles to the local filesystem in /home/mwdeploy/refreshLinks.
Comment 42 Umherirrender 2012-04-24 15:59:54 UTC
(In reply to comment #40)
> Now that this has (almost) been fixed, someone ma want to look into the similar
> bug 27480 to see what's needed there.

It is not a similar request, because this bug request periodically run.

refreshLinks with -dfn-only is *only* a sql, which runs on the cluster (and maybe to slow or to heavy for enwiki)

refreshLinks without -dfn-only means to reparse all pages, doing that periodically sounds not like a good idea ...
Comment 43 Umherirrender 2012-04-24 16:07:20 UTC
In my opinion it is enough to run this script once in a month, or at least right before updateSpecialPages is running (every 3 days), because only there you will see the ghost entries. But when a enwiki run needs hours, that will delay the updateSpecialPages also, which is not a good idea.
Comment 44 Daniel Zahn 2012-05-03 12:12:11 UTC
after repeatedly trying the run on cluster s1 and it failing, for now i did this:

cluster s2-s7 refreshes all seem to be working fine, so the cron jobs are now changed to just run once monthly automatically. to keep it simple: s2 on day 2 of the month, 3 on day 3 and so on, always at midnight. So that would resolve the ticket, just that:

s1 stays being deactivated in automatic crons for now.
Comment 45 Sam Reed (reedy) 2012-05-03 19:18:50 UTC
*** Bug 33817 has been marked as a duplicate of this bug. ***
Comment 46 matanya 2012-07-23 09:03:05 UTC
daniel, anything new with s1?
Comment 47 Sam Reed (reedy) 2012-07-23 11:42:44 UTC
(In reply to comment #46)
> daniel, anything new with s1?

The problem is with the script (due to the sheer size go the enwiki database, I guess), so most likely isn't Daniels problem to fix it..

I'm running it again manually to try and see what was wrong with it (I can't remember), but I guess the fix is going to be to query the highest pageid, and do batches of X (100,000? 1M?) upto the pagecount
Comment 48 Sam Reed (reedy) 2012-07-23 13:40:59 UTC
The original queries take an age, and isn't going to attempt to load it all.

mysql> explain select DISTINCT pl_from from pagelinks LEFT JOIN page ON pl_from=page_id;
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+
| id | select_type | table     | type   | possible_keys | key     | key_len | ref                      | rows      | Extra                        |
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+
|  1 | SIMPLE      | pagelinks | index  | NULL          | pl_from | 265     | NULL                     | 624327870 | Using index; Using temporary |
|  1 | SIMPLE      | page      | eq_ref | PRIMARY       | PRIMARY | 4       | enwiki.pagelinks.pl_from |         1 | Using index; Distinct        |
+----+-------------+-----------+--------+---------------+---------+---------+--------------------------+-----------+------------------------------+
2 rows in set (0.01 sec)

Removing the distinct would make things simpler.. If kept a client side count, and removed the distint... Would this work for us..
Comment 49 Vishnu Nk 2013-12-21 04:36:31 UTC
Can anyone tell me how to get assigned to a bug?
please mail me at :- mails2vichu@gmail.com
Comment 50 Andre Klapper 2013-12-23 14:43:34 UTC
Vishnu: By adding an "I plan to work on this" comment here, however this might be a harder one to contribute to.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links