Last modified: 2013-12-02 22:43:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44715, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42715 - Duplicate translation notifications are generated
Duplicate translation notifications are generated
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
JobQueue (Other open bugs)
1.21.x
All All
: Highest major with 1 vote (vote)
: 1.20.x release
Assigned To: Aaron Schulz
https://meta.wikimedia.org/w/index.ph...
: code-update-regression
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-05 10:09 UTC by Marcin Cieślak
Modified: 2013-12-02 22:43 UTC (History)
18 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Marcin Cieślak 2012-12-05 10:09:22 UTC
09:31 < Seddon> Nemo_bis, unless I am waking up in the middle of my sleep to send translation notifications.......... which would be pretty damn weird, the 
                notification extension is broken
09:32 < saper> Seddon: wrong timestamp?
09:33 < Seddon> saper: people are getting multiple notifications. one guy I saw got 4 notifications spread over a period of days
09:33 < saper> Seddon: do you have some links?
09:33 < Seddon> yeah one sec
09:34 < Platonides> Seddon, the "you get notifications with the wrong email content when someone reviews your translation" bug ?
09:34 < Seddon> no, talk page
09:34 < Seddon> http://meta.wikimedia.org/wiki/User_talk:Vladimir_Penov
09:34 < Seddon> this guy have 4 versions of one notification, 3 of another and two of the most recent
09:35 < Seddon> I checked my contributions and I apparently sent out two batches of notifications
09:39 < Seddon> I wonder...........
09:44 < saper> Seddon: https://meta.wikimedia.org/w/index.php?title=User_talk:Vladimir_Penov&diff=4751313&oldid=4748846 and 
               https://meta.wikimedia.org/w/index.php?title=User_talk:Vladimir_Penov&diff=4754445&oldid=4751313 ?
09:45 < Seddon> saper: Thats the latest instance of it yeah.... I couldn't have sent the second version cause at the time I was asleep
09:45 < saper> byte count is also identical, module one or two digit day
09:46 < Seddon> https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=notifytranslators&user=&page=&year=&month=-1&tagfilter=&hide_patrol_log=1
09:46 < Seddon> only two requests here as well
Comment 1 Siebrand Mazeland 2012-12-05 10:21:54 UTC
Suspecting recent changes to the job queue system, as there were no recent changes to TranslationNotifications. Will investigate in next sprint, starting 11/12, running through 27/12.
Comment 2 Joseph Seddon 2012-12-05 10:36:40 UTC
https://bugzilla.wikimedia.org/show_bug.cgi?id=42614 is the bug currently related to the job queue issues
Comment 3 Siebrand Mazeland 2012-12-05 10:41:39 UTC
The code for TranslationNotifications hasn't changed, the job queue has...

Adding keyword "code-update-regression", and adding Aaron and Tim to CC, because they reworked the job queue recently, as well as RobLa for being the platform engineering manager.

Niklas is telling me that he doesn't want to touch job related issues anymore, because things have gotten "way too complicated", something I'm not really used hearing from him, so something must be off there... He also refers to bug 42614.
Comment 4 Tilman Bayer 2012-12-05 13:47:44 UTC
Note that this occurs not just on Meta, but also - with User:Translation Notification Bot - on other wikis. See e.g. https://mg.wiktionary.org/w/index.php?title=Dinika_amin%27ny_mpikambana:Jagwar&dir=prev&offset=20121127000000&limit=11&action=history&uselang=en (sending the same notification four times over the course of seven days).
Comment 5 Siebrand Mazeland 2012-12-05 14:05:15 UTC
(In reply to comment #4)
> Note that this occurs not just on Meta, but also - with User:Translation
> Notification Bot - on other wikis.

That's not surprising. These edits are triggered from the same code. It's a good thing (relatively speaking :P) that everything is going wrong.
Comment 6 Nemo 2012-12-06 10:07:50 UTC
This is not highest priority. There's another bug with higher priority in this component which is more severe than this (and "Highest" now requires an assignee, but Aaron is not working on this bug AFAIK).
Comment 7 Siebrand Mazeland 2012-12-06 11:10:45 UTC
Here's an update that Joseph made on translators-l: "The extension still is sending out repeating notifications a day after the last request. As as a result we have blocked my meta staff account and locked out the translation notification bot that delivers local notifications to you guys in hope that this will at least stem more duplicates"

Not sure how much this helps, as the delivery process is active on all Wikimedia wikis, not only on Meta-Wiki.
Comment 8 Nemo 2012-12-06 11:30:30 UTC
(In reply to comment #7)
> Not sure how much this helps, as the delivery process is active on all
> Wikimedia wikis, not only on Meta-Wiki.

It's a lock, not a block, for the bot; Jhs said it should work to prevent edits.
Comment 9 Jon Harald Søby 2012-12-06 11:34:15 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Not sure how much this helps, as the delivery process is active on all
> > Wikimedia wikis, not only on Meta-Wiki.
> 
> It's a lock, not a block, for the bot; Jhs said it should work to prevent
> edits.

Yeah, if I understand the code from line 62 correcly here [1], the extension checks if the bot is logged in before posting, and returns an error if it can't log in. Since locking (which is global) logs the user out and prevents logging in, I'm pretty sure this will fix the problem with multiple user talk messages at least – e-mails I don't know about, but this should fix half the problem.


[1] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/TranslationNotifications.git;a=blob;f=TranslationNotificationJob.php;h=da900499081bdf8b57166a86499d1ddcede104e1;hb=HEAD
Comment 10 Joseph Seddon 2012-12-06 11:46:10 UTC
<hashar>	metawiki has 2213 translationNotificationJobs pending
<hashar>	I get three translationNotificationJob jobs for it
<hashar>	from Nov 27th, Dec 04 and Dec 01.
<hashar>	each of them have job_attempts = 3
<Seddon>	looking over the edits, it seems like the job gets half done and then stops and starts all over again. 
<hashar>	the jobs do end up with an error
<hashar>	but that is an empty error
Comment 11 Antoine "hashar" Musso (WMF) 2012-12-06 13:08:06 UTC
I have patched the notification job to always return a boolean: https://gerrit.wikimedia.org/r/37211

runJobs.php always logged the job as being in error because the job run() returned null.  Probably cause the new Job queue to reattempt the job run.
Comment 12 Siebrand Mazeland 2012-12-06 13:42:31 UTC
Thanks a bunch for looking into this, Antoine!

This makes me fear for the other gazillion extensions that have jobs. Is this a regression in the job queue that must be fixed?
Comment 13 Antoine "hashar" Musso (WMF) 2012-12-06 13:44:48 UTC
I dont think the patch solve the issue though.  I am unfamiliar with the JobQueue and how it handle deduplicate or the removal of the jobs once they are finished.  At least Gerrit change #37211 will give let us log an error message :)
Comment 14 Andre Klapper 2012-12-06 17:20:52 UTC
I'm bumping this up as this seems to affected a lot of people whose notification volumes suddenly increase.
Comment 15 Aaron Schulz 2012-12-07 02:48:42 UTC
(In reply to comment #12)
> Thanks a bunch for looking into this, Antoine!
> 
> This makes me fear for the other gazillion extensions that have jobs. Is
> this a
> regression in the job queue that must be fixed?

runJobs always calls ack() even if run() returned false. The only way it won't is if:
a) the runner dies in the middle of the job (e.g. some serious fatal or exception)
b) ack() fails for some reason
Comment 16 Aaron Schulz 2012-12-07 05:17:38 UTC
(In reply to comment #15)
> (In reply to comment #12)
> > Thanks a bunch for looking into this, Antoine!
> > 
> > This makes me fear for the other gazillion extensions that have jobs. Is
> > this a
> > regression in the job queue that must be fixed?
> 
> runJobs always calls ack() even if run() returned false. The only way it
> won't
> is if:
> a) the runner dies in the middle of the job (e.g. some serious fatal or
> exception)
> b) ack() fails for some reason

And it's (b). The translation notification job constructor is missing the $id parameter (it should have one like all the under job classes under /job). I've made https://gerrit.wikimedia.org/r/#/c/37372/2 to work around this problem, but the extension should be fixed.
Comment 17 Andre Klapper 2012-12-07 13:10:11 UTC
(In reply to comment #16)
> > ack() fails for some reason
> I've made https://gerrit.wikimedia.org/r/#/c/37372/2 to work around 
> this problem, but the extension should be fixed.

Is there a ticket against "MediaWiki extensions > TranslationNotifications" to follow yet?
Comment 18 Nemo 2012-12-07 13:17:42 UTC
(In reply to comment #17)
> Is there a ticket against "MediaWiki extensions > TranslationNotifications"
> to
> follow yet?

The situation is still being analysed.
Comment 19 Nemo 2012-12-07 18:07:15 UTC
Gerrit change #37459 was merged
Comment 20 Nemo 2012-12-11 14:39:26 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > Is there a ticket against "MediaWiki extensions > TranslationNotifications"
> > to
> > follow yet?
> 
> The situation is still being analysed.

https://wikitech.wikimedia.org/view/TranslationNotifications_issues_2012-12#Timeline was published.
Comment 22 Tilman Bayer 2012-12-11 22:16:31 UTC
(In reply to comment #20)
> (In reply to comment #18)
> > (In reply to comment #17)
> > > Is there a ticket against "MediaWiki extensions > TranslationNotifications"
> > > to
> > > follow yet?
> > 
> > The situation is still being analysed.
> 
> https://wikitech.wikimedia.org/view/TranslationNotifications_issues_2012-
> 12#Timeline
> was published.

Since that report is not on a generally editable wiki, let me note here instead that while it gives the impression that the first duplicate notifications were observed at the end of November, this has in fact happened at least twice before, in July and August:

https://meta.wikimedia.org/wiki/Talk:Translation_requests#Double_notification

Examples: https://meta.wikimedia.org/w/index.php?diff=3964877&oldid=3964809 , https://meta.wikimedia.org/w/index.php?diff=4049992&oldid=4049949


Back then, the time between the redundant notifications was much shorter, so the reason may have been a different one.

Thanks to everyone who worked to resolve the present bug! I have a small notification (Portuguese only) to send out soon, which should make a good test case.
Comment 23 Nemo 2012-12-11 22:28:48 UTC
(In reply to comment #22)
> Back then, the time between the redundant notifications was much shorter, so
> the reason may have been a different one.

Yes, it was definitely another problem; from what I remember, it was also not systematic but related to some ephemeral conditions.
Comment 24 Tilman Bayer 2013-02-13 16:09:15 UTC
Duplicate talk page notification messages have just occurred again. Examples: 

https://meta.wikimedia.org/w/index.php?diff=prev&oldid=5240374

https://es.wikipedia.org/w/index.php?diff=prev&oldid=63778964

The second round of notifications occurred more than 13 hours after the first. There is no duplicate entry in the notification log (https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=notifytranslators ), and fortunately email notifications do not seem to have been duplicated (at least I haven't received a second one).
Comment 25 Aaron Schulz 2013-02-13 17:50:05 UTC
Last time the bug was TranslationNotificationJob not overriding the parent constructor properly. This time it is the fact that TranslationNotificationJob::run() doesn't return true/false like the docs say to.
Comment 26 Siebrand Mazeland 2013-02-13 18:17:31 UTC
(In reply to comment #25)
> Last time the bug was TranslationNotificationJob not overriding the parent
> constructor properly. This time it is the fact that
> TranslationNotificationJob::run() doesn't return true/false like the docs say
> to.

Gerrit change #48852 should fix this. Please review asap. It's almost midnight here in India. Needs to be backported and deployed, too.

Niklas has already deleted the failed jobs, but when TN is used again, the issue may occur again...
Comment 27 Sumana Harihareswara 2013-02-13 19:07:55 UTC
Gerrit change #48852 is now merged but is there a backport?
Comment 28 Niklas Laxström 2013-02-13 19:11:16 UTC
48852 is for Translate and it does not need to be backported.
48848 is for TN and it has been backported and deployed.
Comment 29 Sumana Harihareswara 2013-02-13 19:12:27 UTC
Thanks all. Marking as closed.
Comment 30 Andre Klapper 2013-02-13 19:19:58 UTC
Thanks, and please reopen if there are still issues.
Comment 31 Siebrand Mazeland 2013-02-13 20:04:43 UTC
Timeline:

16:09 UTC Reported by Tilman in comment 24
18:09 UTC Report noticed by Siebrand (22:39 IST local time)
19:03 UTC First patch set with fix, Gerrit change #48852
19:07 UTC Second patch set based on Siebrand's review
19:17 UTC Aaron reviews and approves
19:18 UTC Patch merged
19:24 UTC Patch backported to 1.21wmf9 by Aaron, Gerrit change #48853
19:26 UTC Patch merged in 1.21wmf9
19:30 UTC Patch deployed on Wikimedia wikis running 1.21wmf9 (http://hexm.de/victory )

Based on this issue, I have also checked the return values of other Job::run() functions in the 188 extensions I have on my local machine. Luckily, jobs aren't used all over the place, so I was only able to locate 3 extensions with bugs, next to Translate which was already fixed by Niklas in Gerrit change #48852.

I reported bug 44960 (AFTv5), bug 44962 (Wikidata) and bug 44963 (TMH).

To minimise damage from incorrect jobs, Aaron and I have also made changes to job queue:
* Gerrit change #48860 treats a non-boolean return value of Job::run() as true
* Gerrit change #48858 makes sure MediaWiki starts yelling if a job's run() method does not return a boolean.

1h11m as time to fix from "issue noticed" to deployment, while the report was noticed at 22:39 IST isn't bad at all, I think. Thanks to all involved!
Comment 32 Nemo 2013-04-29 06:13:59 UTC
This happened again on March 14, with first notification sent after all Wikipedias were switched to 1.21wmf11.

https://it.wikipedia.org/w/index.php?title=Discussioni_utente:Davidbottan&diff=prev&oldid=57453634
https://it.wikipedia.org/w/index.php?title=Discussioni_utente:Davidbottan&diff=prev&oldid=57453811

I don't remember if it was expected that the update would bring some breakage; but if it was, apparently the translation administrators didn't know.
Comment 33 Tilman Bayer 2013-07-16 02:52:35 UTC
Duplicate notifications were just sent out again (both by email and on talk pages), and on MEta, Translation Notification Bot was briefly blocked because of this. 
https://meta.wikimedia.org/w/index.php?diff=5654816&oldid=5654815
Comment 34 Andre Klapper 2013-07-16 10:07:22 UTC
Meh. Might be worth a new bug report.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links