Last modified: 2013-12-02 22:43:39 UTC
09:31 < Seddon> Nemo_bis, unless I am waking up in the middle of my sleep to send translation notifications.......... which would be pretty damn weird, the notification extension is broken 09:32 < saper> Seddon: wrong timestamp? 09:33 < Seddon> saper: people are getting multiple notifications. one guy I saw got 4 notifications spread over a period of days 09:33 < saper> Seddon: do you have some links? 09:33 < Seddon> yeah one sec 09:34 < Platonides> Seddon, the "you get notifications with the wrong email content when someone reviews your translation" bug ? 09:34 < Seddon> no, talk page 09:34 < Seddon> http://meta.wikimedia.org/wiki/User_talk:Vladimir_Penov 09:34 < Seddon> this guy have 4 versions of one notification, 3 of another and two of the most recent 09:35 < Seddon> I checked my contributions and I apparently sent out two batches of notifications 09:39 < Seddon> I wonder........... 09:44 < saper> Seddon: https://meta.wikimedia.org/w/index.php?title=User_talk:Vladimir_Penov&diff=4751313&oldid=4748846 and https://meta.wikimedia.org/w/index.php?title=User_talk:Vladimir_Penov&diff=4754445&oldid=4751313 ? 09:45 < Seddon> saper: Thats the latest instance of it yeah.... I couldn't have sent the second version cause at the time I was asleep 09:45 < saper> byte count is also identical, module one or two digit day 09:46 < Seddon> https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=notifytranslators&user=&page=&year=&month=-1&tagfilter=&hide_patrol_log=1 09:46 < Seddon> only two requests here as well
Suspecting recent changes to the job queue system, as there were no recent changes to TranslationNotifications. Will investigate in next sprint, starting 11/12, running through 27/12.
https://bugzilla.wikimedia.org/show_bug.cgi?id=42614 is the bug currently related to the job queue issues
The code for TranslationNotifications hasn't changed, the job queue has... Adding keyword "code-update-regression", and adding Aaron and Tim to CC, because they reworked the job queue recently, as well as RobLa for being the platform engineering manager. Niklas is telling me that he doesn't want to touch job related issues anymore, because things have gotten "way too complicated", something I'm not really used hearing from him, so something must be off there... He also refers to bug 42614.
Note that this occurs not just on Meta, but also - with User:Translation Notification Bot - on other wikis. See e.g. https://mg.wiktionary.org/w/index.php?title=Dinika_amin%27ny_mpikambana:Jagwar&dir=prev&offset=20121127000000&limit=11&action=history&uselang=en (sending the same notification four times over the course of seven days).
(In reply to comment #4) > Note that this occurs not just on Meta, but also - with User:Translation > Notification Bot - on other wikis. That's not surprising. These edits are triggered from the same code. It's a good thing (relatively speaking :P) that everything is going wrong.
This is not highest priority. There's another bug with higher priority in this component which is more severe than this (and "Highest" now requires an assignee, but Aaron is not working on this bug AFAIK).
Here's an update that Joseph made on translators-l: "The extension still is sending out repeating notifications a day after the last request. As as a result we have blocked my meta staff account and locked out the translation notification bot that delivers local notifications to you guys in hope that this will at least stem more duplicates" Not sure how much this helps, as the delivery process is active on all Wikimedia wikis, not only on Meta-Wiki.
(In reply to comment #7) > Not sure how much this helps, as the delivery process is active on all > Wikimedia wikis, not only on Meta-Wiki. It's a lock, not a block, for the bot; Jhs said it should work to prevent edits.
(In reply to comment #8) > (In reply to comment #7) > > Not sure how much this helps, as the delivery process is active on all > > Wikimedia wikis, not only on Meta-Wiki. > > It's a lock, not a block, for the bot; Jhs said it should work to prevent > edits. Yeah, if I understand the code from line 62 correcly here [1], the extension checks if the bot is logged in before posting, and returns an error if it can't log in. Since locking (which is global) logs the user out and prevents logging in, I'm pretty sure this will fix the problem with multiple user talk messages at least – e-mails I don't know about, but this should fix half the problem. [1] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/TranslationNotifications.git;a=blob;f=TranslationNotificationJob.php;h=da900499081bdf8b57166a86499d1ddcede104e1;hb=HEAD
<hashar> metawiki has 2213 translationNotificationJobs pending <hashar> I get three translationNotificationJob jobs for it <hashar> from Nov 27th, Dec 04 and Dec 01. <hashar> each of them have job_attempts = 3 <Seddon> looking over the edits, it seems like the job gets half done and then stops and starts all over again. <hashar> the jobs do end up with an error <hashar> but that is an empty error
I have patched the notification job to always return a boolean: https://gerrit.wikimedia.org/r/37211 runJobs.php always logged the job as being in error because the job run() returned null. Probably cause the new Job queue to reattempt the job run.
Thanks a bunch for looking into this, Antoine! This makes me fear for the other gazillion extensions that have jobs. Is this a regression in the job queue that must be fixed?
I dont think the patch solve the issue though. I am unfamiliar with the JobQueue and how it handle deduplicate or the removal of the jobs once they are finished. At least Gerrit change #37211 will give let us log an error message :)
I'm bumping this up as this seems to affected a lot of people whose notification volumes suddenly increase.
(In reply to comment #12) > Thanks a bunch for looking into this, Antoine! > > This makes me fear for the other gazillion extensions that have jobs. Is > this a > regression in the job queue that must be fixed? runJobs always calls ack() even if run() returned false. The only way it won't is if: a) the runner dies in the middle of the job (e.g. some serious fatal or exception) b) ack() fails for some reason
(In reply to comment #15) > (In reply to comment #12) > > Thanks a bunch for looking into this, Antoine! > > > > This makes me fear for the other gazillion extensions that have jobs. Is > > this a > > regression in the job queue that must be fixed? > > runJobs always calls ack() even if run() returned false. The only way it > won't > is if: > a) the runner dies in the middle of the job (e.g. some serious fatal or > exception) > b) ack() fails for some reason And it's (b). The translation notification job constructor is missing the $id parameter (it should have one like all the under job classes under /job). I've made https://gerrit.wikimedia.org/r/#/c/37372/2 to work around this problem, but the extension should be fixed.
(In reply to comment #16) > > ack() fails for some reason > I've made https://gerrit.wikimedia.org/r/#/c/37372/2 to work around > this problem, but the extension should be fixed. Is there a ticket against "MediaWiki extensions > TranslationNotifications" to follow yet?
(In reply to comment #17) > Is there a ticket against "MediaWiki extensions > TranslationNotifications" > to > follow yet? The situation is still being analysed.
Gerrit change #37459 was merged
(In reply to comment #18) > (In reply to comment #17) > > Is there a ticket against "MediaWiki extensions > TranslationNotifications" > > to > > follow yet? > > The situation is still being analysed. https://wikitech.wikimedia.org/view/TranslationNotifications_issues_2012-12#Timeline was published.
Permanent link: https://wikitech.wikimedia.org/index.php?title=TranslationNotifications_issues_2012-12&oldid=54502#Timeline
(In reply to comment #20) > (In reply to comment #18) > > (In reply to comment #17) > > > Is there a ticket against "MediaWiki extensions > TranslationNotifications" > > > to > > > follow yet? > > > > The situation is still being analysed. > > https://wikitech.wikimedia.org/view/TranslationNotifications_issues_2012- > 12#Timeline > was published. Since that report is not on a generally editable wiki, let me note here instead that while it gives the impression that the first duplicate notifications were observed at the end of November, this has in fact happened at least twice before, in July and August: https://meta.wikimedia.org/wiki/Talk:Translation_requests#Double_notification Examples: https://meta.wikimedia.org/w/index.php?diff=3964877&oldid=3964809 , https://meta.wikimedia.org/w/index.php?diff=4049992&oldid=4049949 Back then, the time between the redundant notifications was much shorter, so the reason may have been a different one. Thanks to everyone who worked to resolve the present bug! I have a small notification (Portuguese only) to send out soon, which should make a good test case.
(In reply to comment #22) > Back then, the time between the redundant notifications was much shorter, so > the reason may have been a different one. Yes, it was definitely another problem; from what I remember, it was also not systematic but related to some ephemeral conditions.
Duplicate talk page notification messages have just occurred again. Examples: https://meta.wikimedia.org/w/index.php?diff=prev&oldid=5240374 https://es.wikipedia.org/w/index.php?diff=prev&oldid=63778964 The second round of notifications occurred more than 13 hours after the first. There is no duplicate entry in the notification log (https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=notifytranslators ), and fortunately email notifications do not seem to have been duplicated (at least I haven't received a second one).
Last time the bug was TranslationNotificationJob not overriding the parent constructor properly. This time it is the fact that TranslationNotificationJob::run() doesn't return true/false like the docs say to.
(In reply to comment #25) > Last time the bug was TranslationNotificationJob not overriding the parent > constructor properly. This time it is the fact that > TranslationNotificationJob::run() doesn't return true/false like the docs say > to. Gerrit change #48852 should fix this. Please review asap. It's almost midnight here in India. Needs to be backported and deployed, too. Niklas has already deleted the failed jobs, but when TN is used again, the issue may occur again...
Gerrit change #48852 is now merged but is there a backport?
48852 is for Translate and it does not need to be backported. 48848 is for TN and it has been backported and deployed.
Thanks all. Marking as closed.
Thanks, and please reopen if there are still issues.
Timeline: 16:09 UTC Reported by Tilman in comment 24 18:09 UTC Report noticed by Siebrand (22:39 IST local time) 19:03 UTC First patch set with fix, Gerrit change #48852 19:07 UTC Second patch set based on Siebrand's review 19:17 UTC Aaron reviews and approves 19:18 UTC Patch merged 19:24 UTC Patch backported to 1.21wmf9 by Aaron, Gerrit change #48853 19:26 UTC Patch merged in 1.21wmf9 19:30 UTC Patch deployed on Wikimedia wikis running 1.21wmf9 (http://hexm.de/victory ) Based on this issue, I have also checked the return values of other Job::run() functions in the 188 extensions I have on my local machine. Luckily, jobs aren't used all over the place, so I was only able to locate 3 extensions with bugs, next to Translate which was already fixed by Niklas in Gerrit change #48852. I reported bug 44960 (AFTv5), bug 44962 (Wikidata) and bug 44963 (TMH). To minimise damage from incorrect jobs, Aaron and I have also made changes to job queue: * Gerrit change #48860 treats a non-boolean return value of Job::run() as true * Gerrit change #48858 makes sure MediaWiki starts yelling if a job's run() method does not return a boolean. 1h11m as time to fix from "issue noticed" to deployment, while the report was noticed at 22:39 IST isn't bad at all, I think. Thanks to all involved!
This happened again on March 14, with first notification sent after all Wikipedias were switched to 1.21wmf11. https://it.wikipedia.org/w/index.php?title=Discussioni_utente:Davidbottan&diff=prev&oldid=57453634 https://it.wikipedia.org/w/index.php?title=Discussioni_utente:Davidbottan&diff=prev&oldid=57453811 I don't remember if it was expected that the update would bring some breakage; but if it was, apparently the translation administrators didn't know.
Duplicate notifications were just sent out again (both by email and on talk pages), and on MEta, Translation Notification Bot was briefly blocked because of this. https://meta.wikimedia.org/w/index.php?diff=5654816&oldid=5654815
Meh. Might be worth a new bug report.