Last modified: 2014-10-21 10:49:16 UTC
From 2014-09-23T18:xx:xx onwards, no partitions were marked successful [1]. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 00:48:38 // exit code: 0 cwd: ~ ~/cluster-scripts/dump_webrequest_status.sh +---------------------+--------+--------+--------+--------+ | Date | bits | text | mobile | upload | +---------------------+--------+--------+--------+--------+ [...] | 2014-09-23T18:xx:xx | X | X | X | X | | 2014-09-23T19:xx:xx | X | X | X | X | | 2014-09-23T20:xx:xx | X | X | X | X | | 2014-09-23T21:xx:xx | X | X | X | X | +---------------------+--------+--------+--------+--------+
Today's refinery deployment came with Ie557acff61b907e0a43c45f0ca82b5bf43a800d6 which adds a new mandatory parameter "mark_directory_done_workflow_file" to "oozie/webrequest/partition/add". It seems that after the deployment, this Oozie job was not resubmitted. Hence, it was running with the old properties file, hence missing the setting for the newly added parameter. To not disturb Oozie too much, I rolled back /wmf/refinery/current/oozie/webrequest/partition/add on the cluster to ebc92c1. So now the directory contains xmls that work with the old properties file.
I started to rerun the affected jobs, and the first few finished already, and the corresponding partitions were now marked successful. In a few hours, the last jobs should have finished. Waiting to close the bug until then.
The jobs reran just fine. All affected webrequest partitions are now marked successful. Pagecount generation automatically waited for webrequest partitions to get successful, and automatically continued once they were. So we now have good data for each of the affected partitions/hours.