Last modified: 2014-10-21 10:54:41 UTC
For the hour 2014-10-10T15:xx:xx, the upload partition [1] was marked successful. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 10:42:42 // exit code: 0 cwd: ~ cluster-scripts/dump_webrequest_status.sh +---------------------+--------+--------+--------+--------+ | Date | bits | text | mobile | upload | +---------------------+--------+--------+--------+--------+ [...] | 2014-10-10T13:xx:xx | . | . | . | . | | 2014-10-10T14:xx:xx | . | . | . | . | | 2014-10-10T15:xx:xx | . | . | . | X | | 2014-10-10T16:xx:xx | . | . | . | . | | 2014-10-10T17:xx:xx | . | . | . | . | [...] +---------------------+--------+--------+--------+--------+ Statuses: . --> Partition is ok X --> Partition is not ok (duplicates, missing, or nulls)
The Oozie job for checking that partition has status KILLED [1], and seems to have been killed by user hdfs at 17:28 [2]. A few minutes later, bundles have been restarted, so I assume the killing of the partition checking happend deliberately. However, since the job's sequence statistics have not been fully computed (Killed at 95% of reduce step), I started the recomputation job by hand. Sequence stats recomputation is done, and the partition has neither missing nor duplicates. Hence, I manually marked the partition good. [1] qchris@analytics1027:~$ oozie job -verbose -info 0037425-140725140105408-oozie-oozi-W Job ID : 0037425-140725140105408-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : hive_add_partition-wmf_raw.webrequest-upload,2014,10,10,15-wf App Path : hdfs://analytics-hadoop/wmf/refinery/current/oozie/webrequest/partition/add/workflow.xml Status : KILLED Run : 0 User : hdfs Group : - Created : 2014-10-10 17:04:54 GMT Started : 2014-10-10 17:04:54 GMT Last Modified : 2014-10-10 17:28:15 GMT Ended : 2014-10-10 17:28:13 GMT CoordAction ID: 0003812-140725140105408-oozie-oozi-C@2060 Actions ------------------------------------------------------------------------------------------------------------------------------------ ID Console URL Error Code Error Message External ID External Status Name Retries Tracker URI Type Started Status Ended ------------------------------------------------------------------------------------------------------------------------------------ 0037425-140725140105408-oozie-oozi-W@:start: - - - - OK :start: 0 - :START: 2014-10-10 17:04:54 GMT OK 2014-10-10 17:04:54 GMT ------------------------------------------------------------------------------------------------------------------------------------ 0037425-140725140105408-oozie-oozi-W@add_partition http://analytics1027.eqiad.wmnet:11000/oozie?job=0037426-140725140105408-oozie-oozi-W - - 0037426-140725140105408-oozie-oozi-W SUCCEEDED add_partition 0 local sub-workflow 2014-10-10 17:04:54 GMT OK 2014-10-10 17:05:11 GMT ------------------------------------------------------------------------------------------------------------------------------------ 0037425-140725140105408-oozie-oozi-W@generate_sequence_statistics http://analytics1010.eqiad.wmnet:8088/proxy/application_1409078537822_38526/ - -job_1409078537822_38526 KILLED generate_sequence_statistics 0 resourcemanager.analytics.eqiad.wmnet:8032 hive 2014-10-10 17:05:11 GMT KILLED2014-10-10 17:28:15 GMT ------------------------------------------------------------------------------------------------------------------------------------ [2] See HDFS's /var/log/hadoop-yarn/apps/hdfs/logs/application_1409078537822_38526/analytics1029.eqiad.wmnet_8041 line 607: :2014-10-10 17:28:13,907 INFO [IPC Server handler 0 on 36062] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job job_1409078537822_38526 received from hdfs (auth:SIMPLE) at 10.64.36.127