Last modified: 2014-11-14 15:47:48 UTC
Three of the webrequest partitions [1] for 2014-10-13T20/1H have been been marked successful. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 14:37:13 // exit code: 0 cwd: ~ ~/cluster-scripts/dump_webrequest_status.sh +------------------+--------+--------+--------+--------+ | Date | bits | mobile | text | upload | +------------------+--------+--------+--------+--------+ [...] | 2014-11-13T18/1H | . | . | . | X | | 2014-11-13T19/1H | . | . | . | . | | 2014-11-13T20/1H | X | . | X | X | | 2014-11-13T21/1H | . | . | . | . | | 2014-11-13T22/1H | . | . | . | X | [...] +------------------+--------+--------+--------+--------+ Statuses: . --> Partition is ok M --> Partition manually marked ok X --> Partition is not ok (duplicates, missing, or nulls)
The three jobs for 2014-11-13T20/1H were in SUSPENDED state. Some internal workflows got stuck with exception about RM issues [1]. This nicely matches yesterdays restarting of the resourcemanager after upgrading the JVMs. Resuming the 3 jobs did not work, so I killed and restarted them. [1] JA009 JA009: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1409078537822_77051' doesn't exist in RM.
Now the jobs succeeded, and the partitions got marked ok.