Details
-
Type: Bug
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: CDH4.2.1
-
Fix Version/s: None
-
Component/s: MapReduce
-
Labels:None
-
Environment:JobtrackerHA, HDFS HA, CentOS 6.3
Description
We encountered a strange JT state where a job appears to have mappers running on the toplevel JT page, as well as in job details, but when looking at task details, the task is stuck at 0%, there's an error condition 'Error launching task', all while when consulting logs TT-side, the task has run.
It looks like a race condition on task launch: the JT timesout, but the task as actually launched on the TT. When the JT asks for task killing, which it does, the TT does not answer correctly.
Traces support this; on the JT:
0 6c-3b-e5-a8-11-fc:/home/jb.note# grep 'attempt_201305240701_25131_m_000247' /var/log/rsyslog/6c-3b-e5-be-d5-9c.hpc.criteo.prod/jobtracker.log.1 22:12:23,096 INFO org.apache.hadoop.mapred.JobTracker: attempt_201305240701_25131_m_000247_0 is 607352 ms debug. 22:12:23,096 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_201305240701_25131_m_000247_0 timed out. 22:12:23,097 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to set illegal startTime for task : attempt_201305240701_25131_m_000247_0.Stack trace is : java.lang.Exception at org.apache.hadoop.mapred.TaskStatus.setStartTime(TaskStatus.java:246) at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:3118) at org.apache.hadoop.mapred.JobTracker$ExpireLaunchingTasks.run(JobTracker.java:415) at java.lang.Thread.run(Thread.java:662) 22:12:23,097 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to set finish time for task attempt_201305240701_25131_m_000247_0 when no start time is set, stackTrace is : java.lang.Exception at org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:185) at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:3119) at org.apache.hadoop.mapred.JobTracker$ExpireLaunchingTasks.run(JobTracker.java:415) at java.lang.Thread.run(Thread.java:662) 22:12:23,097 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201305240701_25131_m_000247_0: Error launching task 22:12:23,097 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to set finish time for task attempt_201305240701_25131_m_000247_0 when no start time is set, stackTrace is : java.lang.Exception at org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:185) at org.apache.hadoop.mapred.TaskInProgress.incompleteSubTask(TaskInProgress.java:670) at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2893) at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1178) at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:3121) at org.apache.hadoop.mapred.JobTracker$ExpireLaunchingTasks.run(JobTracker.java:415) at java.lang.Thread.run(Thread.java:662)
while on the tasktracker:
22:12:31,720 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201305240701_25131_m_000247_0 which needs 1 slots 22:12:31,720 INFO org.apache.hadoop.mapred.TaskTracker: TaskLauncher : Waiting for 1 to launch attempt_201305240701_25131_m_000247_0, currently we have 0 fre e slots 22:13:00,233 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /hdfs/9/mapred/ttprivate/taskTracker/k.filali/jobcache/job_201305240701_25131/ attempt_201305240701_25131_m_000247_0/taskjvm.sh 22:13:01,689 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201305240701_25131_m_1328936242 given task: attempt_201305240701_25131_m_000247_0 22:13:11,998 INFO org.apache.hadoop.mapred.TaskMemoryManagerThread: Memory usage of ProcessTree 24717 for task-id attempt_201305240701_25131_m_000247_0 : Vir tual 2638241792 bytes, limit : -1048576 bytes; Physical 232701952 bytes, limit 8589934592 bytes 22:13:17,618 INFO org.apache.hadoop.mapred.TaskMemoryManagerThread: Memory usage of ProcessTree 24717 for task-id attempt_201305240701_25131_m_000247_0 : Vir tual 2638241792 bytes, limit : -1048576 bytes; Physical 258260992 bytes, limit 8589934592 bytes ... 22:20:40,543 INFO org.apache.hadoop.mapred.TaskMemoryManagerThread: Memory usage of ProcessTree 24717 for task-id attempt_201305240701_25131_m_000247_0 : Virtual 2638241792 bytes, limit : -1048576 bytes; Physical 521768960 bytes, limit 8589934592 bytes 22:20:47,124 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201305240701_25131_m_000247_0 22:20:51,701 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201305240701_25131_m_000247_0 22:21:23,863 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201305240701_25131_m_000247_0 22:21:25,439 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201305240701_25131_m_000247_0