Details
-
Type:
Bug
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: CDH4.2.1
-
Fix Version/s: None
-
Component/s: MapReduce
-
Labels:
-
Environment:JobtrackerHA, HDFS HA, CDH 4.2.1, CentOS 6.3
Description
We've seen an issue that looks like a race condition on job completion (environment context above).
The jobtracker (HA version) is found stuck, with the following repeating ad nauseam in the logs:
07:00:43,582 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/dataloader-user-enginejoins-glup-content_displayUS/_logs/history/job_201305150720_11628_1369333820422_enginejoins_Import+content_display+US+until+Thu+May+23+17%3A00%3A0 retrying...
To my untrained eyes it looks a bit like the following race condition happens:
- The job completes, job completion is reported to the client
- history is written to the job's output dir (we're using the default here, and we like it)
- in the meantime, some completion handler client-side runs and deletes the job's output directory
- history cannot be written correctly
- JT fails retrying forever to write history
The JT is not resilient to failures when writing history to the job output dir, which is not a priviledged directory, and where problems (quotas, rm -R) can happen at any time.
Looks like a great way to DOS the service
It seems to me that history writing to job's output dir should be best-effort, and definitely not stop jobtracker service.
Note: we've worked around the problem by defaulting 'hadoop.job.history.user.location' to 'none', which has so far successfully prevented the issue from cropping up again.
Note2: we unfortunately can't duplicate this easily now, as the cluster has now been switched to production.