[DISTRO-484] JobtrackerHA history writing is not resilient to HDFS datanodes going down - Cloudera Open Source

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: CDH4.2.1
Fix Version/s: None
Component/s: MapReduce
Labels:
- availability
Environment:
JobtrackerHA, HDFS HA, CDH 4.2.1, CentOS 6.3

Description

We've seen an issue that looks like a race condition on job completion (environment context above).
The jobtracker (HA version) is found stuck, with the following repeating ad nauseam in the logs:

07:00:43,582 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/dataloader-user-enginejoins-glup-content_displayUS/_logs/history/job_201305150720_11628_1369333820422_enginejoins_Import+content_display+US+until+Thu+May+23+17%3A00%3A0 retrying...

To my untrained eyes it looks a bit like the following race condition happens:

The job completes, job completion is reported to the client
history is written to the job's output dir (we're using the default here, and we like it)
in the meantime, some completion handler client-side runs and deletes the job's output directory
history cannot be written correctly
JT fails retrying forever to write history

The JT is not resilient to failures when writing history to the job output dir, which is not a priviledged directory, and where problems (quotas, rm -R) can happen at any time.
Looks like a great way to DOS the service

It seems to me that history writing to job's output dir should be best-effort, and definitely not stop jobtracker service.

Note: we've worked around the problem by defaulting 'hadoop.job.history.user.location' to 'none', which has so far successfully prevented the issue from cropping up again.
Note2: we unfortunately can't duplicate this easily now, as the cluster has now been switched to production.

Attachments

Activity

People

Assignee:

Unassigned

Reporter:

Jean-Baptiste Note

Votes:

1 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

03/Jun/13 9:48 AM

Updated:

03/Jun/13 5:10 PM