Uploaded image for project: 'CDH (READ-ONLY)'
  2. DISTRO-484

JobtrackerHA history writing is not resilient to HDFS datanodes going down


    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: CDH4.2.1
    • Fix Version/s: None
    • Component/s: MapReduce
    • Labels:
    • Environment:
      JobtrackerHA, HDFS HA, CDH 4.2.1, CentOS 6.3


      We've seen an issue that looks like a race condition on job completion (environment context above).
      The jobtracker (HA version) is found stuck, with the following repeating ad nauseam in the logs:

      07:00:43,582 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/dataloader-user-enginejoins-glup-content_displayUS/_logs/history/job_201305150720_11628_1369333820422_enginejoins_Import+content_display+US+until+Thu+May+23+17%3A00%3A0 retrying...

      To my untrained eyes it looks a bit like the following race condition happens:

      • The job completes, job completion is reported to the client
      • history is written to the job's output dir (we're using the default here, and we like it)
      • in the meantime, some completion handler client-side runs and deletes the job's output directory
      • history cannot be written correctly
      • JT fails retrying forever to write history

      The JT is not resilient to failures when writing history to the job output dir, which is not a priviledged directory, and where problems (quotas, rm -R) can happen at any time.
      Looks like a great way to DOS the service

      It seems to me that history writing to job's output dir should be best-effort, and definitely not stop jobtracker service.

      Note: we've worked around the problem by defaulting 'hadoop.job.history.user.location' to 'none', which has so far successfully prevented the issue from cropping up again.
      Note2: we unfortunately can't duplicate this easily now, as the cluster has now been switched to production.




            • Assignee:
              jbnote Jean-Baptiste Note
            • Votes:
              1 Vote for this issue
              1 Start watching this issue


              • Created: