Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-575

[Job history server] some serving thread can OOM if history for a large enough MR job is requested, leaving JHS in a broken state

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: CDH5b2
    • Fix Version/s: None
    • Component/s: MapReduce
    • Labels:
    • Environment:
      CDH5 beta 2
      YARN

      Description

      We have been testing CDH5 beta 2 YARN on a decently-sized (97 beefy nodes) cluster, and ran into an issue with Job History Server.

      Basically, if history (or counters, or whatever) is requested for a completed MR job, and the job was large enough, some thread in the JHS can OOM and leave it in a broken state. The server process stays up (so it can't be auto-restarted by simple process-monitoring tools), but does not respond to any requests, either from the web UI or from the command line (with 'mapred job -status job_id').

      We were able to hit this with the default JVM heap size and a pretty large MR job (over 100k mappers, few thousands of reducers). We were also able to reproduce it with a much smaller MR job when we lowered the JVM heap of Job History Server to 128MB.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ilyam Ilya Maykov
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: