Details
Description
We have been testing CDH5 beta 2 YARN on a decently-sized (97 beefy nodes) cluster, and ran into an issue with Job History Server.
Basically, if history (or counters, or whatever) is requested for a completed MR job, and the job was large enough, some thread in the JHS can OOM and leave it in a broken state. The server process stays up (so it can't be auto-restarted by simple process-monitoring tools), but does not respond to any requests, either from the web UI or from the command line (with 'mapred job -status job_id').
We were able to hit this with the default JVM heap size and a pretty large MR job (over 100k mappers, few thousands of reducers). We were also able to reproduce it with a much smaller MR job when we lowered the JVM heap of Job History Server to 128MB.