[DISTRO-575] [Job history server] some serving thread can OOM if history for a large enough MR job is requested, leaving JHS in a broken state - Cloudera Open Source

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: CDH5b2
Fix Version/s: None
Component/s: MapReduce
Labels:
- hadoop
- yarn
Environment:
CDH5 beta 2
YARN

Description

We have been testing CDH5 beta 2 YARN on a decently-sized (97 beefy nodes) cluster, and ran into an issue with Job History Server.

Basically, if history (or counters, or whatever) is requested for a completed MR job, and the job was large enough, some thread in the JHS can OOM and leave it in a broken state. The server process stays up (so it can't be auto-restarted by simple process-monitoring tools), but does not respond to any requests, either from the web UI or from the command line (with 'mapred job -status job_id').

We were able to hit this with the default JVM heap size and a pretty large MR job (over 100k mappers, few thousands of reducers). We were also able to reproduce it with a much smaller MR job when we lowered the JVM heap of Job History Server to 128MB.

Attachments

Activity

People

Assignee:

Unassigned

Reporter:

Ilya Maykov

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

20/Mar/14 10:30 PM

Updated:

20/Mar/14 10:34 PM