[DISTRO-861] YARN reservation leaks in CDH5.11.1 - Cloudera Open Source

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: YARN
Labels:
None

Description

We are testing CDH5.11.1 upgrade and are seeing some kind of leaks in YARN RM where reserved vcores start going up infinitely (into many millions) until RM eventually dies.

I haven't been able to create a reproducible test-case but looking through release notes I started to suspect YARN-6432 that was included in 5.11.1. I currently downgraded to CDH5.11.0 to see if problem disappears (it usually takes a day or two on this 4-node ~ 100+ vcore cluster).

Meanwhile, I found https://issues.apache.org/jira/browse/YARN-6895 that was fixed a couple weeks ago and it sounds like what we are observing: "If the container released is smaller than the preemption request, a node reservation is created that is never deleted."

This sounds pretty critical so I'm surprised this wasn't noticed before 5.11.2 was released. To reproduce, you would need a pretty small cluster with fair-scheduler and preemption enabled and throw large jobs of various sizes at it.

Attachments

Activity

People

Assignee:

Unassigned

Reporter:

Vitaliy Fuks

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

25/Aug/17 10:41 PM

Updated:

01/Sep/17 4:54 PM