Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: YARN
    • Labels:
      None

      Description

      We are testing CDH5.11.1 upgrade and are seeing some kind of leaks in YARN RM where reserved vcores start going up infinitely (into many millions) until RM eventually dies.

      I haven't been able to create a reproducible test-case but looking through release notes I started to suspect YARN-6432 that was included in 5.11.1. I currently downgraded to CDH5.11.0 to see if problem disappears (it usually takes a day or two on this 4-node ~ 100+ vcore cluster).

      Meanwhile, I found https://issues.apache.org/jira/browse/YARN-6895 that was fixed a couple weeks ago and it sounds like what we are observing: "If the container released is smaller than the preemption request, a node reservation is created that is never deleted."

      This sounds pretty critical so I'm surprised this wasn't noticed before 5.11.2 was released. To reproduce, you would need a pretty small cluster with fair-scheduler and preemption enabled and throw large jobs of various sizes at it.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              vitaliy.fuks Vitaliy Fuks
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: