Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-843

Premounted cgroups don't work with NodeManager and Ubuntu 14.04

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Cloudera Manager
    • Labels:
      None
    • Environment:
      OS: ubuntu14.04
      Linux kernal: 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 GNU/Linux
      Cloudera manager: 5.8.4-1
      Cloudera agent: 5.8.4-1
      CDH parcel:5.8.2

      Description

      Ubuntu 14.04 mounts cgroups automatically on startup after installation of cgroup-lite (for e.x. docker.io _and _libvirt-bin depends on it) to /sys/fs/cgroup/ like this:

      cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,relatime,cpuset)
      cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu)
      cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
      cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory)
      cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices)
      cgroup on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer)
      cgroup on /sys/fs/cgroup/blkio type cgroup (rw,relatime,blkio)
      cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,relatime,perf_event)
      cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,relatime,hugetlb)
      

      cloudera-scm-agent succefully detects those and reports about it to the log:

      [03/Feb/2017 16:45:15 +0000] 1903 MainThread agent        INFO     Agent starting as pid 1903 user root(0) group root(0).
      [21/Feb/2017 10:09:05 +0000] 14054 MainThread agent        INFO     At least one outstanding cgroup; retaining cgroup mounts
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread agent        INFO     Re-using pre-existing directory: /run/cloudera-scm-agent/cgroups
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     Found existing subsystem cpu at /sys/fs/cgroup/cpu
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     Found existing subsystem cpuacct at /sys/fs/cgroup/cpuacct
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     Found existing subsystem memory at /sys/fs/cgroup/memory
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     Found existing subsystem blkio at /sys/fs/cgroup/blkio
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     Found cgroups subsystem: cpu
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     cgroup pseudofile /sys/fs/cgroup/cpu/cpu.rt_runtime_us does not exist, skipping
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     Found cgroups subsystem: cpuacct
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     Found cgroups subsystem: memory
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups      INFO     Found cgroups subsystem: blkio
      [21/Feb/2017 10:09:08 +0000] 20837 MainThread agent        INFO     Found cgroups capabilities: {'has_memory': True, 'default_memory_limit_in_bytes': -1, 'default_memory_soft_limit_in_bytes': -1, 'writable_cgroup_dot_procs': True, 'default_cpu_rt_runtime_us': -1, 'has_cpu': True, 'default_blkio_weight': 1000, 'default_cpu_shares': 1024, 'has_cpuacct': True, 'has_blkio': True}
       

      The ubuntu's default policies autolocate process to the default location under dedicated user's folder /user/0.user/

       1862 ?        Ss     0:27 /usr/lib/cmf/agent/build/env/bin/python /usr/lib/cmf/agent/build/env/bin/supervisord
       1872 ?        S      0:00  \_ python2.7 /usr/lib/cmf/agent/build/env/bin/cmf-listener -l /var/log/cloudera-scm-agent/cmf_listener.log /run/cloudera-scm-agent/events
       2676 ?        Sl     2:46  \_ /usr/lib/jvm/java-8-oracle//bin/java -Dproc_datanode -Xmx1000m -Dhdfs.audit.logger=INFO,RFAAUDIT -Dsecurity.audit.logger=INFO,RFAS -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop22245 ?        Sl     0:00  \_ python2.7 /usr/lib/cmf/agent/build/env/bin/flood
      21957 ?        Ssl    0:01 python2.7 /usr/lib/cmf/agent/build/env/bin/cmf-agent --package_dir /usr/lib/cmf/service --agent_dir /var/run/cloudera-scm-agent --lib_dir /var/lib/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-r
      
      #cat /proc/21957/cgroup
      11:name=systemd:/user/0.user/5.session
      10:hugetlb:/user/0.user/5.session
      9:perf_event:/user/0.user/5.session
      8:blkio:/user/0.user/5.session
      7:freezer:/user/0.user/5.session
      6:devices:/user/0.user/5.session
      5:memory:/user/0.user/5.session
      4:cpuacct:/user/0.user/5.session
      3:cpu:/user/0.user/5.session
      2:cpuset:/
      
      #cat /proc/1862/cgroup
      11:name=systemd:/user/0.user/c1.session
      10:hugetlb:/user/0.user/c1.session
      9:perf_event:/user/0.user/c1.session
      8:blkio:/user/0.user/c1.session
      7:freezer:/user/0.user/c1.session
      6:devices:/user/0.user/c1.session
      5:memory:/cloudera
      4:cpuacct:/user/0.user/c1.session
      3:cpu:/cloudera
      2:cpuset:/
      

      The corresponded cpu folder structure looks like this after hdfs datanode started:

      # ll /sys/fs/cgroup/cpu/user/0.user/c1.session/
      total 0
      drwxr-xr-x 3 root root 0 Feb 21 09:21 ./
      drwxr-xr-x 5 root root 0 Feb 20 16:59 ../
      drwxr-xr-x 2 root root 0 Feb 20 15:10 757-hdfs-DATANODE/
      -rw-r--r-- 1 root root 0 Feb 20 15:10 cgroup.clone_children
      --w--w--w- 1 root root 0 Feb 20 15:10 cgroup.event_control
      -rw-r--r-- 1 root root 0 Feb 20 15:10 cgroup.procs
      -rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.cfs_period_us
      -rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.cfs_quota_us
      -rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.shares
      -r--r--r-- 1 root root 0 Feb 20 15:10 cpu.stat
      -rw-r--r-- 1 root root 0 Feb 20 15:10 notify_on_release
      -rw-r--r-- 1 root root 0 Feb 20 15:10 tasks
      

      Next when I try to start YARN node manager from cloudera manager:

       
      Feb 21, 9:21:49.447 AM	INFO	org.apache.hadoop.service.AbstractService	
      Service NodeManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
      org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
      	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:221)
      	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:514)
      	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:561)
      Caused by: java.io.IOException: Not able to enforce cpu weights; cannot write to cgroup at: /sys/fs/cgroup/cpu
      	at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.initializeControllerPaths(CgroupsLCEResourcesHandler.java:502)
      	at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.init(CgroupsLCEResourcesHandler.java:154)
      	at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.init(CgroupsLCEResourcesHandler.java:137)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:215)
      	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:219)
      	... 3 more
      Feb 21, 9:21:49.486 AM	DEBUG	org.apache.hadoop.service.AbstractService	
      Service: NodeManager entered state STOPPED
      

      In cloudera manager yarn.nodemanager.linux-container-executor.cgroups.hierarchy set to '/hadoop-yarn'
      I created a /sys/fs/cgroup/cpu/hadoop-yarn cgroup manually and gave it 777 permission. But got the error back again.
      I straced the nodemanager java process and got as last system call this. which is

       [pid 11431] access("/sys/fs/cgroup/cpu/u/s/e/r/0/./u/s/e/r/c/4/./s/e/s/s/i/o/n/hadoop-yarn", W_OK) = -1 ENOENT (No such file or directory)
       

      This looks really strange. It seems something wrong with replacements.
      Here is yarn/yarn.sh ["nodemanager"] strerr:

      + echo CONF_DIR=/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER
      + echo CMF_CONF_DIR=/etc/cloudera-scm-agent
      + EXCLUDE_CMF_FILES=('cloudera-config.sh' 'httpfs.sh' 'hue.sh' 'impala.sh' 'sqoop.sh' 'supervisor.conf' '*.log' '*.keytab' '*jceks')
      ++ printf '! -name %s ' cloudera-config.sh httpfs.sh hue.sh impala.sh sqoop.sh supervisor.conf '*.log' yarn.keytab '*jceks'
      + find /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER -type f '!' -path '/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/logs/*' '!' -name cloudera-config.sh '!' -name httpfs.sh '!' -name hue.sh '!' -name impala.sh '!' -name sqoop.sh '!' -name supervisor.conf '!' -name '*.log' '!' -name yarn.keytab '!' -name '*jceks' -exec perl -pi -e 's#{{CMF_CONF_DIR}}#/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER#g' '{}' ';'
      Can't open /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/container-executor.cfg: Permission denied.
      + perl -pi -e 's#{{CGROUP_GROUP_CPU}}#u/s/e/r///0/./u/s/e/r///4/./s/e/s/s/i/o/n#g' /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/yarn-site.xml
      

      Checked furthen and found that the bug is in agent.py at /usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.8.4-py2.7.egg/cmf/agent.py at method update_process_environment_for_cgroups at line 3318

       group = '/'.join(group)
      

      Which do next for group string

      [21/Feb/2017 11:23:41 +0000] 33551 MainThread agent        INFO     Set ENV from agent cgroups before '/'.join(group) CPU user/0.user/4.session
      [21/Feb/2017 11:23:41 +0000] 33551 MainThread agent        INFO     Set ENV from agent cgroups after '/'.join(group) CPU u/s/e/r///0/./u/s/e/r///4/./s/e/s/s/i/o/n
       

      Is it intended to be that way? Can you fix it? Potentially this affects not only NodeManager but Impala too.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aya Alexander Yasnogor
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: