Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-844

hdfs slow due to unexpected DNS queries

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: CDH 5.8.2
    • Fix Version/s: None
    • Component/s: HDFS
    • Labels:
      None
    • Environment:
      SUSE SLES11 SP4
      JAVA 1.8.0-45
      Kerberos disabled

      Description

      Description:
      ---------------

      The cloudera dashboard shows the diferent services as unestable and not available.
      This include HDFS, Hive, Yarn....

      An strace on the hdfs command shows the commands takes 2 minutes to complete, and most of the time its not performing any activity.

      An analysis of the scenario shows the root cause is related to timeouts in DNS.
      This has been verified with an strace.

      The issue can be reproduced with hdfs or enabling the strace to dump the system calls.

      HADOOP_ROOT_LOGGER=DEBUG,console hdfs dfs -ls /
      HADOOP_ROOT_LOGGER=DEBUG,console strace -f hdfs dfs -ls /

      Kerberos is disabled in this deployment, but the hdfs tries to DNS QUERY different hosts that are not in the cloudera manager, and (Ex: kerberos.site.com, kerberos.com)
      It seem to add kerberos to each domain of the hosts.

      The issue is worst if the DNS is not reachable or the queries take time to answer.

      Workaround:
      -----------------
      One possible and working workarround is: disable DNS queries or to add the non-existing servers to /etc/hosts. After this the DNS

      Logs:

      17/02/27 19:22:31 DEBUG util.Shell: setsid exited with exit code 0
      17/02/27 19:22:31 DEBUG conf.Configuration: parsing URL jar:file:/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/jars/hadoop-common-2.6.0-cdh5.8.2.jar!/core-default.xml
      17/02/27 19:22:31 DEBUG conf.Configuration: parsing input stream sun.net.www.protocol.jar.JarURLConnection$JarURLInputStream@239963d8
      17/02/27 19:22:31 DEBUG conf.Configuration: parsing URL file:/etc/hadoop/conf.cloudera.hdfs/core-site.xml
      17/02/27 19:22:31 DEBUG conf.Configuration: parsing input stream java.io.BufferedInputStream@184f6be2
      17/02/27 19:22:31 DEBUG core.Tracer: sampler.classes = ; loaded no samplers
      17/02/27 19:22:31 DEBUG core.Tracer: span.receiver.classes = ; loaded no span receivers
      17/02/27 19:22:31 DEBUG lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(always=false, sampleName=Ops, about=, type=DEFAULT, valueName=Time, value=[Rate of successful kerberos logins and latency (milliseconds)])
      17/02/27 19:22:31 DEBUG lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(always=false, sampleName=Ops, about=, type=DEFAULT, valueName=Time, value=[Rate of failed kerberos logins and latency (milliseconds)])
      17/02/27 19:22:31 DEBUG lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(always=false, sampleName=Ops, about=, type=DEFAULT, valueName=Time, value=[GetGroups])
      17/02/27 19:22:31 DEBUG impl.MetricsSystemImpl: UgiMetrics, User and group related metrics
      17/02/27 19:22:31 DEBUG security.SecurityUtil: Setting hadoop.security.token.service.use_ip to true

              • No logs here, note the timeout at 19:24:32, thats 2 minutes after previous log *****

      17/02/27 19:24:32 DEBUG util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
      17/02/27 19:24:32 DEBUG security.Groups: Creating new Groups object
      17/02/27 19:24:32 DEBUG security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000; warningDeltaMs=5000
      17/02/27 19:24:32 DEBUG security.UserGroupInformation: hadoop login
      17/02/27 19:24:32 DEBUG security.UserGroupInformation: hadoop login commit
      17/02/27 19:24:32 DEBUG security.UserGroupInformation: using local user:UnixPrincipal: root
      17/02/27 19:24:32 DEBUG security.UserGroupInformation: Using user: "UnixPrincipal: root" with name root
      17/02/27 19:24:32 DEBUG security.UserGroupInformation: User entry: "root"
      17/02/27 19:24:32 DEBUG security.UserGroupInformation: UGI loginUser:root (auth:SIMPLE)
      17/02/27 19:24:32 DEBUG core.Tracer: sampler.classes = ; loaded no samplers
      17/02/27 19:24:32 DEBUG core.Tracer: span.receiver.classes = ; loaded no span receivers
      17/02/27 19:24:32 DEBUG hdfs.BlockReaderLocal: dfs.client.use.legacy.blockreader.local = false
      17/02/27 19:24:32 DEBUG hdfs.BlockReaderLocal: dfs.client.read.shortcircuit = false
      17/02/27 19:24:32 DEBUG hdfs.BlockReaderLocal: dfs.client.domain.socket.data.traffic = false
      17/02/27 19:24:32 DEBUG hdfs.BlockReaderLocal: dfs.domain.socket.path = /var/run/hdfs-sockets/dn

      strace
      ---------

      The strace shows the only activity during the timout are DNS queries.

      [pid 3869] getsockname(162,

      {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}

      , [16]) = 0
      [pid 3869] bind(162,

      {sa_family=AF_INET, sin_port=htons(43441), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
      [pid 3869] connect(162, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.200.67.41")}, 16) = 0
      [pid 3869] ioctl(162, FIONREAD, [0]) = 0
      [pid 3869] sendto(162, "\275\235\1\0\0\1\0\0\0\0\0\0\t_kerberos\3com\2es\0\0\20"..., 34, 0, NULL, 0) = 34
      [pid 3869] poll([{fd=162, events=POLLIN|POLLERR}], 1, 1000 <unfinished ...>
      [pid 3869] connect(162, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
      [pid 3869] getsockname(162, {sa_family=AF_INET, sin_port=htons(43441), sin_addr=inet_addr("0.0.0.0")}

      , [16]) = 0
      [pid 3869] connect(162,

      {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.200.75.108")}, 16) = 0
      [pid 3869] ioctl(162, FIONREAD, [0]) = 0
      [pid 3869] sendto(162, "\275\235\1\0\0\1\0\0\0\0\0\0\t_kerberos\3com\2es\0\0\20"..., 34, 0, NULL, 0) = 34
      [pid 3869] poll([{fd=162, events=POLLIN|POLLERR}], 1, 2000 <unfinished ...>
      [pid 3869] connect(162, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
      [pid 3869] getsockname(162, {sa_family=AF_INET, sin_port=htons(43441), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
      [pid 3869] connect(162, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.200.67.41")}, 16) = 0
      [pid 3869] ioctl(162, FIONREAD, [0]) = 0
      [pid 3869] sendto(162, "\275\235\1\0\0\1\0\0\0\0\0\0\t_kerberos\3com\2es\0\0\20"..., 34, 0, NULL, 0) = 34
      [pid 3869] poll([{fd=162, events=POLLIN|POLLERR}]
      , 1, 2000 <unfinished ...>
      [pid 3869] connect(162, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
      [pid 3869] getsockname(162, {sa_family=AF_INET, sin_port=htons(43441), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
      [pid 3869] connect(162, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.200.75.108")}

      , 16) = 0
      [pid 3869] ioctl(162, FIONREAD, [0]) =

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jrubio Jose Rubio
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: