Details
-
Type:
Bug
-
Status: Closed
-
Priority:
Major
-
Resolution: Incomplete
-
Affects Version/s: CDH 5.8.2
-
Fix Version/s: None
-
Component/s: HDFS
-
Labels:None
-
Environment:SUSE SLES11 SP4
JAVA 1.8.0-45
Kerberos disabled
Description
Description:
---------------
The cloudera dashboard shows the diferent services as unestable and not available.
This include HDFS, Hive, Yarn....
An strace on the hdfs command shows the commands takes 2 minutes to complete, and most of the time its not performing any activity.
An analysis of the scenario shows the root cause is related to timeouts in DNS.
This has been verified with an strace.
The issue can be reproduced with hdfs or enabling the strace to dump the system calls.
HADOOP_ROOT_LOGGER=DEBUG,console hdfs dfs -ls /
HADOOP_ROOT_LOGGER=DEBUG,console strace -f hdfs dfs -ls /
Kerberos is disabled in this deployment, but the hdfs tries to DNS QUERY different hosts that are not in the cloudera manager, and (Ex: kerberos.site.com, kerberos.com)
It seem to add kerberos to each domain of the hosts.
The issue is worst if the DNS is not reachable or the queries take time to answer.
Workaround:
-----------------
One possible and working workarround is: disable DNS queries or to add the non-existing servers to /etc/hosts. After this the DNS
Logs:
17/02/27 19:22:31 DEBUG util.Shell: setsid exited with exit code 0
17/02/27 19:22:31 DEBUG conf.Configuration: parsing URL jar:file:/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/jars/hadoop-common-2.6.0-cdh5.8.2.jar!/core-default.xml
17/02/27 19:22:31 DEBUG conf.Configuration: parsing input stream sun.net.www.protocol.jar.JarURLConnection$JarURLInputStream@239963d8
17/02/27 19:22:31 DEBUG conf.Configuration: parsing URL file:/etc/hadoop/conf.cloudera.hdfs/core-site.xml
17/02/27 19:22:31 DEBUG conf.Configuration: parsing input stream java.io.BufferedInputStream@184f6be2
17/02/27 19:22:31 DEBUG core.Tracer: sampler.classes = ; loaded no samplers
17/02/27 19:22:31 DEBUG core.Tracer: span.receiver.classes = ; loaded no span receivers
17/02/27 19:22:31 DEBUG lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(always=false, sampleName=Ops, about=, type=DEFAULT, valueName=Time, value=[Rate of successful kerberos logins and latency (milliseconds)])
17/02/27 19:22:31 DEBUG lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(always=false, sampleName=Ops, about=, type=DEFAULT, valueName=Time, value=[Rate of failed kerberos logins and latency (milliseconds)])
17/02/27 19:22:31 DEBUG lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(always=false, sampleName=Ops, about=, type=DEFAULT, valueName=Time, value=[GetGroups])
17/02/27 19:22:31 DEBUG impl.MetricsSystemImpl: UgiMetrics, User and group related metrics
17/02/27 19:22:31 DEBUG security.SecurityUtil: Setting hadoop.security.token.service.use_ip to true
-
-
-
-
- No logs here, note the timeout at 19:24:32, thats 2 minutes after previous log *****
-
-
-
17/02/27 19:24:32 DEBUG util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
17/02/27 19:24:32 DEBUG security.Groups: Creating new Groups object
17/02/27 19:24:32 DEBUG security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000; warningDeltaMs=5000
17/02/27 19:24:32 DEBUG security.UserGroupInformation: hadoop login
17/02/27 19:24:32 DEBUG security.UserGroupInformation: hadoop login commit
17/02/27 19:24:32 DEBUG security.UserGroupInformation: using local user:UnixPrincipal: root
17/02/27 19:24:32 DEBUG security.UserGroupInformation: Using user: "UnixPrincipal: root" with name root
17/02/27 19:24:32 DEBUG security.UserGroupInformation: User entry: "root"
17/02/27 19:24:32 DEBUG security.UserGroupInformation: UGI loginUser:root (auth:SIMPLE)
17/02/27 19:24:32 DEBUG core.Tracer: sampler.classes = ; loaded no samplers
17/02/27 19:24:32 DEBUG core.Tracer: span.receiver.classes = ; loaded no span receivers
17/02/27 19:24:32 DEBUG hdfs.BlockReaderLocal: dfs.client.use.legacy.blockreader.local = false
17/02/27 19:24:32 DEBUG hdfs.BlockReaderLocal: dfs.client.read.shortcircuit = false
17/02/27 19:24:32 DEBUG hdfs.BlockReaderLocal: dfs.client.domain.socket.data.traffic = false
17/02/27 19:24:32 DEBUG hdfs.BlockReaderLocal: dfs.domain.socket.path = /var/run/hdfs-sockets/dn
strace
---------
The strace shows the only activity during the timout are DNS queries.
[pid 3869] getsockname(162,
{sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
[pid 3869] bind(162,
[pid 3869] connect(162, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.200.67.41")}, 16) = 0
[pid 3869] ioctl(162, FIONREAD, [0]) = 0
[pid 3869] sendto(162, "\275\235\1\0\0\1\0\0\0\0\0\0\t_kerberos\3com\2es\0\0\20"..., 34, 0, NULL, 0) = 34
[pid 3869] poll([{fd=162, events=POLLIN|POLLERR}], 1, 1000 <unfinished ...>
[pid 3869] connect(162, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
[pid 3869] getsockname(162, {sa_family=AF_INET, sin_port=htons(43441), sin_addr=inet_addr("0.0.0.0")}
, [16]) = 0
[pid 3869] connect(162,
[pid 3869] ioctl(162, FIONREAD, [0]) = 0
[pid 3869] sendto(162, "\275\235\1\0\0\1\0\0\0\0\0\0\t_kerberos\3com\2es\0\0\20"..., 34, 0, NULL, 0) = 34
[pid 3869] poll([{fd=162, events=POLLIN|POLLERR}], 1, 2000 <unfinished ...>
[pid 3869] connect(162, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
[pid 3869] getsockname(162, {sa_family=AF_INET, sin_port=htons(43441), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
[pid 3869] connect(162, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.200.67.41")}, 16) = 0
[pid 3869] ioctl(162, FIONREAD, [0]) = 0
[pid 3869] sendto(162, "\275\235\1\0\0\1\0\0\0\0\0\0\t_kerberos\3com\2es\0\0\20"..., 34, 0, NULL, 0) = 34
[pid 3869] poll([{fd=162, events=POLLIN|POLLERR}], 1, 2000 <unfinished ...>
[pid 3869] connect(162, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
[pid 3869] getsockname(162, {sa_family=AF_INET, sin_port=htons(43441), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
[pid 3869] connect(162, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.200.75.108")}
, 16) = 0
[pid 3869] ioctl(162, FIONREAD, [0]) =