Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-631

Hive map reduce not working with YARN - 'java.io.IOException(Could not find status of job:)'

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: CDH 5.1.0
    • Fix Version/s: None
    • Component/s: Hive
    • Labels:
    • Environment:
      CentOS 6.5, hive 0.12.0-cdh5.1.0,

      Description

      We have troubles running any map reduce jobs from Hive when using YARN instead of MRv1.
      We have old MRv1 framework enabled by default a want to start using YARN instead. To test it, we do:

      export HADOOP_CONF_DIR=/etc/hadoop/conf.cloudera.yarn/
      and then run hive console

      When running a query like "SELECT COUNT from table", a map reduce job is submitted to YARN and it completes fine according to Resource manager web gui:

      User Name: horky
      Queue: root.horky
      State: SUCCEEDED
      Uberized: false
      Submitted: Thu Aug 14 09:01:45 CEST 2014
      Started: Thu Aug 14 09:02:01 CEST 2014
      Finished: Thu Aug 14 09:17:51 CEST 2014

      In Hive console, the end of a job looks always like this:

      2014-08-14 09:17:50,306 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 180775.31 sec
      2014-08-14 09:17:51,381 Stage-1 map = 100%, reduce = 80%, Cumulative CPU 180776.23 sec
      2014-08-14 09:17:52,450 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 180776.98 sec
      java.io.IOException: Could not find status of job:job_1406787755131_0048
      at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:297)
      at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:541)
      at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:431)
      at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
      at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
      at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
      at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1485)
      at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1263)
      at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1091)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:921)
      at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
      at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422)
      at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:790)
      at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
      at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:623)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      Ended Job = job_1406787755131_0048 with exception 'java.io.IOException(Could not find status of job:job_1406787755131_0048)'
      FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

      It seems to me that Hive keeps checking for status of the job for which it needs to read job's xml configuration file over and over. But when the job truly finishes, the xml conf file is deleted and Hive crashes because of that.

      The audit log of active Namenode looks like this (grepped for job number):

      2014-08-14 09:17:51,364 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:51,716 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=create src=/user/horky/.staging/job_1406787755131_0048/COMMIT_STARTED dst=null perm=horky:tycho:rw-r-r-
      2014-08-14 09:17:51,723 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=create src=/user/horky/.staging/job_1406787755131_0048/COMMIT_SUCCESS dst=null perm=horky:tycho:rw-r-r-
      2014-08-14 09:17:51,769 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=create src=/user/history/done_intermediate/horky/job_1406787755131_0048.summary_tmp dst=null perm=horky:hadoop:rw-r-r-
      2014-08-14 09:17:51,789 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=setPermission src=/user/history/done_intermediate/horky/job_1406787755131_0048.summary_tmp dst=null perm=horky:hadoop:rwxrwx---
      2014-08-14 09:17:51,810 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job_1406787755131_0048_1.jhist dst=null perm=null
      2014-08-14 09:17:51,811 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=getfileinfo src=/user/history/done_intermediate/horky/job_1406787755131_0048-1407999705247-horky-SELECT+COUNT%28*%29+FROM+urlinfodump_2014_0...10%28Stage-140800067
      2014-08-14 09:17:51,816 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job_1406787755131_0048_1.jhist dst=null perm=null
      2014-08-14 09:17:51,817 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=getfileinfo src=/user/history/done_intermediate/horky/job_1406787755131_0048-1407999705247-horky-SELECT+COUNT%28*%29+FROM+urlinfodump_2014_0...10%28Stage-140800067
      2014-08-14 09:17:51,818 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=open src=/user/horky/.staging/job_1406787755131_0048/job_1406787755131_0048_1.jhist dst=null perm=null
      2014-08-14 09:17:51,832 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=create src=/user/history/done_intermediate/horky/job_1406787755131_0048-1407999705247-horky-SELECT+COUNT%28*%29+FROM+urlinfodump_2014_0...10%28Stage-140800067
      2014-08-14 09:17:51,871 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=setPermission src=/user/history/done_intermediate/horky/job_1406787755131_0048-1407999705247-horky-SELECT+COUNT%28*%29+FROM+urlinfodump_2014_0...10%28Stage-1
      2014-08-14 09:17:51,872 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job_1406787755131_0048_1_conf.xml dst=null perm=null
      2014-08-14 09:17:51,873 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=getfileinfo src=/user/history/done_intermediate/horky/job_1406787755131_0048_conf.xml_tmp dst=null perm=null
      2014-08-14 09:17:51,874 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job_1406787755131_0048_1_conf.xml dst=null perm=null
      2014-08-14 09:17:51,875 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=getfileinfo src=/user/history/done_intermediate/horky/job_1406787755131_0048_conf.xml_tmp dst=null perm=null
      2014-08-14 09:17:51,876 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=open src=/user/horky/.staging/job_1406787755131_0048/job_1406787755131_0048_1_conf.xml dst=null perm=null
      2014-08-14 09:17:51,879 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=create src=/user/history/done_intermediate/horky/job_1406787755131_0048_conf.xml_tmp dst=null perm=horky:hadoop:rw-r-r-
      2014-08-14 09:17:51,910 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=setPermission src=/user/history/done_intermediate/horky/job_1406787755131_0048_conf.xml_tmp dst=null perm=horky:hadoop:rwxrwx---
      2014-08-14 09:17:51,916 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=rename src=/user/history/done_intermediate/horky/job_1406787755131_0048.summary_tmp dst=/user/history/done_intermediate/horky/job_1406787755131_0048.summar
      2014-08-14 09:17:51,922 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=rename src=/user/history/done_intermediate/horky/job_1406787755131_0048_conf.xml_tmp dst=/user/history/done_intermediate/horky/job_1406787755131_0048_conf.x
      2014-08-14 09:17:51,925 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=rename src=/user/history/done_intermediate/horky/job_1406787755131_0048-1407999705247-horky-SELECT+COUNT%28*%29+FROM+urlinfodump_2014_0...10%28Stage-140800067
      2014-08-14 09:17:52,412 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=open src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,423 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,423 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,424 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,425 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,426 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,426 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,436 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,437 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.114 cmd=getfileinfo src=/user/horky/.staging/job_1406787755131_0048/job.xml dst=null perm=null
      2014-08-14 09:17:52,973 INFO FSNamesystem.audit: allowed=true ugi=horky (auth:SIMPLE) ip=/A.B.C.97 cmd=delete src=/user/horky/.staging/job_1406787755131_0048 dst=null perm=null

      So the whole directory gets deleted by application master running on A.B.C.97 node, where as hive console running on .114 node still tries to read the job.xml file. According to hadoop-cmf-hdfs-NAMENODE-node.domain.log.out, hive indeed tried to open the deleted file:

      2014-08-14 09:17:53,474 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:horky (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /user/horky/.staging/job_1406787755131_0048/job.xml
      2014-08-14 09:17:53,474 INFO org.apache.hadoop.ipc.Server: IPC Server handler 192 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from A.B.C.114:42539 Call#439414 Retry#0: error: java.io.FileNotFoundException: File does not exist: /user/horky/.staging/job_1406787755131_0048/job.xml

      I am out of ideas of what to try to tune. I tend to believe this is probably a configuration issue as I don't want to believe nobody would notice it before us.

      Any help would be appreciated.

      Regards
      Jiri Horky

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jirihorky Jiri Horky
            • Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: