Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-538

Container erases temporary file and shell script immediately after execution. Need to keep them in case of failure.

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: CDH5b1
    • Fix Version/s: None
    • Component/s: MapReduce
    • Labels:
      None
    • Environment:
      Client: Windows, Server: Centos Linux

      Description

      If container in misconfigured and shell script fails there is no information left after execution why it failed. All temporary files are immediately removed without coping some of them to the log, in particular "stderr" is not copied anywhere.

      Example of the files I suggest to keep.

      /yarn/nm/usercache/hadoop/appcache/application_1383178353227_0063/container_1383178353227_0063_01_000001/launch_container.sh
      /yarn/nm/usercache/hadoop/appcache/application_1383178353227_0063/filecache/12_tmp/job.xml
      /var/log/hadoop-yarn/container/application_1383178353227_0063/container_1383178353227_0063_01_000001/stdout
      /var/log/hadoop-yarn/container/application_1383178353227_0063/container_1383178353227_0063_01_000001/stderr

      I suggest at least in case of failure do not remove those files. I had to resort to this trick to be able to intercept them before they removed.

      rm -f /tmp/log;  
      while true; do 
         find  /yarn/nm/usercache/hadoop/appcache/ /var/log/hadoop-yarn/container/ -type f | 
         while read i ; do 
             echo $i; 
             echo ====================$i================ >>/tmp/log ; 
         cat $i >> /tmp/log; 
         done 
      done
      

      In my case some environment variables are not set: See ?, but the only message I found in the log was:

      2013-11-06 08:55:33,635 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1383178353227_0063 failed 2 times due to AM Container for appattempt_1383178353227_0063_000002 exited with  exitCode: 1 due to: Exception from container-launch:
      org.apache.hadoop.util.Shell$ExitCodeException:
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
              at org.apache.hadoop.util.Shell.run(Shell.java:379)
              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
              at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
              at java.util.concurrent.FutureTask.run(FutureTask.java:166)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:724)
      
      

      It did not provide enough information. The actual error message was found in "stderr" file:

      ====================/var/log/hadoop-yarn/container/application_1383178353227_0063/container_1383178353227_0063_01_000001/stderr================
      Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              pganelin Pavel Ganelin
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: