Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-678

Resource allocation failure for jobs when a node becomes unavailable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: CDH 5.3.0
    • Fix Version/s: None
    • Component/s: MapReduce
    • Labels:
      None
    • Environment:
      Centos6

      Description

      A node going from RUNNING to UNHEALTHY or LOST makes subsequent jobs (for a limited time) unable to be allocated resources, throwing this error:

      java.io.IOException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=1, maxVirtualCores=0
      

      It seems the RM is losing something in its config when refreshing it on noticing the node state change, and believes wrongly the configured max vcores per application is 0.

      I can reproduce it pretty easily by just killing a nodemanager and waiting for 10 minutes, while executing various applications.

      For a hive query, here is the full stack trace:

      hive> select count(*) from dw_reservation;
      Total jobs = 1
      Launching Job 1 out of 1
      Number of reduce tasks determined at compile time: 1
      In order to change the average load for a reducer (in bytes):
        set hive.exec.reducers.bytes.per.reducer=<number>
      In order to limit the maximum number of reducers:
        set hive.exec.reducers.max=<number>
      In order to set a constant number of reducers:
        set mapreduce.job.reduces=<number>
      java.io.IOException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=1, maxVirtualCores=0
          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:205)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:387)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:347)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:271)
          at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:538)
          at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:188)
          at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:323)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
      
          at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:305)
          at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:528)
          at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
          at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
          at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
          at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:564)
          at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:559)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
          at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:559)
          at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:550)
          at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:420)
          at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
          at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
          at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
          at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
      Caused by: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=1, maxVirtualCores=0
          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:205)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:387)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:347)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:271)
          at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:538)
          at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:188)
          at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:323)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
      
          at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
          at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
          at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
          at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
          at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
          at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:211)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
          at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
          at com.sun.proxy.$Proxy27.submitApplication(Unknown Source)
          at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:225)
          at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:282)
          at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:289)
          ... 19 more
      Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=1, maxVirtualCores=0
          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:205)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:387)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:347)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:271)
          at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:538)
          at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:188)
          at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:323)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
      
          at org.apache.hadoop.ipc.Client.call(Client.java:1411)
          at org.apache.hadoop.ipc.Client.call(Client.java:1364)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
          at com.sun.proxy.$Proxy26.submitApplication(Unknown Source)
          at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:208)
          ... 29 more
      Job Submission failed with exception 'java.io.IOException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=1, maxVirtualCores=0
          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:205)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:387)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:347)
          at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:271)
          at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:538)
          at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:188)
          at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:323)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
      )'
      FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
      

        Attachments

          Activity

            People

            • Assignee:
              adhoot Anubhav Dhoot
              Reporter:
              dmorel David Morel
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: