Details
Description
Seeing an issue where Livy seems unable to kill a PySpark session due to disconnects with its session's ProcessInterpreter.
User observation, using Hue Notebooks to launch a PySpark session, session is initiated and goes to running. User's job has errors and goes to idle but session remains running for 24+ hours. Usually we see an idle session automatically killed after 1 hour.
In Yarn task log, we see the AM start ok and SparkContext comes up, user's job runs with errors and SparkContext goes to idle, the Yarn job then stays idle for 1 hour at which point PythonInterpreter calls shutdown;
INFO PythonInterpreter: Shutting down process
Nothing more is seen in the Yarn log, Yarn job remains running.
In Livy log we see the following timeout exception when trying the shutdown:
INFO com.cloudera.livy.Logging$class.info(40): Stopping InteractiveSession 0...
WARN com.cloudera.livy.rsc.RSCClient.stop(220): Exception while waiting for end session reply.
java.util.concurrent.TimeoutException
The Livy call trace looks like it is trying:
-> repl/ProcessInterpreter.scala close() - Yarn log showing "Shutting down process"
-> repl/PythonInterpreter.scala sendShutdownRequest()
-> livy/server/interactive/InteractiveSession.scala stopSession()
-> livy/rsc/RSCClient.java stop() - the client getting the timeout error:
livy.rsc.RSCClient.stop(220): Exception while waiting for end session reply
Not sure what happened, it appears that the client lost its reference to its session/ProcessInterpreter and can no longer complete a session close attempt.