Details
-
Type:
Improvement
-
Status: Resolved
-
Priority:
Minor
-
Resolution: Not A Bug
-
Affects Version/s: CDH4.2.1
-
Fix Version/s: None
-
Component/s: HDFS
-
Labels:None
-
Environment:CDH4.2.1, HA
Description
During checkpointing on the standby NN, the checkpointer thread is holding onto a lock which prevents basically anything else to run.
This is very uncool, especially because the lock is held during image compression and writeback to disk, as these operation do take a lot of time on non-trivial setups.
As a reminder, fresh clients will connect to the standby and expect it to fail connexion or redirect them to the active NN.
In this state, which can last for tens of seconds, the client is stalled, waiting for an answer, slowing down operations for newly-started tasks.
JMX threaddump is attached which shows the problem.