Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Not A Bug
-
Affects Version/s: CDH4.4.0
-
Fix Version/s: None
-
Component/s: HDFS
-
Labels:
Description
Hi Guys,
Currently we are using cdh4.4.0 HA Enabled cluster, Now day's weekly once Namenode went down, We noticed bcoz of QJN's
2014-12-13 07:44:51,212 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [172.16.30.122:8485, 172.16.30.123:8485, 172.16.30.124:8485], stream=QuorumOutputStream starting at txid 1342968975))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
QJ, nodes flush failed, After time out namenode went down,
In my case we are using three JournelNodes and nine ZooKeeper instance running,
What are step I did debug ::
1. Namenode went down time No Machine load and no memory related issue verified with monitoring tools.
2. Same time No logs written Journel nodes
Before went down, In my observation dfshealth page Journal Manager state threenode's txid are same[ If same txid means it's properly doing syncing ].
Please let me know any further info, I am happy to help you.
Please guide me, how to debug.