Details
-
Type: Bug
-
Status: Open
-
Priority: Critical
-
Resolution: Unresolved
-
Affects Version/s: v0.9.3
-
Fix Version/s: None
-
Component/s: Node
-
Labels:None
Description
Single master
agent: syslogTcp | agentE2EChain
collector: collectorSource | collectorSink("hdfs://...")
From reading through various logs, this is, I believe, the order of events:
- NameNode crashed
- This caused collector to fail writes to hdfs
- Which in turn caused agents to start backing up and buffering on disk (correct so far)
- WatchDog caught a crash and restarted the Flue Master
- Eventually the DFO stops writing to disk but keeps trying to pass messages
- ACKs continue to fail and eventually nothing is passed
Disk space was fine throughout. We had another agent node which continued to operate normally during this period and buffered all messages as expected. Here's a snip of some of the relevant sections of log files:
http://pastie.org/pastes/1883087/text?key=ouxaqhuodprfrmsailunw
I can provide the full log files if they will be of use.