Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: v0.9.3
-
Fix Version/s: v0.9.4
-
Component/s: Sinks+Sources
-
Labels:None
-
Environment:Ubuntu 8.04
Description
I have a problem where about 1/3rd of my events are duplicates. I have a 3 master/3 collector configuration with an agent syslogTcp source -> agentE2EChain sink and a collectorSink to S3.
My config looks like this (only with about 60 more agent nodes, all identically configured):
log1 : collectorSource | collectorSink("s3n://bucket/aarontest/dt=%Y-%m-%d","ue",3600000);
log2 : collectorSource | collectorSink("s3n://bucket/aarontest/dt=%Y-%m-%d","ue",3600000);
log3 : collectorSource | collectorSink("s3n://bucket/aarontest/dt=%Y-%m-%d","ue",3600000);
node1 : syslogTcp(5140) | agentE2EChain("log1","log2","log3");
node2 : syslogTcp(5140) | agentE2EChain("log1","log2","log3");
node3 : syslogTcp(5140) | agentE2EChain("log1","log2","log3");
One day, out of 6.5m events, 2.5m of them were duplicates. As you can see from my config above, the roll time is set to 1 hour
and my flume.agent.logdir.retransmit value is set to 8 hours (28800000ms).
I understand that w/ E2E there is a possibility of duplication, but this seems a bit excessive. The problem does not occur with DFO chains.
There is also a thread on this topic at https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/78af6c9cff03c42c#
I am attempting to determine when the retransmits happen, but it is proving somewhat difficult due to the large number of events.