Details
-
Type:
Bug
-
Status: Closed
-
Priority:
Critical
-
Resolution: Fixed
-
Affects Version/s: v0.9.1
-
Fix Version/s: v0.9.2
-
Component/s: Sinks+Sources
-
Labels:None
Description
Collector configured as:
exec config auctionlogsink 'collectorSource(35853)' '
{ gunzip => collectorSink( "hdfs://clmaster01/bidder_data/raw/auction_logs/%Y%m%d/%H/", "auctionLog-", 300000 ) }'
Agent configured as:
exec config nym7-bidlog 'syslogTcp(5140)' '
{ gzip => agentDFOSink( "clmaster01", 35853 ) }'
We first observed this problem in production when our collector server went down. I've since observed it in a test environment too. If you simply stop the collector process, the agent immediately notices and starts writing events to disk:
2010-10-19 17:40:09,549 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: open attempt 0 failed, backoff (1000ms): Failed to open thrift event sink at 192.168.1.43:35855 : java.net.ConnectException: Connection refused
However, in the event of a network failure (or failure of the machine to respond in any way, as was observed in our production scenario), simulated by pulling out the ethernet cable from the machine, the agent node continues as if nothing has gone wrong.
In my test scenario, when I plugged the cable back in, some of the events were received, presumably because they were caught in a TCP buffer. At no point, however, did the agent detect the situation, write anything to disc or attempt to re-transmit.
Attachments
Issue Links
- relates to
-
FLUME-313 Reconcile semantics differences between Avro RPC and Thrift RPC exceptions.
-
- Open
-