[FLUME-286] DFO mode does not detect network failure - Cloudera Open Source

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: v0.9.1
Fix Version/s: v0.9.2
Component/s: Sinks+Sources
Labels:
None

Description

Collector configured as:

exec config auctionlogsink 'collectorSource(35853)' '

{ gunzip => collectorSink( "hdfs://clmaster01/bidder_data/raw/auction_logs/%Y%m%d/%H/", "auctionLog-", 300000 ) }

Agent configured as:

exec config nym7-bidlog 'syslogTcp(5140)' '

{ gzip => agentDFOSink( "clmaster01", 35853 ) }

We first observed this problem in production when our collector server went down. I've since observed it in a test environment too. If you simply stop the collector process, the agent immediately notices and starts writing events to disk:

2010-10-19 17:40:09,549 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: open attempt 0 failed, backoff (1000ms): Failed to open thrift event sink at 192.168.1.43:35855 : java.net.ConnectException: Connection refused

However, in the event of a network failure (or failure of the machine to respond in any way, as was observed in our production scenario), simulated by pulling out the ethernet cable from the machine, the agent node continues as if nothing has gone wrong.

In my test scenario, when I plugged the cable back in, some of the events were received, presumably because they were caught in a TCP buffer. At no point, however, did the agent detect the situation, write anything to disc or attempt to re-transmit.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

0001-FLUME-286-DFO-mode-does-not-detect-network-failure.patch
04/Nov/10 3:23 AM
7 kB
Jonathan Hsieh

Issue Links

relates to

FLUME-313 Reconcile semantics differences between Avro RPC and Thrift RPC exceptions.

Open

Activity

People

Assignee:

Jonathan Hsieh

Reporter:

James Gurney

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

20/Oct/10 10:45 PM

Updated:

13/Dec/10 2:37 PM

Resolved:

05/Nov/10 4:34 PM