Uploaded image for project: 'Flume (READ-ONLY)'
  1. Flume (READ-ONLY)
  2. FLUME-286

DFO mode does not detect network failure

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: v0.9.1
    • Fix Version/s: v0.9.2
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      Collector configured as:

      exec config auctionlogsink 'collectorSource(35853)' '

      { gunzip => collectorSink( "hdfs://clmaster01/bidder_data/raw/auction_logs/%Y%m%d/%H/", "auctionLog-", 300000 ) }

      '

      Agent configured as:

      exec config nym7-bidlog 'syslogTcp(5140)' '

      { gzip => agentDFOSink( "clmaster01", 35853 ) }

      '

      We first observed this problem in production when our collector server went down. I've since observed it in a test environment too. If you simply stop the collector process, the agent immediately notices and starts writing events to disk:

      2010-10-19 17:40:09,549 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: open attempt 0 failed, backoff (1000ms): Failed to open thrift event sink at 192.168.1.43:35855 : java.net.ConnectException: Connection refused

      However, in the event of a network failure (or failure of the machine to respond in any way, as was observed in our production scenario), simulated by pulling out the ethernet cable from the machine, the agent node continues as if nothing has gone wrong.

      In my test scenario, when I plugged the cable back in, some of the events were received, presumably because they were caught in a TCP buffer. At no point, however, did the agent detect the situation, write anything to disc or attempt to re-transmit.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jon Jonathan Hsieh
                Reporter:
                jamesg James Gurney
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: