Uploaded image for project: 'Flume (READ-ONLY)'
  1. Flume (READ-ONLY)
  2. FLUME-629

DFO failure, stops buffering to disk, messages lost

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: v0.9.3
    • Fix Version/s: None
    • Component/s: Node
    • Labels:
      None

      Description

      Single master
      agent: syslogTcp | agentE2EChain
      collector: collectorSource | collectorSink("hdfs://...")

      From reading through various logs, this is, I believe, the order of events:

      • NameNode crashed
      • This caused collector to fail writes to hdfs
      • Which in turn caused agents to start backing up and buffering on disk (correct so far)
      • WatchDog caught a crash and restarted the Flue Master
      • Eventually the DFO stops writing to disk but keeps trying to pass messages
      • ACKs continue to fail and eventually nothing is passed

      Disk space was fine throughout. We had another agent node which continued to operate normally during this period and buffered all messages as expected. Here's a snip of some of the relevant sections of log files:

      http://pastie.org/pastes/1883087/text?key=ouxaqhuodprfrmsailunw

      I can provide the full log files if they will be of use.

        Attachments

          Activity

            People

            • Assignee:
              jon Jonathan Hsieh
              Reporter:
              csarva Chetan Sarva
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: