Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-290

DataBlockScanner spewing log messages for blocks not found

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: CDH2u3
    • Fix Version/s: CDH4.0.0
    • Component/s: HDFS
    • Labels:
      None

      Description

      Bad data node block verification message filling up data node logs when DataBlockScanner can't find a block ID:
      INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_xx_xx. Its ok since it not in datanode dataset anymore.

      On 4/17, this cluster experienced a name node outage. The root volume had some I/O errors severe enough to require a reboot. Unfortunately, we were temporarily running in a configuration that wasn't dual-writing edits to an NFS mount. We wound up with a corrupt edits file on the root volume, and we had to restore from the secondary name node's snapshot, which is up to 5 minutes old. The bottom line is that we lost all inodes that were created after that last snapshot.

      That lines up pretty well on a 3-week boundary, so maybe what we're seeing is something like:

      1. New block gets created on a data node/gets enqueued for block verification 3 weeks later.
      2. Name node dies.
      3. Name node recovers from a stale snapshot, so a few inodes are lost, including the inode corresponding to the block in step 1.
      4. Data node doesn't know that name node lost track of this block, so 3 weeks later, it tries to verify it.
      5. Error handling logic doesn't quite handle this edge case, so data node freaks out.

      BTW, we're back to dual-writing edits to an NFS mount now, after correction of some issues with the NFS servers in our infrastructure.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              kate Kathleen Ting
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: