Details
Description
Bad data node block verification message filling up data node logs when DataBlockScanner can't find a block ID:
INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification failed for blk_xx_xx. Its ok since it not in datanode dataset anymore.
On 4/17, this cluster experienced a name node outage. The root volume had some I/O errors severe enough to require a reboot. Unfortunately, we were temporarily running in a configuration that wasn't dual-writing edits to an NFS mount. We wound up with a corrupt edits file on the root volume, and we had to restore from the secondary name node's snapshot, which is up to 5 minutes old. The bottom line is that we lost all inodes that were created after that last snapshot.
That lines up pretty well on a 3-week boundary, so maybe what we're seeing is something like:
1. New block gets created on a data node/gets enqueued for block verification 3 weeks later.
2. Name node dies.
3. Name node recovers from a stale snapshot, so a few inodes are lost, including the inode corresponding to the block in step 1.
4. Data node doesn't know that name node lost track of this block, so 3 weeks later, it tries to verify it.
5. Error handling logic doesn't quite handle this edge case, so data node freaks out.
BTW, we're back to dual-writing edits to an NFS mount now, after correction of some issues with the NFS servers in our infrastructure.