Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-491

Namenode stops responding to all IPC, due to Infinite loop in LeaseManager/checkLease

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: CDH4.2.1
    • Fix Version/s: None
    • Component/s: HDFS
    • Labels:
      None
    • Environment:
      CDH4.2.1, namenode in HA

      Description

      Our Namenode is getting into a state where checkLEases() function in LeaseManager.java is never returning. This is a problem as the function is operating under a writer loc which stalls IPC. The NN is therefore as good as dead.

      Putting the NN on a debugger shows that we have an entry in sortedLeases which is cleared neither by fsnamesystem.internalReleaseLease(), nor by the safety path through IOException.

      More precisely, we have an inode which expires the hard limit but which seems to violate the assumptions in FSNamesystem/internalReleaseLease (see file attached, from JDB). The code violates the assertion:
      assert false : "Already checked that the last block is incomplete";
      The assertion does not kill the NN, but the codepath afterwards does not clear the lease.

      What is specific about this Inode ?

      • the inode is made of 2 blocks
      • the array of blocks contains at index 0, a block which is BlockInfoUnderConstruction in state COMMITTED
      • the array of blocks contains at index 1, a block which is a BlockInfo and is just COMPLETED

      When the loop computes the nrCompleteBlocks, it exits and counts zero blocks as completed, which this is clearly not the case. The code following is just relying on the same kind of assumptions and behave weirdely – including going through the assertion, which does not trigger. It looks like the loop assumes that all blocks after the first block which is incomplete are also incomplete – which is not the case for our inode.

      A first, dumb fix would throw an IOException after the assertion. This would make the function at least resilient to cases violating the assertion, as the lease would then be garbage-collected in the checkLeases() caller.

      A second fix would probably involve understanding wether the inode state is expected at all, and prevent the appearance problematic inode or rework the release function to take the case into account. I'm willing to help but i'm completely clueless.

      I would love to report this directly to apache, however it is not clear to me how to categorize CDH4.2.1 over there

        Attachments

        1. error_logging.patch
          2 kB
          Jean-Baptiste Note
        2. trace_on_bigger_file.txt
          1 kB
          Jean-Baptiste Note
        3. variables.txt
          3 kB
          Jean-Baptiste Note

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jbnote Jean-Baptiste Note
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: