Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-492

Namenode stops responding to all IPC, due to very long loop in LeaseManager/checkLease (DOS of NN)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: CDH4.2.1, CDH4.4.0
    • Fix Version/s: None
    • Component/s: HDFS
    • Labels:
      None
    • Environment:
      CDH4.2.1, namenode in HA, HIVE

      Description

      Our Namenode is getting into a state where checkLeases() function in LeaseManager.java is spending inordinate amounts of time processing dead leases.

      Context: some of our Hive jobs are creating huge leases on startup (due to partitioning).

      When the job dies or takes a long time to run, the lease(s) – which expire after one hour – are getting freed and take a very long time to process. Because the function holds onto a Write lock, it actually stops all IPC responses. The ZKFC then fails over automatically, after 45s in this loop.

      We don't know yet if the only one lease is created with a huge number of paths in it, or if multiple leases are expiring at the same time.
      In any case, our hive job is actually performing a DOS on the namenode.

      A possible fix seems quite simple:
      1) exit from lease expiration after a certain number of leases/paths have been processed
      2) Or implement finer-grained locking on the datastructure to avoit holding onto such an important lock during the loop

      Add to this the fact that the loop is quite unoptimized, and the NN spends most of its time there assembling Strings (about 75% of the time according to a JMX profiling during the event).

      Because 2) requires a good understanding of the codebase, we'll be limiting ourselves to 1), but any advice would be appreciated.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jbnote Jean-Baptiste Note
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: