Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: CDH4.1.2
    • Fix Version/s: None
    • Component/s: ZooKeeper
    • Labels:
      None
    • Environment:
      Centos 6.3, small 5-node cluster used as development environment, virtualized OS (KVM), CDH4.1.2 (MRv1, HA Quorum-based Storage)

      Description

      The beginning of discussion in https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/aWsgI8ivp24

      Last 2 posts
      Mine:
      Hi,

      Although I increased syncLimit the problem occured again.

      These are current settings for zoo.cfg:
      maxClientCnxns=50

      1. The number of milliseconds of each tick
        tickTime=2000
      2. The number of ticks that the initial
      3. synchronization phase can take
        initLimit=20
      4. The number of ticks that can pass between
      5. sending a request and getting an acknowledgement
        syncLimit=10
      6. the directory where the snapshot is stored.
        dataDir=/var/lib/zookeeper
      7. the port at which the clients will connect
        clientPort=2181
      8. it is recommended that number of zookeeper servers is odd
        server.1=centos1:2888:3888
        server.2=centos2:2888:3888
        server.3=centos5:2888:3888
      1. When enabled, ZooKeeper auto purge feature retains the autopurge.snapRetainCount most recent snapshots ...
      2. ... and the corresponding transaction logs in the dataDir and dataLogDir respectively and deletes the rest.
        autopurge.snapRetainCount=10
      1. The time interval in hours for which the purge task has to be triggered.
      2. Set to a positive integer (1 and above) to enable the auto purging. Defaults to 0.
        autopurge.purgeInterval=1

      As you can see I autopurge.snapRetainCount is increased to retain more snapshots.

      /usr/lib/zookeeper/bin/zkCli.sh
      ...
      get /hadoop-ha/touk-cluster-dev/ActiveStandbyElectorLock

      zk1: ephemeralOwner = 0x13af9601c0c000c
      zk2: ephemeralOwner = 0x13b515f811514f9
      zk5: ephemeralOwner = 0x13b515f811514f9

      I wonder what to do next.
      I have all logs and zookeeper data directory prepared to be sent - it's 24MB in compressed file.
      If you'd like to look at it, I can send it to you.

      Regards,

      • Marcin Smialek

      ########################################################
      And Yours:
      Sorry to hear that Marcin, can you create a CDH jira here and attach
      the logs and the datadir?
      https://issues.cloudera.org/secure/CreateIssue!default.jspa
      Nothing jumped out at me with your previous attachments. I'm hoping
      this recent one has something that allows me to determine the issue.

      I see that the following issue is in 3.4.4 upstream at Apache, however
      it's not included in CDH4.1.2 (which is 3.4.3 based with some 3.4.4
      fixes, but not all)
      https://issues.apache.org/jira/browse/ZOOKEEPER-1496
      There was a recent report (this morning) of a similar sounding issue
      upstream on the apache zk mailing list, however that user was using
      3.4.4. (so that rules out 1496 for that issue, so it might be
      something new?)

      Did you increase the client session timeout?

      I could try providing you a 3.4.5 (coming in cdh4.2) based ZooKeeper
      jar file, would it be possible for you to try that out? (it has 1496)
      ie replace your zookeeper.jar files on the servers and restart the
      servers?

      Patrick

        Attachments

          Activity

            People

            • Assignee:
              phunt Patrick Hunt
              Reporter:
              msm Marcin Smialek
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: