[DISTRO-448] HA - no active namenode, both standby - Cloudera Open Source

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: CDH4.1.2
Fix Version/s: None
Component/s: ZooKeeper
Labels:
None
Environment:
Centos 6.3, small 5-node cluster used as development environment, virtualized OS (KVM), CDH4.1.2 (MRv1, HA Quorum-based Storage)

Description

The beginning of discussion in https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/aWsgI8ivp24

Last 2 posts
Mine:
Hi,

Although I increased syncLimit the problem occured again.

These are current settings for zoo.cfg:
maxClientCnxns=50

The number of milliseconds of each tick
tickTime=2000
The number of ticks that the initial
synchronization phase can take
initLimit=20
The number of ticks that can pass between
sending a request and getting an acknowledgement
syncLimit=10
the directory where the snapshot is stored.
dataDir=/var/lib/zookeeper
the port at which the clients will connect
clientPort=2181
it is recommended that number of zookeeper servers is odd
server.1=centos1:2888:3888
server.2=centos2:2888:3888
server.3=centos5:2888:3888

When enabled, ZooKeeper auto purge feature retains the autopurge.snapRetainCount most recent snapshots ...
... and the corresponding transaction logs in the dataDir and dataLogDir respectively and deletes the rest.
autopurge.snapRetainCount=10

The time interval in hours for which the purge task has to be triggered.
Set to a positive integer (1 and above) to enable the auto purging. Defaults to 0.
autopurge.purgeInterval=1

As you can see I autopurge.snapRetainCount is increased to retain more snapshots.

/usr/lib/zookeeper/bin/zkCli.sh
...
get /hadoop-ha/touk-cluster-dev/ActiveStandbyElectorLock

zk1: ephemeralOwner = 0x13af9601c0c000c
zk2: ephemeralOwner = 0x13b515f811514f9
zk5: ephemeralOwner = 0x13b515f811514f9

I wonder what to do next.
I have all logs and zookeeper data directory prepared to be sent - it's 24MB in compressed file.
If you'd like to look at it, I can send it to you.

Regards,

Marcin Smialek

########################################################
And Yours:
Sorry to hear that Marcin, can you create a CDH jira here and attach
the logs and the datadir?
https://issues.cloudera.org/secure/CreateIssue!default.jspa
Nothing jumped out at me with your previous attachments. I'm hoping
this recent one has something that allows me to determine the issue.

I see that the following issue is in 3.4.4 upstream at Apache, however
it's not included in CDH4.1.2 (which is 3.4.3 based with some 3.4.4
fixes, but not all)
https://issues.apache.org/jira/browse/ZOOKEEPER-1496
There was a recent report (this morning) of a similar sounding issue
upstream on the apache zk mailing list, however that user was using
3.4.4. (so that rules out 1496 for that issue, so it might be
something new?)

Did you increase the client session timeout?

I could try providing you a 3.4.5 (coming in cdh4.2) based ZooKeeper
jar file, would it be possible for you to try that out? (it has 1496)
ie replace your zookeeper.jar files on the servers and restart the
servers?

Patrick

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

zk1_logs_and_data_dir_20121204.tar.gz
8.02 MB
04/Dec/12 7:04 PM
zk2_logs_and_data_dir_20121204.tar.gz
7.92 MB
04/Dec/12 7:06 PM
zk5_logs_and_data_dir_20121204.tar.gz
7.95 MB
04/Dec/12 7:10 PM

Activity

People

Assignee:

Patrick Hunt

Reporter:

Marcin Smialek

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

04/Dec/12 6:58 PM

Updated:

11/Dec/12 1:26 AM