Details
-
Type:
Bug
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: CDH 5.4.7
-
Fix Version/s: None
-
Component/s: HBase
-
Labels:None
-
Environment:CDH 5.4.7 70 nodes , 60,000 regions on hbase
Description
For some reason we restart hbase cluster,during the start we met problems:
the webUI(60010)shows hmaster assign regions very slow, and seem to have paused at a certain time,
From the backend ,the hmaster instance exit after serval minutes(about 10) ,and the backup master became active and continue to init work, but the same exit will happen soon. So,it's hard to finish the start .
other information:
1) CDH 5.4.7 , 70 nodes , 60,000 regions on hbase ,16Gb for hmaster heapsize
2)when hmaster start , There is no number change of assigned regons on webUI in the first serval minutes, The delay is very serious
3) this is part of the hmaster exception log: "FATAL org.apache.hadoop.hbase.master.HMaster: Failed to become active master
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
See the attachment for detail log
4) during start, the region server webUI show some Compact tasks on some table
5) the major compact period had been set to a big number(30 days) , so major compact work is closed at ordinary times ,some table'hfile became huge(10Tb),
guess there are some potential problems
###########################################################################
after checkding some references , we made some efforts
first, change some parameters
1)hbase.master.executor.openregion.threads=10
2)hbase.master.executor.closeregion.threads=10
3)hbase.master.namespace.init.timeout>2400000
then , do some environment clean work:
1)zookeeper: rmr /hbase
2)HDFS: hadoop fs -rm -r /hbase/WALs/、hadoop fs -rm -r /hbase/oldWALs/
Things look better,but hmaster changing-over still exist , worked about 1 hour and exit and backup master follow the work. After serval times switch, finally, we spent 6 hours to finish the master start process.
###########################################################################
so ,is there anyone can give help? thanks.