The following post is from Nicolas Liochon (Scaled Risk) and Devaraj Das (Hortonworks) with thanks to all members of the HBase team.
HBase is an always-available service and remains available in the face of machine failures and rack failures. Machines in the cluster runs RegionServer daemons. When a RegionServer crashes or the machine goes offline, the regions it was hosting goes offline as well. The focus of the MTTR work in HBase is to be able to detect abnormalities and to be able to restore access to (failed) offlined regions as early as possible.
In talking with customers and users, it turned out that MTTR for HBase regions is one of the significant concerns. A lot of improvements were implemented recently. In this blog post and a couple after this one, we will go over the work the HBase team in Hortonworks, and the community at large, has done, in the area of MTTR. We will also talk about some of them at HBaseCon 2013 in June.
This blog explains how HBase manages the MTTR. In this blog, we introduce some of the settings available in the released versions of HBase and HDFS.
How HBase is resilient to failures while being consistent
HBase ensures consistency by having a single server responsible for a subset of data. Namely, a region is managed by a single region server at a time.
The resiliency to failures comes from HDFS, as data written in HDFS is replicated on several nodes:
- HBase writes the data in HFiles, stored in HDFS. HDFS replicates the blocks of these files, by default 3 times.
- HBase uses a commit log (or Write-Ahead-Log, WAL), and this commit log is as well written in HDFS, and as well replicated, again 3 times by default.
Steps in the failure detection and recovery process
- Identifying that a node is down: a node can cease to respond simply because it is overloaded or as well because it is dead.
- Recovering the writes in progress: that’s reading the commit log and recovering the edits that were not flushed.
- Reassigning the regions: the region server was previously handling a set of regions. This set must be reallocated to other region servers, depending on their respective workload.
What are the pain points? Until the detection and recovery steps have happened, the client is blocked – a single major pain point! Expediting the process, so that clients see less downtime of their data while preserving data consistency is what MTTR is all about.
Detecting node failures
There are multiple ways for a region server to die: It can be a clean stop, i.e., the administrator calls the stop function on the region server. This allows the region server to properly close the regions and tell the HMaster that the shutdown is in progress. In this case the commit log is purged and the HMaster starts the assignment of the regions immediately.
Another way for the region server to stop, is the silent death of the computer, for example if the network card dies or if the ethernet cable is unplugged. In this case, the region server cannot raise an alarm. This is handled in HBase with the help of ZooKeeper: each region server is connected to ZooKeeper, and the master watches these connections. ZooKeeper itself manages an heartbeat with a timeout. So, on a timeout, the HMaster declares the region server as dead, and starts the recovery process.
Recovering in-progress writes
There is a single semantic commit log consisting of multiple files for all the user regions in a region server. When a region server dies, the recovery of the commit logs happens. The recovery is done in parallel, and as a first step, random region servers picks up commit logs (from the well known commit log directory), and splits them by edits-per-region into separate files on the HDFS. The regions are then reassigned to random region servers, and each regionServer then reads the edits from the respective log split file(s) to recover the correct region state. The difficulty arises when it’s not a simple process crash, but a node failure. The region server on the crashed node would have written the blocks locally on to the local DataNode (the standard HDFS client behavior). Assuming a replication factor of three, when a box is lost, you are losing not only a region server, but as well one of the three replicas. Doing the split means reading the block. As 33% of the replicas are dead, it means that for each block you’ve got 33% chance to be directed to the wrong replica. Moreover, the split process writes new files. Each of these files will be replicated 3 times: any of these replicas can be assigned to the dead datanode: the write will fail after a timeout, and will go to another datanode, slowing the recovery.
Assigning the regions
Here, the job is to reassign as fast as possible. Assignment relies on ZooKeeper, and requires synchronisation between the master and the region servers through ZooKeeper.
The MTTR improvements
Detecting node failures
First, it’s possible to lower the default timeout value. By default, HBase is configured with a 3 minutes ZooKeeper (ZK) timeout. This ensures that the Garbage Collection (GC) won’t interfere (GC pauses leads to ZK timeouts and lead to false failure detections). For production system, it’s more sensible to configure to one minute, or 30 seconds if you do care about MTTR. A reasonable minimum is around 20 seconds, even if there are some users who reported less. So you can change hbase.zookeeper.timeout to 60000 in hbase-site.xml. You’d also need to tweak your GC settings appropriately (incremental, generational GC with good figures for the young and old generations, etc., this is a topic by itself) so that you do not have the GC pauses longer than the ZK timeout.
Recovering in-progress writes
In standard cases, there are enough surviving region servers to split in parallel all the files of the commit log. So the issue is really to get directed to the only the live HDFS replicas. The solution for this is to configure HDFS in order to have a faster failure detection in HDFS than in HBase. That is, if in HBase you have a timeout of 60s, HDFS should consider a node as dead after 20 seconds. Here we must detail how HDFS handles dead nodes: HDFS failure detection relies as well on a heartbeat and timeout, managed by the NameNode. In HDFS, when a node is declared as dead, the replicas it contained are duplicated to the surviving datanodes. It’s an expensive process, and, when multiple nodes dies simultaneously, it can trigger “replication storms”: all the replicas are replicated again, leading to an overloaded system, then to non responding nodes, then to nodes being declared dead as well, then to new blocks being replicated, and so on. For this reason, HDFS waits a long time before starting this recovery process: a little bit more than 10 minutes. This is an issue for a low latency software such as HBase: going to dead datanodes means hitting timeouts. In the last HDFS versions 1.0.4 or 1.2, and branches 2 and 3, it’s possible to use a special state: ‘stale’. An HDFS node is stale when it has not sent a heartbeat for more than a configurable amount of time. A node in this state is used only as a last resort for reads, and excluded for writes. So activating these settings will make the recovery much faster. In HDFS 1.1, only the read path takes into account the stale status, but versions 1.2, 2.0 and 3.0 use it for both reads and writes.
<!-- stale mode - 1.2+ --> <property> <name>dfs.namenode.avoid.read.stale.datanode</name> <value>true</value> </property> <property> <name>dfs.namenode.avoid.write.stale.datanode</name> <value>true</value> </property> <property> <name>dfs.namenode.write.stale.datanode.ratio</name> <value>1.0f</value> </property> <!-- stale mode - branch 1.1.1+ --> <property> <name>dfs.namenode.check.stale.datanode</name> <value>true</value> </property>
Assigning the regions
This is pure HBase internals. In HBase 0.94+, the assignment process has been improved to allow to assign more regions with less internal synchronisation, especially in the master [example – Apache jira HBASE-7247].
There are no global failures in HBase: if a region server fails, all the other regions are still available. For a given data-subset, the MTTR was often considered as around ten minutes. This rule of thumb was actually coming from a common case where the recovery was taking time because it was trying to use replicas on a dead datanode. Ten minutes would be the time taken by HDFS to declare a node as dead. With the new stale mode in HDFS, it’s not the case anymore, and the recovery is now bounded by HBase alone. If you care about MTTR, with the settings mentioned here, most cases will take less than 2 minutes between the actual failure and the data being available again in another region server.