Tutorial on ZooKeeper – Part 1: Concepts and Terminologies

Overview

Apache ZooKeeper is a highly reliable distributed coordination service for distributed applications to coordinate with each other through a shared hierarchical name space, which is organized similarly to a standard file system path. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories which can provide strictly ordered access to the znodes. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers. It is especially fast in “read-dominant” workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.

The performance aspects of ZooKeeper allow it to be used in large distributed systems. The reliability aspects prevent it from becoming the single point of failure in big systems. Its strict ordering allows sophisticated synchronization primitives to be implemented at the client.

From this part on, I will write a series of tutorials on ZooKeeper. Some concepts and terminologies are introduced here first.

What is ZooKeeper?

As the Official Site said,

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

ZooKeeper runs in Java and has bindings for both Java and C.

ZooKeeper is very fast and very simple, which is designed to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:

  • Sequential Consistency - Updates from a client will be applied in the order that they were sent.
  • Atomicity - Updates either succeed or fail. No partial results.
  • Single System Image - A client will see the same view of the service regardless of the server that it connects to.
  • Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
  • Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

Learn more about ZooKeeper on the ZooKeeper Official Site.

Some Concepts and Terminology

Data model and the hierarchical namespace

Just as mentioned in the Overview, the name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (“/“). Every node in ZooKeeper’s name space is identified by a path, which always needs to start with the root znode ( “/“ ).

ZooKeeper's Hierarchical Namespace

Znode

Unlike standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. You can create some sub-znodes/children znodes in the znode. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.

Note: Every znode must has a parent whose path is a prefix of the znode with one less element; the exception to this rule is root znode (“/“) which has no parent. Also, exactly like standard file systems, a znode cannot be deleted if it has any children.

Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestamp allow ZooKeeper to validate the cache and to coordinate updates. Each time a znode‘s data changes, the version number increases. For instance, whenever a client retrieves data, it also receives the version of the data. And when a client performs an update or a delete, it must supply the version of the data of the znode it is changing. If the version it supplies doesn’t match the actual version of the data, the update will fail.

The command syntax for creating a znode is as follows:

1
create -[options] /[znode-name] [znode-data]

Example 1: Create a new znode named “znode_test” with data “znode_test_data”

The path consists of the root znode (“/“) and the name of the znode you want to create. Here you can write /znode_test for its path.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper]
[zk: localhost:2181(CONNECTED) 1] create /znode_test znode_test_data
Created /znode_test
[zk: localhost:2181(CONNECTED) 2] get /newznode
znode_test_data
cZxid = 0x200000002
ctime = Wed Oct 14 05:19:00 EDT 2015
mZxid = 0x200000002
mtime = Wed Oct 14 05:19:00 EDT 2015
pZxid = 0x200000009
cversion = 1
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 6
numChildren = 1

The command about operations on the znodes, including creating, getting, deleting and etc, will be introduced in the following tutorials.

Example 2: Create a recursive znode named “znode_rtest3” with data “znode_rtest_data”

Only one important thing you have to keep it in mind, when you create a new recursive znode, the znodes/paths along have to be already created. Otherwise, exceptions will be thrown.

1
2
3
4
5
6
7
8
9
10
11
12
[zk: localhost:2181(CONNECTED) 3] ls /
[zookeeper, znode_test]
[zk: localhost:2181(CONNECTED) 4] create /znode_rtest1/znode_rtest2/znode_rtest3 znode_rtest_data3
Node does not exist: /znode_rtest1/znode_rtest2/znode_rtest3
[zk: localhost:2181(CONNECTED) 5] create /znode_rtest1 znode_rtest_data1
Created /znode_rtest1
[zk: localhost:2181(CONNECTED) 6] create /znode_rtest1/znode_rtest2/znode_rtest3 znode_rtest_data3
Node does not exist: /znode_rtest1/znode_rtest2/znode_rtest3
[zk: localhost:2181(CONNECTED) 7] create /znode_rtest1/znode_rtest2 znode_rtest_data2
Created /znode_rtest1/znode_rtest2
[zk: localhost:2181(CONNECTED) 8] create /znode_rtest1/znode_rtest2/znode_rtest3 znode_rtest_data3
Created /znode_rtest1/znode_rtest2/znode_rtest3

Different Types of Znodes

In ZooKeeper there are 3 types of znodes: persistent, ephemeral, and sequential.

  1. Persistent Znodes (Default)

    These are the default znodes in ZooKeeper. They will stay in the ZooKeeper server permanently, as long as any other clients (including the creator) leave it alone.

    1
    create /znode mydata
  2. Ephemeral Znodes

    Ephemeral znodes (also referred as session znodes) are temporary znodes. Unlike the persistent znodes, they are destroyed as soon as the creator client logs out of the ZooKeeper server. For example, let’s say client1 created eznode1. Once client1 logs out of the ZooKeeper server, the eznode1 gets destroyed.

    1
    createe /eznode mydata
  3. Sequential Znodes

    Sequential znode is given a 10-digit number in a numerical order at the end of its name. Let’s say client1 created a sznode1. In the ZooKeeper server, the sznode1 will be named like this:

    sznode0000000001

    If client1 creates another sequential znode, it would bear the next number in a sequence. So the next sequential znode will be called [znode-name]0000000002.

    1
    create –s /sznode mydata

ACL

ACL (Access Control List) is basically an authentication mechanism implemented in ZooKeeper. It makes znodes accessible to users, depending on how it is set. This part will be introduced in the following tutorials.

Ensemble and Quorum

ZooKeeper service can be replicated over a sets of hosts called an ensemble. As long as a majority of the ensemble are up, the service will be available. For the ZooKeeper service to be active, there must be a majority of non-failing machines that can communicate with each other. Failure in this context means a machine crash, or some error in the network that partitions a server off from the majority. To create a deployment that can tolerate the failure of F machines, you should count on deploying 2xF+1 machines. Thus, a deployment that consists of three machines can handle one failure, and a deployment of five machines can handle two failures. Note that a deployment of six machines can only handle two failures since three machines is not a majority. For this reason, ZooKeeper deployments are usually made up of an odd number of machines. Three ZooKeeper servers is the minimum recommended size for an ensemble, and we also recommend that they run on separate machines.

A replicated group of servers in the same application is called a quorum. All servers in the quorum have copies of the same configuration file. QuorumPeers will form a ZooKeeper ensemble. A quorum is represented by a strict majority of nodes. You can have one node in your ensemble, but it won’t be a highly available or reliable system. If you have two nodes in your ensemble, you would need both to be up to keep the service running because one out of two nodes is not a strict majority. If you have three nodes in the ensemble, one can go down, and you would still have a functioning service (two out of three is a strict majority).

If a quorum of nodes are not available in an ensemble, the ZooKeeper service is nonfunctional.

Reference