Synchronizing Clocks In a Cassandra Cluster Pt. 1 - The Problem

Cassandra is a highly-distributable NoSQL database with tunable consistency. What makes it highly distributable makes it also, in part, vulnerable: the whole deployment must run on synchronized clocks.

It’s quite surprising that, given how crucial this is, it is not covered sufficiently in literature. And, if it is, it simply refers to installation of a NTP daemon on each node which – if followed blindly – leads to really bad consequences. You will find blog posts by users who got burned by clock drifting.

In the first installment of this two part series, I’ll cover how important clocks are and how bad clocks can be in virtualized systems (like Amazon EC2) today. In the next installment, coming out next week, I’ll go over some disadvantages of off-the-shelf NTP installations, and how to overcome them.

About clocks in Cassandra clusters

Cassandra serializes write operations by time stamps you send with a query. Time stamps solve an important serialization problem with inherently loosely coupled nodes in large clusters. At the same time however, time stamp are its Achilles’ heel. If system clocks runs off each other, so does time stamps of write operations and you are about to experience inexplicable data inconsistencies. It is crucial for Cassandra to have clocks right.

Boot-time system clock synchronization is not enough unfortunately. No clock is the same and you will eventually see clock drifting, i.e. growing difference among clocks in the system. You have to maintain clock synchronization continually.

It is a common misconception that clocks on virtual machines are somewhat resistant against clock drifting. In fact, virtual instances are especially prone to, even in a dramatic way, if the system is under heavy load. On Amazon EC2, you can easily observe drift about 50ms per day on unloaded instance and seconds per day on a loaded instance.

How much clocks need to be synchronized? It depends on your type of work load. If you run read-only or append-only queries, you are probably fine with modestly synchronization. However if you run concurrent read-update queries it’s starting to be serious. And if you do so because of API calls or concurrent job processing, it’s critical down to milliseconds.

Unfortunately, there is great, off-the-shelf ready solution. Why unfortunately?

Network Time Protocol

Network Time Protocol (NTP) gets the time from external time source in the network and propagates it further down the network. NTP uses hierarchical tree-like topology, where each layer is referred to as “clock strata”, starting with Stratum 0 as the authoritative time source, and continuing with Strata 1, 2, etc. Nodes which synchronizes clocks with nodes on Stratum n become nodes on Stratun n+1. NTP daemon sends time queries periodically to specified servers, adjusts the value to network latency associated with the message transmission, and re-adjusts the local clock to the calculated time. Running NTP daemon will help to avoid clock drifting especially on loaded machines.

In order to make NTP work you need to specify a set of servers where the current time will be pulled from. NTP servers may be provided by your network supplier, or you can use publicly available NTP servers. The best list of available public NTP servers is NTP pool project where you can also find best options for you geographical region. It is a good thing to use this pool. You should not use NTP servers without consent of the provider.

How to install NTP daemon

Installing NTP daemon is as simple as:

aptitude install ntpd

aptitude install ntpd

and it works immediately. That’s because it is pre-configured to use a standard pool of NTP servers. If you look at /etc/ntp.conf you will see servers defined with the server parameter, for example:

server 0.debian.pool.ntp.org iburst
server 1.debian.pool.ntp.org iburst
server 2.debian.pool.ntp.org iburst
server 3.debian.pool.ntp.org iburst

server 0.debian.pool.ntp.org iburst
server 1.debian.pool.ntp.org iburst
server 2.debian.pool.ntp.org iburst
server 3.debian.pool.ntp.org iburst

This is default for Debian systems, you may see a slightly different list in your distribution. The iburst parameter is there for optimization. If you want to check how NTP daemon works, run the following command: ntpq -p. You will get a list similar to this one:

remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*dns1.dns.imagin 213.130.44.252   3 u   17   64    7    1.979    0.035   0.235
-eu-m01.nthweb.c 193.1.219.116    2 u   19   64    7    1.064    9.067   0.094
+tshirt.heanet.i .PPS.            1 u   15   64    7    3.276   -0.193   0.066
+ns0.fredprod.co 193.190.230.65   2 u   15   64    7    0.818   -0.699   8.112

remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*dns1.dns.imagin 213.130.44.252   3 u   17   64    7    1.979    0.035   0.235
-eu-m01.nthweb.c 193.1.219.116    2 u   19   64    7    1.064    9.067   0.094
+tshirt.heanet.i .PPS.            1 u   15   64    7    3.276   -0.193   0.066
+ns0.fredprod.co 193.190.230.65   2 u   15   64    7    0.818   -0.699   8.112

It shows you the list of servers it synchronizes to, its reference, stratum,
synchronization periods, response delay, offset from the current time and jitter.

NTP uses optimizing algorithm which selects the best source of current clock as well as a working set of servers it takes into account. Node marked with “*” is the current time source. Nodes marked with “+” are used in the final set. Nodes marked with “-” are discarded by the algorithm.

You can restart NTP daemon with

service ntpd restart

service ntpd restart

and watch grabbing a different set of servers, selecting the best source and gradually increasing the period when servers are contacted when the clock gets stabilized.
Works like a charm.

Why not to just install NTP daemon on each node

If NTP works so great out of the box, why not simply install it on all boxes? In fact, this is exactly the advice you commonly get for cluster setup.

With respect to Cassandra, it’s the relative difference among clocks that matters, not their absolute accuracy. By default NTP will sync against a set of random NTP servers on the Internet which will result in synchronization of absolute clocks. Therefore the relative difference of clocks in the C* cluster will depend on how clocks are synchronized to absolute values from several randomly chosen public servers.

Look at the (real) example output from the ntpq command, the offset column. The difference among clocks is about 0.1ms, 0.5ms, but there is also an outlier with 9ms difference. Synchronization to the millisecond is a reasonable requirement, which requires one to synchronize absolute clocks to 0.5ms after/before boundary.

How precise, in absolute values, are public NTP servers? We ran a quick check of 560 randomly chosen public NTP servers from the public pool. The statistics are:

11% are below 0.5ms drift
15% are below 1ms drift
62% are below 10ms drift
11% are below 100ms drift

There are also outliers, with one being off by multiple hours.

Assuming: (1) our checks are representative, (2) each NTP daemon picks up 4 random NTP servers, and (3) synchronizing to the second best option (this is optimistic) these are the probabilities of our cluster clocks being off:

Nodes       5         10        25        100
95%          2.489     5.180     9.349    19.723
50%          7.122    10.892    18.872    44.394
25%         10.917    16.969    30.855    54.197
10%         18.584    30.291    45.311    66.942

Nodes       5         10        25        100
95%          2.489     5.180     9.349    19.723
50%          7.122    10.892    18.872    44.394
25%         10.917    16.969    30.855    54.197
10%         18.584    30.291    45.311    66.942

How to read it: assume a cluster of 25 nodes, then with the probability of 50% there will be two nodes with clock difference of more than 18.8ms.

The results may be surprising – even in a small cluster of 10 nodes they will be off by more than 10.9ms half of the time, and with a probability of 10% it will be off by more than 30ms.