Saturday, June 28, 2014

CLLElectedUnlink - A Lock-Free List with Relaxed Atomics (part 1)

In this post we are going to cover a new type of concurrent linked list with lock-free operations for searching, insertion, and removal, i.e. a new Lock-Free Linked List. We named it CLLElectedUnlink (Concurrent Linked List with Elected Unlink) and you can get a Java implementation from github here:

Main components

This linked list uses two main techniques:

1) The first one is the Election Pattern described in the previous post.
In the CLLElectedUnlink, the remove() function marks nodes to be removed but doesn't physically unlink them from the linked list immediately afterwards. The unlinking process is done in a lazy way by a thread chosen to perform this task, using the Election Pattern. This approach greatly simplifies the process of unlinking because while this operation is ongoing, it knows that no other thread can be modifying any of the nodes involved in the operation, thanks to the Election mechanism.

2) The second one is Relaxed Atomics.
If you don't know what those are, then start by taking a look at this great presentation from Herb Sutter on the C++11 and C11 Memory Model and Atomics (a lot of it also applies to Java, and almost all applies to D).
The usage of relaxed atomics allows for the traversal of the linked list with O(1) number of acquire barriers, assuming new insertions on the list are nearly zero.
On all lock-free linked lists I've seen so far, doing the traversal of the linked list requires (at least) one acquire barrier per node that is traversed.
For example, Java's CLQ needs to do a volatile load every time it reads
If you saw Herb Sutter's presentation, then I'm sure you will remember that on x86 architectures, the usage of relaxed atomics for loads will have no gain in terms of performance, at least for relaxed loads (see the presentation to understand why), but if you think about other architectures like ARM, then maybe there will be a difference in performance.
I haven't had a chance to test this on ARM yet, but just in case, don't expect any miraculous performance improvements. Remember that the the cache-coherence system still has to do its job, it's just that there is more flexibility in how and when it does it.

Ok, so let us start from the beginning, giving as an example the Java code, but you can easily port this algorithm to other languages using our "Rosetta Stone of Atomics".


The first thing to look at is the Node of the CLLElectedUnlink. The Node has three members named "item", "next" and "state":

static class Node<E> {
    final E item;
    Node<E> next;
    volatile int state ;
    Node(E item) {
        this.item = item; = null;
        this.state = INUSE;
- "item" is a reference to the object that is associated with the node. In Java's ConcurrentLinkedQueue (CLQ) it is a volatile instead of a final (or const in C++1x or C11), the reason being that when it is set to null it marks that the node has been logically removed. In the CLLElectedUnlink we use a final qualifier because once a node has been assigned to an object it will never be assigned to a different one (no node re-usage), but the difference from the CLQ is that we don't want to have an acquire barrier when we traverse the linked list, and on each node we check if the node matches the one we're looking for.What this means is that for CLQ.contains() and CLQ.remove(), for each node that is traversed, there are two acquire barriers being done, one when Node.item is read to compare against the object that is being searched for, and another acquire barrier when reading the next node from

- "next" is a reference to the next node on the list, or null if this is the last node. On CLQ.Node the "next" is a volatile because it can change at any time, if the next node in the list is removed and unlinked. In the CLLElectedUnlink it is a relaxed atomic because we want to avoid doing the acquire barrier.
Notice that on Java all references have an atomicity guarantee, and therefore, work as the C++1x and C11 relaxed atomics (kind of). On a C++1x implementation of the CLLElectedUnlink we would have to use the atomic<> qualifier on and then read it using std::memory_order_relaxed.
The trick with using a non volatile next in the CLLElectedUnlink is that whenever we read and it is null, we have to re-read it with an acquire barrier, or put in Java terminology, do a volatile load. This will guarantee that we get the most up-to-date value of next (and implicitly for any of the following nodes if they exist), and if the newly read value is again null, then we know for sure that this node is indeed the last node on the list.

- "state" is a two state volatile variable that indicates whether this node has been logically removed or not. In CLQ this two-state is done by setting the item to null or non-null, but in the CLLElectedUnlink we don't want to have change item to a volatile because it would add an acquire barrier, so we have to create a new variable which we named state. This variable starts by being "INUSE" when the node is first created, and is set to "REMOVED" when the node is marked as removed.


Traversing the linked list is very simple. It's just a while loop that goes through the nodes searching for a matching item.
The first time the node is read, it starts from the beginning of the linked list at head. The head variable is a volatile (in C++1x terminology, it is read with an acquire barrier), and this will guarantee that all globally visible modifications in the linked list will be seen, at least up until the head was read.
The real trick is done when reading the which is not volatile. We read it until we see it is null, and then we read it one more time, this time as a volatile (with an acquire barrier) to make sure we get the most up to date data. If it is null, then we know this is the last node in the list, otherwise we need to keep traversing.

One thing to keep in mind is that, if the item on a node matches the item we're searching for, then we need to read the state variable, which implies an acquire barrier. If we do a lot of removal/insertion of the same item in the list, then it means that for a certain time window, there can be multiple nodes with the same item, where all (except maybe one) are removed, and if we are searching for the item, we need to read the state variable to check if the node is removed or not, thus doing an acquire barrier every time this happens. This is a very rare situation and even then, the number of acquire barriers should not be too high.


Now comes the tricky part. The major difficulty in implementing lock-free lists is related to the many corner cases that may occur when trying to physically remove a node from the list. Much of the complexity in Harris linked list comes from it.

In the CLLElectedUnlink, we bypass most of these difficulties by using the Election Pattern to do the physical unlinking from the list. The main trick is that the remove() will only mark the node as logically removed, but it will be up to the unlink() function to do the physical unlinking.
The unlink() function contains an election mechanism so that only one thread at a time can execute unlinking procedures.
In our implementation, we call the unlink() from the remove(), but it can be called directly as well. This is useful if your application does removals in batches and once they finish you can call the unlink() directly, thus physically unlinking all removed nodes in a single traversal of the list.

One thing about the is that it has the following state machine:

When a node is first created, its next is at null, and the only way to change it is an add() operation when it does an insert of a new node. The insertion of the new node is done with a CAS guaranteeing that it is done once. Notice that after it transitions from null to a reference of a node in the list, it will never be null again... ever! The only way to change a non-null is to assign it to another non-null node, and this is done by the unlinking operation. Thanks to the Election Pattern, one, and only one thread at a time can execute the unlinking operation, which means it can be done with a relaxed atomic store.

One of the tricks of the unlink() is that it will never try to unlink the last node on the list, i.e. the node whose next is null. Doing so would risk unlinking a node after another thread has inserted a new node after the last node, which would disappear, thus giving errors, i.e. the before-happens relation would no longer hold for insertions.

Another trick of the unlink is that it will always set the next to reference a node that is forward on the list. Due to the relaxed atomics, different threads may see different values of next, but it will always be a node that is further ahead on the list, and it can only skip nodes that have been marked as removed, and there is no risk of a list traversal bypassing a node that is valid.
For example, suppose we have 5 nodes on the list, from A to E:

and now three different threads have marked nodes B, C and D to be removed, and one thread then goes and unlinks them all, first by making point to C, then point to D and then point to E. Notice that there is no guarantee on ordering or visibility for relaxed atomics, so it could happen that these stores will be seen in any order, by some threads and no others:

The invariant still holds that can only reference forward nodes.

Next week we will continue talking about the CLLElectedUnlink, so stay tuned.

Saturday, June 21, 2014

Elected Pattern

Enough about locks for the time being. Today we're going to talk about a concurrency technique named the "Elected Pattern".
It is based on a very simple idea that is known in distributed systems as "Leader Election", where one node is assigned a special "role", like for example, the leader or server in a network of nodes.

When applied to concurrency, we use threads instead of nodes, and tasks instead of roles, but the idea is pretty much the same.
Up until Andreia showed me one of her data structures using it more than a year ago, I hadn't seen this being applied to a concurrent data structure, but since then, we found it is used on Java 8's LongAdder, so we're not the only ones using this technique. If you're interested in the details, then take a look at Striped64.longAccumulate() (which is called from LongAdder), and search for cellsBusy and casCellsBusy().

In the Elected pattern, a single thread is selected to perform a special task. Here is some sample code in Java:

final AtomicBoolean electedGuard = new AtomicBoolean(false);
public void functionA() {
    if (electedGuard.compareAndSet(false, true)) {
and yes, I know we can use a try/finally block, but forget about that kind of stuff for a moment.If functionA() is called by multiple threads, all those threads will be calling doSomethingThreadSafe(), which hopefully, as the name indicates, is something that is actually thread-safe to call. One of those threads will be able to set electedGuard to true using the CAS operation and therefore, there will be a single thread calling specialTask().

I know what you're thinking:
You're a liar because you said you would stop blabbering about locks, and yet, here is one, staring us right in the face!
Indeed, the Election pattern uses a kind of lock, and the code above could even be replaced with a tryLock() of a mutex.

There is one very very important thing to be aware of though, and that is, that the functionA() is non-blocking!

Let me repeat it, because most people miss it completely: The Election pattern, by itself, is Wait-Free Population Oblivious.

Once again, I know what you're thinking:
Wait, what?!?
How can something have "a lock" and still be non-blocking?
That doesn't make any sense!

Progress Conditions are all about the (maximum) number of operations it takes to complete a task. For example, if you say that the function doSomethingThreadSafe() is lock-free, then what will the progress condition be for functionA() ?
The answer is, lock-free as well, under the assumption that specialTask() is only called from this particular code path, and that it completes when called in isolation.
When a thread calls functionA(), it will execute doSomethingThreadSafe() and then try to CAS the electedGuard from false to true. If it fails the CAS, then its job is done and it will return. If it succeeds in the CAS, then it will be the only thread to call specialTask(), at least until the call to specialTask() returns and it sets electedGuard back to false. Notice that there is only one thread at a time calling specialTask() so, by itself, it doesn't imply a blockage.

However, (and this is a big "however") even though functionA() will have the same progress condition as doSomethingThreadSafe(), it can happen that if a thread blocks/sleeps/dies after setting electedGuard to true, during specialTask(), then no other thread will be able to execute the specialTask() method for the remainder of the life time of the application.
This implies that the Elected pattern is not fault tolerant.

By this time, you may be thinking:
Whoooaaaa, wait a minute!
Are you saying that functionA() is lock-free but not fault tolerant?
I thought that lock-freeness implied some kind of resilience to faults!

For most lock-free and wait-free algorithms, the assumption that some kind of "fault tolerance" is provided, is correct, but this is not a universal guarantee. Some lock-free algorithms may not be tolerant to faults, with one example being this one, the Elected Pattern.

It is not obvious why this is wait-free, and a lot of people get this one wrong. I've seen researchers with years of experience in concurrency say that the Election pattern is "Blocking", which is not correct.
For those of you not convinced, my suggestion is to look at "The Art of Multiprocessor Programming", page 99, and read very carefully the first two lines of the third paragraph, where the definition of wait-free and lock-free are provided. You will notice that nowhere does it mention anything about faults, or fault-tolerance, or crashes. Here's what it says:
(...) A concurrent object implementation is wait-free if each method call finishes in a finite number of steps. A method is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps. (...)

In the example above, functionA() will complete as soon as doSomethingThreadSafe() completes, regardless of any thread that may have suspended, or blocked, or crashed while "holding the lock" on electedGuard. If doSomethingThreadSafe() is wait-free, then no matter what is going on with the electedGuard or inside specialTask(), the method functionA() will return after executing a finite number of steps, assuming that specialTask() doesn't have any infinite loop or similar bug.

I know most people despise locks (mutual exclusion or rw-locks), but the truth is, that they are  simple and powerful concurrency constructs. No one can deny that locks can be nightmarish to use when you have a lock-based application design, but if you use locks at just the right place, for example, as part of more sophisticated synchronization mechanism or data structure, then the results can be surprising.
Good examples of this are the Double Instance Locking pattern, which is basically just one mutex and two rw-locks "stitched together" to provide lock-free read accesses, and the other example is this one, the Elected Pattern, where with a simple mutex we can have wait-free properties.

I know this doesn't look like much, but on the next post we will show a lock-free linked list that uses the Elected Pattern to solve a complicated problem: the unlinking of removed nodes.

Sunday, June 15, 2014

A Rosetta Stone for Atomics

C11 and C++11 programmers know it as "atomic", D programmers know it as "shared", and Java programmers know it as "volatile", but semantically speaking, they are very similar. The inner workings of the memory model and the atomics in each of these languages can differ significantly, specially between Java and the other ones, but if you know the semantics of one of these, then you know the semantics for all of them.

Let's look at the atomics API for the languages C11, C++1x, D, and Java. This is a kind of "Rosetta Stone" of the different languages, which should come in handy for those of you wishing to translate a piece of concurrent code from one language to another.

Due to lack of space, we start by comparing the basic API for C11, C++1x and D:

Two things to notice on this board is that D has more "boiler plate" code than C11 or C++1x, and that there is no atomic addition in D (am I missing something?), so it has to be implemented using a CAS loop.
As is to be expected, there are a lot of similarities between these three languages, but when we get to Java it starts to get a bit more complicated.
First, there is no single API for atomics, but kind of three APIs:

The first thing that pops out is that using relaxed atomics with the Atomic* classes or volatile is currently not possible without resorting to sun.misc.Unsafe
Another detail for Java is that I'm not sure what the memory model defines in terms of race conditions for "relaxed atomics" (because it doesn't even mention such a thing), while on C++1x, C11 and D it is well defined, and this is important as we will see in a few posts.

I'm not an expert in all of these languages, so if you see anything missing or wrong, please let me know in the comments section.

If you need more info you can probably find it on one of these links: