Saturday, June 28, 2014

CLLElectedUnlink - A Lock-Free List with Relaxed Atomics (part 1)

In this post we are going to cover a new type of concurrent linked list with lock-free operations for searching, insertion, and removal, i.e. a new Lock-Free Linked List. We named it CLLElectedUnlink (Concurrent Linked List with Elected Unlink) and you can get a Java implementation from github here:

Main components

This linked list uses two main techniques:

1) The first one is the Election Pattern described in the previous post.
In the CLLElectedUnlink, the remove() function marks nodes to be removed but doesn't physically unlink them from the linked list immediately afterwards. The unlinking process is done in a lazy way by a thread chosen to perform this task, using the Election Pattern. This approach greatly simplifies the process of unlinking because while this operation is ongoing, it knows that no other thread can be modifying any of the nodes involved in the operation, thanks to the Election mechanism.

2) The second one is Relaxed Atomics.
If you don't know what those are, then start by taking a look at this great presentation from Herb Sutter on the C++11 and C11 Memory Model and Atomics (a lot of it also applies to Java, and almost all applies to D).
The usage of relaxed atomics allows for the traversal of the linked list with O(1) number of acquire barriers, assuming new insertions on the list are nearly zero.
On all lock-free linked lists I've seen so far, doing the traversal of the linked list requires (at least) one acquire barrier per node that is traversed.
For example, Java's CLQ needs to do a volatile load every time it reads
If you saw Herb Sutter's presentation, then I'm sure you will remember that on x86 architectures, the usage of relaxed atomics for loads will have no gain in terms of performance, at least for relaxed loads (see the presentation to understand why), but if you think about other architectures like ARM, then maybe there will be a difference in performance.
I haven't had a chance to test this on ARM yet, but just in case, don't expect any miraculous performance improvements. Remember that the the cache-coherence system still has to do its job, it's just that there is more flexibility in how and when it does it.

Ok, so let us start from the beginning, giving as an example the Java code, but you can easily port this algorithm to other languages using our "Rosetta Stone of Atomics".


The first thing to look at is the Node of the CLLElectedUnlink. The Node has three members named "item", "next" and "state":

static class Node<E> {
    final E item;
    Node<E> next;
    volatile int state ;
    Node(E item) {
        this.item = item; = null;
        this.state = INUSE;
- "item" is a reference to the object that is associated with the node. In Java's ConcurrentLinkedQueue (CLQ) it is a volatile instead of a final (or const in C++1x or C11), the reason being that when it is set to null it marks that the node has been logically removed. In the CLLElectedUnlink we use a final qualifier because once a node has been assigned to an object it will never be assigned to a different one (no node re-usage), but the difference from the CLQ is that we don't want to have an acquire barrier when we traverse the linked list, and on each node we check if the node matches the one we're looking for.What this means is that for CLQ.contains() and CLQ.remove(), for each node that is traversed, there are two acquire barriers being done, one when Node.item is read to compare against the object that is being searched for, and another acquire barrier when reading the next node from

- "next" is a reference to the next node on the list, or null if this is the last node. On CLQ.Node the "next" is a volatile because it can change at any time, if the next node in the list is removed and unlinked. In the CLLElectedUnlink it is a relaxed atomic because we want to avoid doing the acquire barrier.
Notice that on Java all references have an atomicity guarantee, and therefore, work as the C++1x and C11 relaxed atomics (kind of). On a C++1x implementation of the CLLElectedUnlink we would have to use the atomic<> qualifier on and then read it using std::memory_order_relaxed.
The trick with using a non volatile next in the CLLElectedUnlink is that whenever we read and it is null, we have to re-read it with an acquire barrier, or put in Java terminology, do a volatile load. This will guarantee that we get the most up-to-date value of next (and implicitly for any of the following nodes if they exist), and if the newly read value is again null, then we know for sure that this node is indeed the last node on the list.

- "state" is a two state volatile variable that indicates whether this node has been logically removed or not. In CLQ this two-state is done by setting the item to null or non-null, but in the CLLElectedUnlink we don't want to have change item to a volatile because it would add an acquire barrier, so we have to create a new variable which we named state. This variable starts by being "INUSE" when the node is first created, and is set to "REMOVED" when the node is marked as removed.


Traversing the linked list is very simple. It's just a while loop that goes through the nodes searching for a matching item.
The first time the node is read, it starts from the beginning of the linked list at head. The head variable is a volatile (in C++1x terminology, it is read with an acquire barrier), and this will guarantee that all globally visible modifications in the linked list will be seen, at least up until the head was read.
The real trick is done when reading the which is not volatile. We read it until we see it is null, and then we read it one more time, this time as a volatile (with an acquire barrier) to make sure we get the most up to date data. If it is null, then we know this is the last node in the list, otherwise we need to keep traversing.

One thing to keep in mind is that, if the item on a node matches the item we're searching for, then we need to read the state variable, which implies an acquire barrier. If we do a lot of removal/insertion of the same item in the list, then it means that for a certain time window, there can be multiple nodes with the same item, where all (except maybe one) are removed, and if we are searching for the item, we need to read the state variable to check if the node is removed or not, thus doing an acquire barrier every time this happens. This is a very rare situation and even then, the number of acquire barriers should not be too high.


Now comes the tricky part. The major difficulty in implementing lock-free lists is related to the many corner cases that may occur when trying to physically remove a node from the list. Much of the complexity in Harris linked list comes from it.

In the CLLElectedUnlink, we bypass most of these difficulties by using the Election Pattern to do the physical unlinking from the list. The main trick is that the remove() will only mark the node as logically removed, but it will be up to the unlink() function to do the physical unlinking.
The unlink() function contains an election mechanism so that only one thread at a time can execute unlinking procedures.
In our implementation, we call the unlink() from the remove(), but it can be called directly as well. This is useful if your application does removals in batches and once they finish you can call the unlink() directly, thus physically unlinking all removed nodes in a single traversal of the list.

One thing about the is that it has the following state machine:

When a node is first created, its next is at null, and the only way to change it is an add() operation when it does an insert of a new node. The insertion of the new node is done with a CAS guaranteeing that it is done once. Notice that after it transitions from null to a reference of a node in the list, it will never be null again... ever! The only way to change a non-null is to assign it to another non-null node, and this is done by the unlinking operation. Thanks to the Election Pattern, one, and only one thread at a time can execute the unlinking operation, which means it can be done with a relaxed atomic store.

One of the tricks of the unlink() is that it will never try to unlink the last node on the list, i.e. the node whose next is null. Doing so would risk unlinking a node after another thread has inserted a new node after the last node, which would disappear, thus giving errors, i.e. the before-happens relation would no longer hold for insertions.

Another trick of the unlink is that it will always set the next to reference a node that is forward on the list. Due to the relaxed atomics, different threads may see different values of next, but it will always be a node that is further ahead on the list, and it can only skip nodes that have been marked as removed, and there is no risk of a list traversal bypassing a node that is valid.
For example, suppose we have 5 nodes on the list, from A to E:

and now three different threads have marked nodes B, C and D to be removed, and one thread then goes and unlinks them all, first by making point to C, then point to D and then point to E. Notice that there is no guarantee on ordering or visibility for relaxed atomics, so it could happen that these stores will be seen in any order, by some threads and no others:

The invariant still holds that can only reference forward nodes.

Next week we will continue talking about the CLLElectedUnlink, so stay tuned.

No comments:

Post a Comment