Thursday, September 24, 2015

Poor Man's URCU

This post is about a new way of implementing the core API of RCU using only locks.
You can get the detailed paper here:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/poormanurcu-2015.pdf
and benchmark code (which I'm not particularly proud of, but it works) here:
https://github.com/pramalhe/ConcurrencyFreaks/tree/master/CPP/papers/poormansurcu

RCU is many things, but at its essence, it allows to introduce quiescent points in your program. As a subset of this, we can say that "RCU allows a thread to do a modification and then wait until it is certain that this modification is visible to all other threads".
This last feature may seem trivial but it is incredibly useful, from doing safe memory reclamation, to creating stuff like Left-Right.

RCU has three core APIs, namely, rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu(). Typical implementations have wait-free progress for  rcu_read_lock() and rcu_read_unlock(), and blocking by design on synchronize_rcu().
You can see the main implementations on librcu, and many toy implementations in section 9.3.5 "Toy" RCU Implementations of perfbook by Paul McKenney.
https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

To use RCU in your application you need to link with the URCU library, or implement one of the URCU toy implementations mentioned in perfbook. However, this approach requires having a compiler and language with access to atomic operations. An alternative to this is to use a lock-based implementation of RCU, like the one described in 9.3.5.1 of perfbook:
static void rcu_read_lock(void) {
    spin_lock(&rcu_gp_lock);
}

static void rcu_read_unlock(void) {
    spin_unlock(&rcu_gp_lock);
}

void synchronize_rcu(void) {
    spin_lock(&rcu_gp_lock);
    spin_unlock(&rcu_gp_lock);
}

The approach above is blocking for rcu_read_lock() and not scalable at all. A slightly better approach would be to use a Reader-Writer lock instead of a spinlock but still blocking.
We propose an alternative way of using only locks that is lock-free for rcu_read_lock().

Threads that call rcu_read_lock() and rcu_read_unlock() (Arrivers, or Readers) will attempt to acquire the reader lock on the first of the reader-writer locks (rwlock1) with a trylock() operation, and if it fails, do a trylock() on the second reader-writer lock (rwlock2), and if that also fails, then try again the first rwlock, and so on.
This procedure could go on indefinitely, but it should only fail if there is another thread doing progress, namely, a thread acquiring the write-lock in one of the reader-writer locks, which means the operation is lock-free.

A thread that calls synchronize_rcu() (Toggler or Writer) will start by acquiring the lock on the mutual exclusion lock to guarantee only one Toggler at a time is present.
The Toggler will then acquire the write-lock on the second reader-writer lock and release it as soon as it has obtained it, followed by acquiring the write-lock on the first reader-writer lock and release it as soon as it has obtained it. This is enough to introduce a linearization point, and it is now safe for the Toggler to free/delete the memory of the object that is meant to be reclaimed (or whatever other task it needs to do), because it has the guarantee that any ongoing Readers can no longer hold a pointer to such object.

Here's what it looks like when using pthreads:

typedef struct {
  pthread_mutex_t  togglerMutex;
  pthread_rwlock_t rwlock1;
  pthread_rwlock_t rwlock2;
} poorman_urcu_t;

// The return value should be passed as arg 'whichone'
int rcu_read_lock(poorman_urcu_t * self) {
  while (1) {
    if (pthread_rwlock_tryrdlock(&self->rwlock1) == 0) {
      return 1;
    }
    if (pthread_rwlock_tryrdlock(&self->rwlock2) == 0) {
      return 2;
    }
  }
}

void rcu_read_unlock(poorman_urcu_t * self, int whichone) {
  if (whichone == 1) {
    pthread_rwlock_unlock(&self->rwlock1);
  } else {    // whichone == 2
    pthread_rwlock_unlock(&self->rwlock2);
  }
}

void synchronize_rcu(poorman_urcu_t * self) {
  pthread_mutex_lock(&self->togglerMutex);
  pthread_rwlock_wrlock(&self->rwlock2);
  pthread_rwlock_unlock(&self->rwlock2);
  pthread_rwlock_wrlock(&self->rwlock1);
  pthread_rwlock_unlock(&self->rwlock1);
  pthread_mutex_unlock(&self->togglerMutex);
}


Notice that in this implementation the rcu_read_lock() API returns which reader-writer lock was acquired in the form of an integer, which is then passed to rcu_read_unlock() as an argument, so that it knows which reader-writer lock to unlock, but an alternative implementation could be done using a thread-local variable to pass this information.
An example of how to use this API for memory reclamation can be seen here https://lwn.net/Articles/262464/#Wait%20For%20Pre-Existing%20RCU%20Readers%20to%20Complete

We compared against the Bullet-Proof URCU implementation and it's not as good (obviously) but that's not too bad because for most scenarios the bottleneck won't be the RCU calls themselves but some other piece of code in your application.



One interesting detail is that if the underlying Reader-Writer lock is starvation-free for Writers, then synchronize_rcu() will also be starvation-free. This is a desirable property for low latency settings.
Alternatively, a Reader-Writer lock with writer-preference should be used, to avoid starving threads calling synchronize_rcu() (Togglers).

Another interesting detail is that the order of the acquisition/release of the rwlocks in synchronize_rcu() is not important for correctness. The algorithm would be correct regardless of acquiring/releasing the exclusive lock first on the rwlock1 and then rwlock2, or vice-versa.
Although this doesn't matter for correctness, it may matter for performance. For example, in a scenario where the read operations are typically very short, and we choose the first lock the Readers try to be the last lock the Togglers lock (like on the code shown above), after the Toggler unlocks the rwlock1, by the time it unlocks the togglerMutex and then the next waiting Toggler acquires the togglerMutex, most old Readers would have finished their work and have released rwlock2 and are now on rwlock1 (the first one they try), which means that the Toggler will be able to acquire the exclusive lock on rwlock2 and probabilistically ends up waiting only to get rwlock1.

And talking about performance, obviously, this technique's throughput is bounded by the underlying rwlock implementation. If a pthread_rwlock_t is used, don't expect much out of it, but if something like C-RW-WP (Scalable RW-Lock) is used instead, then Readers will scale very well and I would expect this technique to come close to the kind of scalability provided by other mainstream URCU implementations.
Our goal here wasn't performance anyways, we were concerned more about an RCU that was easy to deploy, low memory footprint, and low latency guarantees, that would work on top of pthreads because we have some old small embedded systems where RCU can be useful (think Internet of Things).

A different use-case for this is on scenarios where linking with RCU isn't easy, like on the JVM. It's true that most use-cases for RCU are related to safe memory reclamation and on the JVM we have a GC, so it's kind of pointless for that, but there are some special use-cases, for example, when we need to track the life-time of complex objects and want to safely re-use them in a concurrent setting, with the certainty that there are no threads still referencing them.... but using RCU in this way is a topic for a post on itself.


To wrap up, although not particularly efficient, and with a limited API, our "Poor Man's URCU" algorithm is highly portable across Operating Systems and programming languages, does not require linking with an external library, does not require kernel/OS support, and has no licensing constraint.
We hope that the simplicity and ease-of-use of our algorithm can foster the usage of RCU in domains where before it was impractical.

No comments:

Post a Comment