Wednesday, September 14, 2016

A simple Userspace RCU in Java

In this post we're going to talk about a simple RCU implementatio, and provide an example in Java, but it's easy to port it to C++.

So, what is RCU again?
Yeah, well, RCU is many things and it has been talked about it detail by one of its original authors here.

... no, I mean, like, what is RCU, like _really_?
In its most basic form, RCU is a concurrency construct to provide on-demand quiescence.

By concurrency construct I mean it's a building block to use in a concurrency setting, the same way that a mutual exclusion lock is a concurrency construct, or a semaphore, or a reader-writer lock, etc. And by on-demand quiescence I mean that you can introduce quiescence anywhere in your application, as needed (more on that later).

RCU's basic API is:
  • rcu_read_lock()
  • rcu_read_unlock()
  • synchronize_rcu()

with the first two being called in a block, i.e. for every rcu_read_lock() there must be a later rcu_read_unlock(). Also, we can not call synchronize_rcu() from within a block of code enclosed by rcu_read_lock() / rcu_read_unlock().


Now, unlike the examples I gave of locks and semaphores, RCU does not provide any kind of mutual exclusivity property, at least not directly, but it does provides two very interesting guarantees:
First, the effects of any operation that happens before synchronize_rcu() in one thread (T1), will for sure be visible in another thread (T2) after rcu_read_lock(), if the call to synchronize_rcu() in T1 returned before the call to rcu_read_lock() in T2;
Second, the effects of any operation that happens after synchronize_rcu() in one thread (T1), will for sure not be visible in another thread (T2) before rcu_read_unlock(), if the call to rcu_read_lock() in T2 started before the call to synchronize_rcu() in T1.

... yes, it's confusing when you read it for the first time, but the idea is pretty simple. Here is a schematic that may help to understand:


T1 did some stuff (stuff before) then it called synchronize_rcu() and then it did some other stuff (stuff after). Time flows from top to bottom, i.e. execution order.
The other threads, T2, T3, T4 and T5 are doing rcu_read_lock()/rcu_read_unlock() and we show which events are seen in each situation.

The trickier one here is T5 for which rcu_read_lock() started during the call of synchronize_rcu() (if there was an absolute clock), but because synchronize_rcu() terminated before the call for rcu_read_unlock(), it means that T1 considered that rcu_read_lock() occurred after and so it did have to wait for the call to rcu_read_unlock(), and therefore, T5 will see the stuff that happened before on T1 and may or may not see that stuff that happened after.


Still confused?
That's ok, let's try another way then: The semantics of rcu_read_lock()/rcu_read_unlock() work just like the semantics of a read-lock in a reader-writer lock (only the semantics are similar, the effects are very different). And the semantics for synchronize_rcu(), well, there is no good analogy but you can think of it as having a lock()/unlock() of a writer lock... it's the closest thing I can come up with.

What does an RCU implementation looks like in Java?



static class RCU {

    final static long NOT_READING = Long.MAX_VALUE;

    final static int MAX_THREADS = 128;

    final AtomicLong reclaimerVersion = new AtomicLong(0);

    final AtomicLongArray readersVersion = new AtomicLongArray(MAX_THREADS);

   

    public RCU() {

        for (int i=0; i < MAX_THREADS; i++) readersVersion.set(i, NOT_READING);

    }

   

    public static int getTID() {

        return (int)(Thread.currentThread().getId() % MAX_THREADS);

    }

   

    public void read_lock(final int tid) {  // rcu_read_lock()

        final long rv = reclaimerVersion.get();

        readersVersion.set(tid, rv);

        final long nrv = reclaimerVersion.get();

        if (rv != nrv) readersVersion.lazySet(tid, nrv);

    }

   

    public void read_unlock(final int tid) { // rcu_read_unlock()

        readersVersion.set(tid, NOT_READING);

    }

   

    public void synchronize_rcu() {

        final long waitForVersion = reclaimerVersion.incrementAndGet();

        for (int i=0; i < MAX_THREADS; i++) {

            while (readersVersion.get(i) < waitForVersion) { } // spin

        }

    }

}


In this algorithm above, each thread calling synchronize_rcu() does a incrementAndGet() on an atomic variable named reclaimersVersion and then spins while waiting for all the ongoing readers to complete by scanning the array of their versions named readersVersion.
Upon calling rcu_read_lock(), a (reader) thread will read the current value in reclaimersVersion and set its uniquely assigned entry in the readersVersion array to the same value, and then re-check that the version has not changed and if so, update the readersVersion with the new entry.
When the read operation is done, it will call rcu_read_unlock() which will set its entry in readersVersion back to NOT_READING, a constant with a special value that means that the reader is not currently active.

It's hard to get any simpler than this  ;)

The advantages of this algorithm are:
- Very light in synchronization on the reader's side: Only two sequentially consistent loads and one seq-cst store on rcu_read_lock() and a release store in rcu_read_unlock();
- Wait-free for readers: No loops or retries on either rcu_read_lock() or rcu_read_unlock();
- Starvation-free for reclaimers: Calls to synchronize_rcu() are free from starvation... non-obvious why, but true;
- Multiple reclaimers can make progress simultaneously: Unlike some of the other Userspace RCU implementations that have a mutual exclusion lock inside synchronize_rcu(), this one does not, which means it scales well for multiple reclaimers;
- Code is short and sweet: There can't be many hidden bugs in something that takes less than 20 lines of code;

The only true subtle detail about this algorithm is in the fact that rcu_read_lock() does a load, a store, a load and then a relaxed store. The first load and store would suffice for the algorithm itself, but the thing about volatile/seq-cst stores is that they allow regular code to be reordered from below to above. This means that we could have code being executed before the readersVersion.set(tid, rv) which would be incorrect from the point of view of the expected semantics of RCU, and therefore, we need some kind of barrier to prevent code from traveling up. The easiest way to accomplish this is to do a volatile/seq-cst load, and since we're doing that, we might as well check if it changed and if it did, then we might as well update it with a lazy (or relaxed) store.
Like I said, it's subtle and not too important to understand how the algorithm works, it's more of an implementation detail.



Why would anyone want use RCU in Java? RCU is for memory reclamation, and Java has a GC, right?
Like I mentioned before, RCU is not (just) about memory reclamation, it's about using quiescence in your application to do whatever you want with quiescence, and as amazing as it may seem, there is a non-negligible amount of things you can do with it!

For example, suppose you have a pool of complex objects that are too costly to create, so you want to re-use by returning to the pool. But this pool is concurrent, so you have to make sure that some thread that is lagging behind isn't holding a reference to the object anymore.
After returning the object to the pool, when is it safe to re-use the object?
The answer is RCU: Protect all accesses to the objects from that pool with an rcu_read_lock()/rcu_read_unlock() and only reuse from the pool after calling synchronize_rcu().
Another example is Relativistic Programming which you can see more of in this paper:
http://web.cecs.pdx.edu/~walpole/papers/haskell2015.pdf

A funny thing about RCU is that one single RCU instance can be used for as many purposes as desired, for example, one RCU instance for all the pools (if there are more than one), or one per pool, as you prefer.


We'll prepare a C++ implementation for a future post.

2 comments:

  1. Does the version comparison require modulo arithmetic to handle wraparound? Or does this implementation assume its safe to not worry about that with a 64-bit value? It would take a very, very long time.

    // spin with modulo comparison
    while ((readersVersion.get(i) - waitForVersion) < 0) { }

    ReplyDelete
    Replies
    1. Hi Brian,
      We assume there won't be a wraparound of the 64bit variable, like you said, it would take a very long time (thousands of years maybe).
      It is possible to adapt the algorithm to deal with such case, but we didn't see the point of doing it.

      Cheers,
      Pedro

      Delete