Wednesday, December 31, 2014

Tidex Mutex - No Pthread_Self variant

Before we close up on the Tidex Lock, there is one extra technique that may be useful.
When we use pthread_self() to get the thread id, it should be noticed that using a pthread_t for anything else than comparison is undefined behavior. In practice, this is not much of a problem because we just want to use the negative value of it, so even for the cases where pthread_t is a pointer to a struct, casting it to a long long and using the negative of it should provide a unique value that is never returned by pthread_self().
It is true that most of us get a bit stressed when the words undefined behavior are part of a mutual exclusion lock algorithm, therefore, in order to give some peace of mind to those troubled by it, here is a technique which doesn't need ptread_self() and provides the same performance.
We call it the No Pthread_Self() variant, or Tidex NPS for short.

We start by adding a sequentially consistent atomic variable named globalThreadIndex which is incremented monotonically with atomic_fetch_add() by each new thread accessing an instance of the lock. This increment is done once and only once for each thread and it is the equivalent to the thread id.  To store this thread id for later usage, we have a thread-local variable (non-atomic) to save it.
Here's what the extra code looks like in C11:

 * This variable can even be an 'atomic_short' because it is unlikely that your
 * application will create more than 32767 threads. This also means that
 * both ingress and egress can be of type 'atomic_short', which can save memory.
 * We start at '1' because we want to use the negative of the value as well.
 * Alternatively, we could start at zero but then we would have to advance
 * this index 2 at a time.
 * This is shared by all tidex_nps_mutex_t instances to save memory.
static atomic_long globalThreadIndex = ATOMIC_VAR_INIT(1);

 * The index of the thread is stored in a thread-local variable that is
 * shared by all instances of tidex_nps_mutex_t.
 * If the value is the initialized of INVALID_TID (zero) then we need to
 * get a value from globalThreadIndex using atomic_fetch_add(), once and
 * only once per thread.
static _Thread_local long tlThreadIndex = INVALID_TID;

Notice that both these variables are shared among all instances of the locks so there is no increase in memory usage, and in fact, with this technique we can make the ingress and egress be 32bit integers or event 16 bit integers, which would reduce memory usage.

The modifications to lock() are minimal, and unlock() remains unchanged:

void tidex_nps_mutex_lock(tidex_nps_mutex_t * self)
    long mytid = tlThreadIndex;
    if (mytid == INVALID_TID) {
        tlThreadIndex = atomic_fetch_add(&globalThreadIndex, 1);
        mytid = tlThreadIndex;
    if (atomic_load_explicit(&self->egress, memory_order_relaxed) == mytid) mytid = -mytid;
    long prevtid = atomic_exchange(&self->ingress, mytid);
    while (atomic_load(&self->egress) != prevtid) {
        sched_yield();  // Replace this with thrd_yield() if you use <threads.h>
    // Lock has been acquired
    self->nextEgress = mytid;

Conceptually, there is not much of a difference between this approach and the one using pthread_self() because pthread_self() returns itself a value from a thread-local variable which contains the thread's id.

The performance numbers remain unchanged, at least on our usual benchmark.

As usual, the C11 code is available on github:

Sunday, December 28, 2014

Tidex Mutex in C11

We've made a C11 version of the Tidex Lock and made it available on github:

We ran some benchmarks to compare the Tidex Lock versus a pthread_mutex_t and our own Ticket Lock implementation.

--- Opteron 32 cores ---
1 thread:
pthread_mutex_t =  5583126 ops/second
Ticket Lock     = 18606375 ops/second
Tidex Lock      = 17374496 ops/second

16 threads:
pthread_mutex_t = 1418309 ops/second
Ticket Lock     = 5348964 ops/second
Tidex Lock      = 5322226 ops/second

32 threads:
pthread_mutex_t = 1338859 ops/second
Ticket Lock     = 4004952 ops/second
Tidex Lock      = 3775166 ops/second

--- Intel i7 ---
1 thread:
pthread_mutex_t = 13720671 ops/second
Ticket Lock     = 37627282 ops/second
Tidex Lock      = 39492774 ops/second

4 threads:
pthread_mutex_t =  3418376 ops/second
Ticket Lock     = 18810728 ops/second
Tidex Lock      = 24807009 ops/second

8 threads:
pthread_mutex_t =  4679020 ops/second
Ticket Lock     = 17078066 ops/second
Tidex Lock      = 18310183 ops/second

First thing to notice is that the results for the Tidex Lock are very similar to the Ticket Lock (at least on this benchmark). Another thing is that there is a small disadvantage on the AMD Opteron but a small advantage on the Intel i7-3740QM for the Tidex Lock. This benchmark was done without any spinning.

So what does it look like on C11 ?

void tidex_mutex_lock(tidex_mutex_t * self)
    long long mytid = (long long)pthread_self();
    if (atomic_load_explicit(&self->egress, memory_order_relaxed) == mytid) mytid = -mytid;
    long long prevtid = atomic_exchange(&self->ingress, mytid);
    while (atomic_load(&self->egress) != prevtid) {
        sched_yield();  // Replace this with thrd_yield() if you use <threads.h>
    // Lock has been acquired
    self->nextEgress = mytid;

void tidex_mutex_unlock(tidex_mutex_t * self)
    atomic_store(&self->egress, self->nextEgress);

as you can see, this lock is very easy to implement (and understand).