Thursday, February 28, 2013

Optimizing Reentrant calls 2

We can apply to the ScalableReentrantRWLock the same technique described in this post where we add an integer to the ReadersEntry class and increment/decrement that int instead of the AtomicInteger, this way the number of synchronizations is reduced.
Here is a plot of the performance comparison:


With this approach we see a constant performance improvement of the order of 10% to 15%.
The new implementation will be available on version 2.0 of the Concurrency Freaks library.



What's the minimum number of steps to do a read-only lock?

Let's start by remembering what the read-lock looks like for the ScalableReentrantRWLock:




    public void sharedLock() {
        // Initialize a new Reader-state for this thread if needed        
        if (entry.get() == null) addState();     
       
        final AtomicInteger readersCount = entry.get().state;
       
        while (true) {
            if (writerOwner.get() == SRWL_INVALID_TID) {
                readersCount.set(1);
               if (writerOwner.get() == SRWL_INVALID_TID) {
                   // Acquired lock in read-only mode
                   return;
               } else {
                   // A Writer has acquired the lock, must reset to 0 and wait
                   readersCount.set(0);
               }
            }
            Thread.yield();
        }        
    }



Looking into the while() loop, we can see that the minimum number of operations needed to acquire the read-lock are three: writerOwner.get(), readersCount.set(1), writerOwner.get().
This means that in order to obtain a read-lock, you will need to do at least the following Atomic operations: two reads (loads) and one write (store). An interesting detail about this, is that the second writerOwner.get() should be a very quick operation because if there was no cache-line invalidation of writerOwner the value will still be in cache L1 due to the first writerOwner.get(), and not only that, on x86 the instruction is a simple mov (plus the compiler-level acquire-barrier). If the previous sentence didn't make sense, just go and listen to this starting at minute 31.

And now you ask: Dude, that's just 3 operations! Can we still improved it?
Yes we can!

Here's how: Instead of reading the state of the Writer and then setting the Reader's state to 1, we simply set the Reader's state to 1, and then if there is a Writer waiting, we rollback to state 0.
The problem with that approach is that if you repeat that logic then you might get into a livelock state between a Reader and a Writer, which is not good. We can only use this technique once (for the first time) and then we default to the regular technique that uses three operations. Sounds confusing? Let's look at some code:



    public void sharedLock() {
        // Initialize a new Reader-state for this thread if needed        
        if (entry.get() == null) addState();     
       
        final AtomicInteger readersCount = entry.get().state;

        // Optimistic approach: we hope that most of the time
        // the writerOwner will be SRWL_INVALID_TID and that will
        // save us one extra .get()
        readersCount.set(1);
        if (writerOwner.get() == SRWL_INVALID_TID) {
            // Acquired lock in read-only mode
            return;
        } else {
            // A Writer has acquired the lock, must set the reader's state
            // to 0 and go into the "regular" execution
            readersCount.set(0);
        }

        while (true) {
            if (writerOwner.get() == SRWL_INVALID_TID) {
                readersCount.set(1);
               if (writerOwner.get() == SRWL_INVALID_TID) {
                   // Acquired lock in read-only mode
                   return;
               } else {
                   // A Writer has acquired the lock, must reset to 0 and wait
                   readersCount.set(0);
               }
            }
            Thread.yield();
        }        
    }

Do we gain a performance improvement by doing this?
Not on x86, but maybe yes on architectures where the get() forces a full synchronization and saving a single get() may increase the performance noticeably. Anyone has a PowerPC or ARMv7 system that runs Java?

By the way, this whole idea came from Andreia's head... and I'm offering dinner to whoever can do a read-lock with less operations (than one get() and one set())   :)





The problem of not having a Garbage Collector

As mentioned on a previous post, one of the problems of not having a GC in C or C++ is that some algorithms are very tricky to implement. Sure, you can use a Hazard-Pointer-Manager like the one described here, and I even have made one that is Wait-Free (a subject for a future post), but then you're just replacing a nasty problem with a slightly more complex one.
It's always simpler to modify the algorithm (when possible) to get rid of the issue, than to try to implement your own GC or HPM, and that's what we did.

My pain comes from the algorithm described in this post not being applicable to C or C++, mostly because the helper array can be removed at any time by a new thread or one that is about to terminate.
Any call to addState() or to removeState() will create a new helper array for readersArray and release the memory for the old one.

Andreia had the idea of replacing the array with a ConcurrentLinkedQueue (or a similar lock-free implementation in C++) which would make it safer. Still, in C++ we will  need to call exclusiveLock()/exclusiveUnlock() in removeState() to prevent other threads from having a pointer to the current state of the reader that we are trying to delete. The reason is that, we know that the only place in the code where a pointer for that entry may be kept by another thread is in exclusiveLock(), so in order to prevent that from happening, we call exclusiveLock() from the thread that is about to terminate, which will set the writerOwner to tid_self and prevent the other thread from taking the pointer. By the time the other thread acquires the lock on writerOwner, the ConcurrentLinkedQueue will no longer have that entry and, therefore,  there is no way for it to grab that pointer:


    public void exclusiveLock() {
        final long tid_self = Thread.currentThread().getId();
            
        // Try to acquire the lock in write-mode
        while (!writerOwner.compareAndSet(SRWL_INVALID_TID, tid_self))
            Thread.yield();


        // Transform the ConcurrentLinkedQueue into an array to make it 
        // easier to scan
        AtomicInteger[] readersArray =  
            readersCount.toArray(new AtomicInteger[readersCount.size()]);
        int[] waitForReaders = new int[readersArray.length];
        int numWaitReaders = 0;       
       
        // Write-Lock was acquired, now wait for any running Readers to finish
        for (int i = 0; i < readersArray.length; i++) {
            if (readersArray[i].get() > 0)
                waitForReaders[numWaitReaders++] = i;           
        }
       
        for (int i = 0; i < numWaitReaders; i++) {
            while (readersArray[waitForReaders[i]].get() > 0)
                Thread.yield();
        }      
    }


Truly devious!