More lockfree techniques: Lightweight Synchronization (RCU and RLU), Transactional Memory.

* Suppose we know that a certain datastructure has mostly read accesses, and very few writes/updates. Can we optimize the synchronization logic in such a way that reads can happen without waiting for writes? Two solutions: read-copy-update (RCU) and read-log-update (RLU). 

* In RCU, readers do not take any locks when entering a critical section, but can optionally notify the kernel/RCU library when they are entering or leaving a critical section, e.g., rcu_read_lock and rcu_read_unlock. A writer that wishes to modify a part of the datastructure will make a copy of that part, and make an update over the copy, but will leave the old version intact. Note that some readers may still exist in the old version of the data structure. The writer will now wait until all readers have left the old version, e.g., by making a blocking call synchronize_rcu. After all old readers have left, the writer can reclaim the memory of the old copy. 

* For example, consider an RCU-capable linked list. The writer makes a copy of the linked list node it wishes to edit, and atomically swings the next pointer of the previous node from the old node to the new copy (using a memory barrier instruction if needed). Once all readers have left the old copy of the node, the old node can be safely deleted. Note that new readers that start after the update will only see the new version of the list.

* Until when should a writer wait in synchronize_rcu? The library can either ask readers to explicitly signal when they leave a critical section, or the writer can infer a "grace" period. For example, if all running tasks on all CPUs have undergone a voluntary context switch, then the readers would have exited the critical section. Many such heuristics can be used to infer the end of a critical section in all old readers. 

* What about synchronization between writers? RCU doesn't specify any clear mechanism. Writer-writer conflicts can be managed with locks, for example.

* What about datastructures other than singly linked list, where multiple operations are needed to update from old node to new node? Not all datastructures can be ported to the RCU technique.

* Is RCU same as reader-writer locks? No. With reader-writer locks, a writer can block a reader. But with RCU, readers and writers can coexist.

* A better idea: RLU (see reference paper). With RLU, a writer doesn't copy, but instead maintains a log of changes to be made to the datastructure. Any reader that has already started will see the original datastructure, while new readers that start after the changes have been made in the log will see the logged new copies. Once all old readers have left, the writer will propagate the changes from the log and overwrite the old version of the datastructure, and clear its log. All subsequent readers that start after the changes have been flushed will once again see the latest version of the datastructure.

* Once again, RLU doesn't do anything about writer-writer conflicts, and writers must coordinate using coarse grained or per-object fine grained locks.

* Metadata maintained in RLU:

- Global clock

- A datastructure consists of objects (e.g., nodes in a linked list) that are allocated dynamically, e.g., via malloc. Every pointer to an object has a header to indicate if the pointer is valid, or if a more recent version of the object exists in the log of some thread. This pointer copy is updated by writers when they wish to lock and log an object.

- Every thread maintains a write log, and two clocks: a local clock and a write clock. The local clock of a thread is updated from the global clock when it becomes active and starts to run. The write clock of a thread indicates the time at which it has made a change to an object.

* Protocol for readers: readers mark their entry and exit to critical sections via functions rlu_reader_lock and rlu_reader_unlock. Inside the critical section, objects are accessed via rlu_dereference function, which looks at a pointer and decides if the object should be accessed directly, or if a more recent copy of an object should be fetched from the log of another thread. Threads whose local clock is before the write clock of writer (old threads) will access the old pointer, while new threads (whose local clock is ahead of write clock of writer) will steal a new updated copy of the object from the log. Note that once a writer has cleared its log, it will set its write clock to infinity, so future readers will directly get the updated copy.

* Protocol for writers: the writer uses an atomic operation to make a copy of an object in its log, lock the object, and make the object header point to this copy. It then makes changes to the locked object in this log. Any other writer wishing to modify this object will have to wait. 

Once the changes are done, the writer will start its commit phase. It sets write clock to global clock + 1, and atomically updates the global clock in that order. (What happens if the order of these two operations is reversed?) This is the "defining moment" when the write is said to have occurred, and this update to the write clock will enable write stealing of the updated copy from the log. The writer then waits for old readers to finish, propagates changes from its log into the original datastructure, sets write clock to infinity (to disable stealing), and clears its log.

* What versions will readers see? Readers that started before the "defining moment" in the commit will continue to see the old original version of the datastructure, because the write clock of writer will be infinity and hence the reader won't know about the updated version. Readers that arrive during the commit operation will see the write clock value and realize they have to steal the updated version from the write log. Once the commit is finished, readers will once again stop stealing and read the original version of the datastructure. 

-----------

Transactional Memory

* Transactional memory: memory changes in a critical section are treated as atomic transactions, that all commit or abort at the same time. If the underlying memory provides such an abstraction, then one can simply make critical sections as atomic transactions, without requiring locks for mutual exclusion.

* The idea of hardware transactional memory has been around for a few decades, but is not widely available. Software transactional memory support is available in some languages, but it is slow due to high overhead.

* TM vs locks: TM is optimistic. If no contention, then no overhead. While locks will incur overhead even in low contention. TM is also composable easily (can make a transaction even if subprocedures have transactions) while doing the same with locks will lead to deadlocks. In general, TM is considered easier to program with.

* Let's start with the simple LL/SC atomic instructions.

load linked(var &x): return the value of x, and mark the variable in cache.

store_conditional(var &x, value v): if the marked bit is set, then write value v to x and return true. Else return false.

The combination of load linked and store conditional (LL/SC) instructions let us know if any other process has changed the value of the variable x. If SC succeeds we know that no one else has updated that variable. If SC fails, it means another variable tried to load or store or invalidate the cache line, or something like a context switch occurred (to clear cache).

CAS and LL/SC are equivalent and usually only one of the two is supported in any given architecture.

LL/SC does atomic updates to only one memory location. TM extends this idea to multiple locations. 

* New hardware instructions for TM: LTX (load a memory location in a transaction with exclusive access), ST (store a memory location in a transaction), VALIDATE (is the transaction running fine, or has someone else interfered), COMMIT (try to commit the changes). A critical section consists of multiple LTX, ST operations, with a final commit. The LTX and ST operations can fail silently (and load garbage values for example) if some other process has concurently modified the same memory locations. In fact, the values returned by LTX can be garbage if there are concurrent updates. The VALIDATE instruction is used to check for this condition and guard against unsafe memory access. The COMMIT instruction propagates the changes to the memory locations if the transaction has completed, else none of the changes will be visible in memory and the transaction must be retried (maybe after a delay). Note that there is also a LT instruction for non-exclusive read in a TM.

* Consider a simple stack implemented with TM. Note how the code is very similar to a lock-free single threaded code, but with extra instructions for TM access.

push(node *n) {
  while(1) {
    n->next = LTX(top);
    ST(&top, n);
    if(COMMIT())
      return
    else
      delay;
  }
}

node *pop {
  while(1) {
    node *result = LTX(top);
    if(VALIDATE()) {
      if(result != NULL) 
        ST(&top, result->next);
      if(COMMIT())
        return result;
    }
    //delay as validate failed
  }
}

* How is hardware TM implemented? It is envisioned as a separate part of the L1d cache. When a LTX instruction is issued, it is similar to a read operation on a normal cache: if a cache miss occurs in the TM cache, the value is fetched from other caches (if they have an updated copy) or from main memory. Further, an invalidate message is sent to the other caches. Note that the LTX operation can fail if the cacheline is in use on another core and it refuses to respond to invalidate messages. Also, once you load a value using LTX, another core can invalidate your copy due to the start of a concurrent transaction. This will be detected during VALIDATE and COMMIT. 

* What happens during ST? The value is loaded into the transaction cache. Then two copies are made of the cacheline. One of the copies is modified with the new value and is marked as XABORT (discard this value if the transaction aborts). The old value is retained in another cache line and is marked as XCOMMIT (discard on commit). 

* What happens during COMMIT? The slots marked as XCOMMIT are discarded and the slots marked as XABORT are marked as normal transaction cache slots (much like M state in a regular cache). Note that modified values need not be written back immediately. 

* Cache evictions happen from normal slots (like in regular caches when a new line must be fetched and the cache has no empty slots). Note that XABORT enrties must not be discarded. 

* How does TM and atomic operations/locks compare with respect to bus traffic? If only one core is modifying a cacheline, then TM will generate almost no bus traffic, while atomic operations/locks will still generate bus traffic and force the writes to be visible everywhere. So in the optimistic cache of low contention, TM is better. However, TM can lead to lots of retries and transaction failures under high contention.

* Software TM emulates TM in a programming language by provising programming language constructs for atomic operations. The implementation is much like RLU. For every set of read/write operations that are tagged as atomic, the changes are not propagated into memory. Instead, every transaction maintains a log of values read and written along with a version number (derived from a global clock). During commit, if any conflicting transaction is detected, then the commit fails. Otherwise, the commit succeeds and changes are propagated into memory. Obviously, STM incurs great overhead under high contention.

* Other ways to avoid locks besides lock free datastructures and transactional memory: some programming languages like Go have the notion of ownership of memory. Every thread owns certain memory, and communication happens via message passing. Design philosophy of Go: "Do not communicate by sharing memory; instead, share memory by communicating."