Locking

* The most basic types of locks are spinlocks. The kernel can provide other types of locks (e.g., mutexes that put threads to sleep during the wait phase) over the basic spinlock. We will now see how spinlocks are implemented.

* Atomic instructions available in modern CPUs. (not all may be supported in all architectures)

test_and_set(var &x): writes 1 to variable x and returns previous value

swap(var &x, value v): replaces contents of x with v and returns previous value of x. Equivalent to the xchg instruction in x86 that we have seen earlier.

fetch_and_add(var &x, value v): replaces contents of x with (old value of x + v), returns old value of x. Similar instructions exist for other operations also.

compare_and_swap(var &x, value old, value new) or CAS
  if(x == old) {x = new, return true}
  else return false
 
* The above atomic instructions can be used to implement the following simple test and set family of locks.

- Test and Set Lock (TAS)

int lock = 0;
Acquire:
  while(test_and_set(&lock) == 1);

Release:
  lock = 0;

Problem with this lock: all threads trying to acquire the lock keep updating the same memory location over the system bus, leading to cache invalidations. 

- Test and Test and Set Lock (TTAS): this lock only spins in an atomic instruction when a normal memory read indicates that the lock is free. Note that normal memory read spins only on a local cached copy and does not generate bus traffic until the lock is released. However, after a lock is released, significant bus traffic is generated.

Acquire:
  while(1) {
    while(locked == true); //regular memory read
    if(test_and_set(&lock) == 0) return; //exit the loop, lock acquired
    }

- Test and Set Lock with exponential backoff: if we failed to acuqire lock in the atomic instruction above, it indicates high contention, so the process goes to sleep for a certain duration. This duration keeps increasing exponentially for every failure to acquire the lock.

  while(1) {
    while(locked == true); //regular memory read
    if(test_and_set(&lock) == 0) return; //exit the loop, lock acquired
    else exponential_backoff();
    }

But how long to sleep? What if threads sleep too much? What if some threads sleep much more than others leading to starvation?

* The above locks do not guarantee any fairness or FIFO order. To fix the above, ticket locks are used. 

int next_ticket = 0;
int now_serving = 0;

Acquire:
  my_ticket = fetch_and_increment(&next_ticket)
  while(now_serving != my_ticket); //busy wait

Release:
  now_serving++;

The busy spinning part can be replaced with sleeping where sleep duration is based on the distance between my_ticket and now_serving. 

Note that only one atomic operation happens once per acquire, so the amount of bus traffic generated is much lower than with test and set lock. However, all contending threads read the same variable, so when a lock is released, many cached copies have to be invalidated. So, while this is more scalable than TAS, it still generates significant cache coherence traffic.

* All the above locks are non-scalable locks. That is, as the number of contending cores increases, overhead of acquiring/releasing the lock also increases due to increased cache coherence traffic. To fix the above issue, scalable locks have been proposed. In scalable locks, contending threads spin on separate locations. A popular example of a scalable lock: MCS lock. 

* MCS locks: every contending thread creates a qnode variable, local to itself. A qnode consists of a locked flag and a pointer to the next qnode in the list. A lock is a linked list of waiting qnodes, and is represented by the tail pointer (which points to the first qnode in the list, which belongs to the thread currently holding the lock). 

To acquire a lock, a thread adds its qnode atomically to the end of the list, and spins as long as its own locked flag is true. To release a lock, the thread sets the locked flag if its next node to false, thereby waking it up.

Acquire(lock, qnode):
  qnode.next = NULL;
  prev_tail = swap(lock.tail, qnode);

  if(prev_tail != NULL)
    qnode.locked = true;
    prev.next = qnode; 
    while(qnode.locked); //busy wait, until previous thread sets it to true

Release(lock, qnode):
  if(qnode.next != NULL) qnode.next.locked = false;

However, the above release code has a race condition. What if some other thread has added itself to the list behind the thread releasing the lock, has updated tail, but not the next pointer? Then we have a race condition. So the code below fixes it.

Release(lock, qnode):
  if(qnode.next == NULL) //I am possibly the last in the list
    if(compare_and_swap(lock.tail, qnode, NULL)) //tail points to me, and now is set to NULL
      return; //released lock, queue is empty
    else //my next pointer is NULL, but tail doesn't point to me. Someone is in the middle of an update
      while(qnode.next==NULL); //wait until the other update finishes

  //at this point, qnode.next is not NULL, so wake up the next thread
  qnode.next.locked = false;

---------

* What about software-based locking? One popular algorithm is Peterson's algorithm. Such algorithms no longer work correctly on modern multicore systems, but are still useful to understand the concept of mutual exclusion. Below is an intuitive way of understanding Peterson algorithm.

* Suppose two threads (self and other) are trying to acquire a lock. Consider the following version of the locking algorithm.

Acquire:
flag[self] = true;
while(flag[other] == true); //busy wait

Release: flag[self] = false;

This algorithm works fine when execution of both threads does not overlap.

* Now, consider another locking algorithm.

Acquire: 
turn = other;
while(turn == other); //busy wait

This algorithm works fine when execution of both threads overlaps.

* So we get a complete algorithm (that works both when executions overlap and don't) by putting the above two together.

Acquire:
flag[self] = true;
turn = other;
while(flag[other] == true && turn == other); //busy wait

Release:
flag[self] = false;

* Peterson's algorithm works correctly only if all the writes to flags and turn happen atomically and in the same order. Modern multicore systems reorder stores, and this algorithm is not meant to work correctly in such cases. Modern systems use locks based on atomic hardware instructions to implement locks.