From 682e57fd3af8decaf230139f58429fa76f9ec5a0 Mon Sep 17 00:00:00 2001
From: Nadav Har'El <nyh@cloudius-systems.com>
Date: Sat, 25 May 2013 18:05:20 +0300
Subject: [PATCH] todo: add todo/mutex

Things we still need to do to use the lockfree mutex
---
 todo/mutex | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)
 create mode 100644 todo/mutex

diff --git a/todo/mutex b/todo/mutex
new file mode 100644
index 000000000..b1f2eb4b5
--- /dev/null
+++ b/todo/mutex
@@ -0,0 +1,68 @@
+Replace spinlock-based mutex by lockfree mutext
+===============================================
+
+<lockfree/mutex.hh> seems functional, but to replace the spinlock-based mutex
+with it, we'll should do the following:
+
+1. Make the structure smaller
+-----------------------------
+It can be 36 bytes if we also make condvar (which contains a mutex) smaller,
+or if we make it 32 bytes, we don't need to change condvar.
+
+Some ideas on how to make the structure smaller:
+1. Make sequence, handoff, and/or depth 16-bit (in osv/mutex.h depth is
+   already 16-bit).
+2. Make queue_mpsc's two pointers 32-bit. On thread creation, give each
+   thread a 32-bit pointer (or a recycled thread_id - see below - indexing
+   a global array) which can be used instead of putting the wait struct on
+   the stack. Or perhaps we can put all stacks in the low 32 bit?
+3. Do the same to the two pointers in condvar to make condvar smaller too,
+4. Have a new recycled low (32-bit) numeric "threadid" and use it for owner
+   instead of a 64-bit pointer.
+
+2. More testing
+---------------
+
+Write more tests for the lockfree mutex. The most difficult part of the
+algorithm, the "handoff", happens only when the queue is empty, so the
+best chance to see this in action would probably be to test with only two
+pinned threads.
+
+3. Memory ordering
+------------------
+
+Using sequential memory ordering for all atomic variables is definitely
+not needed, and significantly slows down the mutex. I started relaxing the
+memory ordering, and saw a significant improvement in the uncontended case,
+but I need to complete this work.
+
+4. Benchmark
+------------
+
+Write a benchmark for the uncontended case (done), and for some sort of
+contended case, and compare its performance to the old spinlock and mutex.
+
+4. Clean up the code
+--------------------
+
+Don't put everything in the .h. See how we can most as much as possible to
+the .cc, without hurting performance.
+
+Also make the lockfree mutex usable from C. Think if we can do this with
+the same type, as we did in <osv/mutex.h>. Perhaps we'll need to switch
+from using the atomic<int> type to using just int and global std::atomic
+functions.
+
+5. "Fishy" things to look at again
+----------------------------------
+
+Think - and *test* - the issue of spurious wake() coming from other code.
+Replace the "lock guard" by an explicit prepare_wait(), and later replace
+the schedule by a loop, doing a new prepare_wait() every time schedule()
+returns when we're still not owner.
+
+Think and test: Write a "half lock" which increases count but doesn't add
+anything to the queue. This causes every lock()/unlock() to use the handoff
+protocol, allowing us to 1. test it. 2. see how much performance drops.
+Consider the interesting theoretical problem: why should an uncompleted,
+hung, lock, slow down now all the lock/unlock? Can't there be a better way?
-- 
GitLab