loader.cc · e5fc1f1b7c4743c7ca54262c4515b4f1b1003638 · Verlässliche Systemsoftware / projects / osv

10 years ago

Nadav Har'El authored 10 years ago


Changes in v3, following Avi's review:
* Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable().
* Make the global RCU "generation" counter a static class variable,
  instead of static function variable. Rename it "next_generation"
  (the name "generation" was grossly overloaded previously)
* In rcu_synchronize(), use migration_lock to be sure we wake up the
  thread to which we just added work.
* Use thread_handle, instead of thread*, for percpu_quiescent_state_thread.
  This is safer (atomic variable, so we can't see it half-set on some
  esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is
  a bit of an overkill here, but it's not in a performance sensitive area.

The existing rcu_defer() used a global list of deferred work, protected by
a global mutex. It also woke up the cleanup thread on every call. These
decisions made rcu_dispose() noticably slower than a regular delete, to the
point that when commit 70502950 introduced
an rcu_dispose() to every poll() call, we saw performance of UDP memcached,
which calls poll() on every request, drop by as much as 40%.

The slowness of rcu_defer() was even more apparent in an artificial benchmark
which repeatedly calls new and rcu_dispose from one or several concurrent
threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose
from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse -
when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex
contention, the fact we free the memory on the "wrong" cpu, and the excessive
context switches all bring the measurement to as much as 12,000 ns.

With this patch the new/rcu_dispose numbers are down to 60 ns on a single
thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is
a x5.5 - x120 speedup :-)

This patch replaces the single list of functions with a per-cpu list.
rcu_defer() can add more callbacks to this per-cpu list without a mutex,
and instead of a single "garbage collection" thread running these callbacks,
the per-cpu RCU thread, which we already had, is the one that runs the work
deferred on this cpu's list. This per-cpu work is particularly effective
for free() work (i.e., rcu_dispose()) because it is faster to free memory
on the same CPU where it was allocated. This patch also eliminates the
single "garbage collection" thread which the previous code needed.

The per-CPU work queue has a fixed size, currently set to 2000 functions.
It is actually a double-buffer, so we can continue to accumulate more work
while cleaning up; If rcu_defer() is used so quickly that it outpaces the
cleanup, rcu_defer() will wait while the buffer is no longer full.
The choice of buffer size is a tradeoff between speed and memory: a larger
buffer means fewer context switches (between the thread doing rcu_defer()
and the RCU thread doing the cleanup), but also more memory temporarily
being used by unfreed objects.

Unlike the previous code, we do not wake up the cleanup thread after
every rcu_defer(). When the RCU cleanup work is frequent but still small
relative to the main work of the application (e.g., memcached server),
the RCU cleanup thread would always have low runtime which meant we suffered
a context switch on almost every wakeup of this thread by rcu_defer().
In this patch, we only wake up the cleanup thread when the buffer becomes
full, so we have far fewer context switches. This means that currently
rcu_defer() may delay the cleanup an unbounded amount of time. This is
normally not a problem, and when it it, namely in rcu_synchronize(),
we wake up the thread immediately.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e5fc1f1b

History

v3 RCU: Per-CPU rcu_defer()

Nadav Har'El authored 10 years ago


Changes in v3, following Avi's review:
* Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable().
* Make the global RCU "generation" counter a static class variable,
  instead of static function variable. Rename it "next_generation"
  (the name "generation" was grossly overloaded previously)
* In rcu_synchronize(), use migration_lock to be sure we wake up the
  thread to which we just added work.
* Use thread_handle, instead of thread*, for percpu_quiescent_state_thread.
  This is safer (atomic variable, so we can't see it half-set on some
  esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is
  a bit of an overkill here, but it's not in a performance sensitive area.

The existing rcu_defer() used a global list of deferred work, protected by
a global mutex. It also woke up the cleanup thread on every call. These
decisions made rcu_dispose() noticably slower than a regular delete, to the
point that when commit 70502950 introduced
an rcu_dispose() to every poll() call, we saw performance of UDP memcached,
which calls poll() on every request, drop by as much as 40%.

The slowness of rcu_defer() was even more apparent in an artificial benchmark
which repeatedly calls new and rcu_dispose from one or several concurrent
threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose
from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse -
when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex
contention, the fact we free the memory on the "wrong" cpu, and the excessive
context switches all bring the measurement to as much as 12,000 ns.

With this patch the new/rcu_dispose numbers are down to 60 ns on a single
thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is
a x5.5 - x120 speedup :-)

This patch replaces the single list of functions with a per-cpu list.
rcu_defer() can add more callbacks to this per-cpu list without a mutex,
and instead of a single "garbage collection" thread running these callbacks,
the per-cpu RCU thread, which we already had, is the one that runs the work
deferred on this cpu's list. This per-cpu work is particularly effective
for free() work (i.e., rcu_dispose()) because it is faster to free memory
on the same CPU where it was allocated. This patch also eliminates the
single "garbage collection" thread which the previous code needed.

The per-CPU work queue has a fixed size, currently set to 2000 functions.
It is actually a double-buffer, so we can continue to accumulate more work
while cleaning up; If rcu_defer() is used so quickly that it outpaces the
cleanup, rcu_defer() will wait while the buffer is no longer full.
The choice of buffer size is a tradeoff between speed and memory: a larger
buffer means fewer context switches (between the thread doing rcu_defer()
and the RCU thread doing the cleanup), but also more memory temporarily
being used by unfreed objects.

Unlike the previous code, we do not wake up the cleanup thread after
every rcu_defer(). When the RCU cleanup work is frequent but still small
relative to the main work of the application (e.g., memcached server),
the RCU cleanup thread would always have low runtime which meant we suffered
a context switch on almost every wakeup of this thread by rcu_defer().
In this patch, we only wake up the cleanup thread when the buffer becomes
full, so we have far fewer context switches. This means that currently
rcu_defer() may delay the cleanup an unbounded amount of time. This is
normally not a problem, and when it it, namely in rcu_synchronize(),
we wake up the thread immediately.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>