-
Nadav Har'El authored
Changes in v3, following Avi's review: * Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable(). * Make the global RCU "generation" counter a static class variable, instead of static function variable. Rename it "next_generation" (the name "generation" was grossly overloaded previously) * In rcu_synchronize(), use migration_lock to be sure we wake up the thread to which we just added work. * Use thread_handle, instead of thread*, for percpu_quiescent_state_thread. This is safer (atomic variable, so we can't see it half-set on some esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is a bit of an overkill here, but it's not in a performance sensitive area. The existing rcu_defer() used a global list of deferred work, protected by a global mutex. It also woke up the cleanup thread on every call. These decisions made rcu_dispose() noticably slower than a regular delete, to the point that when commit 70502950 introduced an rcu_dispose() to every poll() call, we saw performance of UDP memcached, which calls poll() on every request, drop by as much as 40%. The slowness of rcu_defer() was even more apparent in an artificial benchmark which repeatedly calls new and rcu_dispose from one or several concurrent threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse - when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex contention, the fact we free the memory on the "wrong" cpu, and the excessive context switches all bring the measurement to as much as 12,000 ns. With this patch the new/rcu_dispose numbers are down to 60 ns on a single thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is a x5.5 - x120 speedup :-) This patch replaces the single list of functions with a per-cpu list. rcu_defer() can add more callbacks to this per-cpu list without a mutex, and instead of a single "garbage collection" thread running these callbacks, the per-cpu RCU thread, which we already had, is the one that runs the work deferred on this cpu's list. This per-cpu work is particularly effective for free() work (i.e., rcu_dispose()) because it is faster to free memory on the same CPU where it was allocated. This patch also eliminates the single "garbage collection" thread which the previous code needed. The per-CPU work queue has a fixed size, currently set to 2000 functions. It is actually a double-buffer, so we can continue to accumulate more work while cleaning up; If rcu_defer() is used so quickly that it outpaces the cleanup, rcu_defer() will wait while the buffer is no longer full. The choice of buffer size is a tradeoff between speed and memory: a larger buffer means fewer context switches (between the thread doing rcu_defer() and the RCU thread doing the cleanup), but also more memory temporarily being used by unfreed objects. Unlike the previous code, we do not wake up the cleanup thread after every rcu_defer(). When the RCU cleanup work is frequent but still small relative to the main work of the application (e.g., memcached server), the RCU cleanup thread would always have low runtime which meant we suffered a context switch on almost every wakeup of this thread by rcu_defer(). In this patch, we only wake up the cleanup thread when the buffer becomes full, so we have far fewer context switches. This means that currently rcu_defer() may delay the cleanup an unbounded amount of time. This is normally not a problem, and when it it, namely in rcu_synchronize(), we wake up the thread immediately. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
Nadav Har'El authoredChanges in v3, following Avi's review: * Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable(). * Make the global RCU "generation" counter a static class variable, instead of static function variable. Rename it "next_generation" (the name "generation" was grossly overloaded previously) * In rcu_synchronize(), use migration_lock to be sure we wake up the thread to which we just added work. * Use thread_handle, instead of thread*, for percpu_quiescent_state_thread. This is safer (atomic variable, so we can't see it half-set on some esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is a bit of an overkill here, but it's not in a performance sensitive area. The existing rcu_defer() used a global list of deferred work, protected by a global mutex. It also woke up the cleanup thread on every call. These decisions made rcu_dispose() noticably slower than a regular delete, to the point that when commit 70502950 introduced an rcu_dispose() to every poll() call, we saw performance of UDP memcached, which calls poll() on every request, drop by as much as 40%. The slowness of rcu_defer() was even more apparent in an artificial benchmark which repeatedly calls new and rcu_dispose from one or several concurrent threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse - when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex contention, the fact we free the memory on the "wrong" cpu, and the excessive context switches all bring the measurement to as much as 12,000 ns. With this patch the new/rcu_dispose numbers are down to 60 ns on a single thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is a x5.5 - x120 speedup :-) This patch replaces the single list of functions with a per-cpu list. rcu_defer() can add more callbacks to this per-cpu list without a mutex, and instead of a single "garbage collection" thread running these callbacks, the per-cpu RCU thread, which we already had, is the one that runs the work deferred on this cpu's list. This per-cpu work is particularly effective for free() work (i.e., rcu_dispose()) because it is faster to free memory on the same CPU where it was allocated. This patch also eliminates the single "garbage collection" thread which the previous code needed. The per-CPU work queue has a fixed size, currently set to 2000 functions. It is actually a double-buffer, so we can continue to accumulate more work while cleaning up; If rcu_defer() is used so quickly that it outpaces the cleanup, rcu_defer() will wait while the buffer is no longer full. The choice of buffer size is a tradeoff between speed and memory: a larger buffer means fewer context switches (between the thread doing rcu_defer() and the RCU thread doing the cleanup), but also more memory temporarily being used by unfreed objects. Unlike the previous code, we do not wake up the cleanup thread after every rcu_defer(). When the RCU cleanup work is frequent but still small relative to the main work of the application (e.g., memcached server), the RCU cleanup thread would always have low runtime which meant we suffered a context switch on almost every wakeup of this thread by rcu_defer(). In this patch, we only wake up the cleanup thread when the buffer becomes full, so we have far fewer context switches. This means that currently rcu_defer() may delay the cleanup an unbounded amount of time. This is normally not a problem, and when it it, namely in rcu_synchronize(), we wake up the thread immediately. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>