Skip to content
Snippets Groups Projects
  • Nadav Har'El's avatar
    e5fc1f1b
    v3 RCU: Per-CPU rcu_defer() · e5fc1f1b
    Nadav Har'El authored
    
    Changes in v3, following Avi's review:
    * Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable().
    * Make the global RCU "generation" counter a static class variable,
      instead of static function variable. Rename it "next_generation"
      (the name "generation" was grossly overloaded previously)
    * In rcu_synchronize(), use migration_lock to be sure we wake up the
      thread to which we just added work.
    * Use thread_handle, instead of thread*, for percpu_quiescent_state_thread.
      This is safer (atomic variable, so we can't see it half-set on some
      esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is
      a bit of an overkill here, but it's not in a performance sensitive area.
    
    The existing rcu_defer() used a global list of deferred work, protected by
    a global mutex. It also woke up the cleanup thread on every call. These
    decisions made rcu_dispose() noticably slower than a regular delete, to the
    point that when commit 70502950 introduced
    an rcu_dispose() to every poll() call, we saw performance of UDP memcached,
    which calls poll() on every request, drop by as much as 40%.
    
    The slowness of rcu_defer() was even more apparent in an artificial benchmark
    which repeatedly calls new and rcu_dispose from one or several concurrent
    threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose
    from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse -
    when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex
    contention, the fact we free the memory on the "wrong" cpu, and the excessive
    context switches all bring the measurement to as much as 12,000 ns.
    
    With this patch the new/rcu_dispose numbers are down to 60 ns on a single
    thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is
    a x5.5 - x120 speedup :-)
    
    This patch replaces the single list of functions with a per-cpu list.
    rcu_defer() can add more callbacks to this per-cpu list without a mutex,
    and instead of a single "garbage collection" thread running these callbacks,
    the per-cpu RCU thread, which we already had, is the one that runs the work
    deferred on this cpu's list. This per-cpu work is particularly effective
    for free() work (i.e., rcu_dispose()) because it is faster to free memory
    on the same CPU where it was allocated. This patch also eliminates the
    single "garbage collection" thread which the previous code needed.
    
    The per-CPU work queue has a fixed size, currently set to 2000 functions.
    It is actually a double-buffer, so we can continue to accumulate more work
    while cleaning up; If rcu_defer() is used so quickly that it outpaces the
    cleanup, rcu_defer() will wait while the buffer is no longer full.
    The choice of buffer size is a tradeoff between speed and memory: a larger
    buffer means fewer context switches (between the thread doing rcu_defer()
    and the RCU thread doing the cleanup), but also more memory temporarily
    being used by unfreed objects.
    
    Unlike the previous code, we do not wake up the cleanup thread after
    every rcu_defer(). When the RCU cleanup work is frequent but still small
    relative to the main work of the application (e.g., memcached server),
    the RCU cleanup thread would always have low runtime which meant we suffered
    a context switch on almost every wakeup of this thread by rcu_defer().
    In this patch, we only wake up the cleanup thread when the buffer becomes
    full, so we have far fewer context switches. This means that currently
    rcu_defer() may delay the cleanup an unbounded amount of time. This is
    normally not a problem, and when it it, namely in rcu_synchronize(),
    we wake up the thread immediately.
    
    Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
    Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
    e5fc1f1b
    History
    v3 RCU: Per-CPU rcu_defer()
    Nadav Har'El authored
    
    Changes in v3, following Avi's review:
    * Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable().
    * Make the global RCU "generation" counter a static class variable,
      instead of static function variable. Rename it "next_generation"
      (the name "generation" was grossly overloaded previously)
    * In rcu_synchronize(), use migration_lock to be sure we wake up the
      thread to which we just added work.
    * Use thread_handle, instead of thread*, for percpu_quiescent_state_thread.
      This is safer (atomic variable, so we can't see it half-set on some
      esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is
      a bit of an overkill here, but it's not in a performance sensitive area.
    
    The existing rcu_defer() used a global list of deferred work, protected by
    a global mutex. It also woke up the cleanup thread on every call. These
    decisions made rcu_dispose() noticably slower than a regular delete, to the
    point that when commit 70502950 introduced
    an rcu_dispose() to every poll() call, we saw performance of UDP memcached,
    which calls poll() on every request, drop by as much as 40%.
    
    The slowness of rcu_defer() was even more apparent in an artificial benchmark
    which repeatedly calls new and rcu_dispose from one or several concurrent
    threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose
    from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse -
    when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex
    contention, the fact we free the memory on the "wrong" cpu, and the excessive
    context switches all bring the measurement to as much as 12,000 ns.
    
    With this patch the new/rcu_dispose numbers are down to 60 ns on a single
    thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is
    a x5.5 - x120 speedup :-)
    
    This patch replaces the single list of functions with a per-cpu list.
    rcu_defer() can add more callbacks to this per-cpu list without a mutex,
    and instead of a single "garbage collection" thread running these callbacks,
    the per-cpu RCU thread, which we already had, is the one that runs the work
    deferred on this cpu's list. This per-cpu work is particularly effective
    for free() work (i.e., rcu_dispose()) because it is faster to free memory
    on the same CPU where it was allocated. This patch also eliminates the
    single "garbage collection" thread which the previous code needed.
    
    The per-CPU work queue has a fixed size, currently set to 2000 functions.
    It is actually a double-buffer, so we can continue to accumulate more work
    while cleaning up; If rcu_defer() is used so quickly that it outpaces the
    cleanup, rcu_defer() will wait while the buffer is no longer full.
    The choice of buffer size is a tradeoff between speed and memory: a larger
    buffer means fewer context switches (between the thread doing rcu_defer()
    and the RCU thread doing the cleanup), but also more memory temporarily
    being used by unfreed objects.
    
    Unlike the previous code, we do not wake up the cleanup thread after
    every rcu_defer(). When the RCU cleanup work is frequent but still small
    relative to the main work of the application (e.g., memcached server),
    the RCU cleanup thread would always have low runtime which meant we suffered
    a context switch on almost every wakeup of this thread by rcu_defer().
    In this patch, we only wake up the cleanup thread when the buffer becomes
    full, so we have far fewer context switches. This means that currently
    rcu_defer() may delay the cleanup an unbounded amount of time. This is
    normally not a problem, and when it it, namely in rcu_synchronize(),
    we wake up the thread immediately.
    
    Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
    Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>