Skip to content
Snippets Groups Projects
  1. Apr 02, 2014
    • Nadav Har'El's avatar
      v3 RCU: Per-CPU rcu_defer() · e5fc1f1b
      Nadav Har'El authored
      
      Changes in v3, following Avi's review:
      * Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable().
      * Make the global RCU "generation" counter a static class variable,
        instead of static function variable. Rename it "next_generation"
        (the name "generation" was grossly overloaded previously)
      * In rcu_synchronize(), use migration_lock to be sure we wake up the
        thread to which we just added work.
      * Use thread_handle, instead of thread*, for percpu_quiescent_state_thread.
        This is safer (atomic variable, so we can't see it half-set on some
        esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is
        a bit of an overkill here, but it's not in a performance sensitive area.
      
      The existing rcu_defer() used a global list of deferred work, protected by
      a global mutex. It also woke up the cleanup thread on every call. These
      decisions made rcu_dispose() noticably slower than a regular delete, to the
      point that when commit 70502950 introduced
      an rcu_dispose() to every poll() call, we saw performance of UDP memcached,
      which calls poll() on every request, drop by as much as 40%.
      
      The slowness of rcu_defer() was even more apparent in an artificial benchmark
      which repeatedly calls new and rcu_dispose from one or several concurrent
      threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose
      from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse -
      when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex
      contention, the fact we free the memory on the "wrong" cpu, and the excessive
      context switches all bring the measurement to as much as 12,000 ns.
      
      With this patch the new/rcu_dispose numbers are down to 60 ns on a single
      thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is
      a x5.5 - x120 speedup :-)
      
      This patch replaces the single list of functions with a per-cpu list.
      rcu_defer() can add more callbacks to this per-cpu list without a mutex,
      and instead of a single "garbage collection" thread running these callbacks,
      the per-cpu RCU thread, which we already had, is the one that runs the work
      deferred on this cpu's list. This per-cpu work is particularly effective
      for free() work (i.e., rcu_dispose()) because it is faster to free memory
      on the same CPU where it was allocated. This patch also eliminates the
      single "garbage collection" thread which the previous code needed.
      
      The per-CPU work queue has a fixed size, currently set to 2000 functions.
      It is actually a double-buffer, so we can continue to accumulate more work
      while cleaning up; If rcu_defer() is used so quickly that it outpaces the
      cleanup, rcu_defer() will wait while the buffer is no longer full.
      The choice of buffer size is a tradeoff between speed and memory: a larger
      buffer means fewer context switches (between the thread doing rcu_defer()
      and the RCU thread doing the cleanup), but also more memory temporarily
      being used by unfreed objects.
      
      Unlike the previous code, we do not wake up the cleanup thread after
      every rcu_defer(). When the RCU cleanup work is frequent but still small
      relative to the main work of the application (e.g., memcached server),
      the RCU cleanup thread would always have low runtime which meant we suffered
      a context switch on almost every wakeup of this thread by rcu_defer().
      In this patch, we only wake up the cleanup thread when the buffer becomes
      full, so we have far fewer context switches. This means that currently
      rcu_defer() may delay the cleanup an unbounded amount of time. This is
      normally not a problem, and when it it, namely in rcu_synchronize(),
      we wake up the thread immediately.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      e5fc1f1b
    • Avi Kivity's avatar
      Merge branch 'async' of https://github.com/tgrabiec/osv · 2271a65b
      Avi Kivity authored
      
      "After net channel merge in commit 2828ef50
      the performance of tomcat benchmark dropped significantly. Investigation
      revealed that the biggest bottleneck was the callout subsystem, which was
      using global mutex to protect its operations. This series improves
      the performance by replacing use of callouts inside the TCP stack with
      a new framework which is supposed to scale better."
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      2271a65b
  2. Apr 01, 2014
    • Avi Kivity's avatar
      Revert "rcu: Per-CPU rcu_defer()" · 6d68d1ab
      Avi Kivity authored
      
      This reverts commit d24cda2c.  It wants
      migration_lock to be merged first.
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      6d68d1ab
    • Nadav Har'El's avatar
      rcu: Per-CPU rcu_defer() · d24cda2c
      Nadav Har'El authored
      
      The existing rcu_defer() used a global list of deferred work, protected by
      a global mutex. It also woke up the cleanup thread on every call. These
      decisions made rcu_dispose() noticably slower than a regular delete, to the
      point that when commit 70502950 introduced
      an rcu_dispose() to every poll() call, we saw performance of UDP memcached,
      which calls poll() on every request, drop by as much as 40%.
      
      The slowness of rcu_defer() was even more apparent in an artificial benchmark
      which repeatedly calls new and rcu_dispose from one or several concurrent
      threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose
      from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse -
      when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex
      contention, the fact we free the memory on the "wrong" cpu, and the excessive
      context switches all bring the measurement to as much as 12,000 ns.
      
      With this patch the new/rcu_dispose numbers are down to 60 ns on a single
      thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is
      a x5.5 - x120 speedup :-)
      
      This patch replaces the single list of functions with a per-cpu list.
      rcu_defer() can add more callbacks to this per-cpu list without a mutex,
      and instead of a single "garbage collection" thread running these callbacks,
      the per-cpu RCU thread, which we already had, is the one that runs the work
      deferred on this cpu's list. This per-cpu work is particularly effective
      for free() work (i.e., rcu_dispose()) because it is faster to free memory
      on the same CPU where it was allocated. This patch also eliminates the
      single "garbage collection" thread which the previous code needed.
      
      The per-CPU work queue has a fixed size, currently set to 2000 functions.
      It is actually a double-buffer, so we can continue to accumulate more work
      while cleaning up; If rcu_defer() is used so quickly that it outpaces the
      cleanup, rcu_defer() will wait while the buffer is no longer full.
      The choice of buffer size is a tradeoff between speed and memory: a larger
      buffer means fewer context switches (between the thread doing rcu_defer()
      and the RCU thread doing the cleanup), but also more memory temporarily
      being used by unfreed objects.
      
      Unlike the previous code, we do not wake up the cleanup thread after
      every rcu_defer(). When the RCU cleanup work is frequent but still small
      relative to the main work of the application (e.g., memcached server),
      the RCU cleanup thread would always have low runtime which meant we suffered
      a context switch on almost every wakeup of this thread by rcu_defer().
      In this patch, we only wake up the cleanup thread when the buffer becomes
      full, so we have far fewer context switches. This means that currently
      rcu_defer() may delay the cleanup an unbounded amount of time. This is
      normally not a problem, and when it it, namely in rcu_synchronize(),
      we wake up the thread immediately.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      d24cda2c
    • Tomasz Grabiec's avatar
      net: replace callouts with the new async framework · 782de281
      Tomasz Grabiec authored
      The callout subsystem is using a shared global lock for most of its
      operations. This became a bottleneck after merging net channels.
      Explanation for this phenomena is that before net channels
      were merged packet processing on the receive side was done form one
      (virtio) thread and there was no contention on that lock. After the
      merge the packets started to be processed from many CPUs which made
      taking the lock expensive.
      
      The new framework uses only per-timer locks, the worker is lock-free.
      
      Below are measurements of the improvement. The measurements (both
      before and after) were taken with Nadav's per-CPU rcu_defer()
      improvement applied, because it was also a bottleneck.
      
      The value measured was HTTP request-response throughput of tomcat
      server as reported by the wrk tool. Server and client on different
      machines, 4 vCPUs, 3g of guest memory.
      
      === 16 connections ===
      
      Before:
      
        avg = 39272.61
        stdev = 3611.84
      
      After:
      
        avg = 52701.82
        stdev = 3953.76
      
      Improvement: 34%
      
      === 256 connections ===
      
      Before:
      
       avg = 35225.19
       stdev = 2504.27
      
      After:
      
       avg = 50576.67
       stdev = 3533.39
      
      Improvement: 43%
      
      One challenge in integrating the new framework with TCP stack was a
      proper teardown of timers. Current code assumed that after calling
      callout_cancel() it is safe to free the timer's memory. This was not
      correct because the timer may have already fired and will try to
      access memory which has been freed. The TCP stack had a workaround for
      this race, each timer checked the inp field of the tcpcb block for
      NULL, which was supposed to indicate that the block has been freed. It
      still was not perfect though because the timer may have performed the
      check before the field was nulled-out in the tcp_discardcb and then
      block on a mutex which will be promptly freed. The solution I went for
      is to delegate the release of memory into an async deferred task,
      which will be executed as soon as possible but in a safe context, in
      which we can wait until all timers are done and then free the
      memory.
      782de281
    • Tomasz Grabiec's avatar
      net: convert in_pcb lock from struct mtx to mutex · e70b0276
      Tomasz Grabiec authored
      The new async API accepts lock of type 'mutex' so I need to convert
      in_pcb lock type, which will be used to synchronize callbacks.
      e70b0276
    • Tomasz Grabiec's avatar
      net: remove dead code · a87583d1
      Tomasz Grabiec authored
      a87583d1
    • Tomasz Grabiec's avatar
      net: add tracepoints for inpcb life cycle · d7fa401b
      Tomasz Grabiec authored
      d7fa401b
    • Tomasz Grabiec's avatar
      core: introduce serial_timer_task · bd179712
      Tomasz Grabiec authored
      This is a wrapper of timer_task which should be used if atomicity of
      callback tasks and timer operations is required. The class accepts
      external lock to serialize all operations. It provides sufficient
      abstraction to replace callouts in the network stack.
      
      Unfortunately, it requires some cooperation from the callback code
      (see try_fire()). That's because I couldn't extract in_pcb lock
      acquisition out of the callback code in TCP stack because there are
      other locks taken before it and doing so _could_ result in lock order
      inversion problems and hence deadlocks. If we can prove these to be
      safe then the API could be simplified.
      
      It may be also worthwhile to propagate the lock passed to
      serial_timer_task down to timer_task to save extra CAS.
      bd179712
    • Tomasz Grabiec's avatar
      core: introduce deferred work framework · 34620ff0
      Tomasz Grabiec authored
      The design behind timer_task
      
      timer_task was design for making cancel() and reschedule() scale well
      with the number of threads and CPUs in the system. These methods may
      be called frequently and from different CPUs. A task scheduled on one
      CPU may be rescheduled later from another CPU. To avoid expensive
      coordination between CPUs a lockfree per-CPU worker was implemented.
      
      Every CPU has a worker (async_worker) which has task registry and a
      thread to execute them. Most of the worker's state may only be changed
      from the CPU on which it runs.
      
      When timer_task is rescheduled it registers its percpu part in current
      CPU's worker. When it is then rescheduled from another CPU, the
      previous registration is marked as not valid and new percpu part is
      registered. When percpu task fires it checks if it is the last
      registration - only then it can fire.
      
      Because timer_task's state is scattered across CPUs some extra
      housekeeping needs to be done before it can be destroyed.  We need to
      make sure that no percpu task will try to access timer_task object
      after it is destroyed. To ensure that we walk the list of
      registrations of given timer_task and atomically flip their state from
      ACTIVE to RELEASED. If that succeeds it means the task is now revoked
      and worker will not try to execute it. If that fails it means the task
      is in the middle of firing and we need to wait for it to finish. When
      the per-CPU task is moved to RELEASED state it is appended to worker's
      queue of released percpu tasks using lockfree mpsc queue. These
      objects may be later reused for registrations.
      34620ff0
    • Tomasz Grabiec's avatar
      sched: introduce thread migration lock · 6c8a861d
      Tomasz Grabiec authored
      This can be useful when there's a need to perform operations on
      per-CPU structure(s) and all need to be executed on the same CPU but
      there is code in between which may sleep (eg malloc).
      
      For example this can be used to ensure that dynamically allocated
      object is always freed on the same CPU on which it was allocated:
      
        WITH_LOCK(migration_lock) {
          auto _owner = *percpu_owner;
          auto x = new X();
          _owner->enqueue(x);
        }
      6c8a861d
    • Tomasz Grabiec's avatar
      sched: add atomic reset() operation to timer_base · b122a924
      Tomasz Grabiec authored
      It is needed by the new async framework.
      b122a924
    • Pekka Enberg's avatar
      build-osv-release: OpenJDK/OSv base image · 44d8450a
      Pekka Enberg authored
      
      Add a OpenJDK/OSv base image for developers who want to use Capstan to
      package and run their Java applications.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      44d8450a
    • Pekka Enberg's avatar
      build-osv-release: OSv memcached server · 16d13511
      Pekka Enberg authored
      
      This adds our own memcached server to an OSv release that is pushed to
      Capstan S3 repository.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      16d13511
    • Nadav Har'El's avatar
      x64: fix early halt() · 8192b61a
      Nadav Har'El authored
      
      When halt() is called very early, before smp_launch(), it crashes when
      calling crash_other_processors() because the other processors' IDT was
      not yet set up. For example, in loader.cc's prepare_commands() we call
      abort() when we failed to parse the command line, and this caused a
      crash reported in issue #252.
      
      With this patch, crash_other_processors does nothing when other processors
      have not yet been set up. This is normally the case before smp_launch(),
      but note that on single-vcpu VM, it will remain the case throughout the run.
      
      Fixes #252.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      8192b61a
  3. Mar 31, 2014
  4. Mar 30, 2014
  5. Mar 28, 2014
  6. Mar 27, 2014
Loading