Skip to content
Snippets Groups Projects
  1. Jun 11, 2013
    • Nadav Har'El's avatar
      Add missing change to lfmutex.cc · d96020e3
      Nadav Har'El authored
      Sorry, forgot one hunk in "git add -p" :(
      d96020e3
    • Nadav Har'El's avatar
      lock-free mutex: change and clarify the role of depth and owner · 99b477dc
      Nadav Har'El authored
      The way "owner" and "depth" were used in lockfree::mutex was messy.
      Ideally, neither should be needed if we implemented a non-recursive
      mutex, but following the design of ::mutex, we (re)used "owner" also as
      a marker that a thread was waken to have the lock (and it's not a
      spurious wake).
      
      After this patch, owner and depth are used in lockfree::mutex *only*
      for implementing a recursive mutex, and building a non-recursive
      mutex should be as simple as dropping these two variables.
      
      In more detail:
      
      1. "owner" is no longer used to tell a woken up thread that the wake
         wasn't spurious. Instead, zero the thread in the wait-record. This
         is a familiar idiom, which we already used a few times before.
      
      2. "depth" isn't an atomic variable, so it should only be read by the
         same thread which set it, and this wasn't the case previously. Now,
         depth is only ever written (set to 1, incremented or decremented)
         or read by the lock-holding thread - and not the lock releasing thread.
      
      3. "owner" needs to be an atomic variable - a non-lock-holding thread
         needs to read it and recognize it isn't holding the lock - but it
         doesn't need any special memory ordering with other variables, so
         should always be accessed with "relaxed" memory ordering.
      99b477dc
    • Nadav Har'El's avatar
      sched::thread - fix very rare join() hang · cf4c46c4
      Nadav Har'El authored
      Fixed a very rare hang in sched::thread::join():
      
      thread::complete() included the following code:
      
          _status.store(status::terminated);
          if (_joiner) {
              _joiner->wake();
          }
      
      If we are preempted right after setting status to "terminated", but
      before calling wake(), this thread will never be scheduled again (it will
      remain in the terminated status forever), and will never call wake() -
      so the join()ing thread may just wait forever.
      
      I saw this happening in a test case that started and joined millions of
      threads, and eventually the join() hangs.
      
      The solution is to enclose the above lines with preempt_disable()/
      preempt_enable().
      cf4c46c4
    • Nadav Har'El's avatar
      wake(): Don't miss a preemption opportunity · aee17ba4
      Nadav Har'El authored
      wake() normally calls schedule(), but doesn't do so if preemption is
      disabled. So we should mark need_reschedule = true, to suggest that
      schedule() can be called when preemption is later enabled.
      aee17ba4
    • Avi Kivity's avatar
      x64: prevent nested exceptions from corrupting the stack · 1dbddc44
      Avi Kivity authored
      Due to the need to handle the x64 red zone, we use a separate stack for
      exceptions via the IST mechanism.  This means that a nested exception will
      reuse the parent exception's stack, corrupting it.  It is usually very hard
      to figure out the root cause when this happens.
      
      Prevent this by setting up a separate stack for nested exceptions, and
      aborting immediately if a nested exception happens.
      1dbddc44
  2. Jun 10, 2013
  3. Jun 09, 2013
    • Guy Zana's avatar
      41360613
    • Guy Zana's avatar
      logger: remove not very useful logger tags · f806a014
      Guy Zana authored
      controlling the logger output in tests is simplistic, it is enough to
      change the severity level to logger_error and all logging messages
      will appear.
      f806a014
    • Guy Zana's avatar
      bf7fe5e7
    • Avi Kivity's avatar
      sched: add tracepoint for preemption events · 76dae193
      Avi Kivity authored
      76dae193
    • Nadav Har'El's avatar
      lock-free mutex: add C API for lockfree::mutex methods · d3c156d8
      Nadav Har'El authored
      Add some extern "C" versions of the lockfree::mutex methods. They will be
      necessary for providing the lockfree::mutex type to C code - as you'll see
      in later patches, C code will see an opaque type, a byte array, and will
      call these functions to operate on it.
      d3c156d8
    • Nadav Har'El's avatar
      lock-free mutex: Avoid including <sched.hh> in <lockfree/mutex.hh> · 874b10ee
      Nadav Har'El authored
      Do not include <sched.hh> in <lockfree/mutex.hh>.
      
      Including <sched.hh> creates annoying dependency loops when we (in a
      later patch) replace <osv/mutex.h> by <lockfree/mutex.hh>, and some header
      files included by <sched.hh> themselves use mutexes, so they include
      <osv/mutex.h>. This last include does nothing (because of the include guard)
      but the compiler never finished reading osv/mutex.h (it was only in its
      beginning, when it included sched.hh) so the inner-included code lacks the
      definitions it assumes after including mutex.h.
      874b10ee
    • Nadav Har'El's avatar
      lockfree mutex: add owned() and getdepth() methods · 1ed9c982
      Nadav Har'El authored
      Add to lockfree::mutex the simple owned() and getdepth() methods which
      existed in ::mutex and were used in a few places - so we need these
      methods to switch from ::mutex to lockfree::mutex.
      1ed9c982
    • Nadav Har'El's avatar
      lockfree mutex: fix wait/wake bug · 9bcf790a
      Nadav Har'El authored
      When I developed lockfree mutex, the scheduler, preemption, and related code
      still had a bunch of bugs, so I resorted to some workarounds that in hindsite
      look unnecessary, and even wrong.
      
      When it seemed that I can only wake() a thread in wait state, I made an
      effort to enter the waiting state (with "wait_guard") before adding the
      thread to the to-awake queue, and then slept with schedule(). The biggest
      problem with this approach was that a spurious wake(), for any reason, of
      this thread, would cause us to end the lock() - and fail on an assert that
      we're the owners of the lock - instead of repeating the wait. When switching
      to lockfree mutex, the sunflow benchmark (for example) would die on this
      assertion failure.
      
      So now I replaced this ugliness with our familiar idiom, wait_until().
      The thread is in running state for some time after entering queue, so
      it might be woken when not yet sleeping and the wake() will be ignored -
      but this is fine because part of our protocol is that the wake() before
      waking also sets "owner" to the to-be-woken thread, and before sleeping
      we check if owner isn't already us.
      
      Also changed the comment on "owner" to say that it is not *just* for
      implementing a recursive mutex, but also nessary for the wakeup protocol.
      9bcf790a
  4. Jun 06, 2013
    • Guy Zana's avatar
      msix: provide high priority handler when registering interrupt · 66066b07
      Guy Zana authored
      we have to disable virio interrupts before msix EOI so disabling
      must be done in the ISR handler context. This patch adds an std::function
      isr to the bindings.
      
      references to the rx and tx queues are saved as well (_rx_queue and _tx_queue),
      so they can be used in the ISR context.
      
      this patch reduces virtio net rx interrupts by a factor of 450.
      66066b07
    • Avi Kivity's avatar
      trace: relax unique ID requirement · aec5d9df
      Avi Kivity authored
      Place the tracepointv type in an anonymous namespace.  This makes every
      translation unit have its own unique tracepoint types, so we only need
      to ensure uniqueness within a source file.
      
      Use the type's type_info to select the correct patch sites.
      
      Idea from Nadav.
      aec5d9df
  5. Jun 05, 2013
    • Avi Kivity's avatar
      trace: improve fast path · b03979d9
      Avi Kivity authored
      When a tracepoint is disabled, we want it to have no impact on running code.
      
      This patch changes the fast path to be a single 5-byte nop instruction.  When
      a tracepoint is enabled, the nop is patched to a jump instruction to the
      out-of-line slow path.
      b03979d9
    • Avi Kivity's avatar
      trace: add unique ID for tracepoints · 0102df29
      Avi Kivity authored
      In order to optimize the fast path of tracepoints, we need to patch
      the call sites to skip calling the slow path code completely.  In turn,
      that requires that each call site be unique -- a separate function.
      
      In the current implementations, tracepoints with the same signature map
      to the same type.  It would have been great to use the name as a discriminant
      (tracepoint<"sched_queue", thread*> trace_sched_queue(...);), but C++ does
      not support string literals as template arguments.
      
      We could do
      
        const char* trace_sched_queue_name = "sched_queue";
        tracepoint<trace_sched_queue_name, thread*> trace_sched_queue(...);
      
      but that doubles the code for declaring a tracepoint.  Add a unique ID instead
      (and code to verify it is unique).
      0102df29
    • Nadav Har'El's avatar
      lockfree::mutex functions should not be inline · cb41801a
      Nadav Har'El authored
      Until now, lockfree::mutex functions were entirely inline, which won't
      fly if we want to make it our default mutex. Change them to be out-of-line,
      implemented in a new source file core/lfmutex.cc.
      
      This has a slight performance impact - uncontended lock/unlock pair used
      to be 17ns, and is now 22ns.
      
      In the future we can get this performance back by making these functions
      partially inline (the uncontended case inline, the waiting case in a
      separate function), although we'll need to consider the speed/code-size
      tradeoff.
      cb41801a
    • Nadav Har'El's avatar
      Loader: show demangled symbol name on lookup failure · b39289cf
      Nadav Har'El authored
      When abort()ing on failed symbol lookup, if this is a C++ symbol, also
      show its demangled form, which in many case can be more helpful.
      Here is an example lookup failure now:
      
      failed looking up symbol _ZN8lockfree5mutex4lockEv (lockfree::mutex::lock())
      Aborted
      b39289cf
  6. Jun 04, 2013
  7. Jun 02, 2013
    • Nadav Har'El's avatar
      Fix condvar_wait() bug · 1b53ec56
      Nadav Har'El authored
      condvar_wait() wrongly dropped the condvar's internal lock too early,
      and accessed wr->t outside the lock, meaning that a concurrent wake()
      could race with it. This bug was exposed in one of the pipe() tests.
      
      This patch fixes this bug, by holding the internal lock throughout
      the execution of condvar_wait(), dropping it temporarily only while
      waiting.
      1b53ec56
  8. May 31, 2013
  9. May 29, 2013
    • Nadav Har'El's avatar
      Say we also implement librt.so.1. · 737f83e9
      Nadav Har'El authored
      Say we also implement librt.so.1. This is required, for example, by the
      Boost libraries (e.g., libboost_system-mt.so.1.50.0). The librt library
      isn't actually a separate library on modern Linux - rather all its
      traditional functions are now in glibc, and "librt" is merely a filter
      on glibc. So no reason not to say we support librt too.
      
      Not to mention that we already implement a bunch of functions that
      traditionally resided in librt (nanosleep, sched_yield, sem_*, etc.
      737f83e9
  10. May 27, 2013
    • Guy Zana's avatar
      debug: introduce debug_ll() and use it in abort() · 6ebb582e
      Guy Zana authored
      the debug() console function is taking a lock before it access the console driver,
      it does that by acquiring a mutex which may sleep.
      
      since we want to be able to debug (and abort) in contexts where it's not possible sleep,
      such as in page_fault, a lockless debug print method is introduced.
      
      previousely to this patch, any abort on page_fault would cause an "endless" recursive
      abort() loop which hanged the system in a peculiar state.
      6ebb582e
  11. May 26, 2013
    • Nadav Har'El's avatar
      Fix comment · 0ad3e2e0
      Nadav Har'El authored
      The comment about unlocking the irq_lock was put on the wrong line.
      Move it (and rephrase it a bit - the word "release" immediately after
      calling an unrelated release() function - is confusing).
      0ad3e2e0
    • Nadav Har'El's avatar
      Fix two bugs in yield() · 19e52ce6
      Nadav Har'El authored
      yield() had two bugs - thanks to Avi for pinpointing them:
      
      1. It used runqueue.push_back() to put the thread on the run queue, but
         push_back() is a low-level function which can only be used if we're
         sure that the item we're pushing has the vruntime suitable for being
         *last* on the queue - and in the code we did nothing to ensure this
         is the case (we should...). So use insert_equal(), not push_back().
      
      2. It was wrongly divided into two separate sections with interrupts
         disabled. The scheduler code is run both at interrupt time (e.g.,
         preempt()) and at thread time (e.g., wait(), yield(), etc.) so to
         guarantee it does not get called in the middle of itself, it needs
         to disable interrupts while working on the (per-cpu) runqueue.
         In the broken yield() code, we disabled interupts while adding the
         current thread to the run queue, and then again to reschedule.
         Between those two critical sections, an interrupt could arrive and
         do something with this thread (e.g., migrate it to another CPU, or
         preempt it), yet when the interrupt returns yield continues to run
         reschedule_from_interrupt which assumes that this thread is still
         running, and definitely not on the run queue.
      
      Bug 2 is what caused the crashes in the tst-yield.so test. The fix is
      to hold the interrupts disabled throughout the entire yield().
      This is easiest done with with lock_guard, which simplifies the flow
      of this function.
      19e52ce6
    • Nadav Har'El's avatar
      sched: avoid unnecessary FPU saving · 947b49ee
      Nadav Har'El authored
      Because of Linux's calling convention, it should not be necessary to
      save the FPU state when a reschedule is caused by a function call.
      
      Because we had a bug and forgot to save the FPU state when calling
      a signal handler, and because this signal handler can cause a reschedule,
      we had to save the FPU on any reschedule. But after fixing that bug, we
      no longer need these unnecessary FPU saves.
      
      The "sunflow" benchmark still runs well after this patch.
      947b49ee
    • Avi Kivity's avatar
      elf: add support for IRELATIVE relocations · e8c62c5e
      Avi Kivity authored
      This are used to support ifunc functions, which are resolved at load-time
      based on cpu features, rather than at link time.
      e8c62c5e
    • Avi Kivity's avatar
      sched: fix preempt_enable() when interrupts are disabled · 84046f23
      Avi Kivity authored
      If interrupts are disabled, we must not call schedule() even if
      the preemption counter says we need to, as the context is not preemption
      safe.
      
      This manifested itself in a wake() within a timer causing a schedule(),
      which re-enabled interrupts, which caused further manipulation of the timer
      list to occur concurrently with the next interrupt, resulting in corruption.
      
      Fixes timer stress test failure.
      84046f23
  12. May 24, 2013
  13. May 22, 2013
    • Nadav Har'El's avatar
      This patch implements two functions: · 121d6a7e
      Nadav Har'El authored
       1. osv::poweroff(), which can turn off a physical machine or in our case
          tell QEMU to quit.
          The implementation uses ACPI, through the ACPICA library.
      
       2. osv::hang(), which ceases all computation on all cores, but does not
          turn off the machine. This can be useful if we want QEMU to remain
          alive for debugging, for example.
      
      The two functions are defined in the new <osv/power.hh> header files, and
      follow the new API guidelines we discussed today: They are C++-only, and
      are in the "osv" namespace.
      121d6a7e
    • Nadav Har'El's avatar
      Added timeout parameter to semaphore::wait() · 7890e39a
      Nadav Har'El authored
      Added a timeout parameter to semaphore::wait(), which defaults to no
      timeout.
      
      semaphore:wait() is now a boolean, just like trywait(), and likewise can
      return false when the semaphore has not actually been decremented but
      rather we had a timeout.
      
      Because we need the mutex again after the wait, I replaced the "with_lock"
      mechanism by the better-looking lock_guard and mutex parameter to
      wait_until.
      7890e39a
    • Nadav Har'El's avatar
      Drasticly lower overhead of leak detection on running program · 92ff9880
      Nadav Har'El authored
      Leak detection (e.g., by running with "--leak") used to have a devastating
      effect on the performance of the checked program, which although was
      tolerable (for leak detection, long runs are often unnecessary), it was
      still annoying.
      
      While before this patch leak-detection runs were roughly 5 times slower
      than regular runs, after this patch they are only about 40% slower than
      a regular run! Read on for the details.
      
      The main reason for this slowness was a simplistic vector which was used to
      keep the records for currently living allocations. This vector was linearly
      searched both for free spots (to remember new allocations) and for specific
      addresses (to forget freed allocations). Because this list often grew to a
      hundred thousand of items, it became incredibly slow and slowed down the
      whole program. For example, getting a prompt from cli.jar happened in 2
      seconds without leak detection, but in 9 seconds with leak detection.
      
      A possible solution would have been to use an O(1) data structure, such as
      a hash table. This would be complicated by our desire to avoid frequent
      memory allocation inside the leak detector, or our general desire to avoid
      complicated stuff in the leak detector because they always end leading to
      complicated deadlocks :-)
      
      This patch uses a different approach, inspired by an idea by Guy.
      
      It still uses an ordinary vector for holding the records, but additionally
      keeps for each record one "next" pointer which is used for maintaining two
      separate lists of records:
      
      1. A list of free records. This allows a finding a record for a new
         allocation in O(1) time.
      
      2. A list of filled records, starting with the most-recently-filled record.
         When we free(), we walk this list and very often finish very quickly,
         because malloc() closely followed by free() are very common.
         Without this list, we had to walk the whole vector filled with ancient
         allocations and even free records, just to find the most recent
         allocation.
      
      Two examples of the performance with and without this patch:
      
      1. Getting a prompt from cli.jar takes 2 seconds without leak detection,
         9 seconds with leak detection before this patch, and 3 seconds with this
         patch.
      
      2. The "sunflow" benchmark runs 53 ops/second without leak detection,
         which went down to 10 ops/second with leak detection before this patch,
         and after this patch - 33 ops/second.
      
      I verified (by commenting out the search algorithm and always using the
      first item in the vector) that the allocation record search is no longer
      having any effect on performance, so it is no longer interesting to replace
      this code with an even more efficient hash table. The remaining slowdown is
      probably due to the backtrace() operation and perhaps also the tracker lock.
      92ff9880
    • Avi Kivity's avatar
      semaphore: convert to an intrusive list · 5c2b2f94
      Avi Kivity authored
      Intrusive lists are faster since they require no allocations.
      5c2b2f94
    • Avi Kivity's avatar
      semaphore: remove indirection accessing internal mutex · b9953a90
      Avi Kivity authored
      Previously, the mutex was stored using a pointer to avoid overflowing
      glibc's sem_t.  Now we no longer have this restriction, drop the indirection.
      b9953a90
    • Avi Kivity's avatar
      semaphore: add support for multiple units in wait() and post() · c240daa3
      Avi Kivity authored
      Use Nadav's idea of iterating over the list and selecting wait records
      that fit the available units.
      c240daa3
    • Avi Kivity's avatar
      747ff478
  14. May 21, 2013
    • Nadav Har'El's avatar
      Add forgotten break · 8ffe45bf
      Nadav Har'El authored
      In the allocation tracker, not only did I use a dog-slow linear search,
      I forgot to stop on the first empty spot, and actually used the last
      empty spot... Add the missing break, which made leak detection 10% faster.
      
      A better implementation would be preferable, but this is a low hanging
      fruit
      8ffe45bf
Loading