Skip to content
Snippets Groups Projects
  1. Feb 11, 2014
    • Nadav Har'El's avatar
      epoll: Support epoll()'s EPOLLET · d41d748f
      Nadav Har'El authored
      
      This patch adds support for epoll()'s edge-triggered mode, EPOLLET.
      Fixes #188.
      
      As explained in issue #188, Boost's asio uses EPOLLET heavily, and we use
      that library in our management http server, and also in our image creation
      tool (cpiod.so). By ignoring EPOLLET, like we did until now, the code worked,
      but unnecessarily wasted CPU when epoll_wait() always returned immediately
      instead of waiting until a new event.
      
      This patch works within the confines of our existing poll mechanisms -
      where epoll() call poll(). We do not change this in this patch, and it
      should be changed in the future (see issue #17).
      
      In this patch we add to each struct file a field "poll_wake_count", which
      as its name suggests counts the number of poll_wake()s done on this
      file. Additionally, epoll remembers the last value it saw of this counter,
      so that in poll_scan(), if we see that an fp (polled with EPOLLET) has
      an unchanged counter from last time, we do not return readiness on this fp
      regardless on whether or not it has readable data.
      
      We have a complication with EPOLLET on sockets. These have an "SB_SEL"
      optimization, which avoids calling poll_wake() when it thinks the new
      data is not interesting because the old data was not yet consumed, and
      also avoids calling poll_wake() if fp->poll() was not previously done.
      This optimization is counter-productive for EPOLLET (and causes missed
      wakeups) so we need to work around it in the EPOLLET case.
      
      This patch also adds a test for the EPOLLET case in tst-epoll.cc. The test
      runs on both OSv and Linux, and can confirm that in the tested scenarios,
      Linux and OSv behave the same, including even one same false-positive:
      When epoll_wait() tells us there is data in a pipe, and we don't read it,
      but then more data comes on a pipe, epoll_wait() will again return a new
      event, despite this is not really being an edge event (the pipe didn't
      change from empty to not-empty, as it was previously not-empty as well).
      
      Concluding remarks:
      
      The primary goal of this implementation is to stop EPOLLET epoll_wait()
      from returning immediately despite nothing have happened on the file.
      That was what caused the 100% CPU use before this patch. That being said,
      the goal of this patch is NOT to avoid all false-positives or unnecessary
      wakeups; When events do occur on the file, we may be doing a bit more
      wakeups than strictly necessary. I think this is acceptable (our epoll()
      has worse problems) but for posterity, I want to explain:
      
      I already mentioned above one false-positive that also happens on Linux.
      Another false-positive wakeup that remains is in one of EPOLLET's classic
      use cases: Consider several threads sleeping on epoll() on the same socket
      (e.g., TCP listening socket, or UDP socket). When one packet arrives, normal
      level-triggered epoll() will wake all the threads, but only one will read
      the packet and the rest will find they have nothing to read. With edge-
      triggered epoll, only one thread should be woken and the rest would not.
      But in our implementation, poll_wake() wakes up *all* the pollers on this
      file, so we cannot currently support this optimization.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      d41d748f
    • Vlad Zolotarov's avatar
      msix: thread affinity · b4e8d47d
      Vlad Zolotarov authored
      
      Instead of binding all msix interrupts to cpu 0, have them chase the
      interrupt service routine thread and pin themselves to the same cpu.
      
      This patch is based on the patch from Avi Kivity <avi@cloudius-systems.com>
      and used some ideas of Nadav Har'El <nyh@cloudius-systems.com>.
      
      It improves the performance of the single thread Rx netperf test by 16%:
      before - 25694 Mbps
      after  - 29875 Mbps
      
      New in V2:
       - Dropped the functor class - use lambda instead.
       - Fixed the race in a waking flow.
       - Added some comments.
       - Added the performance numbers to the patch description.
      
      Signed-off-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      b4e8d47d
  2. Feb 10, 2014
  3. Feb 09, 2014
  4. Feb 07, 2014
    • Glauber Costa's avatar
      boot: take early timings · d38883fa
      Glauber Costa authored
      
      In the past, we have struggled with long delays while reading data from disk in
      real mode, leading to big boot times (not that they are totally gone). For that
      reason, it is useful to know how much time is being spent in that process.  As
      unstable and broken the TSC is, it is pretty much our only ally for that.
      
      What I am proposing in this patch, is that we take timings from key states of
      the bootloader, and pass that to main loader. We will do that by adding some
      space at the end of the multiboot_info structure, so that we can pass some
      fields to it. Right now, we are using 16 bytes so we can pass 2 64-bit tsc
      reads.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Reviewed-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      d38883fa
    • Glauber Costa's avatar
      general infrastructure for boot time calculation · 3ab3a6bb
      Glauber Costa authored
      
      I am proposing a mechanism here that will allow us to have a better idea about
      how much time do we spend booting, and also how much time each of the pieces
      contribute to. For that, we need to be able to get time stamps really early, in
      places where tracepoints may not be available, and a clock most definitely
      won't.
      
      With my proposal, one should be able to register events. After the system
      boots, we will calculate the total time since the first event, as well as the
      delta since the last event. If the first event is early enough, that should
      produce a very good picture about our boot time.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Reviewed-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      3ab3a6bb
    • Glauber Costa's avatar
      pvclock: reuse pvclock's functionality to convert tsc to nano · 2df3c029
      Glauber Costa authored
      
      This patch provides a way to, given a measurement from tsc, acquire a nanosecond
      figure. It works only for xen and kvm pvclocks, and I intend to use it for acquiring
      early boot figures.
      
      It is possible to measure the tsc frequency and with that figure out how to
      convert a tsc read to nanoseconds, but I don't think we should pay that price.
      Most of the pvclock drivers already provide that functionality, and we are not
      planning that many users of that interface anyway.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      2df3c029
  5. Feb 06, 2014
    • Claudio Fontana's avatar
      api/aarch64: add alltypes.h.sh script and first headers · 81aa5e81
      Claudio Fontana authored
      
      add alltypes.h.sh, and first headers in bits/
      
      Signed-off-by: default avatarClaudio Fontana <claudio.fontana@huawei.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      81aa5e81
    • Glauber Costa's avatar
      jvm_balloon: handle explicit unmapping case · fc469b4d
      Glauber Costa authored
      
      The JVM may unmap certain areas of the heap completely, which was confirmed by
      code inspection by Gleb. In that case, the current balloon code will break.
      
      This is because we were deleting the vma from finish_move(), and recreating the
      old mapping implicitly in the process. With this new patch, the tear down of
      the jvm balloon mapping is done by a separate function. Unmapping or evacuating
      the region won't trigger it.
      
      It still needs to communicate to the balloon code that this address is out of
      the balloons list. We do that by calling the page fault handler with an empty
      frame. jvm_balloon_fault will is patched to interpret an empty frame correctly.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      fc469b4d
    • Pekka Enberg's avatar
      runtime: Support format specifies in abort() · a814c5a7
      Pekka Enberg authored
      
      Add format specifier support to abort() to make it easier to produce
      useful error messages.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      a814c5a7
    • Nadav Har'El's avatar
      Elf: Fix also _module_index_list · b0b5462f
      Nadav Har'El authored
      
      Fix also concurrent use of _module_index_list (for the per-module TLS).
      Use a new mutex _module_index_list mutex to protect it. We could
      probably have done something with the RCU instead, but just adding a
      new mutex is a lot easier.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      b0b5462f
    • Nadav Har'El's avatar
      Elf: Fix shared-object unload concurrent with dynamic linker use · 8213bf13
      Nadav Har'El authored
      
      After the above patches, one race remains in the dynamic linker: If an
      object is *unloaded* while some symbol resolution or object iteration
      (dl_iterate_phdr) is in progress, the function in progress may reach
      this object after it is already unmapped from memory, and crash.
      
      Therefore, we need to delay unmapping of objects while any object
      iteration is going on. We need to allow the object to be deleted from
      the _modules and _files list (so that new calls will not find it) but
      temporarily delay the actual freeing of the object's memory.
      
      The cleanest way to achieve this would have been to increment each
      modules' reference in the RCU section of modules_get(), so they won't
      get deleted while still in use. However, this will signficantly slow down
      users like backtrace() with dozens of atomic operations. So we chose
      a different solution: keep a counter _modules_delete_disable, which
      when non-zero causes all module deletion to be delayed until the counter
      drops back to zero. with_modules() now only needs to increment this
      single counter, not every separate module.
      
      Fixes #176.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      8213bf13
    • Nadav Har'El's avatar
      Elf: Fix shared-object load concurrent with dynamic linker use · 68afb68e
      Nadav Har'El authored
      
      This patch addresses the bugs of *use* of the dynamic linker - looking
      up symbols or iterating the list of loaded objects - in parallel with new
      libraries being loaded with get_library().
      
      The underlying problem is that we have an unprotected "_modules" vector
      of loaded objects, which we need to iterate to look up symbols, but this
      list of modules can change when a new shared object is loaded.
      
      We decided *not* to solve this problem by using the same mutex protecting
      object load/unload: _mutex. That would make boot slower, as threads using
      new symbols are blocked just because another thread is concurrently loading
      some unrelated shared object (not a big problem with demand-paged file
      mmaps). Using a mutex can also cause deadlocks in the leak detector,
      because of lock order reversal between malloc's and elf'c mutexes: malloc()
      takes a lock first and then backtrace() will take elf's lock, and on the
      other hand elf can take its lock and then call malloc taking malloc's lock.
      
      Instead, this patch uses RCU to allow lock-free reading of the modules
      list. As in RCU, writing (adding or removing an object from the list)
      manufactures a new list, defering the freeing of the old one, allowing
      reads to continue using the old object list.
      
      Note that after this patch, concurrent lookups and get_library() will
      work correctly, but concurrent lookups and object *unload* still will
      still not be correct because we need to defer an object's unloading from
      memory while lookups are in progress. This will be solved in a following
      patch.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      68afb68e
    • Nadav Har'El's avatar
      Elf: Serialize shared-object load and unload · 566c77f6
      Nadav Har'El authored
      
      Our current dynamic-linker code (elf.cc) is not thread safe, and all sort
      of disasters can happen if shared objects are loaded, unloaded and/or used
      concurrently. This and the following patches solve this problem in stages:
      
      The first stage, in this patch, is to protect concurrent shared-library
      loads and unloads. (if the dynamic linker is also in use concurrently,
      this will still cause problems, and will be solved in the next patches).
      
      Library load and unload use a bunch of shared data without protection,
      so concurrency can cause disaster. For example, two concurrent loads can
      pick the same address to map the objects in. We solve this by using a mutex
      to ensure only one shared object is loaded or unloaded at a time.
      
      Instead of this coarse-grain locking, we could have used finer-grained
      locks to allow several library loads to proceed in parallel, protecting
      just the actual shared data. However the benefits will be very small
      because with demand-paged file mmaps, "loading" a library just sets up
      the memory map, very quickly, and the object will only be actually read
      from disk later, when its pages get used.
      
      Fixes #175.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      566c77f6
    • Nadav Har'El's avatar
      Add SCOPE_LOCK(mutex) macro · 30ea16ce
      Nadav Har'El authored
      
      Add a macro SCOPE_LOCK(mutex) which locks the given mutex and unlocks
      it when the scope ends (this uses RAII, so the mutex will correctly get
      unlocked even when the scope is exited via return or exception).
      
      This does the same as C++11's std::lock_guard, but far less verbose:
      To use std::lock_guard with a mutex m, one nees to do something like
      std::lock_guard<mutex> guard(m);
      where the mutex's type needs to be repeated, and a name needs to be
      invented for the guard which will likely not be used again. This
      macro makes these things unnecessary, and one just writes
      SCOPE_LOCK(m);
      
      Note that WITH_LOCK(m) { ... } should usually be preferred over SCOPE_LOCK.
      However, SCOPE_LOCK can come in handy in some cases, for example adding a
      lock to a function without reindenting it.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      30ea16ce
  6. Feb 05, 2014
  7. Feb 03, 2014
  8. Feb 02, 2014
  9. Jan 27, 2014
Loading