Skip to content
Snippets Groups Projects
  1. Feb 11, 2014
    • Nadav Har'El's avatar
      epoll: Support epoll()'s EPOLLET · d41d748f
      Nadav Har'El authored
      
      This patch adds support for epoll()'s edge-triggered mode, EPOLLET.
      Fixes #188.
      
      As explained in issue #188, Boost's asio uses EPOLLET heavily, and we use
      that library in our management http server, and also in our image creation
      tool (cpiod.so). By ignoring EPOLLET, like we did until now, the code worked,
      but unnecessarily wasted CPU when epoll_wait() always returned immediately
      instead of waiting until a new event.
      
      This patch works within the confines of our existing poll mechanisms -
      where epoll() call poll(). We do not change this in this patch, and it
      should be changed in the future (see issue #17).
      
      In this patch we add to each struct file a field "poll_wake_count", which
      as its name suggests counts the number of poll_wake()s done on this
      file. Additionally, epoll remembers the last value it saw of this counter,
      so that in poll_scan(), if we see that an fp (polled with EPOLLET) has
      an unchanged counter from last time, we do not return readiness on this fp
      regardless on whether or not it has readable data.
      
      We have a complication with EPOLLET on sockets. These have an "SB_SEL"
      optimization, which avoids calling poll_wake() when it thinks the new
      data is not interesting because the old data was not yet consumed, and
      also avoids calling poll_wake() if fp->poll() was not previously done.
      This optimization is counter-productive for EPOLLET (and causes missed
      wakeups) so we need to work around it in the EPOLLET case.
      
      This patch also adds a test for the EPOLLET case in tst-epoll.cc. The test
      runs on both OSv and Linux, and can confirm that in the tested scenarios,
      Linux and OSv behave the same, including even one same false-positive:
      When epoll_wait() tells us there is data in a pipe, and we don't read it,
      but then more data comes on a pipe, epoll_wait() will again return a new
      event, despite this is not really being an edge event (the pipe didn't
      change from empty to not-empty, as it was previously not-empty as well).
      
      Concluding remarks:
      
      The primary goal of this implementation is to stop EPOLLET epoll_wait()
      from returning immediately despite nothing have happened on the file.
      That was what caused the 100% CPU use before this patch. That being said,
      the goal of this patch is NOT to avoid all false-positives or unnecessary
      wakeups; When events do occur on the file, we may be doing a bit more
      wakeups than strictly necessary. I think this is acceptable (our epoll()
      has worse problems) but for posterity, I want to explain:
      
      I already mentioned above one false-positive that also happens on Linux.
      Another false-positive wakeup that remains is in one of EPOLLET's classic
      use cases: Consider several threads sleeping on epoll() on the same socket
      (e.g., TCP listening socket, or UDP socket). When one packet arrives, normal
      level-triggered epoll() will wake all the threads, but only one will read
      the packet and the rest will find they have nothing to read. With edge-
      triggered epoll, only one thread should be woken and the rest would not.
      But in our implementation, poll_wake() wakes up *all* the pollers on this
      file, so we cannot currently support this optimization.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      d41d748f
  2. Feb 10, 2014
  3. Feb 09, 2014
  4. Feb 07, 2014
  5. Feb 06, 2014
    • Raphael S. Carvalho's avatar
      tests: Add tool to analyze ARC and shrink functionality · 4a68aee3
      Raphael S. Carvalho authored
      
      The main purpose of this tool is to understand/analyze the ARC behavior/
      performance on specific workloads.
      
      $ scripts/run.py -e 'tests/misc-zfs-arc.so --help'
      OSv v0.05-155-g1f04e49
      Allowed options:
        --help                produce help message
        --set-max-target      set ARC max target to 80% of the system memory.
        --check-arc-shrink    check ARC shrink functionality
        --test arg            analyze ARC performance on a given testcase, e.g.
                              --test tst-001.so
      
      * --set-max-target: Used to check performance when ARC max target is
      higher than usual. Given that more data will be load into ARC, ZFS operations
      that needs I/O would perform better. 80% was chosen as the low watermark
      is 20%, so avoiding a bunch of memory pressure, thus more stability.
      
      * --check-arc-shrink: Check the functionality of the function arc_shrink
      from ARC.
      
      * --test arg: Check ARC performance on a specified testcase, e.g.:
      $ scripts/run.py -e 'tests/misc-zfs-arc.so --test tst-fs-link.so'
      
      * Default run, i.e -e 'tests/misc-zfs-arc.so' provides four distinct
      workloads:
      1) Non-linear one where prefetch shouldn't be as effective.
      2) Load all data into cache, then read it afterwards to check performance
      on such cases, almost speed of main memory.
      3) Linear workload where the amount of data is 1.5% the size of the system
      memory, thus page replacement will be strongly used, and as the operation
      is sequential, prefetch (readahead) must be effective. It leads to a high
      cache hit ratio as blocks were read ahead of time.
      4) Keep allocating memory through a populated anonymous mmaping to see
      if shrink would take place to release memory back to the operating system.
      
      Eventual reports and ARC stats are provided to ease the task of understanding
      ARC performance on specific workloads.
      
      Signed-off-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      4a68aee3
    • Raphael S. Carvalho's avatar
      tests: testcase to reproduce a variety of IO workloads · 0ee2e417
      Raphael S. Carvalho authored
      
      Mainly created to be used as a tool that reproduces specific workloads,
      so allowing us to understand how underlying components are performing,
      e.g. Adjustable Replacement Cache (ARC) from ZFS.
      
      Reviewed-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      0ee2e417
  6. Feb 03, 2014
  7. Jan 28, 2014
  8. Jan 27, 2014
  9. Jan 26, 2014
  10. Jan 23, 2014
  11. Jan 22, 2014
  12. Jan 21, 2014
  13. Jan 20, 2014
  14. Jan 17, 2014
  15. Jan 15, 2014
  16. Jan 14, 2014
    • Nadav Har'El's avatar
      tst-vfs: don't rely on some random Java file · 305c749f
      Nadav Har'El authored
      
      tst-vfs.cc currently stat()s the file
      	/usr/lib/jvm/jre/lib/amd64/headless/libmawt.so
      And dies if it doesn't exist.
      
      Since Java is now optional in our images, it's not a good idea to check
      for such a file, which might not exist (e.g., "make image=tests check"
      will fail). This patch changes it to check a filename that is certain to
      exist, like namely the test itself - /tests/tst-vfs.so.
      
      If we wanted to have a pathname with more components, the test should
      be rewritten to create this pathname, say /a/a/a/a/a/a/a/a/a/a, and then
      test stat on that newly created file. It cannot rely on such a file to
      pre-exist.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      305c749f
  17. Jan 13, 2014
  18. Jan 10, 2014
  19. Jan 07, 2014
    • Nadav Har'El's avatar
      Exile spinlock to a separate file · 8fcad509
      Nadav Har'El authored
      
      In very early OSv history, the spinlock was used in the mutex's
      implementation so it made sense to put it in mutex.cc and mutex.h.
      
      But now that the spinlock is all that's left in mutex.cc (the real mutex
      is in lfmutex.cc), rename this file spinlock.cc. Also, move the spinlock
      definitions from <osv/mutex.h> to a new <osv/spinlock.h>, so if someone
      wants to make the grave mistake of using a spinlock - they will at least
      need to explicitly include this header file.
      
      Currently, the only remaining user of the spinlock is the console.
      Using a spinlock (and not a mutex) in the console allows printing debug
      messages while preemption is disabled. Arguably, this use-case is no
      longer important (we have tracepoints), so in the future we can consider
      dropping the spinlock completely.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      8fcad509
Loading