Skip to content
Snippets Groups Projects
  1. May 04, 2014
  2. May 03, 2014
  3. Apr 29, 2014
  4. Apr 28, 2014
  5. Apr 25, 2014
  6. Apr 24, 2014
    • Nadav Har'El's avatar
      zfs: fix read() of directory to return EISDIR · 0ad78a14
      Nadav Har'El authored
      
      Posix allows read() on directories in some filesystems. However, Linux
      always returns EISDIR in this case, so because we're emulating Linux,
      so should we, for every filesystem. All our filesystems except ZFS
      (e.g., ramfs) already return EISDIR when reading a directory, but ZFS
      doesn't, so this patch adds the missing check in ZFS.
      
      This patch is related to issue #94: the first step to fixing #94 is to
      return the right error when reading a directory.
      
      This patch also adds a test case, which can be compiled both on OSv and
      Linux, to verify they both have the same behavior. Before the patch, the
      test succeeded on Linux but failed on OSv when the directory is on ZFS.
      
      Instead of fixing zfs_read like I do in this patch, I could have also fixed
      sys_read() in vfs_syscalls.cc which is the top layer of all read()
      operations, and I could have done there
         (fp->f_dentry && fp->f_dentry->d_vnode->v_type == VDIR) {
            return EISDIR;
         }
      to cover all the filesystems. I decided not to do that, because all
      filesystems except ZFS already have this check, and because the lower
      layers like zfs_read() already have more natural access to d_vnode.
      
      Reviewed-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      0ad78a14
  7. Apr 15, 2014
  8. Apr 14, 2014
    • Tomasz Grabiec's avatar
      tests: add test for closing TCP connection with pending packets · fc5fc867
      Tomasz Grabiec authored
      
      The test is supposed to trigger the problem from issue #259. I was not
      able to trigger the problem using guest-local communication, hence the
      client is external to the guest.
      
      Reviewed-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
    • Avi Kivity's avatar
      tests: add rcu list test · 1bbcd274
      Avi Kivity authored
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      1bbcd274
    • Nadav Har'El's avatar
      memory: support for larger-than-page alignment · 8c417e23
      Nadav Har'El authored
      
      Our existing implementation of posix_memalign() and the C11 aligned_alloc()
      used our regular malloc(), so it only work worked up to alignment of 4096
      bytes - and crashed when it failed to achieve a higher desired alignment.
      
      Some applications do ask for higher alignment - for example MongoDB allocates
      a large buffer at 8192 byte alignment, and accordingly crashes half of the
      times, when the desired alignment is not achieved (half of the time, it is
      achieved by chance).
      
      This patch makes our support for alignment better organized, and fixes
      the alignment > 4096 case:
      
      The alignment is no longer available only to the outer function like
      posix_memalign(). Rather, it is passed down to lower-level allocation
      functions like malloc_large() which allocates whole pages - and this
      function now knows how to pick pages which start at a properly aligned
      boundary.
      
      This patch does not improve the wastefulness of our malloc_large(), so
      an overhaul of it would still be welcome. Case in point, malloc_large()
      always adds a full page to any allocation larger than half a page.
      Multiple allocations with posix_memalign(8192, 8192), rather than being
      tightly packed, each take 3 pages and are separated by a free page.
      This page is not wasted, but causes fragmentation of the heap.
      
      Note that after this patch, we still have one other bug in
      posix_memalign(size, align) - for small sizes and large alignments.
      For small sizes, we use a pool allocator with "size" alignment, and
      may not achieve the desired alignment (so causing an assertion failure).
      This bug can also be fixed, but is unrelated to this patch.
      
      This patch also adds a test for posix_memalign(), checking all alignments
      including large alignments which are the topic of this patch.
      The tests for small *sizes*, which as explained above are still buggy,
      are commented out, because they fail.
      
      Fixes #266.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@avi.cloudius>
      8c417e23
  9. Apr 09, 2014
    • Glauber Costa's avatar
      tests: enhance mmap test · ed2a70f1
      Glauber Costa authored
      
      We have recently discovered a bug through which we fail to unmap a valid region.
      This is fixed now, and this patch adds the failing condition to the test suite.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      ed2a70f1
    • Nadav Har'El's avatar
      vfs: fix partial non-blocking write · ef169330
      Nadav Har'El authored
      
      Our read() and write(), and their variants (pread, pwrite, readv, writev,
      preadv, pwritev) all shared the same bug when it comes to a partial read or
      write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully
      with the number of bytes actually written or read, as they should have.
      
      In the internals of the BSD read and write operations (e.g., sosend_generic)
      each operation returns *both* an error number and a number of bytes left.
      But at the end, the system call is expected to return just one of them -
      either an error *or* a number of bytes. The existing read()/write() code,
      when it saw the internals returning an error code, always returned it and
      ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK
      and the number of bytes is non-zero, we should return this number of bytes
      (i.e., a successful partial write), *not* the EWOULDBLOCK error.
      
      This bug went unnoticed almost since the dawn of OSv, because partial reads
      and writes are not common. For example, a write() to a blocking socket will
      always return after the entire write is successful, and will not partially
      succeed. Only when we write to an O_NONBLOCK socket, will it be possible to
      see a partial write - But even then, we would need a pretty large write()
      to see it only partially succeeding.
      
      But this bug is very noticable when running the Jetty Web server (see issue
      At some point it's like the response was restarted (complete with a second
      copy of the headers). In Jetty's demo this was seen as half-shown images,
      as well as corrupt output when fetching large text files like /test/da.txt.
      
      Turns out that Jetty sends static responses in a surprisingly efficient
      (for Java code...) way, using a single system call for the entire response:
      It mmap()s the file it wishes to send, and then uses one writev() call to
      send two arrays: The HTTP headers (built in malloc()ed memory), and the
      file itself (from mmapped memory). So Jetty tries to write even a 1MB file
      in one huge writev() call. But there's an added twist: It does so with the
      socket configured to O_NONBLOCK. So for large writes, the write will only
      partially succeed (empirically, only about 50KB will succeed), and Jetty
      will notice the partial write and continue writing the rest - until the
      whole file is sent. With the bug we had, part of the request will have been
      written, but Jetty still thought the write didn't write anything so it would
      start writing again from the beginning - causing the weird sort of response
      corruption we've been seeing.
      
      This patch also includes a test case which confirms this bug, and its fix.
      In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket
      (on the loopback interface), one thread write()s a very large buffer and
      the other receives what it can. We try this two times - once on a blocking
      socket and once on a non-blocking socket. In each case we expect the number
      of bytes written by one thread (return from write()) and the number read
      by the second thread (return from read()) to be the same. With the bug we
      had, in the non-blocking case we saw write() returning -1 (with
      errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test
      to fail.
      
      Fixes #257.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      ef169330
  10. Apr 03, 2014
  11. Apr 01, 2014
    • Tomasz Grabiec's avatar
      core: introduce serial_timer_task · bd179712
      Tomasz Grabiec authored
      This is a wrapper of timer_task which should be used if atomicity of
      callback tasks and timer operations is required. The class accepts
      external lock to serialize all operations. It provides sufficient
      abstraction to replace callouts in the network stack.
      
      Unfortunately, it requires some cooperation from the callback code
      (see try_fire()). That's because I couldn't extract in_pcb lock
      acquisition out of the callback code in TCP stack because there are
      other locks taken before it and doing so _could_ result in lock order
      inversion problems and hence deadlocks. If we can prove these to be
      safe then the API could be simplified.
      
      It may be also worthwhile to propagate the lock passed to
      serial_timer_task down to timer_task to save extra CAS.
      bd179712
    • Tomasz Grabiec's avatar
      core: introduce deferred work framework · 34620ff0
      Tomasz Grabiec authored
      The design behind timer_task
      
      timer_task was design for making cancel() and reschedule() scale well
      with the number of threads and CPUs in the system. These methods may
      be called frequently and from different CPUs. A task scheduled on one
      CPU may be rescheduled later from another CPU. To avoid expensive
      coordination between CPUs a lockfree per-CPU worker was implemented.
      
      Every CPU has a worker (async_worker) which has task registry and a
      thread to execute them. Most of the worker's state may only be changed
      from the CPU on which it runs.
      
      When timer_task is rescheduled it registers its percpu part in current
      CPU's worker. When it is then rescheduled from another CPU, the
      previous registration is marked as not valid and new percpu part is
      registered. When percpu task fires it checks if it is the last
      registration - only then it can fire.
      
      Because timer_task's state is scattered across CPUs some extra
      housekeeping needs to be done before it can be destroyed.  We need to
      make sure that no percpu task will try to access timer_task object
      after it is destroyed. To ensure that we walk the list of
      registrations of given timer_task and atomically flip their state from
      ACTIVE to RELEASED. If that succeeds it means the task is now revoked
      and worker will not try to execute it. If that fails it means the task
      is in the middle of firing and we need to wait for it to finish. When
      the per-CPU task is moved to RELEASED state it is appended to worker's
      queue of released percpu tasks using lockfree mpsc queue. These
      objects may be later reused for registrations.
      34620ff0
    • Glauber Costa's avatar
      tests: add java reclaim test · a71f8c53
      Glauber Costa authored
      
      This is a test in which two threads compete for resources. One of them will
      (hopefully) trigger memory allocations that are served by the heap while the other
      will stress the filesystem through reads and/or writes (no mappings).
      
      This is designed to test how well the balloon code works together with the ARC
      reclaimer.
      
      There are three main goals I expect OSv to achieve when running this test:
      
      1) When there is no filesystem activity, the balloon should never trigger, and
      the ARC cache should be reduced to its minimum
      2) When there is no java activity, we should balloon as much as we can, leaving
      the memory available to the filesystem (this one is trickier because the IO code
      is itself a java application - on purpose - so we eventually have to stop)
      3) When both are happening in tandem, the system should stabilize in reasonable
      values and not spend useless cycles switching memory back and forth.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      a71f8c53
  12. Mar 30, 2014
  13. Mar 27, 2014
  14. Mar 25, 2014
    • Nadav Har'El's avatar
      tst-kill: fix crash · 229020d2
      Nadav Har'El authored
      
      tst-kill runs various signal handlers, which we run in separate threads.
      When the test completes, we may be unlucky enough for the last signal
      handler to still be running, at which point when the module's memory
      is unmapped (e.g., in test.py -s each test is unmapped when it ends)
      we can get a page fault and a crash.
      
      This patch sleeps for a second at the end of tst-kill, to make sure that
      the signal handler has completed; This sleep is a bit ugly, but I can't
      think of a cleaner way - Posix provides no way to check if there's a
      running handler, and I wouldn't like to add a new API just for this test.
      
      Fixes #249.
      
      Reviewed-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      229020d2
  15. Mar 21, 2014
    • Glauber Costa's avatar
      tests: test for larger-than-memory mmaps. · b0cf5d76
      Glauber Costa authored
      
      It will create a file on-disk twice as large as memory, and then will map it entirely
      in memory. The file is then read from using 3 different sequential patterns, and then
      later on 2 threaded patterns.
      
      This test does not handle writes.
      
      It goes in misc because it takes a very long time to run (especially with a random pattern)
      
      Example output:
      
      Total Ram 586 Mb
      Write done
      Double Pass OK (13.6323 usec / page)
      Recency OK (3.35954 usec / page)
      Random Access OK (640.926 usec / page)
      Threaded pass 1 address ended OK
      Threaded pass many addresses ended OK
      PASSED
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      b0cf5d76
    • Nadav Har'El's avatar
      signal handling: support SA_RESETHAND · 15dfac35
      Nadav Har'El authored
      
      Add support for SA_RESETHAND signal handler flag, which means that the signal
      handler is reset to the default one after handling the signal once.
      
      I admit it's not a very useful feature (our default handler is powering off
      the system...) but no reason not to support it.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      15dfac35
  16. Mar 17, 2014
  17. Mar 10, 2014
  18. Mar 03, 2014
  19. Feb 25, 2014
    • Nadav Har'El's avatar
      tests: more load-balancer tests · 5b805e63
      Nadav Har'El authored
      
      This patch adds two more load-balancing tests to tests/misc-loadbalance.cc:
      
      1. Three threads on two cpus. If load-balancing is working correctly, this
         should slow down all threads x1.5 equally, and not get two x2 threads
         and one x1.
      
         Our performance on this test are fairly close to the expected.
      
      2. Three threads on two cpus, but one thread has priority 0.5, meaning it
         should get twice the CPU time of the two other threads, so fair load
         balancing is to keep the priority-0.5 thread on its own CPU, and the
         two normal-priority threads together on the second CPU - so at the end
         the priority-0.5 thread will get twice the CPU time of the other threads.
      
         Unfortunately, this test now gets bad results (x0.93,x0.94,x1.14
         instead of x1,x1,x1), because our load balancer currently doesn't take
         into account thread priorities: It thinks the CPU running the
         priority-0.5 thread has load 1, while it should be considered to have
         the load 2.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      5b805e63
  20. Feb 24, 2014
  21. Feb 12, 2014
    • Raphael S. Carvalho's avatar
      tests: Use a rare value on the symbol content comparisons · 288a3c84
      Raphael S. Carvalho authored
      
      Zero is a regularly used value, so let's instead use a rare one to compare the
      content of the symbols against. Addressing some stylistic issues as well.
      
      Follow the new output:
      $ sudo scripts/run.py -e 'tests/tst-resolve.so'
      OSv v0.05-348-g8b39f8c
      Target value:   0x05050505
      Success:        nonexistant = 0x05050505
      Success:        debug = 0x05050505
      Success:        condvar_wait = 0x05050505
      The time:       1392076964
      success.
      
      Reviewed-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      288a3c84
    • Nadav Har'El's avatar
      Fix concurrent static initialization · 8ef7f248
      Nadav Har'El authored
      
      When multiple threads concurrently use a function which has a static
      variable with a constructor, gcc guarantees that only one thread will run
      the constructor, and the other ones waits.  It uses __cxa_guard_acquire()/
      release()/abort() for implementing this guarantee. Unfortunately these
      functions, implemented in the C++ standard library, use the Linux futex
      syscall which we don't implement. The futex system call is only used in
      the rare case of contention - in the normal case, the __cxa_guard_*
      functions uses just atomic memory operations.
      
      This patch implements the bare minimum we need from futex to make the
      __cxa_guard_* functions work: All we need is FUTEX_WAIT and FUTEX_WAKE,
      with 0 timeout in FUTEX_WAIT and wake all in FUTEX_WAKE.
      
      It's nice how the new waitqueues fit this implementation like glove to
      a hand: We could use condvars too, but we anyway need to do the wake()
      with the mutex locked (to get the futex's atomic test-and-wait), and in
      that case we can use waitqueues without their additional internal mutex.
      
      This patch also adds a test for this bug. Catching this bug in a real
      application is very rare, but the test exposes it by defining an function-
      static object with a very slow constructor (it sleeps for a second), and
      calls the function from several threads concurrently.
      Before this patch the test fails with:
         constructing
         syscall(): unimplemented system call 202. Aborting.
      
      Fixes #199.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      8ef7f248
  22. Feb 11, 2014
    • Nadav Har'El's avatar
      epoll: Support epoll()'s EPOLLET · d41d748f
      Nadav Har'El authored
      
      This patch adds support for epoll()'s edge-triggered mode, EPOLLET.
      Fixes #188.
      
      As explained in issue #188, Boost's asio uses EPOLLET heavily, and we use
      that library in our management http server, and also in our image creation
      tool (cpiod.so). By ignoring EPOLLET, like we did until now, the code worked,
      but unnecessarily wasted CPU when epoll_wait() always returned immediately
      instead of waiting until a new event.
      
      This patch works within the confines of our existing poll mechanisms -
      where epoll() call poll(). We do not change this in this patch, and it
      should be changed in the future (see issue #17).
      
      In this patch we add to each struct file a field "poll_wake_count", which
      as its name suggests counts the number of poll_wake()s done on this
      file. Additionally, epoll remembers the last value it saw of this counter,
      so that in poll_scan(), if we see that an fp (polled with EPOLLET) has
      an unchanged counter from last time, we do not return readiness on this fp
      regardless on whether or not it has readable data.
      
      We have a complication with EPOLLET on sockets. These have an "SB_SEL"
      optimization, which avoids calling poll_wake() when it thinks the new
      data is not interesting because the old data was not yet consumed, and
      also avoids calling poll_wake() if fp->poll() was not previously done.
      This optimization is counter-productive for EPOLLET (and causes missed
      wakeups) so we need to work around it in the EPOLLET case.
      
      This patch also adds a test for the EPOLLET case in tst-epoll.cc. The test
      runs on both OSv and Linux, and can confirm that in the tested scenarios,
      Linux and OSv behave the same, including even one same false-positive:
      When epoll_wait() tells us there is data in a pipe, and we don't read it,
      but then more data comes on a pipe, epoll_wait() will again return a new
      event, despite this is not really being an edge event (the pipe didn't
      change from empty to not-empty, as it was previously not-empty as well).
      
      Concluding remarks:
      
      The primary goal of this implementation is to stop EPOLLET epoll_wait()
      from returning immediately despite nothing have happened on the file.
      That was what caused the 100% CPU use before this patch. That being said,
      the goal of this patch is NOT to avoid all false-positives or unnecessary
      wakeups; When events do occur on the file, we may be doing a bit more
      wakeups than strictly necessary. I think this is acceptable (our epoll()
      has worse problems) but for posterity, I want to explain:
      
      I already mentioned above one false-positive that also happens on Linux.
      Another false-positive wakeup that remains is in one of EPOLLET's classic
      use cases: Consider several threads sleeping on epoll() on the same socket
      (e.g., TCP listening socket, or UDP socket). When one packet arrives, normal
      level-triggered epoll() will wake all the threads, but only one will read
      the packet and the rest will find they have nothing to read. With edge-
      triggered epoll, only one thread should be woken and the rest would not.
      But in our implementation, poll_wake() wakes up *all* the pollers on this
      file, so we cannot currently support this optimization.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      d41d748f
  23. Feb 10, 2014
Loading