Skip to content
Snippets Groups Projects
  1. May 16, 2014
    • Nadav Har'El's avatar
      sched: high-resolution thread::current()->thread_clock() · c4ebb11a
      Nadav Har'El authored
      
      thread::current()->thread_clock() returns the CPU time consumed by this
      thread. A thread that wishes to measure the amount of CPU time consumed
      by some short section of code will want this clock to have high resolution,
      but in the existing code it was only updated on context switches, so shorter
      durations could not be measured with it.
      
      This patch fixes thread_clock() to also add the time that passed since
      the the time slice started.
      
      When running thread_clock() on *another* thread (not thread::current()),
      we still return a cpu time snapshot from the last context switch - even
      if the thread happens to be running now (on another CPU). Fixing that case
      is quite difficult (and will probably require additional memory-ordering
      guarantees), and anyway not very important: Usually we don't need a
      high-resolution estimate of a different thread's cpu time.
      
      Fixes #302.
      
      Reviewed-by: default avatarGleb Natapov <gleb@cloudius-systems.com>
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      c4ebb11a
  2. May 15, 2014
  3. May 14, 2014
  4. May 08, 2014
  5. May 05, 2014
  6. May 04, 2014
  7. May 03, 2014
  8. Apr 29, 2014
  9. Apr 28, 2014
  10. Apr 25, 2014
  11. Apr 24, 2014
    • Nadav Har'El's avatar
      zfs: fix read() of directory to return EISDIR · 0ad78a14
      Nadav Har'El authored
      
      Posix allows read() on directories in some filesystems. However, Linux
      always returns EISDIR in this case, so because we're emulating Linux,
      so should we, for every filesystem. All our filesystems except ZFS
      (e.g., ramfs) already return EISDIR when reading a directory, but ZFS
      doesn't, so this patch adds the missing check in ZFS.
      
      This patch is related to issue #94: the first step to fixing #94 is to
      return the right error when reading a directory.
      
      This patch also adds a test case, which can be compiled both on OSv and
      Linux, to verify they both have the same behavior. Before the patch, the
      test succeeded on Linux but failed on OSv when the directory is on ZFS.
      
      Instead of fixing zfs_read like I do in this patch, I could have also fixed
      sys_read() in vfs_syscalls.cc which is the top layer of all read()
      operations, and I could have done there
         (fp->f_dentry && fp->f_dentry->d_vnode->v_type == VDIR) {
            return EISDIR;
         }
      to cover all the filesystems. I decided not to do that, because all
      filesystems except ZFS already have this check, and because the lower
      layers like zfs_read() already have more natural access to d_vnode.
      
      Reviewed-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      0ad78a14
  12. Apr 15, 2014
  13. Apr 14, 2014
    • Tomasz Grabiec's avatar
      tests: add test for closing TCP connection with pending packets · fc5fc867
      Tomasz Grabiec authored
      
      The test is supposed to trigger the problem from issue #259. I was not
      able to trigger the problem using guest-local communication, hence the
      client is external to the guest.
      
      Reviewed-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
    • Avi Kivity's avatar
      tests: add rcu list test · 1bbcd274
      Avi Kivity authored
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      1bbcd274
    • Nadav Har'El's avatar
      memory: support for larger-than-page alignment · 8c417e23
      Nadav Har'El authored
      
      Our existing implementation of posix_memalign() and the C11 aligned_alloc()
      used our regular malloc(), so it only work worked up to alignment of 4096
      bytes - and crashed when it failed to achieve a higher desired alignment.
      
      Some applications do ask for higher alignment - for example MongoDB allocates
      a large buffer at 8192 byte alignment, and accordingly crashes half of the
      times, when the desired alignment is not achieved (half of the time, it is
      achieved by chance).
      
      This patch makes our support for alignment better organized, and fixes
      the alignment > 4096 case:
      
      The alignment is no longer available only to the outer function like
      posix_memalign(). Rather, it is passed down to lower-level allocation
      functions like malloc_large() which allocates whole pages - and this
      function now knows how to pick pages which start at a properly aligned
      boundary.
      
      This patch does not improve the wastefulness of our malloc_large(), so
      an overhaul of it would still be welcome. Case in point, malloc_large()
      always adds a full page to any allocation larger than half a page.
      Multiple allocations with posix_memalign(8192, 8192), rather than being
      tightly packed, each take 3 pages and are separated by a free page.
      This page is not wasted, but causes fragmentation of the heap.
      
      Note that after this patch, we still have one other bug in
      posix_memalign(size, align) - for small sizes and large alignments.
      For small sizes, we use a pool allocator with "size" alignment, and
      may not achieve the desired alignment (so causing an assertion failure).
      This bug can also be fixed, but is unrelated to this patch.
      
      This patch also adds a test for posix_memalign(), checking all alignments
      including large alignments which are the topic of this patch.
      The tests for small *sizes*, which as explained above are still buggy,
      are commented out, because they fail.
      
      Fixes #266.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@avi.cloudius>
      8c417e23
  14. Apr 09, 2014
    • Glauber Costa's avatar
      tests: enhance mmap test · ed2a70f1
      Glauber Costa authored
      
      We have recently discovered a bug through which we fail to unmap a valid region.
      This is fixed now, and this patch adds the failing condition to the test suite.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      ed2a70f1
    • Nadav Har'El's avatar
      vfs: fix partial non-blocking write · ef169330
      Nadav Har'El authored
      
      Our read() and write(), and their variants (pread, pwrite, readv, writev,
      preadv, pwritev) all shared the same bug when it comes to a partial read or
      write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully
      with the number of bytes actually written or read, as they should have.
      
      In the internals of the BSD read and write operations (e.g., sosend_generic)
      each operation returns *both* an error number and a number of bytes left.
      But at the end, the system call is expected to return just one of them -
      either an error *or* a number of bytes. The existing read()/write() code,
      when it saw the internals returning an error code, always returned it and
      ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK
      and the number of bytes is non-zero, we should return this number of bytes
      (i.e., a successful partial write), *not* the EWOULDBLOCK error.
      
      This bug went unnoticed almost since the dawn of OSv, because partial reads
      and writes are not common. For example, a write() to a blocking socket will
      always return after the entire write is successful, and will not partially
      succeed. Only when we write to an O_NONBLOCK socket, will it be possible to
      see a partial write - But even then, we would need a pretty large write()
      to see it only partially succeeding.
      
      But this bug is very noticable when running the Jetty Web server (see issue
      At some point it's like the response was restarted (complete with a second
      copy of the headers). In Jetty's demo this was seen as half-shown images,
      as well as corrupt output when fetching large text files like /test/da.txt.
      
      Turns out that Jetty sends static responses in a surprisingly efficient
      (for Java code...) way, using a single system call for the entire response:
      It mmap()s the file it wishes to send, and then uses one writev() call to
      send two arrays: The HTTP headers (built in malloc()ed memory), and the
      file itself (from mmapped memory). So Jetty tries to write even a 1MB file
      in one huge writev() call. But there's an added twist: It does so with the
      socket configured to O_NONBLOCK. So for large writes, the write will only
      partially succeed (empirically, only about 50KB will succeed), and Jetty
      will notice the partial write and continue writing the rest - until the
      whole file is sent. With the bug we had, part of the request will have been
      written, but Jetty still thought the write didn't write anything so it would
      start writing again from the beginning - causing the weird sort of response
      corruption we've been seeing.
      
      This patch also includes a test case which confirms this bug, and its fix.
      In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket
      (on the loopback interface), one thread write()s a very large buffer and
      the other receives what it can. We try this two times - once on a blocking
      socket and once on a non-blocking socket. In each case we expect the number
      of bytes written by one thread (return from write()) and the number read
      by the second thread (return from read()) to be the same. With the bug we
      had, in the non-blocking case we saw write() returning -1 (with
      errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test
      to fail.
      
      Fixes #257.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      ef169330
  15. Apr 03, 2014
  16. Apr 01, 2014
    • Tomasz Grabiec's avatar
      core: introduce serial_timer_task · bd179712
      Tomasz Grabiec authored
      This is a wrapper of timer_task which should be used if atomicity of
      callback tasks and timer operations is required. The class accepts
      external lock to serialize all operations. It provides sufficient
      abstraction to replace callouts in the network stack.
      
      Unfortunately, it requires some cooperation from the callback code
      (see try_fire()). That's because I couldn't extract in_pcb lock
      acquisition out of the callback code in TCP stack because there are
      other locks taken before it and doing so _could_ result in lock order
      inversion problems and hence deadlocks. If we can prove these to be
      safe then the API could be simplified.
      
      It may be also worthwhile to propagate the lock passed to
      serial_timer_task down to timer_task to save extra CAS.
      bd179712
    • Tomasz Grabiec's avatar
      core: introduce deferred work framework · 34620ff0
      Tomasz Grabiec authored
      The design behind timer_task
      
      timer_task was design for making cancel() and reschedule() scale well
      with the number of threads and CPUs in the system. These methods may
      be called frequently and from different CPUs. A task scheduled on one
      CPU may be rescheduled later from another CPU. To avoid expensive
      coordination between CPUs a lockfree per-CPU worker was implemented.
      
      Every CPU has a worker (async_worker) which has task registry and a
      thread to execute them. Most of the worker's state may only be changed
      from the CPU on which it runs.
      
      When timer_task is rescheduled it registers its percpu part in current
      CPU's worker. When it is then rescheduled from another CPU, the
      previous registration is marked as not valid and new percpu part is
      registered. When percpu task fires it checks if it is the last
      registration - only then it can fire.
      
      Because timer_task's state is scattered across CPUs some extra
      housekeeping needs to be done before it can be destroyed.  We need to
      make sure that no percpu task will try to access timer_task object
      after it is destroyed. To ensure that we walk the list of
      registrations of given timer_task and atomically flip their state from
      ACTIVE to RELEASED. If that succeeds it means the task is now revoked
      and worker will not try to execute it. If that fails it means the task
      is in the middle of firing and we need to wait for it to finish. When
      the per-CPU task is moved to RELEASED state it is appended to worker's
      queue of released percpu tasks using lockfree mpsc queue. These
      objects may be later reused for registrations.
      34620ff0
    • Glauber Costa's avatar
      tests: add java reclaim test · a71f8c53
      Glauber Costa authored
      
      This is a test in which two threads compete for resources. One of them will
      (hopefully) trigger memory allocations that are served by the heap while the other
      will stress the filesystem through reads and/or writes (no mappings).
      
      This is designed to test how well the balloon code works together with the ARC
      reclaimer.
      
      There are three main goals I expect OSv to achieve when running this test:
      
      1) When there is no filesystem activity, the balloon should never trigger, and
      the ARC cache should be reduced to its minimum
      2) When there is no java activity, we should balloon as much as we can, leaving
      the memory available to the filesystem (this one is trickier because the IO code
      is itself a java application - on purpose - so we eventually have to stop)
      3) When both are happening in tandem, the system should stabilize in reasonable
      values and not spend useless cycles switching memory back and forth.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      a71f8c53
  17. Mar 30, 2014
  18. Mar 27, 2014
  19. Mar 25, 2014
    • Nadav Har'El's avatar
      tst-kill: fix crash · 229020d2
      Nadav Har'El authored
      
      tst-kill runs various signal handlers, which we run in separate threads.
      When the test completes, we may be unlucky enough for the last signal
      handler to still be running, at which point when the module's memory
      is unmapped (e.g., in test.py -s each test is unmapped when it ends)
      we can get a page fault and a crash.
      
      This patch sleeps for a second at the end of tst-kill, to make sure that
      the signal handler has completed; This sleep is a bit ugly, but I can't
      think of a cleaner way - Posix provides no way to check if there's a
      running handler, and I wouldn't like to add a new API just for this test.
      
      Fixes #249.
      
      Reviewed-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      229020d2
  20. Mar 21, 2014
    • Glauber Costa's avatar
      tests: test for larger-than-memory mmaps. · b0cf5d76
      Glauber Costa authored
      
      It will create a file on-disk twice as large as memory, and then will map it entirely
      in memory. The file is then read from using 3 different sequential patterns, and then
      later on 2 threaded patterns.
      
      This test does not handle writes.
      
      It goes in misc because it takes a very long time to run (especially with a random pattern)
      
      Example output:
      
      Total Ram 586 Mb
      Write done
      Double Pass OK (13.6323 usec / page)
      Recency OK (3.35954 usec / page)
      Random Access OK (640.926 usec / page)
      Threaded pass 1 address ended OK
      Threaded pass many addresses ended OK
      PASSED
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      b0cf5d76
    • Nadav Har'El's avatar
      signal handling: support SA_RESETHAND · 15dfac35
      Nadav Har'El authored
      
      Add support for SA_RESETHAND signal handler flag, which means that the signal
      handler is reset to the default one after handling the signal once.
      
      I admit it's not a very useful feature (our default handler is powering off
      the system...) but no reason not to support it.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      15dfac35
  21. Mar 17, 2014
  22. Mar 10, 2014
  23. Mar 03, 2014
  24. Feb 25, 2014
    • Nadav Har'El's avatar
      tests: more load-balancer tests · 5b805e63
      Nadav Har'El authored
      
      This patch adds two more load-balancing tests to tests/misc-loadbalance.cc:
      
      1. Three threads on two cpus. If load-balancing is working correctly, this
         should slow down all threads x1.5 equally, and not get two x2 threads
         and one x1.
      
         Our performance on this test are fairly close to the expected.
      
      2. Three threads on two cpus, but one thread has priority 0.5, meaning it
         should get twice the CPU time of the two other threads, so fair load
         balancing is to keep the priority-0.5 thread on its own CPU, and the
         two normal-priority threads together on the second CPU - so at the end
         the priority-0.5 thread will get twice the CPU time of the other threads.
      
         Unfortunately, this test now gets bad results (x0.93,x0.94,x1.14
         instead of x1,x1,x1), because our load balancer currently doesn't take
         into account thread priorities: It thinks the CPU running the
         priority-0.5 thread has load 1, while it should be considered to have
         the load 2.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      5b805e63
Loading