Skip to content
Snippets Groups Projects
  1. Oct 10, 2013
    • Avi Kivity's avatar
      build: define _KERNEL everywhere · 95ce17e3
      Avi Kivity authored
      We have _KERNEL defines scattered throughout the code, which makes
      understanding it difficult.
      
      Define it just once, and adjust the source to build.
      
      We define it in an overridable variable, so that non-kernel imported code
      can undo it.
      95ce17e3
  2. Oct 07, 2013
  3. Oct 03, 2013
  4. Sep 29, 2013
  5. Sep 28, 2013
  6. Sep 25, 2013
    • Nadav Har'El's avatar
      Dynamic linker: run finalizers when unloading shared object · bf0688f4
      Nadav Har'El authored
      
      ELF allows specifying initializers - functions to be run after loading a
      a shared object, in DT_INIT_ARRAY, and also finalizers - functions to be
      run before unloading a shared objects, in DT_FINI_ARRAY. The existing code
      ran the initializers, but forgot to run the finalizers, and this patch
      fixes this oversight.
      
      This fix is necessary for destructors of static objects defined in the
      shared object. But this fix is not sufficient for C++ destructors - see
      also the next patch.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      bf0688f4
  7. Sep 24, 2013
    • Nadav Har'El's avatar
      Fix missing poll() wakeup on POLLHUP · 554e80f6
      Nadav Har'El authored
      
      Our poll_wake() code ignored calls with the POLLHUP event, because
      the user did not explicitly ask for this event. This causes a poll()
      waiting on read from a pipe whose write side closes not to wake up.
      
      This patch adds a test for this case in tst-pipe.cc, and fixes the
      bug by adding to the poll structure's _events also ~POLL_REQUESTABLE,
      i.e., any bits which do not have to be explicitly requested by the
      user (POLL_REQUESTABLE is a new macro defined in this patch).
      
      After this patch, poll() wakes as needed in the test (instead of just
      hang), but returns the wrong event because of another bug which will
      be fixed in a separate patch.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      554e80f6
    • Pekka Enberg's avatar
      elf: Fix assert() in object::relocate_pltgot · e4c4696f
      Pekka Enberg authored
      
      The assertion in object::relocate_pltgot uses assignment instead of
      comparison.  Fix that up.
      
      Spotted by Coverity.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      e4c4696f
  8. Sep 23, 2013
    • Nadav Har'El's avatar
      Our select() function is emulated using poll(), which is a sensible thing · b53d39ac
      Nadav Har'El authored
      to do. However, it did several things wrong that this patch fixes. Thanks
      to Paolo Bonzini for finding these problems (see issue #35).
      
      1. When poll() returned a bare POLLHUP, without POLLIN, our select() didn't
      return a read event. But nothing in the manpages guarantees that POLLHUP
      is accompanied by POLLIN, and some special file implementations might
      forget it. As an example, in Linux POLLHUP without POLLIN is common.
      But POLLHUP on its own already means that there's nothing more to read,
      so a read() will return immediately without blocking - and therefore
      select() needs to turn on the readable bit for this fd.
      
      2. Similarly, a bare POLLRDHUP should turn on the writable bit: The
      reader on this file hug up, so a write will fail immediately.
      
      3. Our poll() and select() confused what POLLERR means. POLLERR does not
      mean poll() found a bad file descriptor - there is POLLNVAL for that.
      So this patch fixes poll() to set POLLNVAL, not POLLERR, and select()
      to return with errno=EBADF when it sees POLLNVAL, not POLLERR.
      
      4. Rather, POLLERR means the file descriptor is in an error state, so every
      read() or write() will return immediately (with an error). So when we see
      it, we need to turn on both read and write bits in this case.
      
      5. The meaning of "exceptfds" isn't clear in any manual page, and it
      seems there're a lot of opinions on what it might mean. In this patch I
      did what Paolo suggested, which is to set the except bit when POLLPRI.
      (I don't set exceptfds on POLLERR, or any other case).
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      b53d39ac
  9. Sep 21, 2013
  10. Sep 20, 2013
  11. Sep 15, 2013
    • Nadav Har'El's avatar
      Add copyright statement to core/* · 4c0b39f3
      Nadav Har'El authored
      Added Cloudius copyright statement to core/*.
      
      poll.cc already had a BSD copyright statement, I believe this is a mistake
      (I think Guy wrote this code from scratch), but not wanting to rush to a
      conclusion I'm leaving both copyright statements and we should address this
      issue later.
      4c0b39f3
    • Pekka Enberg's avatar
      poll: Improve tracepoints · f8c106ae
      Pekka Enberg authored
      Pass function arguments to the tracepoint and add a tracepoints for
      poll() return value and errno.
      f8c106ae
  12. Sep 12, 2013
  13. Sep 11, 2013
    • Nadav Har'El's avatar
      Add reboot function · 542c319b
      Nadav Har'El authored
      Added a new function, osv::reboot() (declared in <osv/power.hh>)
      for rebooting the VM.
      
      Also added a Java interface - com.cloudius.util.Power.reboot().
      
      NOTE: Power.java and/or jni/power.cc also need to be copied into
      the mgmt submodule.
      542c319b
  14. Sep 10, 2013
    • Pekka Enberg's avatar
      mmu: Fix file-backed vma splitting · d72b550c
      Pekka Enberg authored
      Commit 3510a5ea ("mmu: File-backed VMAs") forgot to fix vma::split() to
      take file-backed mappings into account. Fix the problem by making
      vma::split() a virtual function and implementing it separately for
      file_vma.
      
      Spotted by Avi Kivity.
      d72b550c
    • Nadav Har'El's avatar
      DHCP: Fix crash · 68f4d147
      Nadav Har'El authored
      Rarely (about once every 20 runs) we had OSV crash during boot, in the
      DHCP code. It turns out that the code first sends out the DCHP requests,
      and then creates a thread to handle the replies. When a reply arrives,
      the code wake()s the thread, but on rare occasions the thread hasn't yet
      been set up (still a null pointer) so we have a crash.
      
      Fix this by reversing the order - first create the reply handling thread,
      and only then send the request.
      68f4d147
  15. Sep 08, 2013
    • Nadav Har'El's avatar
      Scheduler: Fix load-balancer bug · e9f0cf29
      Nadav Har'El authored
      The load_balance() code checks if another CPU has fewer threads in its
      run queue than this thread, and if so, migrates one of this CPU's threads
      to the other CPU.
      
      However, when we count this core's runnable threads, we overcount it by
      1, because as soon as load_balance() goes back to sleep, one of the
      runnable threads will start running. So if this core has just one more
      runnable threads than some remote's core runnable threads, they are
      actually even, so in that case we should *not* migrate a thread.
      
      Overcounting the number of threads on the core running load_balance
      caused bad performance in 2-core and 2-thread SpecJVM: Normally, the
      size of the run queue on each core is 1 (each core is running one of
      the two threads, and on the run queue we have the idle thread). But
      when load_balance runs it sees 2 runnable threads (the idle thread and
      the preempted benchmark thread), and the second core has just 1, so
      it decides to migrate one of its threads to the second CPU. When this
      is over, the second CPU has both benchmark threads, and the first CPU
      has nothing, and this will only be fixed some time later when the
      second CPU's load_balance thread runs, and later the balance will be
      ruined again. All this time that the two threads run on the same CPU
      significantly hurt performance, and on the host's "top" we see qemu
      taking just 120%-150% instead of 200% as it should (and as it does
      after this patch).
      e9f0cf29
    • Nadav Har'El's avatar
      Scheduler: Avoid vruntime jump when clock jumps · 253e4536
      Nadav Har'El authored
      Currently, clock::get()->time() jumps (by system_time(), i.e., the host's
      uptime) at some point during the initialization. This can be a huge jump
      (e.g., a week if the host's uptime is a week). Fixing this jump is hard,
      so we'd rather just tolerate it.
      
      reschedule_from_interrupt() handles this clock jump badly. It calculates
      current_run, the amount of time the current thread has run, to include this
      jump while the thread was running. In the above example, a run time of
      a whole week is wrongly attributed to some thread, and added to its vruntime,
      causing it not to be scheduled again until all other threads yield the
      CPU.
      
      The fix in this patch is to limit the vruntime increase after a long
      run to max_slice (10ms). Even if a thread runs for longer (or just thinks
      it ran for longer), it won't be "penalized" in its dynamic priority more
      than a thread that ran for 10ms. Note that this cap makes sense, as
      cpu::enqueue already enforces a similar limit on the vruntime "bonus"
      of a woken thread, and this patch works toward a similar goal (avoid
      giving one thread a huge bonus because another thread was given a huge
      penalty).
      
      This bug is very visible in the CPU-bound SPECjvm2008 benchmarks, when
      running two benchmark threads on two virtual cpus. As it happens, the
      load_balancer() is the one that gets the huge vruntime increase, so
      it doesn't get to run until no other thread wants to run. Because we start
      with both CPU-bound threads on the same CPU, and these hardly yield the
      CPU (and even more rarely are the two threads sleeping at the same time),
      the load balancer thread on this CPU doesn't get to run, and the two threads
      remain on the same CPU, giving us halved performance (2-cpu performance
      identical to 1-cpu performance) and on the host we see qemu using 100% cpu,
      instead of 200% as expected with two vcpus.
      253e4536
    • Guy Zana's avatar
  16. Sep 03, 2013
    • Avi Kivity's avatar
      irq_lock: avoid 'irq_lock defined but not used' warning · 90390cca
      Avi Kivity authored
      In an attempt to be clever, we define irq_lock as an object in an anonymous
      namespace, so that each translation unit gets its own copy, which is then
      optimized away, since the object is never touched.  But the compiler complains
      that the object is defined but not used if we include the file but don't
      use irq_lock.
      
      Simplify by only declaring the object there, and defining it somewhere else.
      90390cca
  17. Sep 02, 2013
    • Pekka Enberg's avatar
      mmu: msync for file-backed memory maps · 1691c89d
      Pekka Enberg authored
      This adds simple msync() implementation for file-backed memory maps. It
      uses the newly added 'file_vma' data structure to write out and fsync
      the msync'd region as suggested by Avi Kivity.
      1691c89d
    • Pekka Enberg's avatar
      mmu: File-backed VMAs · 3510a5ea
      Pekka Enberg authored
      Add a new 'file_vma' class that extends 'vma'. This is needed to keep
      track of fileref and offset for file-backed VMAs for msync().
      3510a5ea
  18. Aug 29, 2013
  19. Aug 27, 2013
    • Nadav Har'El's avatar
      Fix mincore() on non-mmap()ed memory · 6924f7db
      Nadav Har'El authored
      Commit 65afd075 fixed mincore() to recognize
      unmapped addresses. However, it used mmu::ismapped() which just checks for
      mmap()'ed addresses, and doesn't know about malloc()ed memory. This causes
      trouble for libunwind (which we use for backtrace()) which tests mincore()
      on an on-stack variable, and for non-pthread threads, this stack might be
      malloc'ed, not mmap'ed.
      
      So this patch adds mmu::isreadable(), which checks that a given memory range
      is all readable (this memory can be mmapped, malloced, stack, whatever).
      mincore() now uses that.
      
      mmu::isreadable() is implemented, following Avi's idea, by trying to read,
      with safe_load(), one byte from every page in the range. This approach is
      faster than page-table-walking especially for one-byte checks (which all
      libunwind uses anyway), and also very simple.
      6924f7db
    • Glauber Costa's avatar
      mempool.c: trace large allocations · 0a798e4d
      Glauber Costa authored
      Most of the performance problems I have found on Xen were due to the fact that
      we were hitting malloc_large consistently, for allocations that we should be
      able to service in some other way. Because malloc_large in our implementation
      is such a bottleneck, it was very useful for me to have separate tracepoints
      for them.  I am then proposing for inclusion.
      0a798e4d
    • Nadav Har'El's avatar
      Fix deadlock in leak detector · 227eb39b
      Nadav Har'El authored
      Commit 65afd075 that fixed mincore()
      exposed a deadlock in the leak detector, caused by two threads taking
      two locks in opposite order:
      
      Thread 1:  malloc() does alloc_tracker::remember(). This takes the tracker
         lock and calls backtrace() calling mincore() which takes the
         vma_list_mutex.
      
      Thread 2: mmap() does mmu::allocate() which takes the vma_list_mutex and
         then through mmu::populate::small_page calls memory::alloc_page() which
         calls alloc_tracker::remember() and takes the tracker lock.
      
      This patch fixes this deadlock: alloc_tracker::remember() will now drop its
      lock while running backtrace(), as the lock is only needed to protect the
      allocations[] array. We need to retake the lock after backtrace() completes,
      to copy the backtrace back to the allocations[] array.
      
      Previously, the lock's depth was also (ab)used for avoiding nested
      allocation tracking (e.g., tracking of memory allocation done inside
      backtrace() itself), but now that backtrace() is run without the lock,
      we need a different mechanism - a per-thread "in_tracker" flag, which
      is turned on inside the alloc_tracker::remember()/forget() methods.
      227eb39b
  20. Aug 26, 2013
    • Nadav Har'El's avatar
      Avoid including elf.hh from sched.hh · 714d313a
      Nadav Har'El authored
      sched.hh included elf.hh, just so it can refer to the elf::tls_data
      type. But now that we have rcu.hh which includes sched.hh and therefore
      elf.hh, if we wish to use rcu in elf.hh (we'll do this in a later patch),
      we have an include loop mess.
      
      So better not include elf.hh from sched.hh, and just declare the one
      struct we need.
      
      After sched.hh no longer includes elf.hh and the dozen includes that
      it further included, we need to add missing includes to some of the
      code that included sched.hh and relied on its implict includes.
      714d313a
    • Avi Kivity's avatar
      mmu: don't pass really bad faults to the application · 6f464e76
      Avi Kivity authored
      Trying to execute the null pointer, or faults within the kernel code, are
      a really bad sign and it's better to abort early with them.
      6f464e76
    • Pekka Enberg's avatar
      alloctracker: Fix forget() if remember() hasn't been called · 0affe14a
      Pekka Enberg authored
      If leak detector is enabled after OSv startup, the first call can be to
      free(), not malloc(). Fix alloctracker::forget() to deal with that.
      
      Fixes the SIGSEGV when "osv leak on" is used to enable detection from
      gdb after OSv has started up:
      
        #
        # A fatal error has been detected by the Java Runtime Environment:
        #
        #  SIGSEGV (0xb) at pc=0x00000000003b8ee6, pid=0, tid=18446673706168635392
        #
        # JRE version: 7.0_25
        # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops)
        # Problematic frame:
        # C  0x00000000003b8ee6
        #
        # Core dump written. Default location: //core or core.0
        #
        # An error report file with more information is saved as:
        # /tmp/jvm-0/hs_error.log
        #
        # If you would like to submit a bug report, please include
        # instructions on how to reproduce the bug and visit:
        #   http://icedtea.classpath.org/bugzilla
        #
        Aborted
      
        [penberg@localhost osv]$ addr2line -e build/debug/loader.elf
        0x00000000003b8ee6
        /home/penberg/osv/build/debug/../../core/alloctracker.cc:90
      0affe14a
  21. Aug 25, 2013
    • Avi Kivity's avatar
      rcu: fix hang due to race while awaiting a quiescent state · ac7a8447
      Avi Kivity authored
      Waiting for a quiescent state happens in two stages: first, we request all
      cpus to schedule at least once.  Then, we wait until they do so.
      
      If, between the two stages, a cpu is brought online, then we will request
      N cpus to schedule but wait for N+1 to respond.  This of course never happens,
      and the system hangs.
      
      Fix by copying the vector which holds the cpus which we signal and wait for;
      forcing them to be consistent.  This is safe since newly-added cpus cannot
      be accessing any rcu-protected variables before we started signalling.
      
      Fixes random hangs with rcu, mostly seen with 'perf callstack'
      ac7a8447
  22. Aug 19, 2013
  23. Aug 18, 2013
  24. Aug 16, 2013
    • Pekka Enberg's avatar
      sched: Avoid IPIs in thread::wake() · 71fec998
      Pekka Enberg authored
      Avoid sending an IPI to a CPU that's already being woken up by another
      IPI.  This reduces IPIs by 17% for a cassandra-stress run. Execution
      time is obviously unaffected because execution is bound by lock
      contention.
      
      Before:
      
      [penberg@localhost ~]$ sudo perf kvm stat -e kvm:* -p `pidof qemu-system-x86_64`
      ^C
       Performance counter stats for process id '610':
      
               6,909,333 kvm:kvm_entry
                       0 kvm:kvm_hypercall
                       0 kvm:kvm_hv_hypercall
               1,035,125 kvm:kvm_pio
                       0 kvm:kvm_cpuid
               5,149,393 kvm:kvm_apic
               6,909,369 kvm:kvm_exit
               2,108,440 kvm:kvm_inj_virq
                       0 kvm:kvm_inj_exception
                     982 kvm:kvm_page_fault
               2,783,005 kvm:kvm_msr
                       0 kvm:kvm_cr
                   7,354 kvm:kvm_pic_set_irq
               2,366,388 kvm:kvm_apic_ipi
               2,468,569 kvm:kvm_apic_accept_irq
               2,067,044 kvm:kvm_eoi
               1,982,000 kvm:kvm_pv_eoi
                       0 kvm:kvm_nested_vmrun
                       0 kvm:kvm_nested_intercepts
                       0 kvm:kvm_nested_vmexit
                       0 kvm:kvm_nested_vmexit_inject
                       0 kvm:kvm_nested_intr_vmexit
                       0 kvm:kvm_invlpga
                       0 kvm:kvm_skinit
                   3,677 kvm:kvm_emulate_insn
                       0 kvm:vcpu_match_mmio
                       0 kvm:kvm_update_master_clock
                       0 kvm:kvm_track_tsc
                   7,354 kvm:kvm_userspace_exit
                   7,354 kvm:kvm_set_irq
                   7,354 kvm:kvm_ioapic_set_irq
                     674 kvm:kvm_msi_set_irq
                       0 kvm:kvm_ack_irq
                       0 kvm:kvm_mmio
                 609,915 kvm:kvm_fpu
                       0 kvm:kvm_age_page
                       0 kvm:kvm_try_async_get_page
                       0 kvm:kvm_async_pf_doublefault
                       0 kvm:kvm_async_pf_not_present
                       0 kvm:kvm_async_pf_ready
                       0 kvm:kvm_async_pf_completed
      
            81.180469772 seconds time elapsed
      
      After:
      
      [penberg@localhost ~]$ sudo perf kvm stat -e kvm:* -p `pidof qemu-system-x86_64`
      ^C
       Performance counter stats for process id '30824':
      
               6,411,175 kvm:kvm_entry                                                [100.00%]
                       0 kvm:kvm_hypercall                                            [100.00%]
                       0 kvm:kvm_hv_hypercall                                         [100.00%]
                 992,454 kvm:kvm_pio                                                  [100.00%]
                       0 kvm:kvm_cpuid                                                [100.00%]
               4,300,001 kvm:kvm_apic                                                 [100.00%]
               6,411,133 kvm:kvm_exit                                                 [100.00%]
               2,055,189 kvm:kvm_inj_virq                                             [100.00%]
                       0 kvm:kvm_inj_exception                                        [100.00%]
                   9,760 kvm:kvm_page_fault                                           [100.00%]
               2,356,260 kvm:kvm_msr                                                  [100.00%]
                       0 kvm:kvm_cr                                                   [100.00%]
                   3,354 kvm:kvm_pic_set_irq                                          [100.00%]
               1,943,731 kvm:kvm_apic_ipi                                             [100.00%]
               2,047,024 kvm:kvm_apic_accept_irq                                      [100.00%]
               2,019,044 kvm:kvm_eoi                                                  [100.00%]
               1,949,821 kvm:kvm_pv_eoi                                               [100.00%]
                       0 kvm:kvm_nested_vmrun                                         [100.00%]
                       0 kvm:kvm_nested_intercepts                                    [100.00%]
                       0 kvm:kvm_nested_vmexit                                        [100.00%]
                       0 kvm:kvm_nested_vmexit_inject                                 [100.00%]
                       0 kvm:kvm_nested_intr_vmexit                                   [100.00%]
                       0 kvm:kvm_invlpga                                              [100.00%]
                       0 kvm:kvm_skinit                                               [100.00%]
                   1,677 kvm:kvm_emulate_insn                                         [100.00%]
                       0 kvm:vcpu_match_mmio                                          [100.00%]
                       0 kvm:kvm_update_master_clock                                  [100.00%]
                       0 kvm:kvm_track_tsc                                            [100.00%]
                   3,354 kvm:kvm_userspace_exit                                       [100.00%]
                   3,354 kvm:kvm_set_irq                                              [100.00%]
                   3,354 kvm:kvm_ioapic_set_irq                                       [100.00%]
                     927 kvm:kvm_msi_set_irq                                          [100.00%]
                       0 kvm:kvm_ack_irq                                              [100.00%]
                       0 kvm:kvm_mmio                                                 [100.00%]
                 620,278 kvm:kvm_fpu                                                  [100.00%]
                       0 kvm:kvm_age_page                                             [100.00%]
                       0 kvm:kvm_try_async_get_page                                   [100.00%]
                       0 kvm:kvm_async_pf_doublefault                                 [100.00%]
                       0 kvm:kvm_async_pf_not_present                                 [100.00%]
                       0 kvm:kvm_async_pf_ready                                       [100.00%]
                       0 kvm:kvm_async_pf_completed
      
            79.947992238 seconds time elapsed
      71fec998
Loading