Skip to content
Snippets Groups Projects
  1. Apr 10, 2014
  2. Apr 09, 2014
  3. Apr 08, 2014
    • Avi Kivity's avatar
      sched: fix waitqueue race causing failure to wake up · 4ef65eb6
      Avi Kivity authored
      
      When waitqueue::wake_all() wakes up waiting threads, it calls
      sched::thread::wake_lock() to enqueue those waiting threads on the mutex
      protecting the waitqueue, thus avoiding needless contention on the mutex.
      However, if a thread is already waking, we let it wake naturally and acquire
      the mutex itself.
      
      The problem is that the waitqueue code (wait_object<waitqueue>::poll())
      examines the wait_record it sleeps on and see if it has woken, and if not,
      goes back to sleep.  Since nothing in that thread-already-awake path clears
      the wait_record, that is what happens, and the thread stalls, until a timeout
      occurs.
      
      Fix by clearing the wait record.  As it is protected by the mutex, no
      extra synchronization is needed.
      
      Observed with iperf -P 64 against the guest.  Likely triggered by net channels
      waking up the thread, and then before it has a chance to wake up, a FIN
      packet arrives that is processed in the driver thread; so when the packets
      are consumed the thread is in the waking state.
      
      Reviewed-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      4ef65eb6
    • Gleb Natapov's avatar
      jvmballoon: fix _soft_max_balloons check · 04ac707f
      Gleb Natapov authored
      
      Number of to be released balloons is calculated as a difference between
      current number of balloons and sof max. If they are equal no balloons
      are released and the loop repeats.
      
      Signed-off-by: default avatarGleb Natapov <gleb@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      04ac707f
    • Raphael S. Carvalho's avatar
      zfs: spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock contention · c74afb15
      Raphael S. Carvalho authored
      Reviewed by: Matthew Ahrens <mahrens@delphix.com>
      Reviewed by: George Wilson <george.wilson@delphix.com>
      Reviewed by: Christopher Siden <christopher.siden@delphix.com>
      Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
      Approved by: Richard Lowe <richlowe@richlowe.net>
      
      Reference: https://illumos.org/issues/3581
      
      
      
      Patch taken from Illumos and slight changes were needed to port it to OSv.
      
      This patch targets improvement on taskq lock contention by dispatching work
      over independent task queues. ZFS on Linux devops mention that it's not clear
      whether or not this issue affects their port, but profile results showed that
      time spent on taskq_thread() was reduced by about 11%.
      Apart from getting performance benefits, the number of threads in OSv was
      nicely reduced (from ~344 threads to ~224; so possibly saving a good amount
      of memory footprint).
      Also good for stepping towards our synchronicity with ZFS upstream.
      
      Addressing the issue #247.
      
      Reviewed-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      c74afb15
    • Tomasz Grabiec's avatar
      dhcp: fix parsing of DHCP options · 40d4ae32
      Tomasz Grabiec authored
      
      There are several issues with current code. Firstly, the LENGTH_OK
      macro was used in the condition in while(). This macro was checking if
      options + op_len does not exceed packet limit. This macro works fine
      when used from PARSE_OP() inside the switch, but because 'options' is
      bumped up by op_len at the end of the loop body, use of this macro in
      the while condition may result in premature exit of the loop. This was
      causing that some times OSv was not parsing network mask and gateway
      leaving them at 0.0.0.0 when started on Goodle Compute Engine. As a
      result OSv was not responding over network. See issue #254.
      
      Another issue was that the stop condition which checks for op ==
      DHCP_OPTION_END was using 'op' from the outer context, which was never
      overwritten. The actual variable which was changed based on the packet
      content was redeclared inside the loop.
      
      A third problem, spotted by Vlad, is that the code was not handling
      DHCP_OPTION_PAD properly. This option has only opcode byte and no
      following length byte. Currrent code would attempt to read the length
      byte and skip by that amount, which would yiled incorrect parsing
      result.
      
      Reviewed-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Signed-off-by: default avatarTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      40d4ae32
    • Tomasz Grabiec's avatar
      dhcp: remove lookup_opcode() · abbfc557
      Tomasz Grabiec authored
      
      The lookup_opcode() function is incorrect. It was mishandling
      DHCP_OPTION_PAD, which does not have a following length byte.
      
      Also, the while condition is reading 'op' value which never
      changes. This may result in reads beyond packet size.
      
      Since this function is unused the best fix is to remove it.
      
      Reveiwed-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Signed-off-by: default avatarTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      abbfc557
  4. Apr 07, 2014
  5. Apr 06, 2014
  6. Apr 04, 2014
  7. Apr 03, 2014
    • Nadav Har'El's avatar
      sched: fix rare crashes caused by reschedule running on the wrong CPU · ee92f736
      Nadav Har'El authored
      
      For a long time we've had the bug summarized in issue #178, where very
      rarely but consistently, in various runs such as Cassandra, Netperf and
      tst-queue-mpsc.so, we saw OSv crashing because of some corruption in the
      timer list, such as arming an already armed timer, or canceling and already
      canceled timer.
      
      It turns out the problem was the schedule() function, which basically did
      cpu::current()->schedule(). The problem is that if we're unlucky enough,
      the thread can be migrated right after calling cpu::current(), but before
      the irq disable in schedule(), which causes us to do a rescheduling for
      one CPU on a different CPU, which is a big faux pas. This can cause us,
      for example, to mess with one CPU's preemption_timer from a different CPU,
      causing the timer-related races and crashes we've seen in issue #178.
      
      Clearly, we shouldn't at all have a *method* cpu->schedule() which can
      operate on any cpu. Rather, we should have only a *function* (class-static)
      cpu::schedule() which operates on the current cpu - and makes sure we find
      that current CPU within the IRQ lock to ensure (among other things) the
      thread cannot get migrated.
      
      Another benefit of this patch is that it actually simplifies the code,
      with one less function called "schedule".
      
      Fixes #178.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      ee92f736
Loading