Skip to content
Snippets Groups Projects
  1. Apr 08, 2014
    • Avi Kivity's avatar
      sched: fix waitqueue race causing failure to wake up · 4ef65eb6
      Avi Kivity authored
      
      When waitqueue::wake_all() wakes up waiting threads, it calls
      sched::thread::wake_lock() to enqueue those waiting threads on the mutex
      protecting the waitqueue, thus avoiding needless contention on the mutex.
      However, if a thread is already waking, we let it wake naturally and acquire
      the mutex itself.
      
      The problem is that the waitqueue code (wait_object<waitqueue>::poll())
      examines the wait_record it sleeps on and see if it has woken, and if not,
      goes back to sleep.  Since nothing in that thread-already-awake path clears
      the wait_record, that is what happens, and the thread stalls, until a timeout
      occurs.
      
      Fix by clearing the wait record.  As it is protected by the mutex, no
      extra synchronization is needed.
      
      Observed with iperf -P 64 against the guest.  Likely triggered by net channels
      waking up the thread, and then before it has a chance to wake up, a FIN
      packet arrives that is processed in the driver thread; so when the packets
      are consumed the thread is in the waking state.
      
      Reviewed-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      4ef65eb6
    • Tomasz Grabiec's avatar
      dhcp: remove lookup_opcode() · abbfc557
      Tomasz Grabiec authored
      
      The lookup_opcode() function is incorrect. It was mishandling
      DHCP_OPTION_PAD, which does not have a following length byte.
      
      Also, the while condition is reading 'op' value which never
      changes. This may result in reads beyond packet size.
      
      Since this function is unused the best fix is to remove it.
      
      Reveiwed-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Signed-off-by: default avatarTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      abbfc557
  2. Apr 07, 2014
  3. Apr 03, 2014
    • Nadav Har'El's avatar
      sched: fix rare crashes caused by reschedule running on the wrong CPU · ee92f736
      Nadav Har'El authored
      
      For a long time we've had the bug summarized in issue #178, where very
      rarely but consistently, in various runs such as Cassandra, Netperf and
      tst-queue-mpsc.so, we saw OSv crashing because of some corruption in the
      timer list, such as arming an already armed timer, or canceling and already
      canceled timer.
      
      It turns out the problem was the schedule() function, which basically did
      cpu::current()->schedule(). The problem is that if we're unlucky enough,
      the thread can be migrated right after calling cpu::current(), but before
      the irq disable in schedule(), which causes us to do a rescheduling for
      one CPU on a different CPU, which is a big faux pas. This can cause us,
      for example, to mess with one CPU's preemption_timer from a different CPU,
      causing the timer-related races and crashes we've seen in issue #178.
      
      Clearly, we shouldn't at all have a *method* cpu->schedule() which can
      operate on any cpu. Rather, we should have only a *function* (class-static)
      cpu::schedule() which operates on the current cpu - and makes sure we find
      that current CPU within the IRQ lock to ensure (among other things) the
      thread cannot get migrated.
      
      Another benefit of this patch is that it actually simplifies the code,
      with one less function called "schedule".
      
      Fixes #178.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      ee92f736
  4. Apr 02, 2014
  5. Apr 01, 2014
    • Tomasz Grabiec's avatar
      core: introduce serial_timer_task · bd179712
      Tomasz Grabiec authored
      This is a wrapper of timer_task which should be used if atomicity of
      callback tasks and timer operations is required. The class accepts
      external lock to serialize all operations. It provides sufficient
      abstraction to replace callouts in the network stack.
      
      Unfortunately, it requires some cooperation from the callback code
      (see try_fire()). That's because I couldn't extract in_pcb lock
      acquisition out of the callback code in TCP stack because there are
      other locks taken before it and doing so _could_ result in lock order
      inversion problems and hence deadlocks. If we can prove these to be
      safe then the API could be simplified.
      
      It may be also worthwhile to propagate the lock passed to
      serial_timer_task down to timer_task to save extra CAS.
      bd179712
    • Tomasz Grabiec's avatar
      core: introduce deferred work framework · 34620ff0
      Tomasz Grabiec authored
      The design behind timer_task
      
      timer_task was design for making cancel() and reschedule() scale well
      with the number of threads and CPUs in the system. These methods may
      be called frequently and from different CPUs. A task scheduled on one
      CPU may be rescheduled later from another CPU. To avoid expensive
      coordination between CPUs a lockfree per-CPU worker was implemented.
      
      Every CPU has a worker (async_worker) which has task registry and a
      thread to execute them. Most of the worker's state may only be changed
      from the CPU on which it runs.
      
      When timer_task is rescheduled it registers its percpu part in current
      CPU's worker. When it is then rescheduled from another CPU, the
      previous registration is marked as not valid and new percpu part is
      registered. When percpu task fires it checks if it is the last
      registration - only then it can fire.
      
      Because timer_task's state is scattered across CPUs some extra
      housekeeping needs to be done before it can be destroyed.  We need to
      make sure that no percpu task will try to access timer_task object
      after it is destroyed. To ensure that we walk the list of
      registrations of given timer_task and atomically flip their state from
      ACTIVE to RELEASED. If that succeeds it means the task is now revoked
      and worker will not try to execute it. If that fails it means the task
      is in the middle of firing and we need to wait for it to finish. When
      the per-CPU task is moved to RELEASED state it is appended to worker's
      queue of released percpu tasks using lockfree mpsc queue. These
      objects may be later reused for registrations.
      34620ff0
    • Tomasz Grabiec's avatar
      sched: introduce thread migration lock · 6c8a861d
      Tomasz Grabiec authored
      This can be useful when there's a need to perform operations on
      per-CPU structure(s) and all need to be executed on the same CPU but
      there is code in between which may sleep (eg malloc).
      
      For example this can be used to ensure that dynamically allocated
      object is always freed on the same CPU on which it was allocated:
      
        WITH_LOCK(migration_lock) {
          auto _owner = *percpu_owner;
          auto x = new X();
          _owner->enqueue(x);
        }
      6c8a861d
    • Tomasz Grabiec's avatar
      sched: add atomic reset() operation to timer_base · b122a924
      Tomasz Grabiec authored
      It is needed by the new async framework.
      b122a924
    • Glauber Costa's avatar
      balloons: define soft maximum amount · 9bbc86fc
      Glauber Costa authored
      
      One of the main problems we face with the balloon is the fact that by the time
      the JVM needs more memory, it may be too late. We have many mechanisms in place
      to try to mitigate this by trying to release memory back to the JVM before it
      actually even notices it needs.
      
      This is yet another of them: a soft maximum is defined. Unlike a hard maximum,
      OSv can create as many balloons as it sees fit. However, if we ever go beyond
      this maximum, we will start putting pressure through the reclaimer in order to
      be able to react quicker if pressure builds up from the JVM side.
      
      Also, when we are over this limit and we need to release a balloon, we keep
      releasing until we are back in a safe zone, regardless of how many we were
      asked for.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      9bbc86fc
    • Glauber Costa's avatar
      reclaimer: do not go hard if we're throttling · dad4357f
      Glauber Costa authored
      
      In case we are throttling, try to serve memory from the non-aggressive
      shrinker before going all-in. In particular, the current waiter may be
      allocating memory in behalf of the balloon (for instance, stack), so it
      is a bad idea to block. We may have more memory soon enough.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      dad4357f
    • Glauber Costa's avatar
      zfs: throttle ZFS allocations during balloon · af767977
      Glauber Costa authored
      
      When we create a new balloon, memory will soon be available. However, until
      we actually create it, high rate ARC users will continue to fill the ARC at
      will, which can easily lead to memory exhaustion.
      
      Blocking allocations until the balloon is created is not an option: in order to
      balloon, we need to allocate memory, and the source of this memory is quite
      unpredictable: backing heap, stack, and even file, if an object destruction
      needs to flush something out.
      
      Luckily, the ARC has a mechanism to avoid growing the cache. If the function
      arc_evict_needed() returns true, it will free buffers to give place to new ones
      instead of growing them.
      
      The most natural way of making that function return true would be to implement
      the function vm_paging_needed(), which is already there in generic code and
      would require 0 ARC changes. However, doing this would imply that
      arc_reclaim_needed() would now return true as well. This would not only evict
      some buffers, but actually put the ARC in aggressive mode (Especially because
      the allocations will keep coming)
      
      The solution I've chosen was to implement a new function in the ARC - wrapped
      in an ifdef so we don't lose track of it - that will allow us to communicate
      that we should throttle new buffers due to a VM request (in our case, because
      we are waiting for the balloon).
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      af767977
    • Glauber Costa's avatar
      shrinkers: allow soft and hard mode · 5c2aba30
      Glauber Costa authored
      
      There are some shrinkers, like the ARC, that may become very agressive when
      called more than once in a row. When we have no waiters, we can be more relaxed
      and just go for a best effort shrinking.
      
      This patch introduce a soft mode shrinking that will do just that. In the
      future, we may introduce another threshold smaller than the minimum in which we
      can go in hard mode even before we have waiters. For now, this will do.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      5c2aba30
    • Glauber Costa's avatar
      mempool: transform reclaimer thread into a normal thread · 18a82a55
      Glauber Costa authored
      
      The balloon is no longer synchronous, and now lives in its own
      thread. We used to have the requirement that the reclaimer needed
      to have a full pthread API, but that is no longer. Make it now a
      normal thread.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      18a82a55
    • Glauber Costa's avatar
      balloon: close the balloons earlier if the regions intersect · b8970534
      Glauber Costa authored
      
      Instead of just looking at which region our start address falls at, we can
      iterate through all the vmas that are affected by us installing a new vma, and
      then close the region earlier.
      
      This design has one major big advantage: split() will never happen, and if that
      is the case, the balloon becomes a lot more predictable - which at this point
      is a blow of sanity in the whole thing
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      b8970534
    • Glauber Costa's avatar
      balloon: use a separate thread the for the balloon · 045c602a
      Glauber Costa authored
      Synchronously ballooning unfortunately won't work: as usual the garbage
      collector can run in order to release memory for the balloon, and then we will
      block.
      
      Since we can't make the balloon worker synchronous, we set its priority to a
      really high number, that would guarantee that Java threads won't run before it.
      This will help in really fast allocation situations. Durin those, between the
      notification and the time this thread gets to run, the JVM may have very well
      exhausted all the memory remaining.
      
      Notice that this value is guaranteed to be lower than any JVM created thread
      ever will get. This is since for JVM threads, priority will only be set through
      the setpriority interface, wich will yield 0.2 for its maximum prio.
      045c602a
    • Glauber Costa's avatar
      balloon: no longer make it part of the shrinker infrastructure. · 3c716bda
      Glauber Costa authored
      
      The current ballooning mechanism has a very serious design flaw, that became
      obvious once the tests with it advanced: if we wait until memory is short to
      release memory, it may be that the JVM itself may block. In particular, the
      blocking thread can be running the Garbage Collector. More likely, it can
      start running the GC to open up space for an object so big as the balloon.
      
      I don't believe this is something we should fix with a quick hack, and I am
      then pushing for a redesign: The solution is to use a reservation system, that
      works by ballooning beforehand when the amount of reserved memory goes too low.
      
      This system works by noting that when we balloon, we are effectively limiting
      the maximum amount of memory java will ever use. We already have accounts of
      both the amount of free memory and the amount of memory allocated to the heap.
      With those figures, we can easily find out when the amount of used memory +
      a full heap would go over the amount of available memory. When that happens,
      we balloon.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      3c716bda
    • Glauber Costa's avatar
      jvm_balloon: handle partial writes · 12f70bf0
      Glauber Costa authored
      
      There are situations during which some Garbage Collectors will copy a large
      object in parallel, using various threads, each being responsible for a part of
      the object. In particular, that easily happens under high memory pressure for
      the Parallel and ParallelOld collectors.
      
      If such copy happens, the simple balloon move algorithm will break. However,
      because offset 'x' in the source will always be copied to offset 'x' in the
      destination, we can still calculate the final destination object.  The problem
      is that we cannot open the new balloon yet. Since the JVM believes it is
      copying only a part of the object, the destination may (and usually will)
      contain valid objects, that need to be themselves moved somewhere else before
      we can install our object there.
      
      Also, we can't close the object fully when someone writes to it: because a part
      of the object is now already freed, the JVM may and will go ahead and copy
      another object to this location.
      
      This is a quite tricky problem, but the patch bellow solves it by keeping the
      source balloon open and counting the number of partial bytes copied. It makes
      the assumption that the JVM needs to always copy the whole object to a different
      location, meaning that when it does, the balloon can be closed.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      12f70bf0
    • Glauber Costa's avatar
      jvm_balloon: better define boundaries for the balloon. · e93bd6ba
      Glauber Costa authored
      
      Right now, we are starting the jvm with the maximum amount of memory we can.
      This is, however, terribly bad. The reason is that if that if we go to reclaim
      during a heap allocation, this will easily crash the system. That will be just
      for granted if that allocation turns out to be done on behalf of the GC. In
      that case, no other Java threads will make progress, meaning the balloon cannot
      be activated.
      
      Even if the balloon is no longer part of the shrinker infrastructure, it may
      still be a problem, because we may need to use it to deflate memory. The best
      solution is to just limit heap usage to 90 % of the available memory, which is
      the reclaim criteria.
      
      Also, it is pointless to activate the balloon for very low memory conditions.
      So if our total memory is less than 1Gb, just leave it alone.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      e93bd6ba
    • Glauber Costa's avatar
      reclaimer: implement a stronger data structure for waking sleepers · 80646cd6
      Glauber Costa authored
      
      After Avi's last comments, here is a new take for the reclaimer's wake queue.
      It does not abuse semaphore's in any way, so there is no need to extend it.
      (After this patch is applied, the semaphore extension patch can be reverted)
      
      In this new approach, we will not only check for free memory as a scalar, but
      will instead transverse the remaining page ranges (notice that in case of
      fragmentation this can be a *large* set) and try to find a range that would
      allow our allocation to happen.
      
      Also, because we are implementing both the wait and the "post" methods, the
      waking of the reclaimer happens atomically and the problem that Tomek described
      (wake before waiters is set) no longer exists.
      
      Finally, another advantage of this approach is that we no longer need to keep
      track of the amount of memory we have freed when making OOM decisions: if we
      can wake someone up, we can continue. Otherwise, we OOM. This will be specially
      helpful for the FS shrinkers, since they may not always be able to synchronously
      return a meaningful number accurately (due to, for instance, writeback).
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      80646cd6
    • Glauber Costa's avatar
      balloon: use a shared ptr for the balloon object · c035ba98
      Glauber Costa authored
      
      The old explicit pattern would break if the vma that holds the balloon
      was already destroyed. Use a shared pointer to avoid the complications from it
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      c035ba98
    • Glauber Costa's avatar
      balloon: fix move-to-itself bug · b079217c
      Glauber Costa authored
      
      While working in the reclaim policy, I came across an unhandled balloon case
      that seem to happen in some certain low memory conditions. The balloon will be
      moved to the same address it was before. While this seems weird at first, it is
      perfectly plausible if we think that we can have a space before the balloon almost
      a page in size. If the JVM is only shifting the elements slightly, they will end up
      at the same ballooned region.
      
      The problem with that is that evacuate() is called first - the vma is still in place,
      so it will be found, and it will call delete. When the jvm fault handler executes,
      it will also call delete and we will have a double free.
      
      This patch avoids the delete, but the problem is actually more intricate than
      that: if we just follow the normal path, we will remove the newly created
      address from the candidates list right after we insert it (because the
      addresses are the same). Also, if the address is *not* the same, but falls in
      the same region (this is theoretical), evacuate will be called and split the region
      in two. Our code to handle finish_move would have to be significantly complicated
      to deal with this fact. This patch also handles this theoretical case by making
      sure that the whole region is always disposed at once (by finishing the old balloon
      manually before evacuate is called)
      
      Why I am using a new mmu flag: when we search for the old vma, we need to
      identify which type of vma we are dealing with. If it is an anonymous or
      ballooned area. I use an extra flag to avoid a type field or anything like
      that.
      
      Couldn't we just take the vma out of the list before we delete it? In this case
      the region would be empty, and we could assume that was a ballooned region.
      That would also be a solution, but because we are copying the flag from the
      previous vma, we would have to fallback to a default flag set. The solution
      of marking the JVM vma and leaving it in place allows the flag information to
      be carried on.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      b079217c
  6. Mar 27, 2014
    • Glauber Costa's avatar
      zfsbuffers: reference count the arc buffer · 063a8a56
      Glauber Costa authored
      
      Gleb has noticed that the ARC buffers can go unshared too early. This will
      happen because we call the UNMAP operation on every put(). That is certainly
      not what we want, since the buffer only has to be unshared when the last
      reference is gone.
      
      Design decisions:
      1) We obviously can't use the arc natural reference count for that, since
      bumping it would make the buffer unevictable.
      2) We could modify the arc_buf structure itself to add another refcnt (minimum
      4 bytes).  However, I am trying to keep core-ZFS modifications to a minimum,
      and only to places where it is totally unavoidable.
      
      Therefore, the solution is to add another hash, which will hash the whole
      buffer instead of the physaddr like the one we have currently. In terms of
      memory usage, it will add only 8 bytes per buffer (+/- 128k each buffer), which
      makes for a memory usage of 64k per mapped Gb compared to the arc refcount
      solution. This is a good trade off.
      
      I am also avoiding adding a new vop_map/unmap style operation just to query the
      buffer address from its file attributes (needed for the put side). Instead, I
      am conventioning that an empty iovec means query, and a filled iov means
      unshare.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      063a8a56
  7. Mar 24, 2014
  8. Mar 21, 2014
    • Glauber Costa's avatar
      mmu: only handle page sized maps for buffer mappings · a6e826b4
      Glauber Costa authored
      
      Gleb have noticed a problem with the buffer mapping logic, that would make
      unrelated mappings that would be adjacent to the file mapping be invalidated as
      well. This is because we were invalidated an area as large as the buffer.  If
      the mapping would, for instance, end in the middle of the buffer, we would keep
      invalidated adjacent regions.
      
      The solution is to change the code to handle per-page, not per-buffer mappings.
      We would likely need something like that anyway for COW maps, so let's just
      start handling it now.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      a6e826b4
    • Glauber Costa's avatar
      mmu: handle multiple mappings to the same file · a17309e2
      Glauber Costa authored
      
      This case is not supposed to be utterly common, but we have to handle it
      anyway.  This implementation is slightly naive, but it should be okay if the
      number of mappings is small.  We should revisit it this number start growing
      too much
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      a17309e2
    • Glauber Costa's avatar
    • Glauber Costa's avatar
      map_file: expand parameters to get_page/put_page · 3c77122f
      Glauber Costa authored
      
      To better deal with shared buffers, we should expand the interface to
      get_page/put_page.  Three changes are needed: first, the start of the virtual
      address is passed to both. Together with the offset, it can be used to uniquely
      identify the virtual address being mapped. That will be used to manage the
      mapping list.
      
      Regarding the offset itself, there is need to also pass the file offset
      downwards, the one specified by the mapping. Reading that from the file pointer
      is not enough, because not only this is not what the user has asked for, but it
      can change during the map's lifetime. It also should be a separate parameter
      from the in-mapping offset, not summed together with it. Doing otherwise would
      make impossible for the mapping function to differentiate between those two
      concepts
      
      Also, put_page is called from the free operation, but does not receive the
      actuall address from it. Knowing the physical address will be important to shut
      down our mapping list.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      3c77122f
Loading