Skip to content
Snippets Groups Projects
  1. Sep 02, 2013
  2. Sep 01, 2013
  3. Aug 29, 2013
  4. Aug 28, 2013
    • Glauber Costa's avatar
      mbufs: use an entire page for jumbop zone allocations · 0d466fab
      Glauber Costa authored
      Xen has hard requirements on page transfers, and how to feed the grant tables.
      The address need to be page aligned, since the pfns and not addresses are used,
      and we need to provide at least a full page per buffer, since the hypervisor is
      free to fill any data within the page.
      
      To achieve that, the netfront driver will use m_cljget to attach an extended
      buffer to the mbuf, from the jumbop zone, since they are page-sized. However,
      two problems arise from this:
      
      1) Allocating a page goes through malloc_large. Our implementation of malloc_large
      is currently terribly inefficient, and that creates a very heavy contention site.
      
      What I am doing with this patch is to switch our uma implementation to
      alloc_page / free_page instead of malloc if the caller of zcreate so specified
      (and then of course, specify it for the jumbop cache)
      
      2) The refcount that is attached in the end of the buffer would either extend the
      buffer to 4100 bytes - defeating our purpose, or then the buffer would have to be
      PAGE_SIZE - 4, to accomodate for the refcount. But since the hypervisor will write
      to the whole page, it will eventually overwrite the refcount.
      
      To address that, I am allocating an external reference counter. BSD already
      have some infrastructure to do that, and I am taking advantage of this.
      However, I have found no way of implementing this in a way in which the
      reference count can be easily deduceable from the address of the extended
      buffer, without having the supporting mbuf to start from. Any external data
      structure such as hashes would probably make freeing way too slow. Thankfully,
      uma_find_refcnt and the UMA_ZONE_REFCNT seems to be used mostly in the
      setup/destruction phase (the mbuf refcount is used directly, open coded). So my
      proposal here is to remove the UMA_ZONE_REFCNT for that zone.
      0d466fab
    • Glauber Costa's avatar
      work around xen x2apic bug · cc3d517a
      Glauber Costa authored
      The x2APIC specification says that reading from the X2APIC_ID MSR should return
      the physical apic id of the current processor. However, the Xen implementation
      (as of 4.2.2) is broken, and reads actually return old style xAPIC id. Even if
      they fix it, we still have HVs deployed around that will return the wrong ID.
      We can work around this by testing if the returned APIC id is in the form (id
      << 24), since in that case, the first 24 bits will all be zeroed. Then at least
      we can get this working everywhere. This may pose a problem if we want to ever
      support more than 1 << 24 vCPUs (or if any other HV has some random x2apic
      ids), but that is highly unlikely anyway.
      cc3d517a
    • Glauber Costa's avatar
      apic: bringup cpus individually instead of all at the same time · 5cb16020
      Glauber Costa authored
      As I have described in a previous patch, the Xen hypervisor has a very nasty
      bug that causes all of the x2apic msr writes to trigger a GPF. Although the
      request proceeds fine despite the GPF, it does bring a problem for all-but-self
      style init sequences we are using: after "failing" (succeeding but returning
      failure) to deliver the interrupt for the first cpu in the group, xen will
      break the loop, therefore not delivering the SIPIs to other cpus in the system
      at all. We can work around that by delivering interrupts to each cpu
      individually, instead of all-but-self.
      5cb16020
    • Glauber Costa's avatar
      implement wrmsr_safe · a7ea5784
      Glauber Costa authored
      Unfortunately, the Xen hypervisor has a very nasty bug (seems to be fixed by a
      2013 patch - which means that although it is fixed, a lot of hypervisors will
      have it), that causes all of the x2apic msr writes to init related registers
      (INIT, SIPI, etc) trigger a GPF. The way to work around this, is to implement a
      form of "wrmsr_safe".
      a7ea5784
    • Glauber Costa's avatar
      trivial: remove device debug messages · c6bc3478
      Glauber Costa authored
      I ended up forgetting to remove some kprintfs from device.c that were inserted
      during Xen's blkfront development
      c6bc3478
    • Pekka Enberg's avatar
      gdb: Add mmap info to 'osv mem' · 34efd764
      Pekka Enberg authored
      Now that we can walk through the vma list, add mmap numbers to 'osv
      mem':
      
        (gdb) osv mem
        Total Memory: 4294564864 Bytes
        Mmap Memory:  3278278656 Bytes (76.34%)
        Free Memory:  474492928 Bytes (11.05%)
      34efd764
    • Pekka Enberg's avatar
      gdb: 'osv mmap' for inspecting vmas · 448ef255
      Pekka Enberg authored
      448ef255
  5. Aug 27, 2013
    • Nadav Har'El's avatar
      Fix mincore() on non-mmap()ed memory · 6924f7db
      Nadav Har'El authored
      Commit 65afd075 fixed mincore() to recognize
      unmapped addresses. However, it used mmu::ismapped() which just checks for
      mmap()'ed addresses, and doesn't know about malloc()ed memory. This causes
      trouble for libunwind (which we use for backtrace()) which tests mincore()
      on an on-stack variable, and for non-pthread threads, this stack might be
      malloc'ed, not mmap'ed.
      
      So this patch adds mmu::isreadable(), which checks that a given memory range
      is all readable (this memory can be mmapped, malloced, stack, whatever).
      mincore() now uses that.
      
      mmu::isreadable() is implemented, following Avi's idea, by trying to read,
      with safe_load(), one byte from every page in the range. This approach is
      faster than page-table-walking especially for one-byte checks (which all
      libunwind uses anyway), and also very simple.
      6924f7db
    • Nadav Har'El's avatar
      Test mincore() on stack and malloc()ed memory · 73cc470d
      Nadav Har'El authored
      Unlike msync(), mincore() should also work on non-mmapped memory,
      such as stack and malloc()ed memory. Currently it doesn't - it
      fails on malloc()ed memory and only sometimes works on stacks (works
      on pthread stacks which are mmapped, but not on sched::thread stacks
      which are malloced by default).
      
      This patch adds a test to tst-mmap.cc to demonstrate this problem.
      The test currently fails, will be fixed in a follow-up patch.
      73cc470d
    • Glauber Costa's avatar
      mempool.c: trace large allocations · 0a798e4d
      Glauber Costa authored
      Most of the performance problems I have found on Xen were due to the fact that
      we were hitting malloc_large consistently, for allocations that we should be
      able to service in some other way. Because malloc_large in our implementation
      is such a bottleneck, it was very useful for me to have separate tracepoints
      for them.  I am then proposing for inclusion.
      0a798e4d
    • Nadav Har'El's avatar
      Fix deadlock in leak detector · 227eb39b
      Nadav Har'El authored
      Commit 65afd075 that fixed mincore()
      exposed a deadlock in the leak detector, caused by two threads taking
      two locks in opposite order:
      
      Thread 1:  malloc() does alloc_tracker::remember(). This takes the tracker
         lock and calls backtrace() calling mincore() which takes the
         vma_list_mutex.
      
      Thread 2: mmap() does mmu::allocate() which takes the vma_list_mutex and
         then through mmu::populate::small_page calls memory::alloc_page() which
         calls alloc_tracker::remember() and takes the tracker lock.
      
      This patch fixes this deadlock: alloc_tracker::remember() will now drop its
      lock while running backtrace(), as the lock is only needed to protect the
      allocations[] array. We need to retake the lock after backtrace() completes,
      to copy the backtrace back to the allocations[] array.
      
      Previously, the lock's depth was also (ab)used for avoiding nested
      allocation tracking (e.g., tracking of memory allocation done inside
      backtrace() itself), but now that backtrace() is run without the lock,
      we need a different mechanism - a per-thread "in_tracker" flag, which
      is turned on inside the alloc_tracker::remember()/forget() methods.
      227eb39b
    • Glauber Costa's avatar
      docs: fix netperf instructions · 6f56f6a5
      Glauber Costa authored
      This allows lazy people like me to just copy the instructions
      6f56f6a5
    • Glauber Costa's avatar
      cpu: initialize the FPU and CSR register · 04ddff7a
      Glauber Costa authored
      We can't trust the state of the FPU and the CSR registers to be always sane.
      Apparently, they aren't on at least one version of Xen (which happens to be
      the one I am using) Initialize it manually for all CPUs on bringup.
      04ddff7a
    • Glauber Costa's avatar
      xen: correctly ack interrupts · bcf77dc9
      Glauber Costa authored
      In the xen interrupt code, I have made the mistake of exchanging the previous
      value of _irq_pending with true, which means that we were constantly polling
      for data in the interrupt threads.
      
      This was responsible for the latency spikes I was seeing. The simple "ping"
      test still shows bad results in absolute terms, but at least now the spikes are
      gone.
      bcf77dc9
  6. Aug 26, 2013
    • Nadav Har'El's avatar
      Avoid including elf.hh from sched.hh · 714d313a
      Nadav Har'El authored
      sched.hh included elf.hh, just so it can refer to the elf::tls_data
      type. But now that we have rcu.hh which includes sched.hh and therefore
      elf.hh, if we wish to use rcu in elf.hh (we'll do this in a later patch),
      we have an include loop mess.
      
      So better not include elf.hh from sched.hh, and just declare the one
      struct we need.
      
      After sched.hh no longer includes elf.hh and the dozen includes that
      it further included, we need to add missing includes to some of the
      code that included sched.hh and relied on its implict includes.
      714d313a
    • Avi Kivity's avatar
      signal: avoid nested signals · 4af36771
      Avi Kivity authored
      A signal within a signal handler is really bad news, abort when it happens
      to let the developers debug it.
      4af36771
    • Avi Kivity's avatar
      mmu: don't pass really bad faults to the application · 6f464e76
      Avi Kivity authored
      Trying to execute the null pointer, or faults within the kernel code, are
      a really bad sign and it's better to abort early with them.
      6f464e76
Loading