Skip to content
Snippets Groups Projects
  1. Aug 29, 2013
  2. Aug 28, 2013
    • Glauber Costa's avatar
      mbufs: use an entire page for jumbop zone allocations · 0d466fab
      Glauber Costa authored
      Xen has hard requirements on page transfers, and how to feed the grant tables.
      The address need to be page aligned, since the pfns and not addresses are used,
      and we need to provide at least a full page per buffer, since the hypervisor is
      free to fill any data within the page.
      
      To achieve that, the netfront driver will use m_cljget to attach an extended
      buffer to the mbuf, from the jumbop zone, since they are page-sized. However,
      two problems arise from this:
      
      1) Allocating a page goes through malloc_large. Our implementation of malloc_large
      is currently terribly inefficient, and that creates a very heavy contention site.
      
      What I am doing with this patch is to switch our uma implementation to
      alloc_page / free_page instead of malloc if the caller of zcreate so specified
      (and then of course, specify it for the jumbop cache)
      
      2) The refcount that is attached in the end of the buffer would either extend the
      buffer to 4100 bytes - defeating our purpose, or then the buffer would have to be
      PAGE_SIZE - 4, to accomodate for the refcount. But since the hypervisor will write
      to the whole page, it will eventually overwrite the refcount.
      
      To address that, I am allocating an external reference counter. BSD already
      have some infrastructure to do that, and I am taking advantage of this.
      However, I have found no way of implementing this in a way in which the
      reference count can be easily deduceable from the address of the extended
      buffer, without having the supporting mbuf to start from. Any external data
      structure such as hashes would probably make freeing way too slow. Thankfully,
      uma_find_refcnt and the UMA_ZONE_REFCNT seems to be used mostly in the
      setup/destruction phase (the mbuf refcount is used directly, open coded). So my
      proposal here is to remove the UMA_ZONE_REFCNT for that zone.
      0d466fab
    • Glauber Costa's avatar
      work around xen x2apic bug · cc3d517a
      Glauber Costa authored
      The x2APIC specification says that reading from the X2APIC_ID MSR should return
      the physical apic id of the current processor. However, the Xen implementation
      (as of 4.2.2) is broken, and reads actually return old style xAPIC id. Even if
      they fix it, we still have HVs deployed around that will return the wrong ID.
      We can work around this by testing if the returned APIC id is in the form (id
      << 24), since in that case, the first 24 bits will all be zeroed. Then at least
      we can get this working everywhere. This may pose a problem if we want to ever
      support more than 1 << 24 vCPUs (or if any other HV has some random x2apic
      ids), but that is highly unlikely anyway.
      cc3d517a
    • Glauber Costa's avatar
      apic: bringup cpus individually instead of all at the same time · 5cb16020
      Glauber Costa authored
      As I have described in a previous patch, the Xen hypervisor has a very nasty
      bug that causes all of the x2apic msr writes to trigger a GPF. Although the
      request proceeds fine despite the GPF, it does bring a problem for all-but-self
      style init sequences we are using: after "failing" (succeeding but returning
      failure) to deliver the interrupt for the first cpu in the group, xen will
      break the loop, therefore not delivering the SIPIs to other cpus in the system
      at all. We can work around that by delivering interrupts to each cpu
      individually, instead of all-but-self.
      5cb16020
    • Glauber Costa's avatar
      implement wrmsr_safe · a7ea5784
      Glauber Costa authored
      Unfortunately, the Xen hypervisor has a very nasty bug (seems to be fixed by a
      2013 patch - which means that although it is fixed, a lot of hypervisors will
      have it), that causes all of the x2apic msr writes to init related registers
      (INIT, SIPI, etc) trigger a GPF. The way to work around this, is to implement a
      form of "wrmsr_safe".
      a7ea5784
    • Glauber Costa's avatar
      trivial: remove device debug messages · c6bc3478
      Glauber Costa authored
      I ended up forgetting to remove some kprintfs from device.c that were inserted
      during Xen's blkfront development
      c6bc3478
    • Pekka Enberg's avatar
      gdb: Add mmap info to 'osv mem' · 34efd764
      Pekka Enberg authored
      Now that we can walk through the vma list, add mmap numbers to 'osv
      mem':
      
        (gdb) osv mem
        Total Memory: 4294564864 Bytes
        Mmap Memory:  3278278656 Bytes (76.34%)
        Free Memory:  474492928 Bytes (11.05%)
      34efd764
    • Pekka Enberg's avatar
      gdb: 'osv mmap' for inspecting vmas · 448ef255
      Pekka Enberg authored
      448ef255
  3. Aug 27, 2013
    • Nadav Har'El's avatar
      Fix mincore() on non-mmap()ed memory · 6924f7db
      Nadav Har'El authored
      Commit 65afd075 fixed mincore() to recognize
      unmapped addresses. However, it used mmu::ismapped() which just checks for
      mmap()'ed addresses, and doesn't know about malloc()ed memory. This causes
      trouble for libunwind (which we use for backtrace()) which tests mincore()
      on an on-stack variable, and for non-pthread threads, this stack might be
      malloc'ed, not mmap'ed.
      
      So this patch adds mmu::isreadable(), which checks that a given memory range
      is all readable (this memory can be mmapped, malloced, stack, whatever).
      mincore() now uses that.
      
      mmu::isreadable() is implemented, following Avi's idea, by trying to read,
      with safe_load(), one byte from every page in the range. This approach is
      faster than page-table-walking especially for one-byte checks (which all
      libunwind uses anyway), and also very simple.
      6924f7db
    • Nadav Har'El's avatar
      Test mincore() on stack and malloc()ed memory · 73cc470d
      Nadav Har'El authored
      Unlike msync(), mincore() should also work on non-mmapped memory,
      such as stack and malloc()ed memory. Currently it doesn't - it
      fails on malloc()ed memory and only sometimes works on stacks (works
      on pthread stacks which are mmapped, but not on sched::thread stacks
      which are malloced by default).
      
      This patch adds a test to tst-mmap.cc to demonstrate this problem.
      The test currently fails, will be fixed in a follow-up patch.
      73cc470d
    • Glauber Costa's avatar
      mempool.c: trace large allocations · 0a798e4d
      Glauber Costa authored
      Most of the performance problems I have found on Xen were due to the fact that
      we were hitting malloc_large consistently, for allocations that we should be
      able to service in some other way. Because malloc_large in our implementation
      is such a bottleneck, it was very useful for me to have separate tracepoints
      for them.  I am then proposing for inclusion.
      0a798e4d
    • Nadav Har'El's avatar
      Fix deadlock in leak detector · 227eb39b
      Nadav Har'El authored
      Commit 65afd075 that fixed mincore()
      exposed a deadlock in the leak detector, caused by two threads taking
      two locks in opposite order:
      
      Thread 1:  malloc() does alloc_tracker::remember(). This takes the tracker
         lock and calls backtrace() calling mincore() which takes the
         vma_list_mutex.
      
      Thread 2: mmap() does mmu::allocate() which takes the vma_list_mutex and
         then through mmu::populate::small_page calls memory::alloc_page() which
         calls alloc_tracker::remember() and takes the tracker lock.
      
      This patch fixes this deadlock: alloc_tracker::remember() will now drop its
      lock while running backtrace(), as the lock is only needed to protect the
      allocations[] array. We need to retake the lock after backtrace() completes,
      to copy the backtrace back to the allocations[] array.
      
      Previously, the lock's depth was also (ab)used for avoiding nested
      allocation tracking (e.g., tracking of memory allocation done inside
      backtrace() itself), but now that backtrace() is run without the lock,
      we need a different mechanism - a per-thread "in_tracker" flag, which
      is turned on inside the alloc_tracker::remember()/forget() methods.
      227eb39b
    • Glauber Costa's avatar
      docs: fix netperf instructions · 6f56f6a5
      Glauber Costa authored
      This allows lazy people like me to just copy the instructions
      6f56f6a5
    • Glauber Costa's avatar
      cpu: initialize the FPU and CSR register · 04ddff7a
      Glauber Costa authored
      We can't trust the state of the FPU and the CSR registers to be always sane.
      Apparently, they aren't on at least one version of Xen (which happens to be
      the one I am using) Initialize it manually for all CPUs on bringup.
      04ddff7a
    • Glauber Costa's avatar
      xen: correctly ack interrupts · bcf77dc9
      Glauber Costa authored
      In the xen interrupt code, I have made the mistake of exchanging the previous
      value of _irq_pending with true, which means that we were constantly polling
      for data in the interrupt threads.
      
      This was responsible for the latency spikes I was seeing. The simple "ping"
      test still shows bad results in absolute terms, but at least now the spikes are
      gone.
      bcf77dc9
  4. Aug 26, 2013
    • Nadav Har'El's avatar
      Avoid including elf.hh from sched.hh · 714d313a
      Nadav Har'El authored
      sched.hh included elf.hh, just so it can refer to the elf::tls_data
      type. But now that we have rcu.hh which includes sched.hh and therefore
      elf.hh, if we wish to use rcu in elf.hh (we'll do this in a later patch),
      we have an include loop mess.
      
      So better not include elf.hh from sched.hh, and just declare the one
      struct we need.
      
      After sched.hh no longer includes elf.hh and the dozen includes that
      it further included, we need to add missing includes to some of the
      code that included sched.hh and relied on its implict includes.
      714d313a
    • Avi Kivity's avatar
      signal: avoid nested signals · 4af36771
      Avi Kivity authored
      A signal within a signal handler is really bad news, abort when it happens
      to let the developers debug it.
      4af36771
    • Avi Kivity's avatar
      mmu: don't pass really bad faults to the application · 6f464e76
      Avi Kivity authored
      Trying to execute the null pointer, or faults within the kernel code, are
      a really bad sign and it's better to abort early with them.
      6f464e76
    • Pekka Enberg's avatar
      alloctracker: Fix forget() if remember() hasn't been called · 0affe14a
      Pekka Enberg authored
      If leak detector is enabled after OSv startup, the first call can be to
      free(), not malloc(). Fix alloctracker::forget() to deal with that.
      
      Fixes the SIGSEGV when "osv leak on" is used to enable detection from
      gdb after OSv has started up:
      
        #
        # A fatal error has been detected by the Java Runtime Environment:
        #
        #  SIGSEGV (0xb) at pc=0x00000000003b8ee6, pid=0, tid=18446673706168635392
        #
        # JRE version: 7.0_25
        # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops)
        # Problematic frame:
        # C  0x00000000003b8ee6
        #
        # Core dump written. Default location: //core or core.0
        #
        # An error report file with more information is saved as:
        # /tmp/jvm-0/hs_error.log
        #
        # If you would like to submit a bug report, please include
        # instructions on how to reproduce the bug and visit:
        #   http://icedtea.classpath.org/bugzilla
        #
        Aborted
      
        [penberg@localhost osv]$ addr2line -e build/debug/loader.elf
        0x00000000003b8ee6
        /home/penberg/osv/build/debug/../../core/alloctracker.cc:90
      0affe14a
    • Pekka Enberg's avatar
      runtime: Fix mincore() on an unmapped address · 65afd075
      Pekka Enberg authored
      Fix mincore() to deal with unmapped addresses like msync() does.
      
      This fixes a SIGSEGV in libunwind's access_mem() when leak detector is
      enabled:
      
         (gdb) bt
        #0  page_fault (ef=0xffffc0003ffe7008) at ../../core/mmu.cc:871
        #1  <signal handler called>
        #2  ContiguousSpace::block_start_const (this=<optimized out>, p=0x77d2f3968)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/oops/oop.inline.hpp:411
        #3  0x00001000008ae16c in GenerationBlockStartClosure::do_space (this=0x2000001f9100, s=<optimized out>)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/generation.cpp:242
        #4  0x00001000007f097c in DefNewGeneration::space_iterate (this=0xffffc0003fb68c00, blk=0x2000001f9100, usedOnly=<optimized out>)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/defNewGeneration.cpp:480
        #5  0x00001000008aca0e in Generation::block_start (this=<optimized out>, p=<optimized out>)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/memory/generation.cpp:251
        #6  0x0000100000b06d2f in os::print_location (st=st@entry=0x2000001f9560, x=32165017960, verbose=verbose@entry=false)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/runtime/os.cpp:868
        #7  0x0000100000b11b5b in os::print_register_info (st=0x2000001f9560, context=0x2000001f9740)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:839
        #8  0x0000100000c6cde8 in VMError::report (this=0x2000001f9610, st=st@entry=0x2000001f9560)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:551
        #9  0x0000100000c6da3b in VMError::report_and_die (this=this@entry=0x2000001f9610)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:984
        #10 0x0000100000b1109f in JVM_handle_linux_signal (sig=11, info=0x2000001f9bb8, ucVoid=0x2000001f9740,
            abort_if_unrecognized=<optimized out>)
            at /usr/src/debug/java-1.7.0-openjdk-1.7.0.25-2.3.12.3.fc19.x86_64/openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:528
        #11 0x000000000039f242 in call_signal_handler (frame=0x2000001f9b10) at ../../arch/x64/signal.cc:69
        #12 <signal handler called>
        #13 0x000000000057d721 in access_mem ()
        #14 0x000000000057cb1d in dwarf_get ()
        #15 0x000000000057ce51 in _ULx86_64_step ()
        #16 0x00000000004315fd in backtrace (buffer=0x1ff9d80 <memory::alloc_tracker::remember(void*, int)::bt>, size=20)
            at ../../libc/misc/backtrace.cc:16
        #17 0x00000000003b8d99 in memory::alloc_tracker::remember (this=0x1777ae0 <memory::tracker>, addr=0xffffc0004508de00, size=54)
            at ../../core/alloctracker.cc:59
        #18 0x00000000003b0504 in memory::tracker_remember (addr=0xffffc0004508de00, size=54) at ../../core/mempool.cc:43
        #19 0x00000000003b2152 in std_malloc (size=54) at ../../core/mempool.cc:723
        #20 0x00000000003b259c in malloc (size=54) at ../../core/mempool.cc:856
        #21 0x0000100001615e4c in JNU_GetStringPlatformChars (env=env@entry=0xffffc0003a4dc1d8, jstr=jstr@entry=0xffffc0004591b800,
            isCopy=isCopy@entry=0x0) at ../../../src/share/native/common/jni_util.c:801
        #22 0x000010000161ada6 in Java_java_io_UnixFileSystem_getBooleanAttributes0 (env=0xffffc0003a4dc1d8, this=<optimized out>,
            file=<optimized out>) at ../../../src/solaris/native/java/io/UnixFileSystem_md.c:111
        #23 0x000020000021ed8e in ?? ()
        #24 0x00002000001faa58 in ?? ()
        #25 0x00002000001faac0 in ?? ()
        #26 0x00002000001faa50 in ?? ()
        #27 0x0000000000000000 in ?? ()
      
      Spotted by Avi Kivity.
      65afd075
    • Nadav Har'El's avatar
      __xstat64: Don't check version argument · 31fe1784
      Nadav Har'El authored
      Do to __xstat* what commit 018c672e
      did to __fxstat* - they had the same problem.
      31fe1784
    • Nadav Har'El's avatar
      __fxstat64: Don't check version argument · 018c672e
      Nadav Har'El authored
      In Linux, _STAT_VER is 1 on 64-bit (and 3 on 32-bit), but glibc never
      verifies the argument to __fxstat64. JNR - a library used by JRuby -
      wrongly (I believe) passes ver==0 to __fxstat64
      (see jnr-posix/..../LinuxPosix.java). On Linux this wrong argument is
      ignored but in our implementation, fails the check.
      
      So this patch removes this check from our code as well, to let JNR and
      therefore JRuby which uses it, use stat without failing.
      018c672e
    • Pekka Enberg's avatar
      zfs: Fix GPF in zfs_rmnode() · 3d3c65b3
      Pekka Enberg authored
      If a crashed OSv guest is restarted, ZFS mount causes a GPF in early
      startup:
      
        VFS: mounting zfs at /usr
        zfs: mounting osv/usr from device /dev/vblk1
        Aborted
      
      GDB backtrace points finger at zfs_rmnode():
      
        #0  processor::halt_no_interrupts () at ../../arch/x64/processor.hh:212
        #1  0x00000000003e7f2a in osv::halt () at ../../core/power.cc:20
        #2  0x000000000021cdd4 in abort (msg=0x636df0 "Aborted\n") at ../../runtime.cc:95
        #3  0x000000000021cda2 in abort () at ../../runtime.cc:86
        #4  0x000000000044c149 in osv::generate_signal (siginfo=..., ef=0xffffc0003ffe7008) at ../../libc/signal.cc:44
        #5  0x000000000044c220 in osv::handle_segmentation_fault (addr=72, ef=0xffffc0003ffe7008) at ../../libc/signal.cc:55
        #6  0x0000000000366df3 in page_fault (ef=0xffffc0003ffe7008) at ../../core/mmu.cc:876
        #7  <signal handler called>
        #8  0x0000000000345eaa in zfs_rmnode (zp=0xffffc0003d1de400)
            at ../../bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:611
        #9  0x000000000035650c in zfs_zinactive (zp=0xffffc0003d1de400)
            at ../../bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c:1355
        #10 0x0000000000345be1 in zfs_unlinked_drain (zfsvfs=0xffffc0003ddfe000)
            at ../../bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:523
        #11 0x000000000034f45c in zfsvfs_setup (zfsvfs=0xffffc0003ddfe000, mounting=true)
            at ../../bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:881
        #12 0x000000000034f7a4 in zfs_domount (vfsp=0xffffc0003de02000, osname=0x6b14cb "osv/usr")
            at ../../bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1016
        #13 0x000000000034f98c in zfs_mount (mp=0xffffc0003de02000, dev=0x6b14d7 "/dev/vblk1", flags=0, data=0x6b14cb)
            at ../../bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1415
        #14 0x0000000000406852 in sys_mount (dev=0x6b14d7 "/dev/vblk1", dir=0x6b14a3 "/usr", fsname=0x6b14d3 "zfs", flags=0, data=0x6b14cb)
            at ../../fs/vfs/vfs_mount.c:171
        #15 0x00000000003eff97 in mount_usr () at ../../fs/vfs/main.cc:1415
        #16 0x0000000000203a89 in do_main_thread (_args=0xffffc0003fe9ced0) at ../../loader.cc:215
        #17 0x0000000000448575 in pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, pthread_private::thread_attr const*)::{lambda()#1}::operator()() const () at ../../libc/pthread.cc:59
        #18 0x00000000004499d3 in std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::__lambda0>::_M_invoke(const std::_Any_data &) (__functor=...)
            at ../../external/gcc.bin/usr/include/c++/4.8.1/functional:2071
        #19 0x000000000037e602 in std::function<void ()>::operator()() const (this=0xffffc0003e170038)
            at ../../external/gcc.bin/usr/include/c++/4.8.1/functional:2468
        #20 0x00000000003bae3e in sched::thread::main (this=0xffffc0003e170010) at ../../core/sched.cc:581
        #21 0x00000000003b8c92 in sched::thread_main_c (t=0xffffc0003e170010) at ../../arch/x64/arch-switch.hh:133
        #22 0x0000000000399c8e in thread_main () at ../../arch/x64/entry.S:101
      
      The problem is that ZFS tries to check if the znode is an attribute
      directory and trips over zp->z_vnode being NULL.  However, as explained
      in commit b7ee91ef ("zfs: port vop_lookup"), we don't even support
      extended attributes so drop the check completely for OSv.
      3d3c65b3
    • Pekka Enberg's avatar
      1397a3ec
    • Pekka Enberg's avatar
      tst-zfs-disk: Drop broken ASSERT() · f43cdb68
      Pekka Enberg authored
      The ASSERT() doesn't compile if ZFS debugging is enabled:
      
        CC tests/tst-zfs-disk.o
      In file included from ../../bsd/sys/cddl/compat/opensolaris/sys/debug.h:35:0,
                       from ../../bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_context.h:42,
                       from ../../tests/tst-zfs-disk.c:28:
      ../../tests/tst-zfs-disk.c: In function ‘make_vdev_root’:
      ../../tests/tst-zfs-disk.c:119:9: error: ‘t’ undeclared (first use in this function)
        ASSERT(t > 0);
               ^
      ../../bsd/sys/cddl/contrib/opensolaris/uts/common/sys/debug.h:56:29: note: in definition of macro ‘ASSERT’
       #define ASSERT(EX) ((void)((EX) || assfail(#EX, __FILE__, __LINE__)))
                                   ^
      ../../tests/tst-zfs-disk.c:119:9: note: each undeclared identifier is reported only once for each function it appears in
        ASSERT(t > 0);
               ^
      ../../bsd/sys/cddl/contrib/opensolaris/uts/common/sys/debug.h:56:29: note: in definition of macro ‘ASSERT’
       #define ASSERT(EX) ((void)((EX) || assfail(#EX, __FILE__, __LINE__)))
                                   ^
      f43cdb68
  5. Aug 25, 2013
    • Avi Kivity's avatar
      rcu: fix hang due to race while awaiting a quiescent state · ac7a8447
      Avi Kivity authored
      Waiting for a quiescent state happens in two stages: first, we request all
      cpus to schedule at least once.  Then, we wait until they do so.
      
      If, between the two stages, a cpu is brought online, then we will request
      N cpus to schedule but wait for N+1 to respond.  This of course never happens,
      and the system hangs.
      
      Fix by copying the vector which holds the cpus which we signal and wait for;
      forcing them to be consistent.  This is safe since newly-added cpus cannot
      be accessing any rcu-protected variables before we started signalling.
      
      Fixes random hangs with rcu, mostly seen with 'perf callstack'
      ac7a8447
  6. Aug 22, 2013
  7. Aug 21, 2013
Loading