Skip to content
Snippets Groups Projects
  1. Jan 09, 2014
    • Raphael S. Carvalho's avatar
      zfs: Fix on-disk data inconsistency on shutdown · 2d93af3b
      Raphael S. Carvalho authored
      
      This problem was found when running 'tests/tst-zfs-mount.so' multiple times.
      At the first time, all tests succeed, however, a subsequent run would
      fail at the test: 'mkdir /foo/bar', the error message reported
      that the target file already exists.
      
      The test basically creates a directory /foo/bar, rename it to /foo/bar2,
      then remove /foo/bar2. How could /foo/bar still be there?
      
      Quite simple. Our shutdown function calls unmount_rootfs() which will
      attempt to unmount zfs with the flag MNT_FOURCE, however, it's not being
      passed to zfs_unmount(), neither unmount_rootfs() tests itself the
      return status (which was always getting failures previously).
      So OSv is really being shutdown while there is remaining data waiting to
      be synced with the backing store. As a result, inconsitency.
      
      This problem was fixed by passing the flag to VFS_UNMOUNT which will now
      unmount the fs properly on sudden shutdowns.
      
      Signed-off-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      2d93af3b
  2. Jan 03, 2014
    • Raphael S. Carvalho's avatar
      vfs: change the approach of releasing dentries during unmount · af466dbc
      Raphael S. Carvalho authored
      
      Currently, vflush is used in the unmount process to release remaining
      dentries. vflush in turn calls vevict that is releasing dentries that
      it doesn't own.
      This behavior is not correct neither good to the future of VFS.
      
      So Avi suggested switching to a different approach. We could only
      release those dentries owned by the mountpoint when unmounting it as
      there wouldn't be anything else in the dcache (given its functionality).
      
      The problem was fixed by doing the following steps:
       - Drop vflush calls in sys_umount2, make vevict an empty function,
      and remove vevict.
      
       - Created the function release_mp_dentries to release dentries of a mount
      point which will be called by VFS_UNMOUNT. It cannot be called before
      VFS_UNMOUNT as failures must be considered, neither after as the mount point
      would be considered busy.
      Don't respect this "rule", and that previously seen ZFS replay transaction
      error would happen.
      
      NOTE: vflush is currently duplicated in zfs unmount cases to address the problem
      above. This patch fixes this duplication as well.
      
      Signed-off-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      af466dbc
    • Tomasz Grabiec's avatar
      netinet: fix uninitialized use of 'nims' in inm_merge() · 47234057
      Tomasz Grabiec authored
      
      gcc -O3 complains about uninitialized use of 'nims'. In fact,
      inm_get_source() can return an error without setting nims which will
      be later read from RB_FOREACH_REVERSE_FROM macro. It looks like 'nims'
      is intended to hold the last value for which inm_get_source() returned
      success. The uninitialized access would happen if this function never
      succeded. I am not sure if this is possible in practice, but let's
      initialize nims to NULL, which will cause no iteration in
      RB_FOREACH_REVERSE_FROM macro.
      
      Signed-off-by: default avatarTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      47234057
    • Tomasz Grabiec's avatar
      bsd/xdr: Silence uninitialized use warnings with -O3 · 9381a22c
      Tomasz Grabiec authored
      
      Gcc comaplins about attempt to read 'size' via dereferencing a pointer
      in xdr_u_int() in case xdrs->x_op == XDR_ENCODE.  However, in this
      case the size will be set from the switch case inside xdr_string()
      before xdr_u_int() is invoked. I think it's spurious because the
      code clearly assumes that xdrs->x_op cannot change between these two
      execution points.
      
      Let's initialize 'size' to 0 to make gcc happy.
      
      Signed-off-by: default avatarTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      9381a22c
  3. Jan 01, 2014
    • Nadav Har'El's avatar
      fs: clean up old "fo_*" C functions · a844d248
      Nadav Har'El authored
      
      Instead of the old C-style file-operation function types and fo_*()
      functions, since recently we have methods of the "file" class. All our
      filesystem code is now C++, and can use these methods directly.
      
      So this patch drops the old types and functions, and uses the class methods
      instead.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      a844d248
    • Avi Kivity's avatar
      net: call socket constructor and destructor · b42fd865
      Avi Kivity authored
      
      Drop the socket uma zone and replace it with calls to new and delete.  This
      allows the constructor and destructor to be called, so we can add C++
      objects to the socket structure.
      
      Take care to use the default-initializing form of the constructor since
      the socket code requires zero initialization.
      
      We lose the ability to limit the socket count, so we'll have to re-add it in
      the future if we want it.
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      b42fd865
    • Nadav Har'El's avatar
      file: reduce boiler-plate code in special files · 9478a14d
      Nadav Har'El authored
      
      Each implementation of "struct file" needs to implement 8 different file
      operations. Most special file implementations, such as pipe, socketpair,
      epoll and timerfd, don't support many of these operations. We had in
      unsupported.h functions that can be reused for the unsupported operation,
      but this resulted in a lot of ugly boiler-plate code.
      
      Instead, this patch switches to a cleaner, more C++-like, method:
      It defines a new "file" subclass, called "special_file", which implements
      all file operations except close(), with a default implementation identical
      to the old unsupported.h implementations.
      
      The files of pipe(), socketpair(), timerfd() and epoll_create() now inherit
      from special_file, and only override the file operations they really want
      to implement.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      9478a14d
  4. Dec 30, 2013
    • Tomasz Grabiec's avatar
      bsd: Initialize physmem variable · 9b72ad47
      Tomasz Grabiec authored
      
      This was the cause of poor ZFS performance in misc-fs-stress test.
      
      Before:
      
       Wrote 168.129 MB in 10.12 s = 16.610 Mb/s
       Wrote 194.688 MB in 10.00 s = 19.469 Mb/s
       Wrote 183.004 MB in 10.06 s = 18.186 Mb/s
       Wrote 167.754 MB in 10.28 s = 16.315 Mb/s
      
      After:
      
       Wrote 636.227 MB in 10.00 s = 63.623 Mb/s
       Wrote 666.979 MB in 10.00 s = 66.696 Mb/s
       Wrote 613.512 MB in 10.00 s = 61.350 Mb/s
       Wrote 573.502 MB in 10.00 s = 57.346 Mb/s
       Wrote 668.607 MB in 10.00 s = 66.857 Mb/s
       Wrote 630.920 MB in 10.00 s = 63.087 Mb/s
      
      It turned out that the limiting factor was the ARC cache. A check
      inside arc_tempreserve_space() was forcing txg to be synced too often
      (once every 400ms). The arc_c variable was only 16M (arc_c_min) which
      allowed to write only 8M per transaction. It turns out that arc_c
      depends on kmem_size() which is based on physmem which was never
      initialized.
      
      I would hold with commiting this yet because of several reasons,
      which I want to put under your consideration.
      
      While this improves write throughput it makes the boot time after make
      much longer, on my disk the boot time is increased from 1.5s to 10s.
      This is because zfs verifies the last 3 txgs upon mount. This patch
      increases txg size, which results in more data to check in the next
      boot. I'm working on solving this right now.
      
      Something worth noting is that while larger transactions sync less
      often incresing throughput they also sync longer increasing worst case
      latency. In my test the pauses get as high as 3 seconds with 1G of
      guest memory.
      
      Signed-off-by: default avatarTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      9b72ad47
    • Avi Kivity's avatar
      xen: remove designated initializer in C++ code · 14e6485d
      Avi Kivity authored
      
      While (unfortunately) C++ doesn't support designated initializers, and the
      compiler rejects them, one instance has survived in xenbus.  Strangely,
      gcc 4.8.2 generates correct code, while gcc 4.8.0 fails with an internal
      compiler error, instead of both of them rejecting the code.
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      14e6485d
  5. Dec 26, 2013
  6. Dec 24, 2013
    • Avi Kivity's avatar
      bsd: convert the Xen stuff to C++ · 828ec291
      Avi Kivity authored
      
      Helps making bsd header changes that xen includes.
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      828ec291
    • Nadav Har'El's avatar
      sched: Overhaul sched::thread::attr construction · eb48b150
      Nadav Har'El authored
      
      We use sched::thread::attr to pass parameters to sched::thread creation,
      i.e., create a thread with non-default stack parameters, pinned to a
      particular CPU, or a detached thread.
      
      Previously we had constructors taking many combinations of stack size
      (integer), pinned cpu (cpu*) and detached (boolean), and doing "the
      right thing". However, this makes the code hard to read (what does
      attr(4096) specify?) and the constructors hard to expand with new
      parameters.
      
      Replace the attr() constructors with the so-called "named parameter"
      idiom: attr now only has a null constructor attr(), and one modifies
      it with calls to pin(cpu*), detach(), or stack(size).
      
      For example,
          attr()                                  // default attributes
          attr().pin(sched::cpus[0])              // pin to cpu 0
          attr().stack(4096).pin(sched::cpus[0])  // pin and non-default stack
          and so on.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      eb48b150
    • Dmitry Fleytman's avatar
      netinet: Fix broken checksum verification in LRO mechanism · 71086617
      Dmitry Fleytman authored
      This patch applies bugfix published on FreeBSD list at Feb 2013:
      http://lists.freebsd.org/pipermail/svn-src-stable-9/2013-February/003928.html
      
      
      
      LRO mechanism is broken on systems without IP checksum verification offload.
      Due to improper checksum verification RX packets omit LRO path and go
      directly to TCP stack which is not good for performance.
      
      EC2 Xen is one example of such a system.
      This bug is one of the reasons we see bad performance on Amazon.
      
      Some test results w/ and w/o the fix:
      
      Buffer size    Before         After          Improvement %
      TCP TX
      32             557.52         1386.28        149
      64             552.38         1385.99        151
      128            546.43         1401.46        156
      256            565.25         1382.28        145
      512            557.32         1375.23        147
      1024           549.71         1356.69        147
      2048           551.11         1371.92        149
      4096           556.13         1383.67        149
      8192           559.49         1364.05        144
      16384          567.25         1366.48        141
      32768          546.18         1366.63        150
      65536          553.4          1353.87        145
      
      TCP RX
      32             107.37         105.48         -2
      64             187.56         179.9          -4
      128            297.16         301.71         2
      256            300.47         503.92         68
      512            294.76         826.13         180
      1024           299.95         1916.69        539
      2048           287.04         1924.44        570
      4096           300.78         1929.37        541
      8192           304.52         1934.02        535
      16384          305.04         1957.54        542
      32768          309            1921.84        522
      65536          296.48         1935.41        553
      
      Still we are pretty far from Linux, there are other problems to be fixed.
      
      Signed-off-by: default avatarDmitry Fleytman <dmitry@daynix.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      71086617
  7. Dec 20, 2013
  8. Dec 19, 2013
  9. Dec 16, 2013
  10. Dec 12, 2013
  11. Dec 10, 2013
    • Raphael S. Carvalho's avatar
      vfs: Fix duplicate in-memory vnodes · 9ecda822
      Raphael S. Carvalho authored
      
      Currently, namei() does vget() unconditionally if no dentry is found.
      This is wrong because the path can be a hard link that points to a vnode
      that's already in memory.
      
      To fix the problem:
      
        - Use inode number as part of the hash in vget()
      
        - Use vn_lookup() in vget() to make sure we have one vnode in memory
          per inode number.
      
        - Push the vget() calls down to individual filesystems and make
          VOP_LOOKUP return an vnode
      
      Changes since v2:
        - v1 dropped lock in vn_lookup, thus assert that vnode_lock is held.
      
      Changes since v3:
        - Fix lock ordering issue in dentry_lookup. The lock respective to the parent
      node must be acquired before dentry_lookup and released after the process is
      done. Otherwise, a second thread looking up for the same dentry may take the
      'NULL' path incorrectly.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      Signed-off-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      9ecda822
    • Nadav Har'El's avatar
      Fix wrong error codes in unlink(), rmdir() and readdir() · 86b5374f
      Nadav Har'El authored
      
      This patch fixes the error codes in four error cases:
      
      1. unlink() of a directory used to return EPERM (as in Posix), and now
         returns EISDIR (as in Linux).
      
      2. rmdir() of a non-empty directory used to return EEXIST (as in Posix)
         and now returns ENOTEMPTY (as in Linux).
      
      3. rmdir() of a regular file (non-directory) used to return EBADF
         and now returns ENOTDIR (as in Linux).
      
      4. readdir() of a regular file (non-directory) used to return EBADF
         and now returns ENOTDIR (as in Linux).
      
      This patch also adds a test, tst-remove.cc, for the various unlink() and
      rmdir() success and failure modes.
      
      Fixes #123.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      86b5374f
  12. Dec 09, 2013
  13. Dec 08, 2013
  14. Dec 05, 2013
Loading