Commits · 2d93af3be98513b249fb037094fdaee0a5dfbcf8 · Verlässliche Systemsoftware / projects / osv

Jan 09, 2014

zfs: Fix on-disk data inconsistency on shutdown · 2d93af3b

Raphael S. Carvalho authored 11 years ago

This problem was found when running 'tests/tst-zfs-mount.so' multiple times.
At the first time, all tests succeed, however, a subsequent run would
fail at the test: 'mkdir /foo/bar', the error message reported
that the target file already exists.

The test basically creates a directory /foo/bar, rename it to /foo/bar2,
then remove /foo/bar2. How could /foo/bar still be there?

Quite simple. Our shutdown function calls unmount_rootfs() which will
attempt to unmount zfs with the flag MNT_FOURCE, however, it's not being
passed to zfs_unmount(), neither unmount_rootfs() tests itself the
return status (which was always getting failures previously).
So OSv is really being shutdown while there is remaining data waiting to
be synced with the backing store. As a result, inconsitency.

This problem was fixed by passing the flag to VFS_UNMOUNT which will now
unmount the fs properly on sudden shutdowns.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2d93af3b

Jan 03, 2014

vfs: change the approach of releasing dentries during unmount · af466dbc

Raphael S. Carvalho authored 11 years ago


Currently, vflush is used in the unmount process to release remaining
dentries. vflush in turn calls vevict that is releasing dentries that
it doesn't own.
This behavior is not correct neither good to the future of VFS.

So Avi suggested switching to a different approach. We could only
release those dentries owned by the mountpoint when unmounting it as
there wouldn't be anything else in the dcache (given its functionality).

The problem was fixed by doing the following steps:
 - Drop vflush calls in sys_umount2, make vevict an empty function,
and remove vevict.

 - Created the function release_mp_dentries to release dentries of a mount
point which will be called by VFS_UNMOUNT. It cannot be called before
VFS_UNMOUNT as failures must be considered, neither after as the mount point
would be considered busy.
Don't respect this "rule", and that previously seen ZFS replay transaction
error would happen.

NOTE: vflush is currently duplicated in zfs unmount cases to address the problem
above. This patch fixes this duplication as well.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

af466dbc

netinet: fix uninitialized use of 'nims' in inm_merge() · 47234057

Tomasz Grabiec authored 11 years ago

gcc -O3 complains about uninitialized use of 'nims'. In fact,
inm_get_source() can return an error without setting nims which will
be later read from RB_FOREACH_REVERSE_FROM macro. It looks like 'nims'
is intended to hold the last value for which inm_get_source() returned
success. The uninitialized access would happen if this function never
succeded. I am not sure if this is possible in practice, but let's
initialize nims to NULL, which will cause no iteration in
RB_FOREACH_REVERSE_FROM macro.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

47234057

bsd/xdr: Silence uninitialized use warnings with -O3 · 9381a22c

Tomasz Grabiec authored 11 years ago


Gcc comaplins about attempt to read 'size' via dereferencing a pointer
in xdr_u_int() in case xdrs->x_op == XDR_ENCODE.  However, in this
case the size will be set from the switch case inside xdr_string()
before xdr_u_int() is invoked. I think it's spurious because the
code clearly assumes that xdrs->x_op cannot change between these two
execution points.

Let's initialize 'size' to 0 to make gcc happy.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9381a22c

Jan 01, 2014

fs: clean up old "fo_*" C functions · a844d248

Nadav Har'El authored 11 years ago


Instead of the old C-style file-operation function types and fo_*()
functions, since recently we have methods of the "file" class. All our
filesystem code is now C++, and can use these methods directly.

So this patch drops the old types and functions, and uses the class methods
instead.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

a844d248

net: call socket constructor and destructor · b42fd865

Avi Kivity authored 11 years ago


Drop the socket uma zone and replace it with calls to new and delete.  This
allows the constructor and destructor to be called, so we can add C++
objects to the socket structure.

Take care to use the default-initializing form of the constructor since
the socket code requires zero initialization.

We lose the ability to limit the socket count, so we'll have to re-add it in
the future if we want it.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

b42fd865

file: reduce boiler-plate code in special files · 9478a14d

Nadav Har'El authored 11 years ago

Each implementation of "struct file" needs to implement 8 different file
operations. Most special file implementations, such as pipe, socketpair,
epoll and timerfd, don't support many of these operations. We had in
unsupported.h functions that can be reused for the unsupported operation,
but this resulted in a lot of ugly boiler-plate code.

Instead, this patch switches to a cleaner, more C++-like, method:
It defines a new "file" subclass, called "special_file", which implements
all file operations except close(), with a default implementation identical
to the old unsupported.h implementations.

The files of pipe(), socketpair(), timerfd() and epoll_create() now inherit
from special_file, and only override the file operations they really want
to implement.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

9478a14d

Dec 30, 2013

bsd: Initialize physmem variable · 9b72ad47

Tomasz Grabiec authored 11 years ago


This was the cause of poor ZFS performance in misc-fs-stress test.

Before:

 Wrote 168.129 MB in 10.12 s = 16.610 Mb/s
 Wrote 194.688 MB in 10.00 s = 19.469 Mb/s
 Wrote 183.004 MB in 10.06 s = 18.186 Mb/s
 Wrote 167.754 MB in 10.28 s = 16.315 Mb/s

After:

 Wrote 636.227 MB in 10.00 s = 63.623 Mb/s
 Wrote 666.979 MB in 10.00 s = 66.696 Mb/s
 Wrote 613.512 MB in 10.00 s = 61.350 Mb/s
 Wrote 573.502 MB in 10.00 s = 57.346 Mb/s
 Wrote 668.607 MB in 10.00 s = 66.857 Mb/s
 Wrote 630.920 MB in 10.00 s = 63.087 Mb/s

It turned out that the limiting factor was the ARC cache. A check
inside arc_tempreserve_space() was forcing txg to be synced too often
(once every 400ms). The arc_c variable was only 16M (arc_c_min) which
allowed to write only 8M per transaction. It turns out that arc_c
depends on kmem_size() which is based on physmem which was never
initialized.

I would hold with commiting this yet because of several reasons,
which I want to put under your consideration.

While this improves write throughput it makes the boot time after make
much longer, on my disk the boot time is increased from 1.5s to 10s.
This is because zfs verifies the last 3 txgs upon mount. This patch
increases txg size, which results in more data to check in the next
boot. I'm working on solving this right now.

Something worth noting is that while larger transactions sync less
often incresing throughput they also sync longer increasing worst case
latency. In my test the pauses get as high as 3 seconds with 1G of
guest memory.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9b72ad47

xen: remove designated initializer in C++ code · 14e6485d

Avi Kivity authored 11 years ago

While (unfortunately) C++ doesn't support designated initializers, and the
compiler rejects them, one instance has survived in xenbus. Strangely,
gcc 4.8.2 generates correct code, while gcc 4.8.0 fails with an internal
compiler error, instead of both of them rejecting the code.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

14e6485d

Dec 26, 2013

net: fix excessive window updates · e072be26

Avi Kivity authored 11 years ago


Backported from FreeBSD r242252.

Improves netperf by about 10%.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e072be26

bsd: change operator-(tcp_seq, tcp_seq) to be signed · f015cd5b

Avi Kivity authored 11 years ago


This is more useful if there is no ordering between the two numbers (either
one can be ahead).

Change BYTES_THIS_ACK to return unsigned, to prevent an unsigned division from
turning into a signed division.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

f015cd5b

bsd: add comparison operators to tcp_seq · 15d9e5e4

Avi Kivity authored 11 years ago


Add comparison operators that use modulo arithmetic to order sequence
numbers, and use them to replace SEQ_LT() and friends, increasing code
readability.

As a consequence std::min() and std::max() can be used instead of SEQ_MIN()
and SEQ_MAX().

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

15d9e5e4

bsd: convert tcp_seq to a class · 07cad895

Avi Kivity authored 11 years ago

tcp sequence numbers are similar to integers, but have different comparison
operations. Separate them into a class so we don't mix the two.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

07cad895

bsd: refine __ntohl() and friends as inline functions · a89ebcb7

Avi Kivity authored 11 years ago


inline functions can be overloaded and are less nasty than macros in other
ways (like evaluating their arguments only once).

Note we can't touch ntohl() itself, since it is defined to be an out-of-line
function by libc.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

a89ebcb7

Dec 24, 2013

bsd: convert the Xen stuff to C++ · 828ec291

Avi Kivity authored 11 years ago


Helps making bsd header changes that xen includes.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

828ec291

sched: Overhaul sched::thread::attr construction · eb48b150

Nadav Har'El authored 11 years ago


We use sched::thread::attr to pass parameters to sched::thread creation,
i.e., create a thread with non-default stack parameters, pinned to a
particular CPU, or a detached thread.

Previously we had constructors taking many combinations of stack size
(integer), pinned cpu (cpu*) and detached (boolean), and doing "the
right thing". However, this makes the code hard to read (what does
attr(4096) specify?) and the constructors hard to expand with new
parameters.

Replace the attr() constructors with the so-called "named parameter"
idiom: attr now only has a null constructor attr(), and one modifies
it with calls to pin(cpu*), detach(), or stack(size).

For example,
    attr()                                  // default attributes
    attr().pin(sched::cpus[0])              // pin to cpu 0
    attr().stack(4096).pin(sched::cpus[0])  // pin and non-default stack
    and so on.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

eb48b150

netinet: Fix broken checksum verification in LRO mechanism · 71086617

Dmitry Fleytman authored 11 years ago

This patch applies bugfix published on FreeBSD list at Feb 2013:
http://lists.freebsd.org/pipermail/svn-src-stable-9/2013-February/003928.html



LRO mechanism is broken on systems without IP checksum verification offload.
Due to improper checksum verification RX packets omit LRO path and go
directly to TCP stack which is not good for performance.

EC2 Xen is one example of such a system.
This bug is one of the reasons we see bad performance on Amazon.

Some test results w/ and w/o the fix:

Buffer size    Before         After          Improvement %
TCP TX
32             557.52         1386.28        149
64             552.38         1385.99        151
128            546.43         1401.46        156
256            565.25         1382.28        145
512            557.32         1375.23        147
1024           549.71         1356.69        147
2048           551.11         1371.92        149
4096           556.13         1383.67        149
8192           559.49         1364.05        144
16384          567.25         1366.48        141
32768          546.18         1366.63        150
65536          553.4          1353.87        145

TCP RX
32             107.37         105.48         -2
64             187.56         179.9          -4
128            297.16         301.71         2
256            300.47         503.92         68
512            294.76         826.13         180
1024           299.95         1916.69        539
2048           287.04         1924.44        570
4096           300.78         1929.37        541
8192           304.52         1934.02        535
16384          305.04         1957.54        542
32768          309            1921.84        522
65536          296.48         1935.41        553

Still we are pretty far from Linux, there are other problems to be fixed.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

71086617

Dec 20, 2013

net: Rename if_get_if_info to if_getinfo · 70533618

Asias He authored 11 years ago


if_getinfo is shorter and more consistent all the other if_ functions
name, e.g., if_init, if_ioctl.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

70533618

Fix compilation error in uips_socket.cc · 7a7bfb82

Tomasz Grabiec authored 11 years ago


Error message:

bsd/sys/kern/uipc_socket.cc: In function 'socket* soalloc(vnet*)':
bsd/sys/kern/uipc_socket.cc:245:58: error: expected ')' before 'PRIx64'
  uipc_d("soalloc() so=%" PRIx64, (uint64_t)so);
                                                          ^

This was introduced in one of these commits:
c506516b14d32e416f4a9247869a1a89f7e9c413
406a93ed

As suggested by Pekka including <cinttypes> is the proper fix for C++11.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7a7bfb82

Dec 19, 2013

net: convert most of the networking stack to C++ · 67736859

Avi Kivity authored 11 years ago


It's just to painful to do this piecemeal.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

67736859

netport: make CTASSERT work in both C and C++ · 406a93ed

Avi Kivity authored 11 years ago


Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

406a93ed

net: Added ifnet->if_get_if_info() · 3057a18c

Vlad Zolotarov authored 11 years ago


Added ifnet->if_get_if_info() method to collect the
internally gathered statistics into the standard if_data struct.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Reviewed-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3057a18c

net: Added IFCAP_HWSTATS · 095d5b77

Vlad Zolotarov authored 11 years ago


Added IFCAP_HWSTATS iface capability indicating that
device driver and HW are capable to handle the statistics internally.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Reviewed-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

095d5b77

bsd: Fix unimplemented copystr() and copyinstr() · cc8bf695

Raphael S. Carvalho authored 11 years ago

Over the process of creating a zpool, history_str_get() will be eventually
called by zfs_ioc_pool_create(). history_str_get() will not work as expected as
it fully relies on the function copyinstr() which is currently unimplemented.

After that, spa_create() will be called with history_str storing trash, thus
making the spa_history_log filled with completely wrong entries.

Let's fix this problem simply by implementing copyinstr. By the way, copystr
has basically the same semantics as copyinstr, so let's implement it as well.
It's not being used anywhere, but who knows? It might be needed in the future
by some bsd compliant application.

As far as I know, this problem didn't manifest *visibly* anywhere as the spa
history log wasn't intentionally used by us yet. I found the problem while
auditing code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cc8bf695

Dec 16, 2013

bsd: fix horrible m_* macro clash · f81d5121

Avi Kivity authored 11 years ago


bsd defines some m_ macros, for example m_flags, to save some typing.  However
if you have a variable of the same name in another header, for example
m_flags, have fun trying to compile your code.

Expand the code in place and eliminate the macros.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f81d5121

net: reconcile bsd and api IP_MAX_MEMBERSHIPS · 0a02e276

Avi Kivity authored 11 years ago


Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0a02e276

net: reconcile api and bsd INADDR_* macros · 6dcaca1d

Avi Kivity authored 11 years ago


Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6dcaca1d

net: reconcile bsd and linux IP_ socket options · 84b96e0f

Avi Kivity authored 11 years ago


The bsd specific ones are nenumbered to avoid clashes.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

84b96e0f

net: reconcile in.h IN_CLASS* macros and friends · 8fa97c1b

Avi Kivity authored 11 years ago


Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8fa97c1b

net: reconcile bsd and api SOL_* macros · e9c888e1

Avi Kivity authored 11 years ago


Since most SOL_* macros are equivalent to the IPPROTO_* defines, only one
define needs to be changed.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e9c888e1

net: reconcile bsd and api IPPROTO_* macros · 4c5a10d5

Avi Kivity authored 11 years ago


Move them to a common include file.

Since they're defined externally, there is no real conflict.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4c5a10d5

net: remove unneeded linux compatibility macros · 55b05046

Avi Kivity authored 11 years ago


Unused.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

55b05046

Dec 12, 2013

linux_ioctl.cc: fix unicode in comment · 1eba19df

Avi Kivity authored 11 years ago


Fix a unicode character in the author's name.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

1eba19df

compilation: Fix compilation errors · 6f3a966a

Vlad authored 11 years ago


Fix some compilation errors that would arrise when NDEBUG is defined:
 - tests/misc-tcp.cc: missing #include<iostream> that would come from
   include/boost/assert.hpp: line 81 when NDEBUG is not defined.
 - bsd/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c:
   when NDEBUG is not defined assert() is defined to __assert_fail,
   which has "noreturn" attribute, which satisfies the compiler.
   When NDEBUG is defined and assert() completly compiles out, the
   compiler start complaining about the missing "return" statment.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6f3a966a

Dec 10, 2013

vfs: Fix duplicate in-memory vnodes · 9ecda822

Raphael S. Carvalho authored 11 years ago


Currently, namei() does vget() unconditionally if no dentry is found.
This is wrong because the path can be a hard link that points to a vnode
that's already in memory.

To fix the problem:

  - Use inode number as part of the hash in vget()

  - Use vn_lookup() in vget() to make sure we have one vnode in memory
    per inode number.

  - Push the vget() calls down to individual filesystems and make
    VOP_LOOKUP return an vnode

Changes since v2:
  - v1 dropped lock in vn_lookup, thus assert that vnode_lock is held.

Changes since v3:
  - Fix lock ordering issue in dentry_lookup. The lock respective to the parent
node must be acquired before dentry_lookup and released after the process is
done. Otherwise, a second thread looking up for the same dentry may take the
'NULL' path incorrectly.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

9ecda822

Fix wrong error codes in unlink(), rmdir() and readdir() · 86b5374f

Nadav Har'El authored 11 years ago


This patch fixes the error codes in four error cases:

1. unlink() of a directory used to return EPERM (as in Posix), and now
   returns EISDIR (as in Linux).

2. rmdir() of a non-empty directory used to return EEXIST (as in Posix)
   and now returns ENOTEMPTY (as in Linux).

3. rmdir() of a regular file (non-directory) used to return EBADF
   and now returns ENOTDIR (as in Linux).

4. readdir() of a regular file (non-directory) used to return EBADF
   and now returns ENOTDIR (as in Linux).

This patch also adds a test, tst-remove.cc, for the various unlink() and
rmdir() success and failure modes.

Fixes #123.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

86b5374f

Dec 09, 2013

Revert "vfs: Fix duplicate in-memory vnodes" · 0984e12e

Pekka Enberg authored 11 years ago


This reverts commit e4aad1ba.

It causes tst-vfs.so to just hang.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0984e12e

Dec 08, 2013

vfs: Fix duplicate in-memory vnodes · e4aad1ba

Raphael S. Carvalho authored 11 years ago


Currently, namei() does vget() unconditionally if no dentry is found.
This is wrong because the path can be a hard link that points to a vnode
that's already in memory.

To fix the problem:

  - Use inode number as part of the hash in vget()

  - Use vn_lookup() in vget() to make sure we have one vnode in memory
    per inode number.

  - Push the vget() calls down to individual filesystems and make
    VOP_LOOKUP return an vnode

  - Drop lock in vn_lookup() and assert that vnode_lock is held.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e4aad1ba

Dec 05, 2013

zfs: Fix build breakage in zfs_setattr() · 8ab1a76e

Pekka Enberg authored 11 years ago


OSv does not support xattrs in ZFS. Fix up a build breakage reported by Glauber:

  /home/ubuntu/osv/bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c: In function ‘zfs_setattr’:
  /home/ubuntu/osv/bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:2152:12: error: ‘xoap’ may be used uninitialized in this function [-Werror=uninitialized]

Reported-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8ab1a76e

linux_ioctl.cc: fix indentation · 3b21d389

Avi Kivity authored 11 years ago


Probably the worst indented file in the whole tree.  Set Eclipse Ctrl-I
on it.

Also fix a unicode character in the author's name.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3b21d389