Commits · ff3534e27e59bd06cc0c1c434b609cc923a309d0 · Verlässliche Systemsoftware / projects / osv

Mar 27, 2014

drivers: Add zfs device to allow use of zfs commands · ff3534e2

Raphael S. Carvalho authored 10 years ago


Previously, zfs device was being only provided to allow the use of
commands needed to create the zpool, and so the file system.
At that time, doing so was quite enough, however, making zfs
device, i.e. /dev/zfs part of every OSv instance would allow us
to use commands that will help analysing, debugging, tuning
the zpool and file systems there contained.

The basic explanation is that those commands use libzfs which in
turn relies on /dev/zfs to communicate with the zfs code.

Commands example:
zpool, zfs, zdb. The latter one not being ported to OSv yet.
This patch will also be helpful for the ongoing ztest porting.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ff3534e2

manifest: Add /etc/mnttab from the upload manifest process · 8078bc20

Raphael S. Carvalho authored 10 years ago


/etc/mnttab is required by libzfs to get running properly, so let's
create it as an empty file.

ryao from zfsonlinux and openzfs told me that an empty /etc/mnttab is used
on Linux. Also reading the libzfs code shows that /etc/mnttab mostly used for
management of the file itself, nothing that would prevent some libzfs
functionality from working.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8078bc20

scripts/test.py: Blacklist tst-dns-resolver.so · 7f6529e7

Pekka Enberg authored 10 years ago


The tst-dns-resolver.so fails spuriously. Blacklist it until the problem
is fixed to keep Jenkin builds running.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7f6529e7

Merge branch 'glommer/zfsbuffers' of github.com:glommer/osv · 12d1df73
Pekka Enberg authored 10 years ago

12d1df73

cpiod.so: Unmount file systems mounted over the mkfs phase · 8e31549c

Raphael S. Carvalho authored 10 years ago


The root dataset and ZFS are mounted at the mkfs phase, but they aren't
unmounted aftwards.

Running mkfs with VERBOSE flag enabled shows the following:
Running mkfs...
VFS: mounting zfs at /zfs
zfs: mounting osv from device osv
VFS: mounting zfs at /zfs/zfs
zfs: mounting osv/zfs from device osv/zfs

The first mount happens when issuing:
{"zpool", "create", "-f", "-R", "/zfs", "osv", "/dev/vblk0.1"}, &ret);
It creates a pool called osv and mounts the root dataset at /zfs

The latter mount happens when issuing:
{"zfs", "create", "osv/zfs"}
It creates a file system called zfs at the pool OSv and automatically
mounts it at the root dataset mountpoint.

No data inconsistency problem was seen up to now because both mkfs.so and
cpiod.so do an explicit sync() at the end, thus ensuring everything was
correctly flushed out to the stable storage.
There is an expression in Dutch that says: prevention is better than cure.
Thus, this patch changes cpiod.so to unmount both mount points when the
/zfs/zfs prefix was passed. It cannot be done at mkfs.so itself because
cpiod.so is called afterwards at the same OSv instance.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8e31549c

scripts: make 'scripts/test.py' support Python3 · 1563c916

Zifei Tong authored 10 years ago


Python3 no longer allow implicitly conversion form bytes to string,
add explicit decode() to convert input bytes.

Tested with both Python2 and Python3.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1563c916

zfsbuffers: reference count the arc buffer · 063a8a56

Glauber Costa authored 10 years ago


Gleb has noticed that the ARC buffers can go unshared too early. This will
happen because we call the UNMAP operation on every put(). That is certainly
not what we want, since the buffer only has to be unshared when the last
reference is gone.

Design decisions:
1) We obviously can't use the arc natural reference count for that, since
bumping it would make the buffer unevictable.
2) We could modify the arc_buf structure itself to add another refcnt (minimum
4 bytes).  However, I am trying to keep core-ZFS modifications to a minimum,
and only to places where it is totally unavoidable.

Therefore, the solution is to add another hash, which will hash the whole
buffer instead of the physaddr like the one we have currently. In terms of
memory usage, it will add only 8 bytes per buffer (+/- 128k each buffer), which
makes for a memory usage of 64k per mapped Gb compared to the arc refcount
solution. This is a good trade off.

I am also avoiding adding a new vop_map/unmap style operation just to query the
buffer address from its file attributes (needed for the put side). Instead, I
am conventioning that an empty iovec means query, and a filled iov means
unshare.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

063a8a56

core: fix misbehaving debugf() · 7c8d7415

Zifei Tong authored 10 years ago


debugf() used to write log message with respect to the length of format
string. This will cause the messages wrongly truncated.

Also change confusing variable names: exchange 'fmt' and 'msg'.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7c8d7415

zfsbuffers: kill bogus variable · fd233dd7

Glauber Costa authored 10 years ago


Spotted by code review. Gleg had spotted one improper use of "i", but
there was another. In this case we iterate over nothing, and i is always 0.
It is uninitialized to begin with, and the code works just because it is
being set to 0 by luck.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

fd233dd7

tests: improve zfs shared buffers tests · 28186c29

Glauber Costa authored 10 years ago

We have seen bugs with mmap shared/file handling for small files. This patch tests
some of the corner scenarios to find those problems.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

28186c29

zfsbuffers: Do not truncate files · d3ed5bef

Glauber Costa authored 10 years ago

There is a problem with the way ZFS currently handles its buffers, which is
actually a limitation of our allocator: buffers smaller than a page won't be
page aligned even if we ask for it. Therefore, if the buffer we are mapping
falls into this category, we will map the wrong location.

The way I solved this problem was so stupid, that in retrospect I can't even
believe I did it: when the file would run out of size, we would truncate the
file. This is obviously wrong because reading a file is not expected to change
its size in any circumstance, and if anybody relied in the actual size, we will
be crashing something. This is the bug that plagued Cassandra.

Not truncating, however, brings back the original problem. One solution I have
considered is to always allocate at least a page for data allocations (leaving
metadata alone), but that would deviate from ZFS and harm many-small-files
workloads.

However, During testing, I have noticed though that ZFS will allocate small
buffers only when the file itself is small. This means that we can just avoid
using the special shared mapping for small files - which makes sense anyway.

For instance, if we have a file that is 128k + 1byte (remember 128k is ZFS's
maximum buffer size), both buffers will be large enough to be aligned. And if I
that ever fails to hold, we will now see an assertion hit instead of a random
bug. In time, we should fix our allocator to provide alignment guarantees.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

d3ed5bef

mmu: fix how mappings into ARC buffer are tracked · 054997f5

Gleb Natapov authored 10 years ago


Currently all mapping are keyed on ARC buffer start when mapping is
added, but on remove pointer into ARC buffer is used, so remove may
leave no longer valid mappings in the database. This patch fixes it
by using a pointer into ARC as a key, the same pointer that is used
during removal.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

054997f5

Mar 26, 2014

trace: pass -X to less · 2d8eefd7

Tomasz Grabiec authored 10 years ago


Option -F should always be used with -X. Without this flag if output
is smaller than the screen then no output will be shown.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2d8eefd7

mgmt: update to latest · a5129a99
Pekka Enberg authored 10 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
a5129a99

build: be forgiving to missing newlines in manifest · e49ef897

Nadav Har'El authored 10 years ago


Running "make image=mgmt,iperf" failed because the last line in
apps/iperf/usr.manifest did not have a newline, and was copied
without a newline, which caused it to be stuck together to the first
line of mgmt's manifest.

Fix this by explicitly adding a newline character to each line we add to
the generated manifest file - whether or not the original manifest had one.

Reviewed-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e49ef897

scripts/gen-vmx.sh: Reduce virtualHW version · b5eb9f12

Asias He authored 10 years ago


ESXi 5.5 can not edit virtualHW.version = 10 vmx file. Reducing to 8
works in ESXi.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b5eb9f12

Mar 25, 2014

sched: change timer state inside atomic context · 33e518ee

Tomasz Grabiec authored 11 years ago


timer_base is a thread-agnostic interface for per-CPU timers.  The
current code is prone to a race condition involving set() and
cancel(). The latter may attempt to remove the timer before it was
inserted into timer tree.

Found during code inspection.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

33e518ee

__sysv_signal() · 901e67bc

Nadav Har'El authored 11 years ago


Implement __sysv_signal(), which is used by code using signal() when compiled
with _XOPEN_SOURCE, -std=..., or something similar (see signal(2) manual page
for a full discussion of the two variants of signal()).

Fixes #238.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

901e67bc

signal() using sigaction() · 88dfa52e

Nadav Har'El authored 11 years ago


Instead of duplicating sigaction()'s code, let's just use it.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

88dfa52e

sigaction() error checking · 86927e2d

Nadav Har'El authored 11 years ago


Add signal number verification to sigaction().
Also add a FIXME comment that we don't support mode sa_flags.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

86927e2d

scripts: Print number of threads in 'osv info threads' · e81acc44

Raphael S. Carvalho authored 10 years ago


The id of the last thread might mislead people's thinking about
the number of threads in OSv, so let's simply print the nr after all
threads were already printed.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e81acc44

tst-kill: fix crash · 229020d2

Nadav Har'El authored 11 years ago


tst-kill runs various signal handlers, which we run in separate threads.
When the test completes, we may be unlucky enough for the last signal
handler to still be running, at which point when the module's memory
is unmapped (e.g., in test.py -s each test is unmapped when it ends)
we can get a page fault and a crash.

This patch sleeps for a second at the end of tst-kill, to make sure that
the signal handler has completed; This sleep is a bit ugly, but I can't
think of a cleaner way - Posix provides no way to check if there's a
running handler, and I wouldn't like to add a new API just for this test.

Fixes #249.

Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

229020d2

bsd: set a smaller default stack for BSD threads. · 10455c5b

Glauber Costa authored 11 years ago

All threads created through the bsd/porting/kthread interface are threads that
used to be kernel threads in BSD, which means they are expected to use less
stack. Although I have no idea what is the default stack size for BSD, in Linux
things need as little as 4k. More importantly, they are threads whose memory
usage are under our control, and we could fix heavy offenders without a
problem.

If we don't say anything, they will start with 64k which is way, way, too much.
I am proposing here we go lower and get to 16k - which is even still quite
conservative, but so am I.

Measuring memory before and after the mount - because ZFS is currently our
heaviest user, I can save around 7Mb with this patch.

Passes make check (except for tst-kill, which is broken AFAICT) and misc-fs-stress.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

10455c5b

Support 'make osv.vmdk' without 'make all' · a69b2d2d

Takuya ASADA authored 10 years ago

Add 'all' as dependent target of osv.vmdk/.vdi, to prevent build error.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a69b2d2d

loader: Print OSv version info correctly · c9c94f51

Asias He authored 11 years ago


On VBOX and VMW, the version info is not printed correctly.
Fix it by only print after our console is initialized.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c9c94f51

mgmt: update to latest · 97414db0
Pekka Enberg authored 10 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
97414db0
Merge github.com:amnonh/osv · 07bb6565
Pekka Enberg authored 10 years ago

07bb6565

Mar 24, 2014

balloon: Fix JNI detach in request_memory() · 2e74fbbc

wangbicheng authored 11 years ago


The request_memory() function has an early exit condition where we
forgot to call _detach().  Fix that up by attaching later in the
function which ensures we eventually will detach.

Signed-off-by: wangbicheng <wangbicheng@huawei.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2e74fbbc

networking: remove double socket unlock · 678e297f

Nadav Har'El authored 11 years ago


sogeneric_send(), in one error case, unlocks the socket before doing
'goto release' which unlocks the socket again, resulting in an assertion
failure (unlocking an unlocked mutex) and crash.

This patch removes the extra unlock.

Fixes #248.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

678e297f

scripts/gen-vbox-ova.sh: Add --vga option to vdi image · 0ec68543

Asias He authored 11 years ago


Generate osv.vdi image from 'make osv.vdi' Makefile target which will
convert to the image to vdi as well as add the --vga option.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0ec68543

Support VMDK and VDI format · 6b72a23d

Takuya ASADA authored 11 years ago


Add osv.vmdk and osv.vdi targets on Makefile.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6b72a23d

Add vmxnet3 argument on run.py · 1fa1a602

Takuya ASADA authored 11 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1fa1a602

Add vmxnet3 driver · 7a78ad84

Takuya ASADA authored 11 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7a78ad84

mempool: Contiguous physical memory holder class · c230c939

Takuya ASADA authored 11 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
[ penberg: fix formatting ]
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c230c939

sched: Add assert to capture rescheduling attempt in nested exception · fc1d36c5

Gleb Natapov authored 11 years ago


Rescheduling in nested exception is not supported since nested exception
stack is per cpu, but without the assert such reschedule will cause
stack corruption which will be hard to debug.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fc1d36c5

ahci: Use all the cmd slots · 66dccdd7

Asias He authored 11 years ago


AHCI has 32 cmd slots to issue cmd. Only first slot is used currently
which makes the queue-depth 1.

This patch uses all the cmd slots and makes the cmd completion async.
Now, the queue-depth is 32.

Test with "/tests/misc-bdev-write.so" on VBOX shows improvements:

   Before: ~10MB/s
   After: ~20MB/s

1000 round of "/tests/misc-bdev-rw.so" tests passed on VBOX and QEMU.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

66dccdd7

ahci: Assert data buffer address · 4dbb2044

Asias He authored 11 years ago


Assert to make sure the data buffer address satisfies the AHCI spec's data
alignment requirement.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4dbb2044

ahci: Align cmd_table tables to 128-byte address · 403bbebe

Asias He authored 11 years ago


AHCI spec '4.2.2 Command List Structure' says 'Command Table Descriptor
Base Address' must be aligned to 128-byte cache line, indicated by bits
06:00 being reserved.

All the 32 cmd_table are allocated in one linear space. The size of
cmd_table is 144 bytes for now which is larger than 128 bytes. So we pad
cmd_table to 256 byes.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

403bbebe

ahci: Limit max size of a bio request · e11552ce

Asias He authored 11 years ago

One PRDT entry can contain 4MB buffer at most, we currently only use one
PRDT per AHCI cmd. Limit the size of a bio request to respect it. Larger
bio will be split into smaller bios by multiplex_strategy.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e11552ce

libc: add scandir() from musl · 0e12a918

Nadav Har'El authored 11 years ago


Add missing scandir() function from musl 1.0.0. Fixes #237.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0e12a918