Commits · fbd7d0846f3daa980b7182ca08a7f83248d24853 · Verlässliche Systemsoftware / projects / osv

Feb 11, 2014

epoll: Support epoll()'s EPOLLET · d41d748f

Nadav Har'El authored 11 years ago

This patch adds support for epoll()'s edge-triggered mode, EPOLLET.
Fixes #188.

As explained in issue #188, Boost's asio uses EPOLLET heavily, and we use
that library in our management http server, and also in our image creation
tool (cpiod.so). By ignoring EPOLLET, like we did until now, the code worked,
but unnecessarily wasted CPU when epoll_wait() always returned immediately
instead of waiting until a new event.

This patch works within the confines of our existing poll mechanisms -
where epoll() call poll(). We do not change this in this patch, and it
should be changed in the future (see issue #17).

In this patch we add to each struct file a field "poll_wake_count", which
as its name suggests counts the number of poll_wake()s done on this
file. Additionally, epoll remembers the last value it saw of this counter,
so that in poll_scan(), if we see that an fp (polled with EPOLLET) has
an unchanged counter from last time, we do not return readiness on this fp
regardless on whether or not it has readable data.

We have a complication with EPOLLET on sockets. These have an "SB_SEL"
optimization, which avoids calling poll_wake() when it thinks the new
data is not interesting because the old data was not yet consumed, and
also avoids calling poll_wake() if fp->poll() was not previously done.
This optimization is counter-productive for EPOLLET (and causes missed
wakeups) so we need to work around it in the EPOLLET case.

This patch also adds a test for the EPOLLET case in tst-epoll.cc. The test
runs on both OSv and Linux, and can confirm that in the tested scenarios,
Linux and OSv behave the same, including even one same false-positive:
When epoll_wait() tells us there is data in a pipe, and we don't read it,
but then more data comes on a pipe, epoll_wait() will again return a new
event, despite this is not really being an edge event (the pipe didn't
change from empty to not-empty, as it was previously not-empty as well).

Concluding remarks:

The primary goal of this implementation is to stop EPOLLET epoll_wait()
from returning immediately despite nothing have happened on the file.
That was what caused the 100% CPU use before this patch. That being said,
the goal of this patch is NOT to avoid all false-positives or unnecessary
wakeups; When events do occur on the file, we may be doing a bit more
wakeups than strictly necessary. I think this is acceptable (our epoll()
has worse problems) but for posterity, I want to explain:

I already mentioned above one false-positive that also happens on Linux.
Another false-positive wakeup that remains is in one of EPOLLET's classic
use cases: Consider several threads sleeping on epoll() on the same socket
(e.g., TCP listening socket, or UDP socket). When one packet arrives, normal
level-triggered epoll() will wake all the threads, but only one will read
the packet and the rest will find they have nothing to read. With edge-
triggered epoll, only one thread should be woken and the rest would not.
But in our implementation, poll_wake() wakes up *all* the pollers on this
file, so we cannot currently support this optimization.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d41d748f

Feb 10, 2014

tests/misc-bdev-write: Introduce offset limitation · 89c5c1a2

Dmitry Fleytman authored 11 years ago


Useful for testing on RAM disks when writes are fast enough
to fill the whole image in less than test execution time.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

89c5c1a2

tests: fix test_malloc to use condvar for synchronisation. · 146690d1

Gleb Natapov authored 11 years ago

It uses low level thread::wait_until() now which calls caller supplied
predicate with preemption disabled. If caller supplied code access not yet
mapped memory it will trigger an assertion on a page fault path.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

146690d1

Convert tst-epoll.cc to standard C++ · b406ae7a

Nadav Har'El authored 11 years ago


Replace OSv-specific constructs in tst-epoll.cc by their standard C++
counterparts (i.e., std::thread, std::chrono, std::cout).
This test now also runs (and of course, succeeds) on Linux.

In general, it is important at all our Linux-ABI tests (where we test our
implementation of the Linux/glibc functionality) to be able to run on Linux
as well. Otherwise, it is possible our tests don't actually test the right
thing (we may test for some expected behavior, but the actual behavior on
Linux is different).

I'm doing this in preparation for fixing issue #188 (fix edge-triggered
epoll).

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b406ae7a

Feb 09, 2014

tests: add shutdown() test · 6739a08d

Avi Kivity authored 11 years ago


net channels caused a crash in shutdown(), so add a test to excercise it.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

6739a08d

tests: convert tst-bsd-tcp1 to the boost unit test framework · bdb8ee13
Avi Kivity authored 11 years ago
```
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
```
bdb8ee13

Feb 07, 2014

tests/tst-zfs-mount: Add check to VOP_INACTIVE functionality · 3ea5ea34

Raphael S. Carvalho authored 11 years ago


Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3ea5ea34

Feb 06, 2014

tests: Add tool to analyze ARC and shrink functionality · 4a68aee3

Raphael S. Carvalho authored 11 years ago


The main purpose of this tool is to understand/analyze the ARC behavior/
performance on specific workloads.

$ scripts/run.py -e 'tests/misc-zfs-arc.so --help'
OSv v0.05-155-g1f04e49
Allowed options:
  --help                produce help message
  --set-max-target      set ARC max target to 80% of the system memory.
  --check-arc-shrink    check ARC shrink functionality
  --test arg            analyze ARC performance on a given testcase, e.g.
                        --test tst-001.so

* --set-max-target: Used to check performance when ARC max target is
higher than usual. Given that more data will be load into ARC, ZFS operations
that needs I/O would perform better. 80% was chosen as the low watermark
is 20%, so avoiding a bunch of memory pressure, thus more stability.

* --check-arc-shrink: Check the functionality of the function arc_shrink
from ARC.

* --test arg: Check ARC performance on a specified testcase, e.g.:
$ scripts/run.py -e 'tests/misc-zfs-arc.so --test tst-fs-link.so'

* Default run, i.e -e 'tests/misc-zfs-arc.so' provides four distinct
workloads:
1) Non-linear one where prefetch shouldn't be as effective.
2) Load all data into cache, then read it afterwards to check performance
on such cases, almost speed of main memory.
3) Linear workload where the amount of data is 1.5% the size of the system
memory, thus page replacement will be strongly used, and as the operation
is sequential, prefetch (readahead) must be effective. It leads to a high
cache hit ratio as blocks were read ahead of time.
4) Keep allocating memory through a populated anonymous mmaping to see
if shrink would take place to release memory back to the operating system.

Eventual reports and ARC stats are provided to ease the task of understanding
ARC performance on specific workloads.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4a68aee3

tests: testcase to reproduce a variety of IO workloads · 0ee2e417

Raphael S. Carvalho authored 11 years ago

Mainly created to be used as a tool that reproduces specific workloads,
so allowing us to understand how underlying components are performing,
e.g. Adjustable Replacement Cache (ARC) from ZFS.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0ee2e417

Feb 03, 2014

tests/misc-fs-stress: Provide it with a default file path · 7d005f58

Raphael S. Carvalho authored 11 years ago

Currently, a file path always need to be specified as an argument,
so let's make things easier by providing a default file path.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7d005f58

tests: do not malloc on bdev write · e6a27888

Glauber Costa authored 11 years ago


For page-sized allocations, it is better to use alloc_page than it is to use
malloc. The reason is that malloc for that size will arrive at malloc_large,
which is a locked operation, while alloc_page will proceed locklessly if there
is room in the per-cpu buffers.

Although it is just a test, since the goal is to saturate the disk, doing so
will allow us to get a closer picture, since the completion handler won't
block.

In KVM with my weird disk I see now ~65 Mbps where I previously saw ~45Mbps.
Interestingly enough, it doesn't seem to make a whole lot of difference for
Xen. There is a difference, but it is not nowhere as near.

Reviewed-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e6a27888

tests: add tests for disk latency. · a9e63f3a

Glauber Costa authored 11 years ago


This test is similar in spirit to misc-bdev-write, but instead of pushing
as many bios as we can, we'll push one bio at a time.

Example output for Xen:

OSv v0.05-118-g0f2973c
Min      50%      90%      99%      99.99%   99.999%  Max     [msec]
---      ---      ---      ---      ------   -------  ---
0.2344   0.3240   0.2847   0.8185   2.4095   6.5230   13.6572

Example output fo KVM:

OSv v0.05-118-g0f2973c
Min      50%      90%      99%      99.99%   99.999%  Max     [msec]
---      ---      ---      ---      ------   -------  ---
0.2626   0.3976   0.3273   0.4791   0.5993   7.6401   15.9672

(Hint: the current xen blkfront slowness is not related to RT latency...)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a9e63f3a

Jan 28, 2014

tests: add a C++ Hello World test · 1e6d0ad0

Glauber Costa authored 11 years ago


This is useful to measure OSv boot speed, IOW, how fast are we without
CLI, Java, etc.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1e6d0ad0

Jan 27, 2014

clock: Remove unnecessary #include <drivers/clock.hh> · 8bf2fedd

Nadav Har'El authored 11 years ago


Remove unused #include of <drivers/clock.hh>.
Except the clock drivers and <osv/clock.hh>, no source file now now
include this header. Rather, <osv/clock.hh> should be used. Code including
<sched.hh> will also get <osv/clock.hh> automatically.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8bf2fedd

clock: Support different clocks in pthread_cond_wait · 4513537e

Nadav Har'El authored 11 years ago


Fix pthread_cond_timedwait to set the absolute timer using a timepoint,
instead of the old s64.

Moreover, now that we have both a wall-clock and monotonic clock, we
can support pthread_condattr_setclock, so this patch also adds this
support. OpenJDK 8, for example, cannot run without this support (it
assumes that if the OS supports CLOCK_MONOTONIC, it can also configure
condition variables to use it).

Unfortunately supporting pthread_condattr_setclock - the only condition-
variable attribute that really exists - grows the pthread condition
variable structure :(

Fixes #168.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4513537e

clock: Drop nanotime() function · 770e21f1

Nadav Har'El authored 11 years ago


Drop the nanotime() function.

Change the few remaining callers to use the appropriate osv::clock or
std::chrono replacements.

We already got rid in previous patches of most references to nanotime()
by switching from absolute times to relative times.

The direct equivalent of the old nanotime() function, where we actually
need the number of nanoseconds since the UNIX epoch, is the rather
verbose expression osv::clock::wall::now().time_since_epoch().count(),
or the shorter clock::get()->time().

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

770e21f1

clock: Remove old type-less time literals · 74a57376

Nadav Har'El authored 11 years ago


Drop the s64 literals _ms, _ns, etc., from <drivers/clock.hh>.
Fix a few places which still use the old literals.

The std:chrono::duration version from <osv/clock.hh> remains -
but remember you need to "using namespace osv::clock::literals"
to use them.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

74a57376

clock: Use new clock APIs in tst-wait-for.cc · de89218b

Nadav Har'El authored 11 years ago


Fix tst-wait-for.cc to use the new <osv/clock.hh> APIs.

This test uses std::abs on a time duration, and unfortunately C++11 fails
to implement std::abs on an std::chrono::duration. This patch also adds
support for this (in the form of a trivial template function) to
<osv/clock.hh>.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

de89218b

clock: Drop sched::thread::sleep_until() · b3bfe5fa

Nadav Har'El authored 11 years ago


Delete the sched::thread::sleep_until() function. All users of this
function actually wanted a relative time, not absolute time, and can
use the simpler new sched::thread::sleep() instead.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b3bfe5fa

clock: Test timerfd's support for monotonic clock · 2c47a392

Nadav Har'El authored 11 years ago


In tst-timerfd, test timerfd with monotonic clock, in addition to the
existing test with the realtime clock.

This patch also changes this test to only use Linux APIs, not anything
OSv-specific, so it can also be run on Linux.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2c47a392

Jan 26, 2014

Change tests/misc-bsd-callout to C++ · 7b8bdf8e

Nadav Har'El authored 11 years ago


Rename tests/misc-bsd-callout.c to tests/misc-bsd-callout.cc - we'll
need it in C++ in the upcoming patch set.

No changes were needed for this code to continue compiling.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

7b8bdf8e

Jan 23, 2014

tst-zfs-mount: Add ZFS refcnt consistency check · 47c0afdc

Raphael S. Carvalho authored 11 years ago


Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

47c0afdc

Jan 22, 2014

Fix tst-commands.cc · 7a4444c3

Nadav Har'El authored 11 years ago


A previous test changed loader.cc's command line parsing to also support
"&" as a separator of commands, causing the previous command to be executed
in a new thread. To achieve this, the parser ends each command with another
string, containing "&", ";" or "", depending on what appeared on the end
of this command.

This change caused tst-commands.cc, which checks the results of the
command line parsing, to fail. So this patch fixes the test to match
the code.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7a4444c3

tst-queue-mpsc: fix one bug in the test. · 2abe7c3f

Nadav Har'El authored 11 years ago


The second stack trace mentioned in issue #178 happens because of a
bug in tst-queue-mpsc (this is what happens when tests become too
complex, and have bugs of their own...):

The "popper" thread reads an "item" from a lockfree:queue_mpsc, and wakes
the "pusher" thread in that item. But we have a bug when the pusher thread
is done and returns: While the condvar remains valid, the "item" containing
it does not! We cannot continue to use the index item->value.waiter after
we woke this thread, because it can return and item points to invalid
memory... We need to save the index "item->value.waiter" before waking
the thread.

Unfortunately, this does *not* completely solve issue #178 - the timer
bug (similar to the two stack traces on issue #178) is still seen
(rarely) after this patch.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2abe7c3f

include: Move debug.hh to include/osv · 7809519b
Pekka Enberg authored 11 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
7809519b
include: Move mempool.hh to include/osv · 9c95f49d
Pekka Enberg authored 11 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
9c95f49d
include: Move commands.hh to include/osv · 86110819
Pekka Enberg authored 11 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
86110819
include: Move barrier.hh to include/osv · c80be886
Pekka Enberg authored 11 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
c80be886
include: Move mmu.hh to include/osv · 9cb900b7
Pekka Enberg authored 11 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
9cb900b7
include: Move sched.hh to include/osv · fae5693e
Pekka Enberg authored 11 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
fae5693e

Jan 21, 2014

chdir(): Fix error path, and add test · 4ae8779e

Nadav Har'El authored 11 years ago


This patch fixes chdir() on a normal file, which used to succeed (!?),
but now will fail as it should, with ENOTDIR.

The patch also adds an exhaustive test for chdir's success and error cases.
Before the latest chdir() patches, most of these tests would fail, and now
all of them succeed.

This test is standard C++ & Posix code, so it can be run also on Linux.
This is important for verifing that whatever we expect from OSv, Linux
really does the same.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4ae8779e

test case for dladdr() · dba67315

Takuya ASADA authored 11 years ago


Test case for dladdr(), to make sure it returns correct symbol in these cases:
 - addr is less than 'vfprintf'. Should returns another symbol.
 - addr is equals to 'vfprintf'. Should returns 'vfprintf' as the result.
 - addr is bigger than 'vfprintf', and also inside of it. Should returns 'vfprintf' as the result.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

dba67315

Jan 20, 2014
- tests: add test for wait_for(predicate) · 078e5690
  Avi Kivity authored 11 years ago
  
  Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
  078e5690
- tests: add tests for wait_for() with timer objects · 3136db04
  Avi Kivity authored 11 years ago
  
  Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
  3136db04
Jan 17, 2014

tests: miscs-procfs.so · fc0715e0

Pekka Enberg authored 11 years ago


Add a simple manual test case for checking "/proc/self/maps" output.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fc0715e0

Jan 15, 2014

tst-kill: fix timeout test · 337f4eb0

Avi Kivity authored 11 years ago


The timeout test sets a 2 second timeout and a 1 second alarm, and expects
the timeout to happen first.

Change the timeout to 0.5 seconds.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

337f4eb0

Jan 14, 2014

tst-vfs: don't rely on some random Java file · 305c749f

Nadav Har'El authored 11 years ago


tst-vfs.cc currently stat()s the file
	/usr/lib/jvm/jre/lib/amd64/headless/libmawt.so
And dies if it doesn't exist.

Since Java is now optional in our images, it's not a good idea to check
for such a file, which might not exist (e.g., "make image=tests check"
will fail). This patch changes it to check a filename that is certain to
exist, like namely the test itself - /tests/tst-vfs.so.

If we wanted to have a pathname with more components, the test should
be rewritten to create this pathname, say /a/a/a/a/a/a/a/a/a/a, and then
test stat on that newly created file. It cannot rely on such a file to
pre-exist.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

305c749f

Jan 13, 2014

tst-kill: add tests for interrupted syscalls (SIG_ALRM) · 2dc3caf3

Dmitry Fleytman authored 11 years ago


Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2dc3caf3

Jan 10, 2014

tests: add test for thread clock · 89c50e8a

Glauber Costa authored 11 years ago


Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

89c50e8a

Jan 07, 2014

Exile spinlock to a separate file · 8fcad509

Nadav Har'El authored 11 years ago

In very early OSv history, the spinlock was used in the mutex's
implementation so it made sense to put it in mutex.cc and mutex.h.

But now that the spinlock is all that's left in mutex.cc (the real mutex
is in lfmutex.cc), rename this file spinlock.cc. Also, move the spinlock
definitions from <osv/mutex.h> to a new <osv/spinlock.h>, so if someone
wants to make the grave mistake of using a spinlock - they will at least
need to explicitly include this header file.

Currently, the only remaining user of the spinlock is the console.
Using a spinlock (and not a mutex) in the console allows printing debug
messages while preemption is disabled. Arguably, this use-case is no
longer important (we have tracepoints), so in the future we can consider
dropping the spinlock completely.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8fcad509