Commits · b7bc004529adafb2342a91847099afa21fce0d4d · Verlässliche Systemsoftware / projects / osv

May 04, 2014

trace: support tracepoint signatures of arbitrary length · b7bc0045

Tomasz Grabiec authored 10 years ago


Currently tracepoint's signature string is encoded into u64 which
gives 8 character limit to the signature. When signature does not fit
into that limit, only the first 8 characters are preserved.

This patch fixes the problem by storing the signature as a C string of
arbitrary length.

Fixes #288.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

b7bc0045

May 03, 2014

tests: Add simple test case to /dev/urandom · 9f8f7068

Raphael S. Carvalho authored 10 years ago


Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9f8f7068

Apr 29, 2014

tests: add test for overflowing listen backlog · d399579c

Tomasz Grabiec authored 10 years ago


This test passes on Linux but failed on OSv before the patch which
changed solisten_proto() to always use SOMAXCONN.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d399579c

tests: test for sigwait · fb0e0882

Glauber Costa authored 10 years ago


Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fb0e0882

Apr 28, 2014

test: add test for removing of an opened file. · 0853dbca

Gleb Natapov authored 10 years ago


Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0853dbca

test: add test for renaming to opened file. · b4666431

Gleb Natapov authored 10 years ago


Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b4666431

test: add test for renaming of opened file. · a8526d28

Gleb Natapov authored 10 years ago


Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a8526d28

malloc: Fix small allocations with alignment > size. · c8845bb5

Nadav Har'El authored 10 years ago

When a small allocation is requested with large alignment, we ignored
the alignment, and as a consequence posix_memalign() or
alloc_phys_contiguous_aligned() could crash when it failed to achieve
the desired alignment. This is not a common case (usually, size >= alignment,
and the new C11 aligned_alloc() even supports only this case), but still
it might happen, and we saw it in cloudius-systems/capstan#75.

When size < alignment, this patch changes the size so we can achieve the
desired alignment. For small alignments, this means setting size=alignment,
so for example to get an alignment of 1024 bytes we need at least 1024-byte
allocation. This is a waste of memory, but as these allocations are rare,
we expect this to be acceptable. For large alignments, e.g., alignment=8192,
we don't need size=alignment but we do need size to be large enough so we'll
use malloc_large() (malloc_large() already supports arbitrarily large
alignments).

This patch also adds test cases to tst-align.so to test alignments larger
than the desired size.

Fixes #271 and cloudius-systems/capstan#75.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c8845bb5

Apr 25, 2014

trace: support for serializing variable-length sequences of bytes · 2d795f99

Tomasz Grabiec authored 10 years ago

Tracepoint argument which extends 'blob_tag' will be interpreted as a
range of byte-sized values. Storage required to serialize such object
is proportional to its size.

I need it to implement storage-fiendly packet capturing using tracing layer.

It could be also used to capture variable length strings. Current
limit (50 chars) is too short for some paths passed to vfs calls. With
variable-length encoding, we could set a more generous limit.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2d795f99

Apr 24, 2014

zfs: fix read() of directory to return EISDIR · 0ad78a14

Nadav Har'El authored 10 years ago


Posix allows read() on directories in some filesystems. However, Linux
always returns EISDIR in this case, so because we're emulating Linux,
so should we, for every filesystem. All our filesystems except ZFS
(e.g., ramfs) already return EISDIR when reading a directory, but ZFS
doesn't, so this patch adds the missing check in ZFS.

This patch is related to issue #94: the first step to fixing #94 is to
return the right error when reading a directory.

This patch also adds a test case, which can be compiled both on OSv and
Linux, to verify they both have the same behavior. Before the patch, the
test succeeded on Linux but failed on OSv when the directory is on ZFS.

Instead of fixing zfs_read like I do in this patch, I could have also fixed
sys_read() in vfs_syscalls.cc which is the top layer of all read()
operations, and I could have done there
   (fp->f_dentry && fp->f_dentry->d_vnode->v_type == VDIR) {
      return EISDIR;
   }
to cover all the filesystems. I decided not to do that, because all
filesystems except ZFS already have this check, and because the lower
layers like zfs_read() already have more natural access to d_vnode.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0ad78a14

Apr 15, 2014

tests: test elf segments permissions · bb20ffd6

Pawel Dziepak authored 10 years ago


Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

bb20ffd6

tests: move memory permission check implementation to header file · fbb6e578

Pawel Dziepak authored 10 years ago


This code is going to be shared by tst-mmap and tst-elf-permissions.

Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fbb6e578

Apr 14, 2014

tests: add test for closing TCP connection with pending packets · fc5fc867

Tomasz Grabiec authored 10 years ago


The test is supposed to trigger the problem from issue #259. I was not
able to trigger the problem using guest-local communication, hence the
client is external to the guest.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fc5fc867

tests: add rcu list test · 1bbcd274
Avi Kivity authored 10 years ago
```
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
```
1bbcd274

memory: support for larger-than-page alignment · 8c417e23

Nadav Har'El authored 10 years ago


Our existing implementation of posix_memalign() and the C11 aligned_alloc()
used our regular malloc(), so it only work worked up to alignment of 4096
bytes - and crashed when it failed to achieve a higher desired alignment.

Some applications do ask for higher alignment - for example MongoDB allocates
a large buffer at 8192 byte alignment, and accordingly crashes half of the
times, when the desired alignment is not achieved (half of the time, it is
achieved by chance).

This patch makes our support for alignment better organized, and fixes
the alignment > 4096 case:

The alignment is no longer available only to the outer function like
posix_memalign(). Rather, it is passed down to lower-level allocation
functions like malloc_large() which allocates whole pages - and this
function now knows how to pick pages which start at a properly aligned
boundary.

This patch does not improve the wastefulness of our malloc_large(), so
an overhaul of it would still be welcome. Case in point, malloc_large()
always adds a full page to any allocation larger than half a page.
Multiple allocations with posix_memalign(8192, 8192), rather than being
tightly packed, each take 3 pages and are separated by a free page.
This page is not wasted, but causes fragmentation of the heap.

Note that after this patch, we still have one other bug in
posix_memalign(size, align) - for small sizes and large alignments.
For small sizes, we use a pool allocator with "size" alignment, and
may not achieve the desired alignment (so causing an assertion failure).
This bug can also be fixed, but is unrelated to this patch.

This patch also adds a test for posix_memalign(), checking all alignments
including large alignments which are the topic of this patch.
The tests for small *sizes*, which as explained above are still buggy,
are commented out, because they fail.

Fixes #266.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@avi.cloudius>

8c417e23

Apr 09, 2014

tests: enhance mmap test · ed2a70f1

Glauber Costa authored 10 years ago

We have recently discovered a bug through which we fail to unmap a valid region.
This is fixed now, and this patch adds the failing condition to the test suite.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ed2a70f1

vfs: fix partial non-blocking write · ef169330

Nadav Har'El authored 10 years ago


Our read() and write(), and their variants (pread, pwrite, readv, writev,
preadv, pwritev) all shared the same bug when it comes to a partial read or
write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully
with the number of bytes actually written or read, as they should have.

In the internals of the BSD read and write operations (e.g., sosend_generic)
each operation returns *both* an error number and a number of bytes left.
But at the end, the system call is expected to return just one of them -
either an error *or* a number of bytes. The existing read()/write() code,
when it saw the internals returning an error code, always returned it and
ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK
and the number of bytes is non-zero, we should return this number of bytes
(i.e., a successful partial write), *not* the EWOULDBLOCK error.

This bug went unnoticed almost since the dawn of OSv, because partial reads
and writes are not common. For example, a write() to a blocking socket will
always return after the entire write is successful, and will not partially
succeed. Only when we write to an O_NONBLOCK socket, will it be possible to
see a partial write - But even then, we would need a pretty large write()
to see it only partially succeeding.

But this bug is very noticable when running the Jetty Web server (see issue
At some point it's like the response was restarted (complete with a second
copy of the headers). In Jetty's demo this was seen as half-shown images,
as well as corrupt output when fetching large text files like /test/da.txt.

Turns out that Jetty sends static responses in a surprisingly efficient
(for Java code...) way, using a single system call for the entire response:
It mmap()s the file it wishes to send, and then uses one writev() call to
send two arrays: The HTTP headers (built in malloc()ed memory), and the
file itself (from mmapped memory). So Jetty tries to write even a 1MB file
in one huge writev() call. But there's an added twist: It does so with the
socket configured to O_NONBLOCK. So for large writes, the write will only
partially succeed (empirically, only about 50KB will succeed), and Jetty
will notice the partial write and continue writing the rest - until the
whole file is sent. With the bug we had, part of the request will have been
written, but Jetty still thought the write didn't write anything so it would
start writing again from the beginning - causing the weird sort of response
corruption we've been seeing.

This patch also includes a test case which confirms this bug, and its fix.
In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket
(on the loopback interface), one thread write()s a very large buffer and
the other receives what it can. We try this two times - once on a blocking
socket and once on a non-blocking socket. In each case we expect the number
of bytes written by one thread (return from write()) and the number read
by the second thread (return from read()) to be the same. With the bug we
had, in the non-blocking case we saw write() returning -1 (with
errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test
to fail.

Fixes #257.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

ef169330

Apr 03, 2014

tests: Add test case to procfs readdir functionality · bd5afb0c

Raphael S. Carvalho authored 10 years ago


Specific output:

Reading directory entries at /proc...
dentry name: .
dentry name: ..
dentry name: self
Reading directory entries at /proc/self...
dentry name: .
dentry name: ..
dentry name: maps

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

bd5afb0c

Apr 01, 2014

core: introduce serial_timer_task · bd179712

Tomasz Grabiec authored 10 years ago

This is a wrapper of timer_task which should be used if atomicity of
callback tasks and timer operations is required. The class accepts
external lock to serialize all operations. It provides sufficient
abstraction to replace callouts in the network stack.

Unfortunately, it requires some cooperation from the callback code
(see try_fire()). That's because I couldn't extract in_pcb lock
acquisition out of the callback code in TCP stack because there are
other locks taken before it and doing so _could_ result in lock order
inversion problems and hence deadlocks. If we can prove these to be
safe then the API could be simplified.

It may be also worthwhile to propagate the lock passed to
serial_timer_task down to timer_task to save extra CAS.

bd179712

core: introduce deferred work framework · 34620ff0

Tomasz Grabiec authored 11 years ago

The design behind timer_task

timer_task was design for making cancel() and reschedule() scale well
with the number of threads and CPUs in the system. These methods may
be called frequently and from different CPUs. A task scheduled on one
CPU may be rescheduled later from another CPU. To avoid expensive
coordination between CPUs a lockfree per-CPU worker was implemented.

Every CPU has a worker (async_worker) which has task registry and a
thread to execute them. Most of the worker's state may only be changed
from the CPU on which it runs.

When timer_task is rescheduled it registers its percpu part in current
CPU's worker. When it is then rescheduled from another CPU, the
previous registration is marked as not valid and new percpu part is
registered. When percpu task fires it checks if it is the last
registration - only then it can fire.

Because timer_task's state is scattered across CPUs some extra
housekeeping needs to be done before it can be destroyed. We need to
make sure that no percpu task will try to access timer_task object
after it is destroyed. To ensure that we walk the list of
registrations of given timer_task and atomically flip their state from
ACTIVE to RELEASED. If that succeeds it means the task is now revoked
and worker will not try to execute it. If that fails it means the task
is in the middle of firing and we need to wait for it to finish. When
the per-CPU task is moved to RELEASED state it is appended to worker's
queue of released percpu tasks using lockfree mpsc queue. These
objects may be later reused for registrations.

34620ff0

tests: add java reclaim test · a71f8c53

Glauber Costa authored 11 years ago


This is a test in which two threads compete for resources. One of them will
(hopefully) trigger memory allocations that are served by the heap while the other
will stress the filesystem through reads and/or writes (no mappings).

This is designed to test how well the balloon code works together with the ARC
reclaimer.

There are three main goals I expect OSv to achieve when running this test:

1) When there is no filesystem activity, the balloon should never trigger, and
the ARC cache should be reduced to its minimum
2) When there is no java activity, we should balloon as much as we can, leaving
the memory available to the filesystem (this one is trickier because the IO code
is itself a java application - on purpose - so we eventually have to stop)
3) When both are happening in tandem, the system should stabilize in reasonable
values and not spend useless cycles switching memory back and forth.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

a71f8c53

Mar 30, 2014

tests: introduce write/read/verify test · 96a9ebcb

Dmitry Fleytman authored 11 years ago


Test misc-bdev-rw introduced.

The test writes buffers of various lengths to block device,
reads data back and verifies data read is the same as data written.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

96a9ebcb

tests: test various buffer sizes in misc-bdev-write test · 08b04e4a

Dmitry Fleytman authored 11 years ago


Xen block driver now supports new featute called indirect descriptors.
This feature allows to put more data into each ring cell but it activates
for "long" reads and writes only - longer than 11 pages.

With this patch test by default runs 2 scenarious:
  * 1 page buffers
  * 32 pages buffers

Also introduced command line parameter to specify
size of buffers explicitly.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

08b04e4a

Mar 27, 2014

tests: improve zfs shared buffers tests · 28186c29

Glauber Costa authored 11 years ago

We have seen bugs with mmap shared/file handling for small files. This patch tests
some of the corner scenarios to find those problems.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

28186c29

Mar 25, 2014

tst-kill: fix crash · 229020d2

Nadav Har'El authored 11 years ago


tst-kill runs various signal handlers, which we run in separate threads.
When the test completes, we may be unlucky enough for the last signal
handler to still be running, at which point when the module's memory
is unmapped (e.g., in test.py -s each test is unmapped when it ends)
we can get a page fault and a crash.

This patch sleeps for a second at the end of tst-kill, to make sure that
the signal handler has completed; This sleep is a bit ugly, but I can't
think of a cleaner way - Posix provides no way to check if there's a
running handler, and I wouldn't like to add a new API just for this test.

Fixes #249.

Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

229020d2

Mar 21, 2014

tests: test for larger-than-memory mmaps. · b0cf5d76

Glauber Costa authored 11 years ago


It will create a file on-disk twice as large as memory, and then will map it entirely
in memory. The file is then read from using 3 different sequential patterns, and then
later on 2 threaded patterns.

This test does not handle writes.

It goes in misc because it takes a very long time to run (especially with a random pattern)

Example output:

Total Ram 586 Mb
Write done
Double Pass OK (13.6323 usec / page)
Recency OK (3.35954 usec / page)
Random Access OK (640.926 usec / page)
Threaded pass 1 address ended OK
Threaded pass many addresses ended OK
PASSED

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

b0cf5d76

signal handling: support SA_RESETHAND · 15dfac35

Nadav Har'El authored 11 years ago

Add support for SA_RESETHAND signal handler flag, which means that the signal
handler is reset to the default one after handling the signal once.

I admit it's not a very useful feature (our default handler is powering off
the system...) but no reason not to support it.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

15dfac35

Mar 17, 2014
- tests: add System V shared memory test · ccec00f7
  Gleb Natapov authored 11 years ago
  
  Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
  ccec00f7
Mar 10, 2014

tests: Check st_dev and f_fsid validity · c6916238

Raphael S. Carvalho authored 11 years ago


Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c6916238

tst-zfs-mount.so: Disable reference count check · 36b91546

Pekka Enberg authored 11 years ago


Commit 951cca45 ("Revert "vfs/zfs: Sync vnode and znode refcounts")
forgot to disable ZFS reference count test case.

Reported-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

36b91546

Mar 03, 2014

tests: added a test verifying the fix for issue #225 · 1ba76eb0

Vlad Zolotarov authored 11 years ago


Running 'make check' after the fix:
...
TEST tst-ring-spsc-wraparound.so OK  (8.817 s)
...

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1ba76eb0

tests/misc-lfring.cc: Delete the test for ring_mpsc. · 31750ff0

Vlad Zolotarov authored 11 years ago


We are about to kill ring_mpsc - delete the corresponding tests.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

31750ff0

Feb 25, 2014

tests: more load-balancer tests · 5b805e63

Nadav Har'El authored 11 years ago

This patch adds two more load-balancing tests to tests/misc-loadbalance.cc:

1. Three threads on two cpus. If load-balancing is working correctly, this
should slow down all threads x1.5 equally, and not get two x2 threads
and one x1.

Our performance on this test are fairly close to the expected.

2. Three threads on two cpus, but one thread has priority 0.5, meaning it
should get twice the CPU time of the two other threads, so fair load
balancing is to keep the priority-0.5 thread on its own CPU, and the
two normal-priority threads together on the second CPU - so at the end
the priority-0.5 thread will get twice the CPU time of the other threads.

Unfortunately, this test now gets bad results (x0.93,x0.94,x1.14
instead of x1,x1,x1), because our load balancer currently doesn't take
into account thread priorities: It thinks the CPU running the
priority-0.5 thread has load 1, while it should be considered to have
the load 2.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

5b805e63

Feb 24, 2014

bsd: move arch-dependent machine x64 files into a subdir · fb3b2a7b

Claudio Fontana authored 11 years ago


also updates the isa defs with some AArch64 values.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

fb3b2a7b

Feb 12, 2014

tests: Use a rare value on the symbol content comparisons · 288a3c84

Raphael S. Carvalho authored 11 years ago


Zero is a regularly used value, so let's instead use a rare one to compare the
content of the symbols against. Addressing some stylistic issues as well.

Follow the new output:
$ sudo scripts/run.py -e 'tests/tst-resolve.so'
OSv v0.05-348-g8b39f8c
Target value:   0x05050505
Success:        nonexistant = 0x05050505
Success:        debug = 0x05050505
Success:        condvar_wait = 0x05050505
The time:       1392076964
success.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

288a3c84

Fix concurrent static initialization · 8ef7f248

Nadav Har'El authored 11 years ago


When multiple threads concurrently use a function which has a static
variable with a constructor, gcc guarantees that only one thread will run
the constructor, and the other ones waits.  It uses __cxa_guard_acquire()/
release()/abort() for implementing this guarantee. Unfortunately these
functions, implemented in the C++ standard library, use the Linux futex
syscall which we don't implement. The futex system call is only used in
the rare case of contention - in the normal case, the __cxa_guard_*
functions uses just atomic memory operations.

This patch implements the bare minimum we need from futex to make the
__cxa_guard_* functions work: All we need is FUTEX_WAIT and FUTEX_WAKE,
with 0 timeout in FUTEX_WAIT and wake all in FUTEX_WAKE.

It's nice how the new waitqueues fit this implementation like glove to
a hand: We could use condvars too, but we anyway need to do the wake()
with the mutex locked (to get the futex's atomic test-and-wait), and in
that case we can use waitqueues without their additional internal mutex.

This patch also adds a test for this bug. Catching this bug in a real
application is very rare, but the test exposes it by defining an function-
static object with a very slow constructor (it sleeps for a second), and
calls the function from several threads concurrently.
Before this patch the test fails with:
   constructing
   syscall(): unimplemented system call 202. Aborting.

Fixes #199.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8ef7f248

Feb 11, 2014

epoll: Support epoll()'s EPOLLET · d41d748f

Nadav Har'El authored 11 years ago

This patch adds support for epoll()'s edge-triggered mode, EPOLLET.
Fixes #188.

As explained in issue #188, Boost's asio uses EPOLLET heavily, and we use
that library in our management http server, and also in our image creation
tool (cpiod.so). By ignoring EPOLLET, like we did until now, the code worked,
but unnecessarily wasted CPU when epoll_wait() always returned immediately
instead of waiting until a new event.

This patch works within the confines of our existing poll mechanisms -
where epoll() call poll(). We do not change this in this patch, and it
should be changed in the future (see issue #17).

In this patch we add to each struct file a field "poll_wake_count", which
as its name suggests counts the number of poll_wake()s done on this
file. Additionally, epoll remembers the last value it saw of this counter,
so that in poll_scan(), if we see that an fp (polled with EPOLLET) has
an unchanged counter from last time, we do not return readiness on this fp
regardless on whether or not it has readable data.

We have a complication with EPOLLET on sockets. These have an "SB_SEL"
optimization, which avoids calling poll_wake() when it thinks the new
data is not interesting because the old data was not yet consumed, and
also avoids calling poll_wake() if fp->poll() was not previously done.
This optimization is counter-productive for EPOLLET (and causes missed
wakeups) so we need to work around it in the EPOLLET case.

This patch also adds a test for the EPOLLET case in tst-epoll.cc. The test
runs on both OSv and Linux, and can confirm that in the tested scenarios,
Linux and OSv behave the same, including even one same false-positive:
When epoll_wait() tells us there is data in a pipe, and we don't read it,
but then more data comes on a pipe, epoll_wait() will again return a new
event, despite this is not really being an edge event (the pipe didn't
change from empty to not-empty, as it was previously not-empty as well).

Concluding remarks:

The primary goal of this implementation is to stop EPOLLET epoll_wait()
from returning immediately despite nothing have happened on the file.
That was what caused the 100% CPU use before this patch. That being said,
the goal of this patch is NOT to avoid all false-positives or unnecessary
wakeups; When events do occur on the file, we may be doing a bit more
wakeups than strictly necessary. I think this is acceptable (our epoll()
has worse problems) but for posterity, I want to explain:

I already mentioned above one false-positive that also happens on Linux.
Another false-positive wakeup that remains is in one of EPOLLET's classic
use cases: Consider several threads sleeping on epoll() on the same socket
(e.g., TCP listening socket, or UDP socket). When one packet arrives, normal
level-triggered epoll() will wake all the threads, but only one will read
the packet and the rest will find they have nothing to read. With edge-
triggered epoll, only one thread should be woken and the rest would not.
But in our implementation, poll_wake() wakes up *all* the pollers on this
file, so we cannot currently support this optimization.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d41d748f

Feb 10, 2014

tests/misc-bdev-write: Introduce offset limitation · 89c5c1a2

Dmitry Fleytman authored 11 years ago


Useful for testing on RAM disks when writes are fast enough
to fill the whole image in less than test execution time.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

89c5c1a2

tests: fix test_malloc to use condvar for synchronisation. · 146690d1

Gleb Natapov authored 11 years ago

It uses low level thread::wait_until() now which calls caller supplied
predicate with preemption disabled. If caller supplied code access not yet
mapped memory it will trigger an assertion on a page fault path.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

146690d1

Convert tst-epoll.cc to standard C++ · b406ae7a

Nadav Har'El authored 11 years ago


Replace OSv-specific constructs in tst-epoll.cc by their standard C++
counterparts (i.e., std::thread, std::chrono, std::cout).
This test now also runs (and of course, succeeds) on Linux.

In general, it is important at all our Linux-ABI tests (where we test our
implementation of the Linux/glibc functionality) to be able to run on Linux
as well. Otherwise, it is possible our tests don't actually test the right
thing (we may test for some expected behavior, but the actual behavior on
Linux is different).

I'm doing this in preparation for fixing issue #188 (fix edge-triggered
epoll).

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b406ae7a