Commits · fbd7d0846f3daa980b7182ca08a7f83248d24853 · Verlässliche Systemsoftware / projects / osv

Feb 11, 2014

epoll: Support epoll()'s EPOLLET · d41d748f

Nadav Har'El authored 11 years ago

This patch adds support for epoll()'s edge-triggered mode, EPOLLET.
Fixes #188.

As explained in issue #188, Boost's asio uses EPOLLET heavily, and we use
that library in our management http server, and also in our image creation
tool (cpiod.so). By ignoring EPOLLET, like we did until now, the code worked,
but unnecessarily wasted CPU when epoll_wait() always returned immediately
instead of waiting until a new event.

This patch works within the confines of our existing poll mechanisms -
where epoll() call poll(). We do not change this in this patch, and it
should be changed in the future (see issue #17).

In this patch we add to each struct file a field "poll_wake_count", which
as its name suggests counts the number of poll_wake()s done on this
file. Additionally, epoll remembers the last value it saw of this counter,
so that in poll_scan(), if we see that an fp (polled with EPOLLET) has
an unchanged counter from last time, we do not return readiness on this fp
regardless on whether or not it has readable data.

We have a complication with EPOLLET on sockets. These have an "SB_SEL"
optimization, which avoids calling poll_wake() when it thinks the new
data is not interesting because the old data was not yet consumed, and
also avoids calling poll_wake() if fp->poll() was not previously done.
This optimization is counter-productive for EPOLLET (and causes missed
wakeups) so we need to work around it in the EPOLLET case.

This patch also adds a test for the EPOLLET case in tst-epoll.cc. The test
runs on both OSv and Linux, and can confirm that in the tested scenarios,
Linux and OSv behave the same, including even one same false-positive:
When epoll_wait() tells us there is data in a pipe, and we don't read it,
but then more data comes on a pipe, epoll_wait() will again return a new
event, despite this is not really being an edge event (the pipe didn't
change from empty to not-empty, as it was previously not-empty as well).

Concluding remarks:

The primary goal of this implementation is to stop EPOLLET epoll_wait()
from returning immediately despite nothing have happened on the file.
That was what caused the 100% CPU use before this patch. That being said,
the goal of this patch is NOT to avoid all false-positives or unnecessary
wakeups; When events do occur on the file, we may be doing a bit more
wakeups than strictly necessary. I think this is acceptable (our epoll()
has worse problems) but for posterity, I want to explain:

I already mentioned above one false-positive that also happens on Linux.
Another false-positive wakeup that remains is in one of EPOLLET's classic
use cases: Consider several threads sleeping on epoll() on the same socket
(e.g., TCP listening socket, or UDP socket). When one packet arrives, normal
level-triggered epoll() will wake all the threads, but only one will read
the packet and the rest will find they have nothing to read. With edge-
triggered epoll, only one thread should be woken and the rest would not.
But in our implementation, poll_wake() wakes up *all* the pollers on this
file, so we cannot currently support this optimization.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d41d748f

msix: thread affinity · b4e8d47d

Vlad Zolotarov authored 11 years ago


Instead of binding all msix interrupts to cpu 0, have them chase the
interrupt service routine thread and pin themselves to the same cpu.

This patch is based on the patch from Avi Kivity <avi@cloudius-systems.com>
and used some ideas of Nadav Har'El <nyh@cloudius-systems.com>.

It improves the performance of the single thread Rx netperf test by 16%:
before - 25694 Mbps
after  - 29875 Mbps

New in V2:
 - Dropped the functor class - use lambda instead.
 - Fixed the race in a waking flow.
 - Added some comments.
 - Added the performance numbers to the patch description.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b4e8d47d

Feb 10, 2014

mmu: map file mappings with clear ptes. · f6717dbc

Gleb Natapov authored 11 years ago

When one does not care about dirty bit in a page table it is beneficial
to map everything as dirty since it will relieve HW from setting the bit
on first access, but since now file mappings use dirty bit to reduce
write back during sync lets map them clean initially, but only if file
is shared since for private file mapping sync is disabled.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f6717dbc

mmu: enable on demand file paging · 1a9de4fb

Gleb Natapov authored 11 years ago

Now everything is ready to enable on demand file paging, so disable
unconditional populate on mmap for files and provide fault() for file
backed vmas. In fact file vma fault() is not different from anon vma
fault() so provide one general vma::fault() handler for both.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1a9de4fb

mmu: separate page allocation from page mapping process · 55693e5c

Gleb Natapov authored 11 years ago

Currently pages are allocated just before they are mapped, but sometimes
we want to map preallocated pages (for shared mapping or mapping page
cache page for instance). This patch moves page acquisition process into
separate class. This will allow us to add subclass that provides pages
from shared pages pool or from a page cache.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

55693e5c

mmu: fix offset calculation in page mapper · d8b67a3f

Gleb Natapov authored 11 years ago

Generic page mapper code provides offset to mapper's small/huge_page
callbacks, but to make this offset meaningful it should be an offset
from a beginning of a vma that is mapped, but currently it is an offset
from a starting address of current mapping operation which means it
will be always zero it mapping is done page by page. Fix this by passing
starting vma address to page mapper for correct offset calculation.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d8b67a3f

interrupt: Handle unregister of shared vector properly · ca9eea6a

Asias He authored 11 years ago


Introduce unregister_level_triggered_handler to handle unregister of
shared vector.

Signed-off-by: Asias He <asias@cloudius-systems.com>

ca9eea6a

rcu: Optimize rcu_dispose and rcu_dispose_array · d1ed2244

Asias He authored 11 years ago


Only call rcu_defer if the pointer is not nullptr.

Signed-off-by: Asias He <asias@cloudius-systems.com>

d1ed2244

interrupt: Introduce set_ack_and_handler · 277189e4

Asias He authored 11 years ago


Add set_ack_and_handler helper to setup ack and handler function of gsi
level interrupt.

Signed-off-by: Asias He <asias@cloudius-systems.com>

277189e4

Feb 09, 2014

net: wire up net channel poll() support for sockets · f27133b0

Avi Kivity authored 11 years ago


If a socket has an associated net channel, make sure we wake the polling
thread is a packet arrives.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

f27133b0

net: add poll support for net channels · 7166ad01

Avi Kivity authored 11 years ago


Add an additional, optional, list of poll requests that the net channel
producer awakens.

This will be used by poll() to ensure it is awakened if there is net channel
activity.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

7166ad01

vfs: add file::poll_install() and poll_uninstall() · 07fcb7a9

Avi Kivity authored 11 years ago


Some file types may be interested to know they are being polled.  Add hooks
for this.

Needed for net channels, as the polling thread must wake up when there
is activity on the channel.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

07fcb7a9

net: packet classifier · 65330e34

Avi Kivity authored 11 years ago

Add a structure that holds a hash table of net channels, and is able
to classify tcp/ipv4 packets and dispatch them via the channel.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

65330e34

net: add net channel support · 264adbb5

Avi Kivity authored 11 years ago


This adds a net_channel class the connects a packet producer with a
packet consumer.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

264adbb5

sched: add operator== to thread_handle · 5710bf2f

Avi Kivity authored 11 years ago


Useful for removing a handle from a container.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

5710bf2f

sched: add operator bool to thread_handle · 5bfee0db

Avi Kivity authored 11 years ago


Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

5bfee0db

sched: add assignment operator to thread_handle · d61611ff

Avi Kivity authored 11 years ago


Useful when stored in a container.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d61611ff

rcu: make read_by_owner() const · 0e087fd7

Avi Kivity authored 11 years ago


Like iterators, the fact that an rcu pointer is not changable doesn't
mean anything about the pointed-to data.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

0e087fd7

interrupt: Change signature of pre_eoi · 75e25751

Asias He authored 11 years ago

pre_eoi return 'true' when the interrupt is for the device, 'false'
otherwise. This is the preparation for interrupt sharing support.

Signed-off-by: Asias He <asias@cloudius-systems.com>

75e25751

Feb 07, 2014

boot: take early timings · d38883fa

Glauber Costa authored 11 years ago

In the past, we have struggled with long delays while reading data from disk in
real mode, leading to big boot times (not that they are totally gone). For that
reason, it is useful to know how much time is being spent in that process. As
unstable and broken the TSC is, it is pretty much our only ally for that.

What I am proposing in this patch, is that we take timings from key states of
the bootloader, and pass that to main loader. We will do that by adding some
space at the end of the multiboot_info structure, so that we can pass some
fields to it. Right now, we are using 16 bytes so we can pass 2 64-bit tsc
reads.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d38883fa

general infrastructure for boot time calculation · 3ab3a6bb

Glauber Costa authored 11 years ago

I am proposing a mechanism here that will allow us to have a better idea about
how much time do we spend booting, and also how much time each of the pieces
contribute to. For that, we need to be able to get time stamps really early, in
places where tracepoints may not be available, and a clock most definitely
won't.

With my proposal, one should be able to register events. After the system
boots, we will calculate the total time since the first event, as well as the
delta since the last event. If the first event is early enough, that should
produce a very good picture about our boot time.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3ab3a6bb

pvclock: reuse pvclock's functionality to convert tsc to nano · 2df3c029

Glauber Costa authored 11 years ago

This patch provides a way to, given a measurement from tsc, acquire a nanosecond
figure. It works only for xen and kvm pvclocks, and I intend to use it for acquiring
early boot figures.

It is possible to measure the tsc frequency and with that figure out how to
convert a tsc read to nanoseconds, but I don't think we should pay that price.
Most of the pvclock drivers already provide that functionality, and we are not
planning that many users of that interface anyway.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2df3c029

Feb 06, 2014

api/aarch64: add alltypes.h.sh script and first headers · 81aa5e81

Claudio Fontana authored 11 years ago


add alltypes.h.sh, and first headers in bits/

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

81aa5e81

jvm_balloon: handle explicit unmapping case · fc469b4d

Glauber Costa authored 11 years ago

The JVM may unmap certain areas of the heap completely, which was confirmed by
code inspection by Gleb. In that case, the current balloon code will break.

This is because we were deleting the vma from finish_move(), and recreating the
old mapping implicitly in the process. With this new patch, the tear down of
the jvm balloon mapping is done by a separate function. Unmapping or evacuating
the region won't trigger it.

It still needs to communicate to the balloon code that this address is out of
the balloons list. We do that by calling the page fault handler with an empty
frame. jvm_balloon_fault will is patched to interpret an empty frame correctly.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fc469b4d

runtime: Support format specifies in abort() · a814c5a7

Pekka Enberg authored 11 years ago


Add format specifier support to abort() to make it easier to produce
useful error messages.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a814c5a7

Elf: Fix also _module_index_list · b0b5462f

Nadav Har'El authored 11 years ago


Fix also concurrent use of _module_index_list (for the per-module TLS).
Use a new mutex _module_index_list mutex to protect it. We could
probably have done something with the RCU instead, but just adding a
new mutex is a lot easier.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b0b5462f

Elf: Fix shared-object unload concurrent with dynamic linker use · 8213bf13

Nadav Har'El authored 11 years ago


After the above patches, one race remains in the dynamic linker: If an
object is *unloaded* while some symbol resolution or object iteration
(dl_iterate_phdr) is in progress, the function in progress may reach
this object after it is already unmapped from memory, and crash.

Therefore, we need to delay unmapping of objects while any object
iteration is going on. We need to allow the object to be deleted from
the _modules and _files list (so that new calls will not find it) but
temporarily delay the actual freeing of the object's memory.

The cleanest way to achieve this would have been to increment each
modules' reference in the RCU section of modules_get(), so they won't
get deleted while still in use. However, this will signficantly slow down
users like backtrace() with dozens of atomic operations. So we chose
a different solution: keep a counter _modules_delete_disable, which
when non-zero causes all module deletion to be delayed until the counter
drops back to zero. with_modules() now only needs to increment this
single counter, not every separate module.

Fixes #176.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8213bf13

Elf: Fix shared-object load concurrent with dynamic linker use · 68afb68e

Nadav Har'El authored 11 years ago

This patch addresses the bugs of *use* of the dynamic linker - looking
up symbols or iterating the list of loaded objects - in parallel with new
libraries being loaded with get_library().

The underlying problem is that we have an unprotected "_modules" vector
of loaded objects, which we need to iterate to look up symbols, but this
list of modules can change when a new shared object is loaded.

We decided *not* to solve this problem by using the same mutex protecting
object load/unload: _mutex. That would make boot slower, as threads using
new symbols are blocked just because another thread is concurrently loading
some unrelated shared object (not a big problem with demand-paged file
mmaps). Using a mutex can also cause deadlocks in the leak detector,
because of lock order reversal between malloc's and elf'c mutexes: malloc()
takes a lock first and then backtrace() will take elf's lock, and on the
other hand elf can take its lock and then call malloc taking malloc's lock.

Instead, this patch uses RCU to allow lock-free reading of the modules
list. As in RCU, writing (adding or removing an object from the list)
manufactures a new list, defering the freeing of the old one, allowing
reads to continue using the old object list.

Note that after this patch, concurrent lookups and get_library() will
work correctly, but concurrent lookups and object *unload* still will
still not be correct because we need to defer an object's unloading from
memory while lookups are in progress. This will be solved in a following
patch.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

68afb68e

Elf: Serialize shared-object load and unload · 566c77f6

Nadav Har'El authored 11 years ago

Our current dynamic-linker code (elf.cc) is not thread safe, and all sort
of disasters can happen if shared objects are loaded, unloaded and/or used
concurrently. This and the following patches solve this problem in stages:

The first stage, in this patch, is to protect concurrent shared-library
loads and unloads. (if the dynamic linker is also in use concurrently,
this will still cause problems, and will be solved in the next patches).

Library load and unload use a bunch of shared data without protection,
so concurrency can cause disaster. For example, two concurrent loads can
pick the same address to map the objects in. We solve this by using a mutex
to ensure only one shared object is loaded or unloaded at a time.

Instead of this coarse-grain locking, we could have used finer-grained
locks to allow several library loads to proceed in parallel, protecting
just the actual shared data. However the benefits will be very small
because with demand-paged file mmaps, "loading" a library just sets up
the memory map, very quickly, and the object will only be actually read
from disk later, when its pages get used.

Fixes #175.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

566c77f6

Add SCOPE_LOCK(mutex) macro · 30ea16ce

Nadav Har'El authored 11 years ago


Add a macro SCOPE_LOCK(mutex) which locks the given mutex and unlocks
it when the scope ends (this uses RAII, so the mutex will correctly get
unlocked even when the scope is exited via return or exception).

This does the same as C++11's std::lock_guard, but far less verbose:
To use std::lock_guard with a mutex m, one nees to do something like
std::lock_guard<mutex> guard(m);
where the mutex's type needs to be repeated, and a name needs to be
invented for the guard which will likely not be used again. This
macro makes these things unnecessary, and one just writes
SCOPE_LOCK(m);

Note that WITH_LOCK(m) { ... } should usually be preferred over SCOPE_LOCK.
However, SCOPE_LOCK can come in handy in some cases, for example adding a
lock to a function without reindenting it.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

30ea16ce

Feb 05, 2014

build: move x64-specific modules into x64 subdirs · 824f0f64

Claudio Fontana authored 11 years ago


First step towards moving x64-specific stuff into arch
subdirectories:

move the submodules in external/ to external/x64/,

rename amd64->x64 in opensolaris,

rename x86_64->x64 in include/api,

adapt build.mk to reference the updated paths,
and improve detection of include dirs and libs in the
external submodules.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

824f0f64

Feb 03, 2014

cleanup: remove public access to flags and perms for jvm balloon vma · 7b16a77b

Glauber Costa authored 11 years ago


In the end, all uses ended up being done by the object itself. Remove
the public accessors.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7b16a77b

sched: fix race in wake_lock() · ad94d796

Avi Kivity authored 11 years ago


wake_lock() prevents a race with the target thread waking up after wake_lock()
and attempting to take the mutex by itself, by changing the thread status
to sending_lock.

However, it doesn't prevent the reverse race:

t1                         t2
=========================  ======================
wait_for()
mutex::unlock()
thread::wait()

                           t1->wake()
                           mutex::lock()

mutex::lock() [A]
  thread::wait()

                           t1->wake_lock() [B]

After [A], t1 is waiting on the mutex, and after [B], it is waiting twice,
which is impossible and aborts on an assertion failure.

Fix by detecting that we're already waiting in send_lock() aborting the
send_lock().

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

ad94d796

queue_mpsc: add iteration support · 6c9712a0

Avi Kivity authored 11 years ago


Consumer-side iteration only; pop() invalidates iterators.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

6c9712a0

Feb 02, 2014
- libc: provide pthread_setname_np() · 9ac6a0cc
  Avi Kivity authored 11 years ago
  
  Useful for debugging. Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
  9ac6a0cc
- trace: record and display thread names · 4fa6c6dd
  Avi Kivity authored 11 years ago
  
  Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
  4fa6c6dd
- sched: add thread name attribute · 1ba9a8bc
  Avi Kivity authored 11 years ago
  
  Useful for debugging and pthread_setname_np(). Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
  1ba9a8bc
Jan 27, 2014

clock: Clean up more unneeded stuff · 0b6bc10e

Nadav Har'El authored 11 years ago


For this patch series to be bisectable, the previous patches left behind
some features, or temporarily added features, that we couldn't remove
until they are no longer use. Now we can finally delete these unused
features:

1. After the previous patches stopped using clock::get()->time() in sched.cc,
we can drop the alias osv::clock::get() we added temporarily just to make
it easier to compile the first patches in this series.
2. Drop the clock_event::set(s64) variant.
3. Drop the timer_base::set(s64) variant.

Fixes #81.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0b6bc10e

clock: condvar::wait with a time point · 0f5992ca

Nadav Har'El authored 11 years ago


Replace the old function condvar::wait(mutex*, uint64_t) with one taking
a timepoint. This timepoint can use any clock which the timer supports,
namely osv::clock::uptime or osv::clock::wall (as usual, wall-clock timers
are not recommended, and are converted to an uptime timer at the point
of instantiation).

Leave a C-only function condvar_wait(convar*, mutex*, s64) but comment on
what it takes.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0f5992ca

clock: Use new clock APIs in tst-wait-for.cc · de89218b

Nadav Har'El authored 11 years ago


Fix tst-wait-for.cc to use the new <osv/clock.hh> APIs.

This test uses std::abs on a time duration, and unfortunately C++11 fails
to implement std::abs on an std::chrono::duration. This patch also adds
support for this (in the form of a trivial template function) to
<osv/clock.hh>.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

de89218b