Commits · e77f5b1d180f43ff7aa570e2012f1a0781014197 · Verlässliche Systemsoftware / projects / osv

May 21, 2014

mmu, x64: optimize detection of PROT_NONE mem region using PT reserved bit · e77f5b1d

Gleb Natapov authored 10 years ago


Java uses accesses to PROT_NONE region to stop threads sometimes, so it
worthwhile to be able to catch this as fast as possible without taking
vma_list_mutex. The patch does it by setting reserved bit on all ptes in
PROT_NONE VMA which causes RSVD bit to be set in a page fault error code.
It is enough to check it to know that access is to a valid VMA, but
permission is lacking.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e77f5b1d

May 19, 2014

sched: strengthen the fence in migrate_disable() · 4dbcc248

Tomasz Grabiec authored 10 years ago


memory_order_acquire does not prevent previous stores from moving past
the barrier so if the _migration_lock_counter incrementation is split
into two accesses, this is eligible:

tmp = _migration_lock_counter;
atomic_signal_fence(std::memory_order_acquire); // load-load, load-store
<critical instructions here>
_migration_lock_counter = tmp + 1; // was moved past the barrier

To prevent this, we need to order previous stores with future
loads and stores, which is given only by std::memory_order_seq_cst.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

4dbcc248

May 18, 2014

sched: relax migration lock barriers · 618c2ff6

Avi Kivity authored 10 years ago


Instead of forcing a reload (and a flush) of all variables in memory,
use the minimum required barrier via std::atomic_signal_fence().

Reviewed-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

618c2ff6

ring_spsc: put proper memory barriers when reading and storing _begin · 7dde3997

Vlad Zolotarov authored 10 years ago


Proper memory ordering should be implied to loads and stores of _begin
field. Otherwise they may be reordered with the appropriate stores and
loads to/from the _ring array and in a corner case when the ring is full
it may lead to ring data corruption.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Reported-by: Nadav Har'el <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

7dde3997

sched: add barrier() to migrate_disable/migrate_enable() · dca10749

Tomasz Grabiec authored 10 years ago


These functions are used to demark a critical section and should
follow a contract which says that no operation inside the critical
section should be moved before migrate_disable() and after
migrate_enable(). These functions are declared inline and the compiler
could theoretically move instructions across these.

Spotted during code contemplation.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

dca10749

May 16, 2014

sched: high-resolution thread::current()->thread_clock() · c4ebb11a

Nadav Har'El authored 10 years ago


thread::current()->thread_clock() returns the CPU time consumed by this
thread. A thread that wishes to measure the amount of CPU time consumed
by some short section of code will want this clock to have high resolution,
but in the existing code it was only updated on context switches, so shorter
durations could not be measured with it.

This patch fixes thread_clock() to also add the time that passed since
the the time slice started.

When running thread_clock() on *another* thread (not thread::current()),
we still return a cpu time snapshot from the last context switch - even
if the thread happens to be running now (on another CPU). Fixing that case
is quite difficult (and will probably require additional memory-ordering
guarantees), and anyway not very important: Usually we don't need a
high-resolution estimate of a different thread's cpu time.

Fixes #302.

Reviewed-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c4ebb11a

sched: make preempt functions inline · 97f5c29d

Glauber Costa authored 10 years ago


Again, we are currently calling a function everytime we disable/enable preemption
(actually a pair of functions), where simple mov instructions would do.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

97f5c29d

sched: make current inline · 19b9d16f

Glauber Costa authored 10 years ago

We are heavily using this function to grab the address of the current thread.
That means a function call will be issued every time that is done, where a
simple mov instruction would do.

For objects outside the main ELF, we don't want that to be inlined, since that
would mean the resolution would have to go through an expensive __tls_get_addr.
So what we do is that we don't present the symbol as inline for them, and make
sure the symbol is always generated.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

19b9d16f

May 15, 2014

ring_spsc: use static_assert() instead of assert() for template MaxSize parameter checking · 91199cb0

Vlad Zolotarov authored 10 years ago


Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

91199cb0

xmitter class · aa45bbb4

Vlad Zolotarov authored 10 years ago


  This class is a heart of a per-CPU Tx framework.
  Except for a constructor it has two public methods:
   - xmit(buff): push the packet descriptor downstream either to the HW or into
     the per-CPU queue if there is a contention.

   - poll_until(cond): this is a main function of a worker thread that will
     consume packet descriptors from the per-CPU queue(s) and send them to
     the output iterator (which is responsible to ensure their successful
     sending to the HW channel).

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

aa45bbb4

cpu_queue class · 9741e82d

Vlad Zolotarov authored 10 years ago


  This class will represent a single per-CPU Tx queue.

  These queues will be subject to the merging by the nway_merger class in
  order to address the reordering issue. Therefore this class will
  implement the following methods/classes:
   - push(val)
   - empty()
   - front(), which will return the iterator that implements:
       - operator *() to access the underlying value
   - erase(it), which would pop the front element.

  If the producer fails to push a new element into the queue (the queue is
  full) then it may start "waiting for the queue": request to be woken when the
  queue is not full anymore (when the consumer frees some some entries from the
  queue):

   - push_new_waiter() method.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9741e82d

Added a test for nway_merger with ring_spsc as a base sorted container. · 48a8f7ca

Vlad Zolotarov authored 10 years ago


Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

48a8f7ca

nway_merger helper class implementation · e15cd7df

Vlad Zolotarov authored 10 years ago


This class allows efficiently merge n sorted containers.
It allows both a single-call merging with a merge() method and
the iterator-like semantincs with a pop() method.
In both cases the merged stream/next element are streamed to
the output iterator.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e15cd7df

ring_spsc: Use size() method instead of duplicating its code · c2bc94f2

Vlad Zolotarov authored 10 years ago


Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c2bc94f2

ring_spsc: Use bitwise AND instead of modulo for calculating the entry index · 1eb8e837
Vlad Zolotarov authored 10 years ago
```
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
1eb8e837

sched: relax memory ordering in cpu_set::fetch_clear · dd33a91d

Pawel Dziepak authored 10 years ago


Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

dd33a91d

sched: reimplement incoming_wakeup_queue · bca7bbbe

Pawel Dziepak authored 10 years ago


This patch implements lockfree_queue (which is used as incoming_wakeup_queue)
so that it doesn't need exchange or compare_exchange operations.

The idea is to use a linked list but interleave actual objects stored in the
queue with helper object (lockless_queue_helper) which are just pointer to the
next element. Each object in the queue owns the helper that precedes it (and
they are dequeued together) while the last helper, which does not precede any
object is owned by the queue itself.

When a new object is enqueued it gains ownership of the last helper in the
queue in exchange of the helper it owned before which now becomes the new
tail of the list.

Unlike the original implementation this version of lockfree_queue really
requires that there is no more than one concurrent producer and no more than
one concurrent consumer.

The results oftests/misc-ctxs on my test machine are as follows (the values
are medians of five runs):

before:
colocated: 332 ns
apart: 590 ns

after:
colocated: 313 ns
apart: 558 ns

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

bca7bbbe

May 14, 2014

trace: introduce sampling profiler · b3fa77d3

Tomasz Grabiec authored 10 years ago


This introduces a simple timer-based sampling profiler which is
reusing our tracing infrastructure to collect samples.

To enable sampler from run.py run it like this:

 $ scripts/run.py ... --sampler [frequency]

Where 'frequency' is an optional parameter for overriding sampling
frequency. The default is 1000 (ticks per second). The bigger the
frequency the bigger sampling overhead is. Too low values will hurt
profile accuracy.

Ad-hoc sampler enabling is planned. The code already takes that into
account.

To see the profile you need to extract the trace:

 $ trace extract

And then show it like this:

 $ trace prof

All 'prof' options can be applied, for example you can group by CPU:

 $ trace prof -g cpu

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b3fa77d3

trace: allow atomic changes of _log_backtrace via tracepoint_base::log_backtrace() · ada97980

Tomasz Grabiec authored 10 years ago


Sampler will need to set and later restore value of this option.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ada97980

Add lookup_name_demangled(), which converts addr to demangled symbol name. · 4474ac95

Takuya ASADA authored 10 years ago


lookup_name_demangled() lookups a symbol name, demangle it, then
snprintf onto preallocated buffer.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4474ac95

May 13, 2014

netchannels: specialize new and delete operators · 8e087635

Glauber Costa authored 10 years ago

While running one of the redis benchmarks, I saw around 23k calls to
malloc_large. Among those, ~10 - 11k were 2-page sized. I managed to track it
down to the creation of net channels. The problem here is that the net channel
structure is slightly larger than half a page - the maximum size for small
object pools. That will throw all allocations into malloc_large. Besides being
slow, it also wastes a page for every net channel created, since malloc_large
will include an extra page in the beginning of each allocation.

This patch fixes this by overloading the operators new and delete for the
netchannel structure so that we use the more efficient and less wasteful
alloc_page.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8e087635

May 12, 2014

shrinker: export to C · 2ef04bad

Glauber Costa authored 10 years ago


Export the shrinker interface to C users.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2ef04bad

May 08, 2014

Fix bug booting with 64 CPUs · 8dd67e0d

Nadav Har'El authored 10 years ago


OSv is currently limited to 64 vCPUs, because we use a 64-bit bitmask for
wakeups (see max_cpus in sched.cc). Having exactly 64 CPUs *should* work,
but unfortunately didn't because of a bug:

cpu_set::operator++ first incremented the index, and then called advance()
to find the following one-bit. We had a bug when the index was 63: we then
expect operator++ to return 64 (end(), signaling the end of the iteration),
but what happened was that after it incremented the index to 64, advance()
wrongly handled the case idx=64 (1<<64 returns 1, unexpectedly) and moved
it back to idx=63.

The patch fixes operator++ to not call advance when idx=64 is reached,
so now it works correctly also for idx=63, and booting with 64 CPUs now
works.

Fixes #234.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

8dd67e0d

assert: fix __assert_fail() · 7254b4b0

Jaspal Singh Dhillon authored 10 years ago


This patch changes the definition of __assert_fail() in api/assert.h which
would allow it and other header files which include it (such as debug.hh) to
be used in mgmt submodules. Fixes conflict with declaration of
__assert_fail() in external/x64/glibc.bin/usr/include/assert.h

Signed-off-by: Jaspal Singh Dhillon <jaspal.iiith@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7254b4b0

May 07, 2014

mmu: do not destroy pt roots with double initialization · 8f9b688c

Jani Kokkonen authored 10 years ago


the class construction of the page_table_root must happen
before priority "mempool", or all the work in arch-setup
will be destroyed by the class constructor.
Problem noticed while working on the page fault handler for AArch64.

Signed-off-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

8f9b688c

May 05, 2014

core: add 'latch' synchronizer · 67a3299d

Tomasz Grabiec authored 10 years ago


The synchronizer allows any thread to block on it until it is
unlocked. It is unlocked once count_down() has been called given
number of times.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

67a3299d

Add console init_prio · 8f7385ba

Takuya ASADA authored 10 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8f7385ba

Add linearize_uio_write · 1e3bb0fb

Takuya ASADA authored 10 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1e3bb0fb

trace: cleanup packet tracing · 593f26d6

Tomasz Grabiec authored 10 years ago


The current tracepoint coverage does not handle all situations
well. In particular:

 * it does not cover link layer devices other than virtio-net. This
   change fixes that by tracing in more abstract layers.

 * it records incoming packets at enqueue time, whereas sometimes it's
   better to trace at handling time. This can be very useful when
   correlating TCP state changes with incoming packets. New tracepoint
   was introduced for that: net_packet_handling.

 * it does not record protocol of the buffer. For non-ethernet
   protocols we should set appropriate protocol type when
   reconstructing ethernet frame when dumping to PCAP.

We now have the following tracepoints:

 * net_packet_in - for incoming packets, enqueued or handled directly.

 * net_packet_out - for outgoing packets hitting link layer (not
   loopback).

 * net_packet_handling - for packets which have been queued and are
   now being handled.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

593f26d6

May 04, 2014

trace: support tracepoint signatures of arbitrary length · b7bc0045

Tomasz Grabiec authored 10 years ago


Currently tracepoint's signature string is encoded into u64 which
gives 8 character limit to the signature. When signature does not fit
into that limit, only the first 8 characters are preserved.

This patch fixes the problem by storing the signature as a C string of
arbitrary length.

Fixes #288.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

b7bc0045

May 03, 2014

pagecache: map zero page instead of ARC page for hole in a file · 4658e563

Gleb Natapov authored 10 years ago

Attempt to get read ARC buffer for a hole in a file results in temporary
ARC buffer which is destroyed immediately after use. It means that
mapping such buffer is impossible, it is unmapped before page fault
handler return to application. The patch solves this by detecting that
hole in a file is accessed and mapping special zero page instead. It is
mapped as COW, so on write attempt new page is allocated.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4658e563

Apr 29, 2014

libc: separate x64-specific code and provide aarch64 stubs · b829959c

Claudio Fontana authored 10 years ago


Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b829959c

libc: implement sigsetjmp(), siglongjmp() · 3a779f1c

Nadav Har'El authored 11 years ago

This patch implements the sigsetjmp()/siglongjmp() functions. Fixes #241.

sigsetjmp() and siglongjmp() are similar to setjmp() and longjmp(), except
that they also save and restore the signals mask. Signals are hardly useful
in OSv, so we don't necessarily need this signal mask feature, but we still
want to implement these functions, if only so that applications which use
them by default could run (see issue #241).

Most of the code in this patch is from Musl 1.0.0, with a few small
modifications - namely, call our sigprocmask() function instead a Linux
syscall. Note I copied the x64 version of sigsetjmp.s only. Musl also
has this file for ARM and other architectures. Interestingly we already
had in our source tree, but didn't use, block.c, and this patch starts
to use it.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3a779f1c

sched: provide thread exit notifiers · 6613c1ec

Glauber Costa authored 10 years ago


Functions to be run when a thread finishes.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6613c1ec

sched: expose thread_remote_local_var to outside users · 508762bc

Glauber Costa authored 10 years ago

While working with blocked signals and notifications, it would be good to test
what's the current state of other thread's pending signal mask.

That machinery exists in sched.cc but isn't exposed. This patch exposes that,
together with a more convenient helper for when we are interested in the
pointer itself, without dereferencing it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

508762bc

trace: Make trace point code patching thread/smp safe · e7652342

Calle Wilund authored 10 years ago


Also, move platform dependent fast dispatch to platform arch code tree(s)

The patching code is a bit more complex than would seem immediately (or
even factually) neccesary. However, depending on cpu, there might be
issues with trying to code patch across cache lines (unaligned).
To be safe, we do it with the old 16-bit jmp + write + finish dance.

[avi: fix up build.mk]

Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e7652342

Apr 28, 2014

memory: add utility RAII classes for aligned physically contiguous memory · d29f65d1

Avi Kivity authored 10 years ago


phys_ptr<T>: unique_ptr<> for physical memory
make_phys_ptr(): allocate and initialize a phys_ptr<>
make_phys_array(): allocate a phys_ptr<> referencing an array

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d29f65d1

clock: Add fill_tv(), duration to timeval conversion function · e684d88a

Takuya ASADA authored 10 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e684d88a

Apr 25, 2014

net: log packets going through loopback and virtio-net. · f30ba40d

Tomasz Grabiec authored 10 years ago


There was no way to sniff packets going through OSv's loopback
interface. I faced a need to debug in-guest TCP traffic. Packets are
logged using tracing infrastructure. Packet data is serialized as
sample data up to a limit, which is currently hardcoded to 128 bytes.

To enable capturing of packets just enable tracepoints named:
  - net_packet_loopback
  - net_packet_eth

Raw data can be seen in `trace list` output. Better presentation
methods will be added in the following patches.

This may also become useful when debugging network problems in the
cloud, as we have no ability to run tcpdump on the host there.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f30ba40d

trace: support for serializing variable-length sequences of bytes · 2d795f99

Tomasz Grabiec authored 10 years ago

Tracepoint argument which extends 'blob_tag' will be interpreted as a
range of byte-sized values. Storage required to serialize such object
is proportional to its size.

I need it to implement storage-fiendly packet capturing using tracing layer.

It could be also used to capture variable length strings. Current
limit (50 chars) is too short for some paths passed to vfs calls. With
variable-length encoding, we could set a more generous limit.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2d795f99