Commits · 7b475da9cb77658511b9aa0ef20cf6d8f9739174 · Verlässliche Systemsoftware / projects / osv

May 19, 2014

Glauber Costa authored 10 years ago

As Nadav pointed out during review, this macro could use a bit more work, to
use a single parameter instead of one. That is what is done in this patch.
Unfortunately just pasting __COUNTER__ doesn't work because of preprocessor
rules, and we need some indirection to get it working. Also, visibility
"hidden" can go because that is already implied by "static". The problem then
becomes the fact that gcc does not really like unreferenced static variables,
which is solved by the "used" attribute. From gcc docs about "used":

"This attribute, attached to a variable with the static storage, means that
the variable must be emitted even if it appears that the variable is not
referenced."

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

34ddac25

May 18, 2014

sched: simplify load_balance() determination of pinned threads · 7b0f326a

Avi Kivity authored 10 years ago


Take the migration lock for pinned threads instead of a separate
check whether they are pinned or not.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

7b0f326a

May 16, 2014

aarch64: implement fixup fault and backtrace · 706c8b63

Jani Kokkonen authored 10 years ago


implement fixup fault and the backtrace functionality which is
its first simple user.

Signed-off-by: Jani Kokkonen <jani.kokkonen@huawei.com>

[claudio: added elf changes to allow lookup and demangling to work]

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

706c8b63

sched: high-resolution thread::current()->thread_clock() · c4ebb11a

Nadav Har'El authored 10 years ago


thread::current()->thread_clock() returns the CPU time consumed by this
thread. A thread that wishes to measure the amount of CPU time consumed
by some short section of code will want this clock to have high resolution,
but in the existing code it was only updated on context switches, so shorter
durations could not be measured with it.

This patch fixes thread_clock() to also add the time that passed since
the the time slice started.

When running thread_clock() on *another* thread (not thread::current()),
we still return a cpu time snapshot from the last context switch - even
if the thread happens to be running now (on another CPU). Fixing that case
is quite difficult (and will probably require additional memory-ordering
guarantees), and anyway not very important: Usually we don't need a
high-resolution estimate of a different thread's cpu time.

Fixes #302.

Reviewed-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c4ebb11a

sched: make preempt functions inline · 97f5c29d

Glauber Costa authored 10 years ago


Again, we are currently calling a function everytime we disable/enable preemption
(actually a pair of functions), where simple mov instructions would do.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

97f5c29d

sched: make current inline · 19b9d16f

Glauber Costa authored 10 years ago

We are heavily using this function to grab the address of the current thread.
That means a function call will be issued every time that is done, where a
simple mov instruction would do.

For objects outside the main ELF, we don't want that to be inlined, since that
would mean the resolution would have to go through an expensive __tls_get_addr.
So what we do is that we don't present the symbol as inline for them, and make
sure the symbol is always generated.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

19b9d16f

May 15, 2014

sched: reimplement incoming_wakeup_queue · bca7bbbe

Pawel Dziepak authored 10 years ago


This patch implements lockfree_queue (which is used as incoming_wakeup_queue)
so that it doesn't need exchange or compare_exchange operations.

The idea is to use a linked list but interleave actual objects stored in the
queue with helper object (lockless_queue_helper) which are just pointer to the
next element. Each object in the queue owns the helper that precedes it (and
they are dequeued together) while the last helper, which does not precede any
object is owned by the queue itself.

When a new object is enqueued it gains ownership of the last helper in the
queue in exchange of the helper it owned before which now becomes the new
tail of the list.

Unlike the original implementation this version of lockfree_queue really
requires that there is no more than one concurrent producer and no more than
one concurrent consumer.

The results oftests/misc-ctxs on my test machine are as follows (the values
are medians of five runs):

before:
colocated: 332 ns
apart: 590 ns

after:
colocated: 313 ns
apart: 558 ns

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

bca7bbbe

May 14, 2014

trace: introduce sampling profiler · b3fa77d3

Tomasz Grabiec authored 10 years ago


This introduces a simple timer-based sampling profiler which is
reusing our tracing infrastructure to collect samples.

To enable sampler from run.py run it like this:

 $ scripts/run.py ... --sampler [frequency]

Where 'frequency' is an optional parameter for overriding sampling
frequency. The default is 1000 (ticks per second). The bigger the
frequency the bigger sampling overhead is. Too low values will hurt
profile accuracy.

Ad-hoc sampler enabling is planned. The code already takes that into
account.

To see the profile you need to extract the trace:

 $ trace extract

And then show it like this:

 $ trace prof

All 'prof' options can be applied, for example you can group by CPU:

 $ trace prof -g cpu

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b3fa77d3

trace: allow atomic changes of _log_backtrace via tracepoint_base::log_backtrace() · ada97980

Tomasz Grabiec authored 10 years ago


Sampler will need to set and later restore value of this option.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ada97980

trace: ensure buffers are initialized when tracepoint is enabled · 0fa88c55

Tomasz Grabiec authored 10 years ago

Tracepoints can be enbaled not only via enable_tracepoint(std::string)
but via tracepoint_base::enable() also. This change also makes the
initialization thread-safe as it may be called from aribtrary thread.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0fa88c55

Show current RIP and symbol name on sigsegv, by using dump_registers · 66317807

Takuya ASADA authored 10 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

66317807

Add lookup_name_demangled(), which converts addr to demangled symbol name. · 4474ac95

Takuya ASADA authored 10 years ago


lookup_name_demangled() lookups a symbol name, demangle it, then
snprintf onto preallocated buffer.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4474ac95

debug: allow debug_early to work really early again · 56996315

Claudio Fontana authored 10 years ago


an effect of commit 9bbbe9dc is that no output is possible
before prio 'console' initializers have been run.
This change allows to have at least one API available
really early (from boot code and premain).
Document the requirements for the early console class
regarding the write() method.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

56996315

May 12, 2014

shrinker: export to C · 2ef04bad

Glauber Costa authored 10 years ago


Export the shrinker interface to C users.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2ef04bad

May 07, 2014

pagecache: save vnode pointer instead of dentry in write page cache · 8ba4681a

Gleb Natapov authored 10 years ago


dentry object represents a directory, vnode represent a file, so it is
better to use vnode in the page cache.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

8ba4681a

May 06, 2014

elf: use symbol_binding() where appropriate · 9ef48b03

Pawel Dziepak authored 10 years ago


There is already defined (but unused before this patch) function that
extracts binding type from Elf_Sym::st_info.

Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9ef48b03

elf: add unused attribute to unused function · a27ae1ff

Boqun Feng authored 10 years ago


-Werror=unused-function complains symbol_binding is unused, add an
attribute of unused to mark this function unused.

Signed-off-by: Boqun Feng <boqun.feng@linux.vnet.ibm.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a27ae1ff

mmu: fix an unsigned/signed comparison error · 101daf6b

Boqun Feng authored 10 years ago


-Werror=sign-compare complains about when comparing (unsigned)level with
page_mapper.nr_page_sizes().
Since nr_page_sizes() is meaningful only when it's non-negative and the
mmu::nr_page_sizes is unsigned, changing the return types of all
nr_page_sizes functions to unsigned is reasonable.

Signed-off-by: Boqun Feng <boqun.feng@linux.vnet.ibm.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

101daf6b

May 05, 2014

trace: cleanup packet tracing · 593f26d6

Tomasz Grabiec authored 10 years ago


The current tracepoint coverage does not handle all situations
well. In particular:

 * it does not cover link layer devices other than virtio-net. This
   change fixes that by tracing in more abstract layers.

 * it records incoming packets at enqueue time, whereas sometimes it's
   better to trace at handling time. This can be very useful when
   correlating TCP state changes with incoming packets. New tracepoint
   was introduced for that: net_packet_handling.

 * it does not record protocol of the buffer. For non-ethernet
   protocols we should set appropriate protocol type when
   reconstructing ethernet frame when dumping to PCAP.

We now have the following tracepoints:

 * net_packet_in - for incoming packets, enqueued or handled directly.

 * net_packet_out - for outgoing packets hitting link layer (not
   loopback).

 * net_packet_handling - for packets which have been queued and are
   now being handled.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

593f26d6

May 03, 2014

pagecache: map zero page instead of ARC page for hole in a file · 4658e563

Gleb Natapov authored 10 years ago

Attempt to get read ARC buffer for a hole in a file results in temporary
ARC buffer which is destroyed immediately after use. It means that
mapping such buffer is impossible, it is unmapped before page fault
handler return to application. The patch solves this by detecting that
hole in a file is accessed and mapping special zero page instead. It is
mapped as COW, so on write attempt new page is allocated.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4658e563

May 01, 2014

dhcp: Print IP address by default · d90d41fc

Takuya ASADA authored 10 years ago

Current OSv implementation suppress most of output when it's not verbose mode.

It may speed up bootup speed, but supress printing out IP address is
inconvenient for most of users. So I added infof(), which acts like debugf()
but print out string even not on verbose mode, and use it in dhcp.cc to print
out IP address.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d90d41fc

Apr 29, 2014

sched: provide thread exit notifiers · 6613c1ec

Glauber Costa authored 10 years ago


Functions to be run when a thread finishes.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6613c1ec

sched: expose thread_remote_local_var to outside users · 508762bc

Glauber Costa authored 10 years ago

While working with blocked signals and notifications, it would be good to test
what's the current state of other thread's pending signal mask.

That machinery exists in sched.cc but isn't exposed. This patch exposes that,
together with a more convenient helper for when we are interested in the
pointer itself, without dereferencing it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

508762bc

trace: Make trace point code patching thread/smp safe · e7652342

Calle Wilund authored 10 years ago


Also, move platform dependent fast dispatch to platform arch code tree(s)

The patching code is a bit more complex than would seem immediately (or
even factually) neccesary. However, depending on cpu, there might be
issues with trying to code patch across cache lines (unaligned).
To be safe, we do it with the old 16-bit jmp + write + finish dance.

[avi: fix up build.mk]

Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e7652342

Apr 28, 2014

malloc: Fix small allocations with alignment > size. · c8845bb5

Nadav Har'El authored 10 years ago

When a small allocation is requested with large alignment, we ignored
the alignment, and as a consequence posix_memalign() or
alloc_phys_contiguous_aligned() could crash when it failed to achieve
the desired alignment. This is not a common case (usually, size >= alignment,
and the new C11 aligned_alloc() even supports only this case), but still
it might happen, and we saw it in cloudius-systems/capstan#75.

When size < alignment, this patch changes the size so we can achieve the
desired alignment. For small alignments, this means setting size=alignment,
so for example to get an alignment of 1024 bytes we need at least 1024-byte
allocation. This is a waste of memory, but as these allocations are rare,
we expect this to be acceptable. For large alignments, e.g., alignment=8192,
we don't need size=alignment but we do need size to be large enough so we'll
use malloc_large() (malloc_large() already supports arbitrarily large
alignments).

This patch also adds test cases to tst-align.so to test alignments larger
than the desired size.

Fixes #271 and cloudius-systems/capstan#75.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c8845bb5

Apr 25, 2014

net: log packets going through loopback and virtio-net. · f30ba40d

Tomasz Grabiec authored 10 years ago


There was no way to sniff packets going through OSv's loopback
interface. I faced a need to debug in-guest TCP traffic. Packets are
logged using tracing infrastructure. Packet data is serialized as
sample data up to a limit, which is currently hardcoded to 128 bytes.

To enable capturing of packets just enable tracepoints named:
  - net_packet_loopback
  - net_packet_eth

Raw data can be seen in `trace list` output. Better presentation
methods will be added in the following patches.

This may also become useful when debugging network problems in the
cloud, as we have no ability to run tcpdump on the host there.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f30ba40d

rcu: declare as a memory reclaimer · 7136fe18

Avi Kivity authored 10 years ago


If the rcu threads need memory, let them have it, since they will use it
to free even more memory.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7136fe18

memory: fix debug allocator interaction with reclaimer · 09b8cc8f

Avi Kivity authored 10 years ago


malloc() must wait for memory, and since page table operations can
allocate memory, it must be able to dip into the reserve pool.  free()
should indicate it is a reclaimer.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

09b8cc8f

memory: add facility to indicate that a thread is a temporarily a reclaimer · cfba1a5e

Avi Kivity authored 10 years ago

We already have a facility that to indicate that a thread is a reclaimer
and should be allowed to allocate reserve memory (since that memory will be
used to free memory). Extend it to allow indicating that a particular
code section is used to free memory, not the entire thread.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cfba1a5e

loader: when can't read elf, fail normally instead of crashing · f5ba2ce5

Nadav Har'El authored 10 years ago


After the previous patches, when we try to run an executable we cannot
read (e.g., a directory - see issue #94), a "struct error" exception will
be thrown out of osv::run, and nobody will catch it so the user will see
a somewhat-puzzling "uncaught exception" error.

With this patch, we catch the read error exception inside osv::run(), and
when it happens, just return a normal load failure (nullptr). E.g, now
trying to run a directory will result in a normal failure:

   $ scripts/run.py -e /
   OSv v0.07-39-g03feb99
   run_main(): cannot execute /. Powering off.

Fixes #94.

The osv::run() API currently (before this patch, and also after it)
doesn't have any way to say *why* the loading failed - it could have been
that the executable was a directory, that it was not an ELF shared object,
that it was a shared object and didn't have a main - in all cases the
return value is nullptr. In the future this should probably change.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f5ba2ce5

mmu: drop finalize() from page_allocator · f5a46140
Gleb Natapov authored 10 years ago
```
No longer used.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
```
f5a46140

pagecache: change locking between mmu and ARC · 4792e60c

Gleb Natapov authored 10 years ago

Now vma_list_mutex is used to protect against races between ARC buffer
mapping by MMU and eviction by ZFS. The problem is that MMU code calls
into ZFS with vma_list_mutex held, so on that path all ZFS related locks
are taken after vma_list_mutex. An attempt to acquire vma_list_mutex
during ARC buffer eviction, while many of the same ZFS locks are already
held, causes deadlock. It was solved by using trylock() and skipping an
eviction if vma_list_mutex cannot be acquired, but it appears that some
mmapped buffers are destroyed not during eviction, but after writeback and
this destruction cannot be delayed. It calls for locking scheme redesign.

This patch introduce arc_lock that have to be held during access to
read_cache. It prevents simultaneous eviction and mapping. arc_lock should
be the most inner lock held on any code path. Code is change to adhere to
this rule. For that the patch replaces ARC_SHARED_BUF flag by new b_mmaped
field. The reason is that access to b_flags field is guarded by hash_lock
and it is impossible to guaranty same order between hash_lock and arc_lock
on all code paths. Dropping the need for hash_lock is a nice solution.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

4792e60c

mmu: populate pte in page_allocator · 31d939c7

Gleb Natapov authored 10 years ago

Currently page_allocator return a page to a page mapper and the later
populates a pte with it. Sometimes page allocation and pte population
needs to be appear atomic though. For instance in case of a pagecache
we want to prevent page eviction before pte is populated since page
eviction clears pte, but if allocation and mapping is not atomic pte
can be populated with stale data after eviction. With current approach
very wide scoped lock is needed to guaranty atomicity. Moving pte
population into page_allocator allows for much simpler locking.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

31d939c7

pagecache: track ARC buffers in the pagecache · 4fd8693a

Gleb Natapov authored 10 years ago

Current code assumes that for the same file and same offset ZFS will
always return same ARC buffer, but this appears to be not the case.
ZFS may create new ARC buffer while an old one is undergoing writeback.
It means that we need to track mapping between file/offset and mmapped
ARC buffer by ourselves. It's exactly what this patch is about. It adds
new kind of cached page that holds pointers to an ARC buffer and stores
them in new read_cache map.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

4fd8693a

pagecache: avoid unnecessary tlb flushes in unmap_address · b11aa4d1
Gleb Natapov authored 10 years ago
```
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
```
b11aa4d1

pagecache: remove redundant locking · 39cdb2db

Gleb Natapov authored 10 years ago


All pagecache functions run under vma_list_lock, so no additional
locking is needed.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

39cdb2db

pagecache: move cache management code into pagecache.cc · 832a16f6
Gleb Natapov authored 10 years ago
```
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
```
832a16f6

mmu: separate page unmapping and freeing. · e9756adc

Gleb Natapov authored 10 years ago


Unmap page as soon as possible instead of waiting for max_pages to
accumulate. Will allow to free pages outside of vma_list_mutex in the
feature.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

e9756adc

pagecache: add traces to pagecache function · 150522a2

Gleb Natapov authored 10 years ago


Useful for debugging cache related problems.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

150522a2

mmu: propagate IO errors during msync() to an application · 059db30b
Gleb Natapov authored 10 years ago
```
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
```
059db30b