Commits · 0b00f490abfddfa4899665949d40716602189898 · Verlässliche Systemsoftware / projects / osv

Apr 24, 2014

mmu: implement madvise for MADV_DONTNEED · 0b00f490

Glauber Costa authored 10 years ago

The jemalloc memory allocator will make intense use of MADV_DONTNEED to flush
pages it is no longer using. Respect that advice.

Let's keep returning -1 for the remaining cases so we don't fool anybody

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0b00f490

linux: provide a /proc/$$/stat compatibility file · e7bfa6cb

Glauber Costa authored 10 years ago

MongoDB wants it. In general, I am providing the information that is easy
to get, and ignoring the ones which are not - with the exception of process
count, that seemed easy enough to implement.

This is the kind of thing Mongo does with it:

2014-04-15T09:54:12.322+0000 [clientcursormon] mem (MB) res:670160 virt:25212
2014-04-15T09:54:12.323+0000 [clientcursormon] mapped (incl journal view):160
2014-04-15T09:54:12.324+0000 [clientcursormon] connections:0

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e7bfa6cb

sched: export total sys and user time statistics · c152b50e

Glauber Costa authored 10 years ago

This is one of the statistics that shows up in /proc/self/stat under Linux, but
this is generally interesting for applications. Since we don't have kernel mode
and userspace mode, it is very hard to differentiate between "time spent in
userspace" and "kernel time spent on behalf of the process". Therefore, we will
present system time as always 0. If we wanted, we could at least differentiate
clearly osv-specific threads as system time, but there is no need to go through
the trouble now.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c152b50e

mmu: export function to measure the amount of mmaped area · 7bdbb328

Glauber Costa authored 10 years ago

It will be used for procfs compatibility. Applications may want to know how much
memory is potentially available through mmap mappings (not necessarily allocated).

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7bdbb328

Apr 22, 2014

pthread: implement spinlocks · fd905d5c

Glauber Costa authored 10 years ago

While we take pride in having no spinlocks in the system, if an application
wants to use them, who are we to deny them this god given right? Some
applications will implement spinlocks through a pthread interface, which is
what I implement here.

We did not have any standard trylock mechanism, so one is provided. Other than
that, the interface is pretty trivial except for the fact that it seems to
provide some protection against deadlocks. We will just ignore that for the
moment and assume a well behaved application.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fd905d5c

stub posix_fallocate · 6e438d21

Glauber Costa authored 10 years ago

MongoDB expects that call and would like to guarantee allocation of blocks
in the file. It does have a fallback, so for the time being I am just providing
the symbol. I have opened Issue #265 to track this.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6e438d21

fadvise64: alias it from fadvise · cee87017

Glauber Costa authored 10 years ago


Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cee87017

virtio: fix virtio-blk under debug allocator · e9177ca8

Avi Kivity authored 10 years ago

The debug allocator can allocate non-contiguous memory for large requests,
but since b7de9871 it uses only one sg entry for the entire buffer.

One possible fix is to allocate contiguous memory even under the debug
allocator, but in the future we may wish to allow discontiguous allocation
when not enough contiguous space is available. So instead we implement
a virt_to_phys() variant that takes a range, and outputs the physical
segments that make it up, and use that to construct a minimal sg list
depending on the input.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e9177ca8

elf: Support on-load symbol resolution · 6b4f6b43

Nadav Har'El authored 10 years ago


Normally, symbol binding in shared objects is lazy, using the PLTGOT
mechanism. This means that a symbol is resolved only when first used.
This is great because it speeds up object load, and also allows us never
to implement symbols which aren't actually used in any real code path.

However, as issue #256 shows, symbols which are used in DSOs from a
preemption-disabled context cannot be resolved on first use, because
symbol resolution may sleep. Two important examples of this are
sched::thread::wait() and sched::thread::stop_wait(), both used by
wait_until() while it is in preempt_disable.

This patch adds the missing support for the standard DT_BIND_NOW tag.
This tag can be added added to an object with the "-z now" ld option.
When an object has this tag, all its symbols should be resolved on
load time, instead of lazily (when first used).

Bug #256 can be fixed by linking tst-mmap.so with "-z now" (this will
be a separate patch).

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

6b4f6b43

Apr 20, 2014

virtio: fix virtio-blk under debug allocator · a888df1a

Avi Kivity authored 10 years ago

The debug allocator can allocate non-contiguous memory for large requests,
but since b7de9871 it uses only one sg entry for the entire buffer.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

a888df1a

Apr 17, 2014

trace: per-cpu trace buffers + event terminators · c947c654

Calle Wilund authored 10 years ago


Per-cpu trace buffers. Actual buffer space is kept at roughly the "same"
as previously, up to 4 vcpu. Above this used space will be higher.
Does not handle vcpu:s appearing or disappearing in runtime.

Trace events are allocated with a "not done" terminator marker, which is
finalized when event is written, which should prevent any partial data
messing up extraction.

Fixes #146

Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c947c654

Apr 16, 2014

aarch64: remove most AARCH64_PORT_STUB related to MMU · c5466812

Claudio Fontana authored 10 years ago


also enable core/pagecache.cc in the AArch64 build.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

c5466812

mmu: implement mmu for aarch64 with refactoring of x64 · 645ac7dc

Jani Kokkonen authored 10 years ago


Signed-off-by: Jani Kokkonen <jani.kokkonen@huawei.com>
[claudio: some fixes]
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

645ac7dc

MMU: add flush_tlb_local and flush_tlb_all · 908db4cc

Jani Kokkonen authored 10 years ago


add the APIs to flush a single processor tlb or all
tlbs in the cluster.

Signed-off-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

908db4cc

Apr 15, 2014

elf: make area described in GNU_RELRO read-only · 24b1c3a8

Pawel Dziepak authored 10 years ago

Segment GNU_RELRO is used to inform dynamic linker which sections needs
to be writable only when relocating ELF file and can be made read-only
later. Usually GNU_RELRO overlaps with standard LOAD segment that contains
readable and writable data but ELF file is generated in such way that it
is possible to properly set per-page permissions.

Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

24b1c3a8

elf: set segment permissions exactly as specified in p_flags · 43e0c7d4

Pawel Dziepak authored 10 years ago


Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

43e0c7d4

trace: make section .tracepoint_patch_sites writable · 435bb00d

Pawel Dziepak authored 10 years ago


.tracepoint_patch_sites contains pointers to locations in .text section
which needs to be relocated.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

435bb00d

Apr 14, 2014

limits: provide alias for *limit64 functions · 1adfb5ec

Glauber Costa authored 10 years ago


Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1adfb5ec

Apr 13, 2014

rcu: introduce an rcu list type · 4e465860

Avi Kivity authored 10 years ago


This is a singly-linked list, suitable for building an rcu hash table.

Only a minimal interface is implemented so far.  The list itself exposes
two list-like interfaces, one for a mutating owner, the other for read-only
rcu access.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

4e465860

rcu: add rcu_flush() · da9e1352

Avi Kivity authored 10 years ago


Add a callback to force all currently queued callbacks to execute.

This is useful for global state changes, such as removing a loaded module,
and for unit tests.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

da9e1352

elf: populate segments if ".note.osv-mlock" section is present · ff63795d

Avi Kivity authored 10 years ago

This allows shared objects that disable preemption to work correctly,
since faulting in executable pages with preemption disabled is not supported.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

ff63795d

elf: add functions to read the section table and section table entry names · a65c7a7a
Avi Kivity authored 10 years ago
```
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
```
a65c7a7a

Apr 08, 2014

sched: fix waitqueue race causing failure to wake up · 4ef65eb6

Avi Kivity authored 10 years ago


When waitqueue::wake_all() wakes up waiting threads, it calls
sched::thread::wake_lock() to enqueue those waiting threads on the mutex
protecting the waitqueue, thus avoiding needless contention on the mutex.
However, if a thread is already waking, we let it wake naturally and acquire
the mutex itself.

The problem is that the waitqueue code (wait_object<waitqueue>::poll())
examines the wait_record it sleeps on and see if it has woken, and if not,
goes back to sleep.  Since nothing in that thread-already-awake path clears
the wait_record, that is what happens, and the thread stalls, until a timeout
occurs.

Fix by clearing the wait record.  As it is protected by the mutex, no
extra synchronization is needed.

Observed with iperf -P 64 against the guest.  Likely triggered by net channels
waking up the thread, and then before it has a chance to wake up, a FIN
packet arrives that is processed in the driver thread; so when the packets
are consumed the thread is in the waking state.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4ef65eb6

dhcp: remove lookup_opcode() · abbfc557

Tomasz Grabiec authored 10 years ago


The lookup_opcode() function is incorrect. It was mishandling
DHCP_OPTION_PAD, which does not have a following length byte.

Also, the while condition is reading 'op' value which never
changes. This may result in reads beyond packet size.

Since this function is unused the best fix is to remove it.

Reveiwed-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

abbfc557

Apr 07, 2014

aarch64: fix up merge issues · 52f64ac3

Claudio Fontana authored 10 years ago


lots of mmu related changes require fixups for AArch64.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

52f64ac3

Apr 03, 2014

sched: fix rare crashes caused by reschedule running on the wrong CPU · ee92f736

Nadav Har'El authored 10 years ago

For a long time we've had the bug summarized in issue #178, where very
rarely but consistently, in various runs such as Cassandra, Netperf and
tst-queue-mpsc.so, we saw OSv crashing because of some corruption in the
timer list, such as arming an already armed timer, or canceling and already
canceled timer.

It turns out the problem was the schedule() function, which basically did
cpu::current()->schedule(). The problem is that if we're unlucky enough,
the thread can be migrated right after calling cpu::current(), but before
the irq disable in schedule(), which causes us to do a rescheduling for
one CPU on a different CPU, which is a big faux pas. This can cause us,
for example, to mess with one CPU's preemption_timer from a different CPU,
causing the timer-related races and crashes we've seen in issue #178.

Clearly, we shouldn't at all have a *method* cpu->schedule() which can
operate on any cpu. Rather, we should have only a *function* (class-static)
cpu::schedule() which operates on the current cpu - and makes sure we find
that current CPU within the IRQ lock to ensure (among other things) the
thread cannot get migrated.

Another benefit of this patch is that it actually simplifies the code,
with one less function called "schedule".

Fixes #178.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

ee92f736

Apr 02, 2014

include: remove unused include file · a7db4920

Claudio Fontana authored 11 years ago


include/api/x64/atomic.h is not used anywhere.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

a7db4920

elf: ifdef x64-specific code, add AArch64 dynamic rels · d16c41a6

Claudio Fontana authored 11 years ago


implementation includes only for the few rels that are actually
encountered during elf::get_init().

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

d16c41a6

include/osv/trace.hh: ifdef x64-specific code · ae0e824f

Claudio Fontana authored 11 years ago


protect x64 machine code in #ifdef __x86_64__

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

ae0e824f

debug: add debug_early functionality · 3e460ce3

Claudio Fontana authored 11 years ago


the functions debug_early, debug_early_u64 and
debug_early_entry can be used very early, before premain.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

3e460ce3

include/api/aarch64: first implementation · c56bebbb
Claudio Fontana authored 11 years ago
```
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
```
c56bebbb

Revert "uio: super structure around uio" · 9a42fcac

Gleb Natapov authored 10 years ago


This reverts commit 64889277.

The structure is not longer used.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

9a42fcac

mmu: add write pagecache implementation on top of ARC · eeac30c7

Gleb Natapov authored 11 years ago


This patch adds write page cache implementation. On a read fault pages
are initially mapped from ARC directly, but they are marked as RO in a
page table. On a write fault pages are copied into small write page
cache for shared mapping, or into anonymous page for private mappings.
Pages are removed from write cache and written back into a file in FIFO
order.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

eeac30c7

mmu: pass fault type and shared status to fd->get_page() return cow status · 96adaad9

Gleb Natapov authored 10 years ago

File page allocator needs to know fault type (read/write) and mapping
type (shared/private) to handle page allocation correctly. It also needs
a way to communicate to a caller that some pages need to be mapped RO
for COW. This patch adds required functionality without using it yet.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

96adaad9

mmu: return COW status from page allocators · 97ce592e

Gleb Natapov authored 11 years ago


Page allocator may return shared page that needs to be COWed during
write access. There is not way to notify page mapper that such page
should be mapped read only currently. This patch allows page allocator
to control that.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

97ce592e

mmu: save flags passed during file mapping in a vma. · a8d22f23

Gleb Natapov authored 10 years ago


Currently flags are saved for anon_vma, but not for file_vma. Examining
those flags in gdb proved to be helpful. Remove _shared since it no
longer needed, the information is in the flags.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

a8d22f23

mmu: track mapping between ARC pages and ptep instead of virtual address · 1b106798

Gleb Natapov authored 11 years ago

Currently we track mapping between ARC pages and virtual addresses which
require us to walk the page table to get to a pte pointer when invalidation
is required. Change that to track pointers to pte directly.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

1b106798

mmu: split mmu-defs.hh header · 2712138c

Gleb Natapov authored 11 years ago


arch-mmu.hh depend on mmu.hh currently which make it impossible to
include in header that mmu.hh itself includes. mmu.hh is pretty heavy
header containing many definitions and code, so move bare definitions
into separate header file.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

2712138c

Apr 01, 2014

core: introduce serial_timer_task · bd179712

Tomasz Grabiec authored 10 years ago

This is a wrapper of timer_task which should be used if atomicity of
callback tasks and timer operations is required. The class accepts
external lock to serialize all operations. It provides sufficient
abstraction to replace callouts in the network stack.

Unfortunately, it requires some cooperation from the callback code
(see try_fire()). That's because I couldn't extract in_pcb lock
acquisition out of the callback code in TCP stack because there are
other locks taken before it and doing so _could_ result in lock order
inversion problems and hence deadlocks. If we can prove these to be
safe then the API could be simplified.

It may be also worthwhile to propagate the lock passed to
serial_timer_task down to timer_task to save extra CAS.

bd179712

core: introduce deferred work framework · 34620ff0

Tomasz Grabiec authored 11 years ago

The design behind timer_task

timer_task was design for making cancel() and reschedule() scale well
with the number of threads and CPUs in the system. These methods may
be called frequently and from different CPUs. A task scheduled on one
CPU may be rescheduled later from another CPU. To avoid expensive
coordination between CPUs a lockfree per-CPU worker was implemented.

Every CPU has a worker (async_worker) which has task registry and a
thread to execute them. Most of the worker's state may only be changed
from the CPU on which it runs.

When timer_task is rescheduled it registers its percpu part in current
CPU's worker. When it is then rescheduled from another CPU, the
previous registration is marked as not valid and new percpu part is
registered. When percpu task fires it checks if it is the last
registration - only then it can fire.

Because timer_task's state is scattered across CPUs some extra
housekeeping needs to be done before it can be destroyed. We need to
make sure that no percpu task will try to access timer_task object
after it is destroyed. To ensure that we walk the list of
registrations of given timer_task and atomically flip their state from
ACTIVE to RELEASED. If that succeeds it means the task is now revoked
and worker will not try to execute it. If that fails it means the task
is in the middle of firing and we need to wait for it to finish. When
the per-CPU task is moved to RELEASED state it is appended to worker's
queue of released percpu tasks using lockfree mpsc queue. These
objects may be later reused for registrations.

34620ff0