Commits · b2c6447217bd079c2469e57e9ce12f38f31b97dc · Verlässliche Systemsoftware / projects / osv

Jun 01, 2014

mmu: implement madvise(MADV_NOHUGEPAGE) · 8d453be5

Gleb Natapov authored 10 years ago


Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

8d453be5

May 30, 2014

libc: implement the whole mnt* family of functions · c5bcc0c4

Glauber Costa authored 10 years ago

We had an implementation of getmntent, but its signature was wrong to begin
with. I doubt it was ever working, since it is not such a common function. But
even then, we lacked other functions in this family, like setmntent.

This patch implements them all. The main matching code comes from musl, but the
end result is significantly different. In particular, I didn't really want to
mess with creating new virtual proc files, symlinks and the such. Instead,
trying to opening a dynamic file (like /proc/mounts) or any of its famous
aliases will return a special value.

Functions like getmntent will be able to parse that value and act accordingly.
The code to show dynamic mounts comes from our old getmntent() implementation,
but it is here modified to not use statics for the strings so we can implement
getmntent_r() correctly.

Fixes #326

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
CC: Lyor Goldstein <lgoldstein@vmware.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c5bcc0c4

vfs: Implement fallocate system call · 5c157181

Raphael S. Carvalho authored 10 years ago


This patch adds fallocate as a vnode operation, implements the
fallocate function and the system call.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

5c157181

May 29, 2014

poll: special case POLLIN on one file descriptor · de7d4d92

Tomasz Grabiec authored 10 years ago


OpenJDK's SocketInputStream.read() is calling poll() to support
timeout. The call site looks like this:

 pfd.fd = s;
 pfd.events = POLLIN | POLLERR;
 poll(&pfd, 1, timeout);

Our current implementation of poll() is quite complex because it needs
to handle polling on many files. It also allocates memory in several
places:
 - in poll() due to std::vector
 - in poll_install()
 - in net_channel::add_poller
 - in net_channel::del_poller

poll() on one socket can be greatly simplified and we can avoid memory
allocation completely. This change adds special casing for that. It
reduces allocation rate in half in tomcat benchmark with 256 connections.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

de7d4d92

sched: remove a bit of dead code. · 69dbf7db

Nadav Har'El authored 10 years ago


Before commit 202b2ccc, the scheduler
was responsible for saving the FPU state, so we needed to know whether
the scheduler itself uses it or not. Now that the FPU state is always
saved at interrupt time, we no longer care whether or not the scheduler
uses the FPU, so we can drop this flag.

Also drop the optional "pseudo-float" (integer-based floating point
operations) support from the scheduler. This never had any real
advantage over the actual floating point, and now that we save the
FPU state unconditionally, it makes even less sense to avoid floating
point.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

69dbf7db

sched: remove unnecessary reschedule_from_interrupt() parameter. · 5f5b6144

Nadav Har'El authored 10 years ago


Before commit 202b2ccc,
cpu::reschedule_from_interrupt() needed to know whether we were called
from an interrupt (preempt=true) or as an ordinary function (preempt=false),
to know whether or not to save the FPU state.

As we now save the FPU state at the interrupt code,
reschedule_from_interrupt() no longer needs to deal with this, and so this
patch removes the unneeded paramter "preempt" to that function.

One thing we are losing in this patch is the "sched_preempt" tracepoint,
which we previously had when an interrupt caused an actual context switch
(not just a reschedule call, but actually switching to a different thread).
We still have the "sched_switch" tracepoint which traces all the context
switches, which is probably more interesting anyway.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

5f5b6144

May 28, 2014

rwlock: replace condition variables with waitqueues · 6f6c2bef

Avi Kivity authored 10 years ago


Waitqueues fullfil the same role as condition variables, but are much lighter
since they rely on an external mutex, which is already provided by rwlock.

Replacing condvars by waitqueues significantly reduces the size of an rwlock,
and in addition reduces the number of atomic operations in contended cases
significantly.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

6f6c2bef

rwlock: helpers for WITH_LOCK and friends · 907a94fa

Avi Kivity authored 10 years ago


Implement facades for rwlock so it can be used with WITH_LOCK, specifying
whether we want a read lock or a write lock:

  rwlock my_rwlock;

  WITH_LOCK(my_rwlock.for_read()) {
    read stuff
  }

  WITH_LOCK(my_rwlock.for_write()) {
    write stuff
  }

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

907a94fa

fpu: early save/restore in interrupt/exception context · 202b2ccc

Avi Kivity authored 10 years ago


Since we cannot guarantee that the fpu will not be used in interrupts and
exceptions, we must save it earlier rather than later.

This was discovered with an fpu-based memcpy, but can be triggered in other
ways.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

202b2ccc

May 27, 2014

include/api: Remove dead code · c4e1e147

Pekka Enberg authored 10 years ago


The code in <api/x86/reloc.h> is not used. Avi says it's dead code that
originates fro Musl.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c4e1e147

pagecache: do not write back clean pages during write pages eviction · 6e70f178

Gleb Natapov authored 10 years ago


Not all pages in the write page cache are dirty sync msync() may write
some of them back. Check for that and do not write back clean pages
needlessly.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

6e70f178

pagecache: make msync pagecache aware · e7bb7b97

Gleb Natapov authored 10 years ago


Current msync implementation is scanning all pages in msync are via
pagetables to find dirty pages, but pagecache already knows what pages are
potentially dirty for given file/offsets, so it is can check if they are
dirty via rmap.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

e7bb7b97

pagecache: unify rmap flush and access scanning classes. · 05f66d2f

Gleb Natapov authored 10 years ago


As Avi pointed ptep_flush and ptep_accessed classes can be replaced by
general map-reduce mechanism with customizable map and reduce functions.
The patch implements that.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

05f66d2f

May 26, 2014

libc: implement lockf · bef75331

Glauber Costa authored 10 years ago

This is mainly a wrapper around fcntl, so it should work to the extent
that fcntl works and fail gracefully where it doesn't. Code is imported
from musl with some modifications to allow it to compile as C++ code.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

bef75331

mempool: remove contention inside free_different_cpu() · ce0c0f99

Tomasz Grabiec authored 10 years ago


Fixes #308.

When the per-cpu-pair ring fills up, the freeing thread is blocked and
enters a synchronous object hand-off. That synchronous hand-off is
cause of contention. Instead of having a bounded ring we can use an
unordered_queue_mpsc which links the freed objects in a chain. In this
implementation push() always succeeds and we don't need to block.

In a test which allocates 1K blocks on once CPU and having two threads
freeing them on two other CPUs, there is a ~40% improvement of free()
throughput.

I tested various implementaions, based on different queues. Statistics
of free/sec reported by misc-free-perf (one sample = one run):

current:

  avg =  8133055.09
  stdev = 118322.06
  samples = 5

ring_spsc<1M> (no blocking):

  avg = 10442665.98
  stdev = 476334.93
  samples = 5

unordered_queue_spsc:

  avg = 10258212.69
  stdev = 418194.22
  samples = 5

unordered_queue_mpsc:

  avg = 11701334.99
  stdev = 725299.97
  samples = 5

Testing showed that unordered_queue_mpsc() performs best in this case.

Dead objects are collected by per-CPU worker thread (same as
before). The thread is woken up once every 256 frees. That threshold
was chosen so that the behavior would more or less correspond to what
was before.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

ce0c0f99

tests: add test for free() throughput across CPUs · f86b36a6

Tomasz Grabiec authored 10 years ago


There is one allocating threads and two freeing threads. Each thread
is allocated on a different core. The test measures throughput of
objects freed by both threads.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

f86b36a6

tests: introduce latch.await_for() · 328537ba
Tomasz Grabiec authored 10 years ago
```
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
```
328537ba
tests: introduce latch.is_released() · a1b0b963
Tomasz Grabiec authored 10 years ago
```
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
```
a1b0b963

lockfree: introduce unordered_queue_spsc · 7788b5b2

Tomasz Grabiec authored 10 years ago


It is meant to provide both the speed of a ring buffer and
non-blocking properties of linked queues by combining the two. Unlike
for ring_spsc, push() is always guaranteed to succeed.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

7788b5b2

lockfree: introduce unordered_queue_mpsc · f0135963

Tomasz Grabiec authored 10 years ago


It's like queue_mpsc with two improvements:

 * consumer and producer links are cache line aligned to avoid false
   sharing. I was tempted to apply this to queue_mpsc too but then
   discovered that this queue is embedded in a mutex, and doing so
   would greatly bloat mutex size, so I gave up on this idea.

 * The contract of pop() is relaxed to return items in no particular
   order so that we can avoid the cost of reversing the chain.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

f0135963

May 23, 2014

pagecache: add accessed bit scanner thread · 7d122f7d

Gleb Natapov authored 10 years ago


Run a thread in a background to scan pagecache for accessed and
propagate them to ARC. The thread may take anywhere from 0.1% to 20%
of CPU time. There is no hard science behind how current CPU usage is
determined, it uses page access rate to calculate how hard pagecache
should be scanned currently. It can be improved by taking eviction rate
into account too.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7d122f7d

May 22, 2014

vfs: LFS64 versions of two more functions · 2a854422

Glauber Costa authored 10 years ago


preadv, pwritev.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2a854422

May 21, 2014

tls: move more details about initialization to arch · fadc5fdc

Claudio Fontana authored 10 years ago


the thread_control_block structure needs to be different
between x64 and AArch64;
For AArch64's implementation for local execution, try to
match the layout in glibc and the generated code.

Do not align .tdata and .tbss sections with .tdata : ALIGN(64)
or it will affect the TLS loads.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Cc: Glauber Costa <glommer@cloudius-systems.com>
Cc: Will Newton <will.newton@linaro.org>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

fadc5fdc

mmu,x64: optimize detection of PROT_READ mem region by checking cow bit in a pte · 8e388e1b

Gleb Natapov authored 10 years ago

Java uses PROT_READ page to synchronize threads, so it worthwhile to be
able to catch this as fast as possible without taking vma_list_mutex.
The patch does it by checking that pte is not marked as cow during write
fault on present pte since cow or PROT_READ are the only reasons for pte
to be write protected. The problem is that to get pte we need to walk
the page table, but access to a page table is currently protected by
vma_list_mutex. The patch uses RCU to free intermediate page table levels
which makes it possible to get to pte without taking vma_list_mutex lock.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

8e388e1b

mmu: remove reference to noexistent hw_pte_ref class · 212a81eb

Gleb Natapov authored 10 years ago


Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

212a81eb

mmu, x64: optimize detection of PROT_NONE mem region using PT reserved bit · e77f5b1d

Gleb Natapov authored 10 years ago


Java uses accesses to PROT_NONE region to stop threads sometimes, so it
worthwhile to be able to catch this as fast as possible without taking
vma_list_mutex. The patch does it by setting reserved bit on all ptes in
PROT_NONE VMA which causes RSVD bit to be set in a page fault error code.
It is enough to check it to know that access is to a valid VMA, but
permission is lacking.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e77f5b1d

May 19, 2014

sched: strengthen the fence in migrate_disable() · 4dbcc248

Tomasz Grabiec authored 10 years ago


memory_order_acquire does not prevent previous stores from moving past
the barrier so if the _migration_lock_counter incrementation is split
into two accesses, this is eligible:

tmp = _migration_lock_counter;
atomic_signal_fence(std::memory_order_acquire); // load-load, load-store
<critical instructions here>
_migration_lock_counter = tmp + 1; // was moved past the barrier

To prevent this, we need to order previous stores with future
loads and stores, which is given only by std::memory_order_seq_cst.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

4dbcc248

May 18, 2014

sched: relax migration lock barriers · 618c2ff6

Avi Kivity authored 10 years ago


Instead of forcing a reload (and a flush) of all variables in memory,
use the minimum required barrier via std::atomic_signal_fence().

Reviewed-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

618c2ff6

ring_spsc: put proper memory barriers when reading and storing _begin · 7dde3997

Vlad Zolotarov authored 10 years ago


Proper memory ordering should be implied to loads and stores of _begin
field. Otherwise they may be reordered with the appropriate stores and
loads to/from the _ring array and in a corner case when the ring is full
it may lead to ring data corruption.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Reported-by: Nadav Har'el <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

7dde3997

sched: add barrier() to migrate_disable/migrate_enable() · dca10749

Tomasz Grabiec authored 10 years ago


These functions are used to demark a critical section and should
follow a contract which says that no operation inside the critical
section should be moved before migrate_disable() and after
migrate_enable(). These functions are declared inline and the compiler
could theoretically move instructions across these.

Spotted during code contemplation.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

dca10749

May 16, 2014

sched: high-resolution thread::current()->thread_clock() · c4ebb11a

Nadav Har'El authored 10 years ago


thread::current()->thread_clock() returns the CPU time consumed by this
thread. A thread that wishes to measure the amount of CPU time consumed
by some short section of code will want this clock to have high resolution,
but in the existing code it was only updated on context switches, so shorter
durations could not be measured with it.

This patch fixes thread_clock() to also add the time that passed since
the the time slice started.

When running thread_clock() on *another* thread (not thread::current()),
we still return a cpu time snapshot from the last context switch - even
if the thread happens to be running now (on another CPU). Fixing that case
is quite difficult (and will probably require additional memory-ordering
guarantees), and anyway not very important: Usually we don't need a
high-resolution estimate of a different thread's cpu time.

Fixes #302.

Reviewed-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c4ebb11a

sched: make preempt functions inline · 97f5c29d

Glauber Costa authored 10 years ago


Again, we are currently calling a function everytime we disable/enable preemption
(actually a pair of functions), where simple mov instructions would do.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

97f5c29d

sched: make current inline · 19b9d16f

Glauber Costa authored 10 years ago

We are heavily using this function to grab the address of the current thread.
That means a function call will be issued every time that is done, where a
simple mov instruction would do.

For objects outside the main ELF, we don't want that to be inlined, since that
would mean the resolution would have to go through an expensive __tls_get_addr.
So what we do is that we don't present the symbol as inline for them, and make
sure the symbol is always generated.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

19b9d16f

May 15, 2014

ring_spsc: use static_assert() instead of assert() for template MaxSize parameter checking · 91199cb0

Vlad Zolotarov authored 10 years ago


Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

91199cb0

xmitter class · aa45bbb4

Vlad Zolotarov authored 10 years ago


  This class is a heart of a per-CPU Tx framework.
  Except for a constructor it has two public methods:
   - xmit(buff): push the packet descriptor downstream either to the HW or into
     the per-CPU queue if there is a contention.

   - poll_until(cond): this is a main function of a worker thread that will
     consume packet descriptors from the per-CPU queue(s) and send them to
     the output iterator (which is responsible to ensure their successful
     sending to the HW channel).

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

aa45bbb4

cpu_queue class · 9741e82d

Vlad Zolotarov authored 10 years ago


  This class will represent a single per-CPU Tx queue.

  These queues will be subject to the merging by the nway_merger class in
  order to address the reordering issue. Therefore this class will
  implement the following methods/classes:
   - push(val)
   - empty()
   - front(), which will return the iterator that implements:
       - operator *() to access the underlying value
   - erase(it), which would pop the front element.

  If the producer fails to push a new element into the queue (the queue is
  full) then it may start "waiting for the queue": request to be woken when the
  queue is not full anymore (when the consumer frees some some entries from the
  queue):

   - push_new_waiter() method.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9741e82d

Added a test for nway_merger with ring_spsc as a base sorted container. · 48a8f7ca

Vlad Zolotarov authored 10 years ago


Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

48a8f7ca

nway_merger helper class implementation · e15cd7df

Vlad Zolotarov authored 10 years ago


This class allows efficiently merge n sorted containers.
It allows both a single-call merging with a merge() method and
the iterator-like semantincs with a pop() method.
In both cases the merged stream/next element are streamed to
the output iterator.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e15cd7df

ring_spsc: Use size() method instead of duplicating its code · c2bc94f2

Vlad Zolotarov authored 10 years ago


Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c2bc94f2

ring_spsc: Use bitwise AND instead of modulo for calculating the entry index · 1eb8e837
Vlad Zolotarov authored 10 years ago
```
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
1eb8e837