Commits · d96020e3170465d6500d6ab6f6ba36f8c61f4047 · Verlässliche Systemsoftware / projects / osv

Jun 11, 2013

Add missing change to lfmutex.cc · d96020e3
Nadav Har'El authored 11 years ago
```
Sorry, forgot one hunk in "git add -p" :(
```
d96020e3

lock-free mutex: change and clarify the role of depth and owner · 99b477dc

Nadav Har'El authored 11 years ago

The way "owner" and "depth" were used in lockfree::mutex was messy.
Ideally, neither should be needed if we implemented a non-recursive
mutex, but following the design of ::mutex, we (re)used "owner" also as
a marker that a thread was waken to have the lock (and it's not a
spurious wake).

After this patch, owner and depth are used in lockfree::mutex *only*
for implementing a recursive mutex, and building a non-recursive
mutex should be as simple as dropping these two variables.

In more detail:

1. "owner" is no longer used to tell a woken up thread that the wake
wasn't spurious. Instead, zero the thread in the wait-record. This
is a familiar idiom, which we already used a few times before.

2. "depth" isn't an atomic variable, so it should only be read by the
same thread which set it, and this wasn't the case previously. Now,
depth is only ever written (set to 1, incremented or decremented)
or read by the lock-holding thread - and not the lock releasing thread.

3. "owner" needs to be an atomic variable - a non-lock-holding thread
needs to read it and recognize it isn't holding the lock - but it
doesn't need any special memory ordering with other variables, so
should always be accessed with "relaxed" memory ordering.

99b477dc

sched::thread - fix very rare join() hang · cf4c46c4

Nadav Har'El authored 11 years ago

Fixed a very rare hang in sched::thread::join():

thread::complete() included the following code:

    _status.store(status::terminated);
    if (_joiner) {
        _joiner->wake();
    }

If we are preempted right after setting status to "terminated", but
before calling wake(), this thread will never be scheduled again (it will
remain in the terminated status forever), and will never call wake() -
so the join()ing thread may just wait forever.

I saw this happening in a test case that started and joined millions of
threads, and eventually the join() hangs.

The solution is to enclose the above lines with preempt_disable()/
preempt_enable().

cf4c46c4

wake(): Don't miss a preemption opportunity · aee17ba4

Nadav Har'El authored 11 years ago

wake() normally calls schedule(), but doesn't do so if preemption is
disabled. So we should mark need_reschedule = true, to suggest that
schedule() can be called when preemption is later enabled.

aee17ba4

x64: prevent nested exceptions from corrupting the stack · 1dbddc44

Avi Kivity authored 11 years ago

Due to the need to handle the x64 red zone, we use a separate stack for
exceptions via the IST mechanism. This means that a nested exception will
reuse the parent exception's stack, corrupting it. It is usually very hard
to figure out the root cause when this happens.

Prevent this by setting up a separate stack for nested exceptions, and
aborting immediately if a nested exception happens.

1dbddc44

Jun 10, 2013

elf: add support for STT_IFUNC · 5948f0d7

Avi Kivity authored 11 years ago

Used for implementing support for indirect functions referenced from shared
libraries.

5948f0d7

Jun 09, 2013

logger: changed debugging calls to use tprintf_X variants · 41360613
Guy Zana authored 11 years ago

41360613

logger: remove not very useful logger tags · f806a014

Guy Zana authored 11 years ago

controlling the logger output in tests is simplistic, it is enough to
change the severity level to logger_error and all logging messages
will appear.

f806a014

trace: add tracepoint to mutex_lock, mutex_unlock · bf7fe5e7
Guy Zana authored 11 years ago

bf7fe5e7
sched: add tracepoint for preemption events · 76dae193
Avi Kivity authored 11 years ago

76dae193

lock-free mutex: add C API for lockfree::mutex methods · d3c156d8

Nadav Har'El authored 11 years ago

Add some extern "C" versions of the lockfree::mutex methods. They will be
necessary for providing the lockfree::mutex type to C code - as you'll see
in later patches, C code will see an opaque type, a byte array, and will
call these functions to operate on it.

d3c156d8

lock-free mutex: Avoid including <sched.hh> in <lockfree/mutex.hh> · 874b10ee

Nadav Har'El authored 11 years ago

Do not include <sched.hh> in <lockfree/mutex.hh>.

Including <sched.hh> creates annoying dependency loops when we (in a
later patch) replace <osv/mutex.h> by <lockfree/mutex.hh>, and some header
files included by <sched.hh> themselves use mutexes, so they include
<osv/mutex.h>. This last include does nothing (because of the include guard)
but the compiler never finished reading osv/mutex.h (it was only in its
beginning, when it included sched.hh) so the inner-included code lacks the
definitions it assumes after including mutex.h.

874b10ee

lockfree mutex: add owned() and getdepth() methods · 1ed9c982

Nadav Har'El authored 11 years ago

Add to lockfree::mutex the simple owned() and getdepth() methods which
existed in ::mutex and were used in a few places - so we need these
methods to switch from ::mutex to lockfree::mutex.

1ed9c982

lockfree mutex: fix wait/wake bug · 9bcf790a

Nadav Har'El authored 11 years ago

When I developed lockfree mutex, the scheduler, preemption, and related code
still had a bunch of bugs, so I resorted to some workarounds that in hindsite
look unnecessary, and even wrong.

When it seemed that I can only wake() a thread in wait state, I made an
effort to enter the waiting state (with "wait_guard") before adding the
thread to the to-awake queue, and then slept with schedule(). The biggest
problem with this approach was that a spurious wake(), for any reason, of
this thread, would cause us to end the lock() - and fail on an assert that
we're the owners of the lock - instead of repeating the wait. When switching
to lockfree mutex, the sunflow benchmark (for example) would die on this
assertion failure.

So now I replaced this ugliness with our familiar idiom, wait_until().
The thread is in running state for some time after entering queue, so
it might be woken when not yet sleeping and the wake() will be ignored -
but this is fine because part of our protocol is that the wake() before
waking also sets "owner" to the to-be-woken thread, and before sleeping
we check if owner isn't already us.

Also changed the comment on "owner" to say that it is not *just* for
implementing a recursive mutex, but also nessary for the wakeup protocol.

9bcf790a

Jun 06, 2013

msix: provide high priority handler when registering interrupt · 66066b07

Guy Zana authored 11 years ago

we have to disable virio interrupts before msix EOI so disabling
must be done in the ISR handler context. This patch adds an std::function
isr to the bindings.

references to the rx and tx queues are saved as well (_rx_queue and _tx_queue),
so they can be used in the ISR context.

this patch reduces virtio net rx interrupts by a factor of 450.

66066b07

trace: relax unique ID requirement · aec5d9df

Avi Kivity authored 11 years ago

Place the tracepointv type in an anonymous namespace.  This makes every
translation unit have its own unique tracepoint types, so we only need
to ensure uniqueness within a source file.

Use the type's type_info to select the correct patch sites.

Idea from Nadav.

aec5d9df

Jun 05, 2013

trace: improve fast path · b03979d9

Avi Kivity authored 11 years ago

When a tracepoint is disabled, we want it to have no impact on running code.

This patch changes the fast path to be a single 5-byte nop instruction. When
a tracepoint is enabled, the nop is patched to a jump instruction to the
out-of-line slow path.

b03979d9

trace: add unique ID for tracepoints · 0102df29

Avi Kivity authored 11 years ago

In order to optimize the fast path of tracepoints, we need to patch
the call sites to skip calling the slow path code completely.  In turn,
that requires that each call site be unique -- a separate function.

In the current implementations, tracepoints with the same signature map
to the same type.  It would have been great to use the name as a discriminant
(tracepoint<"sched_queue", thread*> trace_sched_queue(...);), but C++ does
not support string literals as template arguments.

We could do

  const char* trace_sched_queue_name = "sched_queue";
  tracepoint<trace_sched_queue_name, thread*> trace_sched_queue(...);

but that doubles the code for declaring a tracepoint.  Add a unique ID instead
(and code to verify it is unique).

0102df29

lockfree::mutex functions should not be inline · cb41801a

Nadav Har'El authored 11 years ago

Until now, lockfree::mutex functions were entirely inline, which won't
fly if we want to make it our default mutex. Change them to be out-of-line,
implemented in a new source file core/lfmutex.cc.

This has a slight performance impact - uncontended lock/unlock pair used
to be 17ns, and is now 22ns.

In the future we can get this performance back by making these functions
partially inline (the uncontended case inline, the waiting case in a
separate function), although we'll need to consider the speed/code-size
tradeoff.

cb41801a

Loader: show demangled symbol name on lookup failure · b39289cf

Nadav Har'El authored 11 years ago

When abort()ing on failed symbol lookup, if this is a C++ symbol, also
show its demangled form, which in many case can be more helpful.
Here is an example lookup failure now:

failed looking up symbol _ZN8lockfree5mutex4lockEv (lockfree::mutex::lock())
Aborted

b39289cf

Jun 04, 2013
- fix logging in select · b87d2d26
  Guy Zana authored 11 years ago
  
  b87d2d26
Jun 02, 2013

Fix condvar_wait() bug · 1b53ec56

Nadav Har'El authored 11 years ago

condvar_wait() wrongly dropped the condvar's internal lock too early,
and accessed wr->t outside the lock, meaning that a concurrent wake()
could race with it. This bug was exposed in one of the pipe() tests.

This patch fixes this bug, by holding the internal lock throughout
the execution of condvar_wait(), dropping it temporarily only while
waiting.

1b53ec56

May 31, 2013

select: implement select() · c5f87a53

Guy Zana authored 11 years ago

add simple select() implementation on top of poll(), been tested by
running netperf/iperf3 which is using select() for waiting for reads
and setting timeouts.

haven't been thourougly tested.

c5f87a53

poll: small bug fix, poll should return errors via errno · 2549bcad
Guy Zana authored 11 years ago

2549bcad

May 29, 2013

Say we also implement librt.so.1. · 737f83e9

Nadav Har'El authored 11 years ago

Say we also implement librt.so.1. This is required, for example, by the
Boost libraries (e.g., libboost_system-mt.so.1.50.0). The librt library
isn't actually a separate library on modern Linux - rather all its
traditional functions are now in glibc, and "librt" is merely a filter
on glibc. So no reason not to say we support librt too.

Not to mention that we already implement a bunch of functions that
traditionally resided in librt (nanosleep, sched_yield, sem_*, etc.

737f83e9

May 27, 2013

debug: introduce debug_ll() and use it in abort() · 6ebb582e

Guy Zana authored 11 years ago

the debug() console function is taking a lock before it access the console driver,
it does that by acquiring a mutex which may sleep.

since we want to be able to debug (and abort) in contexts where it's not possible sleep,
such as in page_fault, a lockless debug print method is introduced.

previousely to this patch, any abort on page_fault would cause an "endless" recursive
abort() loop which hanged the system in a peculiar state.

6ebb582e

May 26, 2013

Fix comment · 0ad3e2e0

Nadav Har'El authored 11 years ago

The comment about unlocking the irq_lock was put on the wrong line.
Move it (and rephrase it a bit - the word "release" immediately after
calling an unrelated release() function - is confusing).

0ad3e2e0

Fix two bugs in yield() · 19e52ce6

Nadav Har'El authored 11 years ago

yield() had two bugs - thanks to Avi for pinpointing them:

1. It used runqueue.push_back() to put the thread on the run queue, but
push_back() is a low-level function which can only be used if we're
sure that the item we're pushing has the vruntime suitable for being
*last* on the queue - and in the code we did nothing to ensure this
is the case (we should...). So use insert_equal(), not push_back().

2. It was wrongly divided into two separate sections with interrupts
disabled. The scheduler code is run both at interrupt time (e.g.,
preempt()) and at thread time (e.g., wait(), yield(), etc.) so to
guarantee it does not get called in the middle of itself, it needs
to disable interrupts while working on the (per-cpu) runqueue.
In the broken yield() code, we disabled interupts while adding the
current thread to the run queue, and then again to reschedule.
Between those two critical sections, an interrupt could arrive and
do something with this thread (e.g., migrate it to another CPU, or
preempt it), yet when the interrupt returns yield continues to run
reschedule_from_interrupt which assumes that this thread is still
running, and definitely not on the run queue.

Bug 2 is what caused the crashes in the tst-yield.so test. The fix is
to hold the interrupts disabled throughout the entire yield().
This is easiest done with with lock_guard, which simplifies the flow
of this function.

19e52ce6

sched: avoid unnecessary FPU saving · 947b49ee

Nadav Har'El authored 11 years ago

Because of Linux's calling convention, it should not be necessary to
save the FPU state when a reschedule is caused by a function call.

Because we had a bug and forgot to save the FPU state when calling
a signal handler, and because this signal handler can cause a reschedule,
we had to save the FPU on any reschedule. But after fixing that bug, we
no longer need these unnecessary FPU saves.

The "sunflow" benchmark still runs well after this patch.

947b49ee

elf: add support for IRELATIVE relocations · e8c62c5e

Avi Kivity authored 11 years ago

This are used to support ifunc functions, which are resolved at load-time
based on cpu features, rather than at link time.

e8c62c5e

sched: fix preempt_enable() when interrupts are disabled · 84046f23

Avi Kivity authored 11 years ago

If interrupts are disabled, we must not call schedule() even if
the preemption counter says we need to, as the context is not preemption
safe.

This manifested itself in a wake() within a timer causing a schedule(),
which re-enabled interrupts, which caused further manipulation of the timer
list to occur concurrently with the next interrupt, resulting in corruption.

Fixes timer stress test failure.

84046f23

May 24, 2013
- convert the poll list to <sys/queue.h> · 277598d7
  Christoph Hellwig authored 11 years ago
  
  277598d7
May 22, 2013

This patch implements two functions: · 121d6a7e

Nadav Har'El authored 11 years ago

 1. osv::poweroff(), which can turn off a physical machine or in our case
    tell QEMU to quit.
    The implementation uses ACPI, through the ACPICA library.

 2. osv::hang(), which ceases all computation on all cores, but does not
    turn off the machine. This can be useful if we want QEMU to remain
    alive for debugging, for example.

The two functions are defined in the new <osv/power.hh> header files, and
follow the new API guidelines we discussed today: They are C++-only, and
are in the "osv" namespace.

121d6a7e

Added timeout parameter to semaphore::wait() · 7890e39a

Nadav Har'El authored 11 years ago

Added a timeout parameter to semaphore::wait(), which defaults to no
timeout.

semaphore:wait() is now a boolean, just like trywait(), and likewise can
return false when the semaphore has not actually been decremented but
rather we had a timeout.

Because we need the mutex again after the wait, I replaced the "with_lock"
mechanism by the better-looking lock_guard and mutex parameter to
wait_until.

7890e39a

Drasticly lower overhead of leak detection on running program · 92ff9880

Nadav Har'El authored 11 years ago

Leak detection (e.g., by running with "--leak") used to have a devastating
effect on the performance of the checked program, which although was
tolerable (for leak detection, long runs are often unnecessary), it was
still annoying.

While before this patch leak-detection runs were roughly 5 times slower
than regular runs, after this patch they are only about 40% slower than
a regular run! Read on for the details.

The main reason for this slowness was a simplistic vector which was used to
keep the records for currently living allocations. This vector was linearly
searched both for free spots (to remember new allocations) and for specific
addresses (to forget freed allocations). Because this list often grew to a
hundred thousand of items, it became incredibly slow and slowed down the
whole program. For example, getting a prompt from cli.jar happened in 2
seconds without leak detection, but in 9 seconds with leak detection.

A possible solution would have been to use an O(1) data structure, such as
a hash table. This would be complicated by our desire to avoid frequent
memory allocation inside the leak detector, or our general desire to avoid
complicated stuff in the leak detector because they always end leading to
complicated deadlocks :-)

This patch uses a different approach, inspired by an idea by Guy.

It still uses an ordinary vector for holding the records, but additionally
keeps for each record one "next" pointer which is used for maintaining two
separate lists of records:

1. A list of free records. This allows a finding a record for a new
allocation in O(1) time.

2. A list of filled records, starting with the most-recently-filled record.
When we free(), we walk this list and very often finish very quickly,
because malloc() closely followed by free() are very common.
Without this list, we had to walk the whole vector filled with ancient
allocations and even free records, just to find the most recent
allocation.

Two examples of the performance with and without this patch:

1. Getting a prompt from cli.jar takes 2 seconds without leak detection,
9 seconds with leak detection before this patch, and 3 seconds with this
patch.

2. The "sunflow" benchmark runs 53 ops/second without leak detection,
which went down to 10 ops/second with leak detection before this patch,
and after this patch - 33 ops/second.

I verified (by commenting out the search algorithm and always using the
first item in the vector) that the allocation record search is no longer
having any effect on performance, so it is no longer interesting to replace
this code with an even more efficient hash table. The remaining slowdown is
probably due to the backtrace() operation and perhaps also the tracker lock.

92ff9880

semaphore: convert to an intrusive list · 5c2b2f94
Avi Kivity authored 11 years ago
```
Intrusive lists are faster since they require no allocations.
```
5c2b2f94

semaphore: remove indirection accessing internal mutex · b9953a90

Avi Kivity authored 11 years ago

Previously, the mutex was stored using a pointer to avoid overflowing
glibc's sem_t.  Now we no longer have this restriction, drop the indirection.

b9953a90

semaphore: add support for multiple units in wait() and post() · c240daa3
Avi Kivity authored 11 years ago
```
Use Nadav's idea of iterating over the list and selecting wait records
that fit the available units.
```
c240daa3
semaphore: extract generic semaphore from pthread semaphore implementation · 747ff478
Avi Kivity authored 11 years ago
```
No code changes.
```
747ff478

May 21, 2013

Add forgotten break · 8ffe45bf

Nadav Har'El authored 11 years ago

In the allocation tracker, not only did I use a dog-slow linear search,
I forgot to stop on the first empty spot, and actually used the last
empty spot... Add the missing break, which made leak detection 10% faster.

A better implementation would be preferable, but this is a low hanging
fruit

8ffe45bf