Commits · ef16933014fb8343e7a52d87a5596551f5119769 · Verlässliche Systemsoftware / projects / osv

Apr 08, 2014

sched: fix waitqueue race causing failure to wake up · 4ef65eb6

Avi Kivity authored 10 years ago


When waitqueue::wake_all() wakes up waiting threads, it calls
sched::thread::wake_lock() to enqueue those waiting threads on the mutex
protecting the waitqueue, thus avoiding needless contention on the mutex.
However, if a thread is already waking, we let it wake naturally and acquire
the mutex itself.

The problem is that the waitqueue code (wait_object<waitqueue>::poll())
examines the wait_record it sleeps on and see if it has woken, and if not,
goes back to sleep.  Since nothing in that thread-already-awake path clears
the wait_record, that is what happens, and the thread stalls, until a timeout
occurs.

Fix by clearing the wait record.  As it is protected by the mutex, no
extra synchronization is needed.

Observed with iperf -P 64 against the guest.  Likely triggered by net channels
waking up the thread, and then before it has a chance to wake up, a FIN
packet arrives that is processed in the driver thread; so when the packets
are consumed the thread is in the waking state.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4ef65eb6

dhcp: remove lookup_opcode() · abbfc557

Tomasz Grabiec authored 10 years ago


The lookup_opcode() function is incorrect. It was mishandling
DHCP_OPTION_PAD, which does not have a following length byte.

Also, the while condition is reading 'op' value which never
changes. This may result in reads beyond packet size.

Since this function is unused the best fix is to remove it.

Reveiwed-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

abbfc557

Apr 07, 2014

aarch64: fix up merge issues · 52f64ac3

Claudio Fontana authored 10 years ago


lots of mmu related changes require fixups for AArch64.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

52f64ac3

Apr 03, 2014

sched: fix rare crashes caused by reschedule running on the wrong CPU · ee92f736

Nadav Har'El authored 10 years ago

For a long time we've had the bug summarized in issue #178, where very
rarely but consistently, in various runs such as Cassandra, Netperf and
tst-queue-mpsc.so, we saw OSv crashing because of some corruption in the
timer list, such as arming an already armed timer, or canceling and already
canceled timer.

It turns out the problem was the schedule() function, which basically did
cpu::current()->schedule(). The problem is that if we're unlucky enough,
the thread can be migrated right after calling cpu::current(), but before
the irq disable in schedule(), which causes us to do a rescheduling for
one CPU on a different CPU, which is a big faux pas. This can cause us,
for example, to mess with one CPU's preemption_timer from a different CPU,
causing the timer-related races and crashes we've seen in issue #178.

Clearly, we shouldn't at all have a *method* cpu->schedule() which can
operate on any cpu. Rather, we should have only a *function* (class-static)
cpu::schedule() which operates on the current cpu - and makes sure we find
that current CPU within the IRQ lock to ensure (among other things) the
thread cannot get migrated.

Another benefit of this patch is that it actually simplifies the code,
with one less function called "schedule".

Fixes #178.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

ee92f736

Apr 02, 2014

include: remove unused include file · a7db4920

Claudio Fontana authored 11 years ago


include/api/x64/atomic.h is not used anywhere.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

a7db4920

elf: ifdef x64-specific code, add AArch64 dynamic rels · d16c41a6

Claudio Fontana authored 11 years ago


implementation includes only for the few rels that are actually
encountered during elf::get_init().

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

d16c41a6

include/osv/trace.hh: ifdef x64-specific code · ae0e824f

Claudio Fontana authored 11 years ago


protect x64 machine code in #ifdef __x86_64__

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

ae0e824f

debug: add debug_early functionality · 3e460ce3

Claudio Fontana authored 11 years ago


the functions debug_early, debug_early_u64 and
debug_early_entry can be used very early, before premain.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>

3e460ce3

include/api/aarch64: first implementation · c56bebbb
Claudio Fontana authored 11 years ago
```
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
```
c56bebbb

Revert "uio: super structure around uio" · 9a42fcac

Gleb Natapov authored 10 years ago


This reverts commit 64889277.

The structure is not longer used.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

9a42fcac

mmu: add write pagecache implementation on top of ARC · eeac30c7

Gleb Natapov authored 11 years ago


This patch adds write page cache implementation. On a read fault pages
are initially mapped from ARC directly, but they are marked as RO in a
page table. On a write fault pages are copied into small write page
cache for shared mapping, or into anonymous page for private mappings.
Pages are removed from write cache and written back into a file in FIFO
order.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

eeac30c7

mmu: pass fault type and shared status to fd->get_page() return cow status · 96adaad9

Gleb Natapov authored 10 years ago

File page allocator needs to know fault type (read/write) and mapping
type (shared/private) to handle page allocation correctly. It also needs
a way to communicate to a caller that some pages need to be mapped RO
for COW. This patch adds required functionality without using it yet.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

96adaad9

mmu: return COW status from page allocators · 97ce592e

Gleb Natapov authored 11 years ago


Page allocator may return shared page that needs to be COWed during
write access. There is not way to notify page mapper that such page
should be mapped read only currently. This patch allows page allocator
to control that.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

97ce592e

mmu: save flags passed during file mapping in a vma. · a8d22f23

Gleb Natapov authored 10 years ago


Currently flags are saved for anon_vma, but not for file_vma. Examining
those flags in gdb proved to be helpful. Remove _shared since it no
longer needed, the information is in the flags.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

a8d22f23

mmu: track mapping between ARC pages and ptep instead of virtual address · 1b106798

Gleb Natapov authored 11 years ago

Currently we track mapping between ARC pages and virtual addresses which
require us to walk the page table to get to a pte pointer when invalidation
is required. Change that to track pointers to pte directly.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

1b106798

mmu: split mmu-defs.hh header · 2712138c

Gleb Natapov authored 11 years ago


arch-mmu.hh depend on mmu.hh currently which make it impossible to
include in header that mmu.hh itself includes. mmu.hh is pretty heavy
header containing many definitions and code, so move bare definitions
into separate header file.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

2712138c

Apr 01, 2014

core: introduce serial_timer_task · bd179712

Tomasz Grabiec authored 10 years ago

This is a wrapper of timer_task which should be used if atomicity of
callback tasks and timer operations is required. The class accepts
external lock to serialize all operations. It provides sufficient
abstraction to replace callouts in the network stack.

Unfortunately, it requires some cooperation from the callback code
(see try_fire()). That's because I couldn't extract in_pcb lock
acquisition out of the callback code in TCP stack because there are
other locks taken before it and doing so _could_ result in lock order
inversion problems and hence deadlocks. If we can prove these to be
safe then the API could be simplified.

It may be also worthwhile to propagate the lock passed to
serial_timer_task down to timer_task to save extra CAS.

bd179712

core: introduce deferred work framework · 34620ff0

Tomasz Grabiec authored 11 years ago

The design behind timer_task

timer_task was design for making cancel() and reschedule() scale well
with the number of threads and CPUs in the system. These methods may
be called frequently and from different CPUs. A task scheduled on one
CPU may be rescheduled later from another CPU. To avoid expensive
coordination between CPUs a lockfree per-CPU worker was implemented.

Every CPU has a worker (async_worker) which has task registry and a
thread to execute them. Most of the worker's state may only be changed
from the CPU on which it runs.

When timer_task is rescheduled it registers its percpu part in current
CPU's worker. When it is then rescheduled from another CPU, the
previous registration is marked as not valid and new percpu part is
registered. When percpu task fires it checks if it is the last
registration - only then it can fire.

Because timer_task's state is scattered across CPUs some extra
housekeeping needs to be done before it can be destroyed. We need to
make sure that no percpu task will try to access timer_task object
after it is destroyed. To ensure that we walk the list of
registrations of given timer_task and atomically flip their state from
ACTIVE to RELEASED. If that succeeds it means the task is now revoked
and worker will not try to execute it. If that fails it means the task
is in the middle of firing and we need to wait for it to finish. When
the per-CPU task is moved to RELEASED state it is appended to worker's
queue of released percpu tasks using lockfree mpsc queue. These
objects may be later reused for registrations.

34620ff0

sched: introduce thread migration lock · 6c8a861d

Tomasz Grabiec authored 11 years ago

This can be useful when there's a need to perform operations on
per-CPU structure(s) and all need to be executed on the same CPU but
there is code in between which may sleep (eg malloc).

For example this can be used to ensure that dynamically allocated
object is always freed on the same CPU on which it was allocated:

  WITH_LOCK(migration_lock) {
    auto _owner = *percpu_owner;
    auto x = new X();
    _owner->enqueue(x);
  }

6c8a861d

sched: add atomic reset() operation to timer_base · b122a924
Tomasz Grabiec authored 11 years ago
```
It is needed by the new async framework.
```
b122a924

balloons: define soft maximum amount · 9bbc86fc

Glauber Costa authored 10 years ago

One of the main problems we face with the balloon is the fact that by the time
the JVM needs more memory, it may be too late. We have many mechanisms in place
to try to mitigate this by trying to release memory back to the JVM before it
actually even notices it needs.

This is yet another of them: a soft maximum is defined. Unlike a hard maximum,
OSv can create as many balloons as it sees fit. However, if we ever go beyond
this maximum, we will start putting pressure through the reclaimer in order to
be able to react quicker if pressure builds up from the JVM side.

Also, when we are over this limit and we need to release a balloon, we keep
releasing until we are back in a safe zone, regardless of how many we were
asked for.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

9bbc86fc

reclaimer: do not go hard if we're throttling · dad4357f

Glauber Costa authored 10 years ago

In case we are throttling, try to serve memory from the non-aggressive
shrinker before going all-in. In particular, the current waiter may be
allocating memory in behalf of the balloon (for instance, stack), so it
is a bad idea to block. We may have more memory soon enough.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

dad4357f

zfs: throttle ZFS allocations during balloon · af767977

Glauber Costa authored 10 years ago

When we create a new balloon, memory will soon be available. However, until
we actually create it, high rate ARC users will continue to fill the ARC at
will, which can easily lead to memory exhaustion.

Blocking allocations until the balloon is created is not an option: in order to
balloon, we need to allocate memory, and the source of this memory is quite
unpredictable: backing heap, stack, and even file, if an object destruction
needs to flush something out.

Luckily, the ARC has a mechanism to avoid growing the cache. If the function
arc_evict_needed() returns true, it will free buffers to give place to new ones
instead of growing them.

The most natural way of making that function return true would be to implement
the function vm_paging_needed(), which is already there in generic code and
would require 0 ARC changes. However, doing this would imply that
arc_reclaim_needed() would now return true as well. This would not only evict
some buffers, but actually put the ARC in aggressive mode (Especially because
the allocations will keep coming)

The solution I've chosen was to implement a new function in the ARC - wrapped
in an ifdef so we don't lose track of it - that will allow us to communicate
that we should throttle new buffers due to a VM request (in our case, because
we are waiting for the balloon).

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

af767977

shrinkers: allow soft and hard mode · 5c2aba30

Glauber Costa authored 10 years ago

There are some shrinkers, like the ARC, that may become very agressive when
called more than once in a row. When we have no waiters, we can be more relaxed
and just go for a best effort shrinking.

This patch introduce a soft mode shrinking that will do just that. In the
future, we may introduce another threshold smaller than the minimum in which we
can go in hard mode even before we have waiters. For now, this will do.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

5c2aba30

mempool: transform reclaimer thread into a normal thread · 18a82a55

Glauber Costa authored 11 years ago


The balloon is no longer synchronous, and now lives in its own
thread. We used to have the requirement that the reclaimer needed
to have a full pthread API, but that is no longer. Make it now a
normal thread.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

18a82a55

balloon: close the balloons earlier if the regions intersect · b8970534

Glauber Costa authored 11 years ago

Instead of just looking at which region our start address falls at, we can
iterate through all the vmas that are affected by us installing a new vma, and
then close the region earlier.

This design has one major big advantage: split() will never happen, and if that
is the case, the balloon becomes a lot more predictable - which at this point
is a blow of sanity in the whole thing

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

b8970534

balloon: use a separate thread the for the balloon · 045c602a

Glauber Costa authored 11 years ago

Synchronously ballooning unfortunately won't work: as usual the garbage
collector can run in order to release memory for the balloon, and then we will
block.

Since we can't make the balloon worker synchronous, we set its priority to a
really high number, that would guarantee that Java threads won't run before it.
This will help in really fast allocation situations. Durin those, between the
notification and the time this thread gets to run, the JVM may have very well
exhausted all the memory remaining.

Notice that this value is guaranteed to be lower than any JVM created thread
ever will get. This is since for JVM threads, priority will only be set through
the setpriority interface, wich will yield 0.2 for its maximum prio.

045c602a

balloon: no longer make it part of the shrinker infrastructure. · 3c716bda

Glauber Costa authored 11 years ago

The current ballooning mechanism has a very serious design flaw, that became
obvious once the tests with it advanced: if we wait until memory is short to
release memory, it may be that the JVM itself may block. In particular, the
blocking thread can be running the Garbage Collector. More likely, it can
start running the GC to open up space for an object so big as the balloon.

I don't believe this is something we should fix with a quick hack, and I am
then pushing for a redesign: The solution is to use a reservation system, that
works by ballooning beforehand when the amount of reserved memory goes too low.

This system works by noting that when we balloon, we are effectively limiting
the maximum amount of memory java will ever use. We already have accounts of
both the amount of free memory and the amount of memory allocated to the heap.
With those figures, we can easily find out when the amount of used memory +
a full heap would go over the amount of available memory. When that happens,
we balloon.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

3c716bda

jvm_balloon: handle partial writes · 12f70bf0

Glauber Costa authored 11 years ago

There are situations during which some Garbage Collectors will copy a large
object in parallel, using various threads, each being responsible for a part of
the object. In particular, that easily happens under high memory pressure for
the Parallel and ParallelOld collectors.

If such copy happens, the simple balloon move algorithm will break. However,
because offset 'x' in the source will always be copied to offset 'x' in the
destination, we can still calculate the final destination object. The problem
is that we cannot open the new balloon yet. Since the JVM believes it is
copying only a part of the object, the destination may (and usually will)
contain valid objects, that need to be themselves moved somewhere else before
we can install our object there.

Also, we can't close the object fully when someone writes to it: because a part
of the object is now already freed, the JVM may and will go ahead and copy
another object to this location.

This is a quite tricky problem, but the patch bellow solves it by keeping the
source balloon open and counting the number of partial bytes copied. It makes
the assumption that the JVM needs to always copy the whole object to a different
location, meaning that when it does, the balloon can be closed.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

12f70bf0

jvm_balloon: better define boundaries for the balloon. · e93bd6ba

Glauber Costa authored 11 years ago

Right now, we are starting the jvm with the maximum amount of memory we can.
This is, however, terribly bad. The reason is that if that if we go to reclaim
during a heap allocation, this will easily crash the system. That will be just
for granted if that allocation turns out to be done on behalf of the GC. In
that case, no other Java threads will make progress, meaning the balloon cannot
be activated.

Even if the balloon is no longer part of the shrinker infrastructure, it may
still be a problem, because we may need to use it to deflate memory. The best
solution is to just limit heap usage to 90 % of the available memory, which is
the reclaim criteria.

Also, it is pointless to activate the balloon for very low memory conditions.
So if our total memory is less than 1Gb, just leave it alone.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

e93bd6ba

reclaimer: implement a stronger data structure for waking sleepers · 80646cd6

Glauber Costa authored 11 years ago


After Avi's last comments, here is a new take for the reclaimer's wake queue.
It does not abuse semaphore's in any way, so there is no need to extend it.
(After this patch is applied, the semaphore extension patch can be reverted)

In this new approach, we will not only check for free memory as a scalar, but
will instead transverse the remaining page ranges (notice that in case of
fragmentation this can be a *large* set) and try to find a range that would
allow our allocation to happen.

Also, because we are implementing both the wait and the "post" methods, the
waking of the reclaimer happens atomically and the problem that Tomek described
(wake before waiters is set) no longer exists.

Finally, another advantage of this approach is that we no longer need to keep
track of the amount of memory we have freed when making OOM decisions: if we
can wake someone up, we can continue. Otherwise, we OOM. This will be specially
helpful for the FS shrinkers, since they may not always be able to synchronously
return a meaningful number accurately (due to, for instance, writeback).

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

80646cd6

balloon: use a shared ptr for the balloon object · c035ba98

Glauber Costa authored 11 years ago

The old explicit pattern would break if the vma that holds the balloon
was already destroyed. Use a shared pointer to avoid the complications from it

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

c035ba98

balloon: fix move-to-itself bug · b079217c

Glauber Costa authored 11 years ago

While working in the reclaim policy, I came across an unhandled balloon case
that seem to happen in some certain low memory conditions. The balloon will be
moved to the same address it was before. While this seems weird at first, it is
perfectly plausible if we think that we can have a space before the balloon almost
a page in size. If the JVM is only shifting the elements slightly, they will end up
at the same ballooned region.

The problem with that is that evacuate() is called first - the vma is still in place,
so it will be found, and it will call delete. When the jvm fault handler executes,
it will also call delete and we will have a double free.

This patch avoids the delete, but the problem is actually more intricate than
that: if we just follow the normal path, we will remove the newly created
address from the candidates list right after we insert it (because the
addresses are the same). Also, if the address is *not* the same, but falls in
the same region (this is theoretical), evacuate will be called and split the region
in two. Our code to handle finish_move would have to be significantly complicated
to deal with this fact. This patch also handles this theoretical case by making
sure that the whole region is always disposed at once (by finishing the old balloon
manually before evacuate is called)

Why I am using a new mmu flag: when we search for the old vma, we need to
identify which type of vma we are dealing with. If it is an anonymous or
ballooned area. I use an extra flag to avoid a type field or anything like
that.

Couldn't we just take the vma out of the list before we delete it? In this case
the region would be empty, and we could assume that was a ballooned region.
That would also be a solution, but because we are copying the flag from the
previous vma, we would have to fallback to a default flag set. The solution
of marking the JVM vma and leaving it in place allows the flag information to
be carried on.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

b079217c

Mar 27, 2014

zfsbuffers: reference count the arc buffer · 063a8a56

Glauber Costa authored 11 years ago


Gleb has noticed that the ARC buffers can go unshared too early. This will
happen because we call the UNMAP operation on every put(). That is certainly
not what we want, since the buffer only has to be unshared when the last
reference is gone.

Design decisions:
1) We obviously can't use the arc natural reference count for that, since
bumping it would make the buffer unevictable.
2) We could modify the arc_buf structure itself to add another refcnt (minimum
4 bytes).  However, I am trying to keep core-ZFS modifications to a minimum,
and only to places where it is totally unavoidable.

Therefore, the solution is to add another hash, which will hash the whole
buffer instead of the physaddr like the one we have currently. In terms of
memory usage, it will add only 8 bytes per buffer (+/- 128k each buffer), which
makes for a memory usage of 64k per mapped Gb compared to the arc refcount
solution. This is a good trade off.

I am also avoiding adding a new vop_map/unmap style operation just to query the
buffer address from its file attributes (needed for the put side). Instead, I
am conventioning that an empty iovec means query, and a filled iov means
unshare.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

063a8a56

Mar 24, 2014

mempool: Contiguous physical memory holder class · c230c939

Takuya ASADA authored 11 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
[ penberg: fix formatting ]
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c230c939

sched: Add assert to capture rescheduling attempt in nested exception · fc1d36c5

Gleb Natapov authored 11 years ago


Rescheduling in nested exception is not supported since nested exception
stack is per cpu, but without the assert such reschedule will cause
stack corruption which will be hard to debug.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fc1d36c5

Mar 21, 2014

mmu: only handle page sized maps for buffer mappings · a6e826b4

Glauber Costa authored 11 years ago

Gleb have noticed a problem with the buffer mapping logic, that would make
unrelated mappings that would be adjacent to the file mapping be invalidated as
well. This is because we were invalidated an area as large as the buffer. If
the mapping would, for instance, end in the middle of the buffer, we would keep
invalidated adjacent regions.

The solution is to change the code to handle per-page, not per-buffer mappings.
We would likely need something like that anyway for COW maps, so let's just
start handling it now.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

a6e826b4

mmu: handle multiple mappings to the same file · a17309e2

Glauber Costa authored 11 years ago


This case is not supposed to be utterly common, but we have to handle it
anyway.  This implementation is slightly naive, but it should be okay if the
number of mappings is small.  We should revisit it this number start growing
too much

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

a17309e2

mmu: map the underlying file directly from filesystem page cache · a4e9cc32
Glauber Costa authored 11 years ago
```
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
```
a4e9cc32

map_file: expand parameters to get_page/put_page · 3c77122f

Glauber Costa authored 11 years ago

To better deal with shared buffers, we should expand the interface to
get_page/put_page. Three changes are needed: first, the start of the virtual
address is passed to both. Together with the offset, it can be used to uniquely
identify the virtual address being mapped. That will be used to manage the
mapping list.

Regarding the offset itself, there is need to also pass the file offset
downwards, the one specified by the mapping. Reading that from the file pointer
is not enough, because not only this is not what the user has asked for, but it
can change during the map's lifetime. It also should be a separate parameter
from the in-mapping offset, not summed together with it. Doing otherwise would
make impossible for the mapping function to differentiate between those two
concepts

Also, put_page is called from the free operation, but does not receive the
actuall address from it. Knowing the physical address will be important to shut
down our mapping list.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

3c77122f