Commits · 330343970b11f813065900dce17d76a44b8b84ba · Verlässliche Systemsoftware / projects / osv

Jan 21, 2014

DHCP: Repeat DHCP discovery on timeout · d241a2c0

Dmitry Fleytman authored 11 years ago


It is a bad practice to have DHCP discovery without timeout
and retries. In case discovery packet gets lost boot stucks.

Beside this there is an interesting phenomena on some systems.
A few first DHCP discovery packets sent on boot get lost in some cases.

This started to happen from time to time on my KVM system and almost
every time on my Xen system after installing recent Fedora Core updates.
Packet leaves VM's interface but never arrives to bridge interface.
The packet itself built properly and arrives to DHCP server just fine
after a few retransmissions.

Most probably this phenomena is a bug (or limitation) in the current
Linux bridge version so this patch is actually a work-around, but
since in general case it is a good idea to have DHCP timeouts/retries
it worth to have it anyway.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d241a2c0

Jan 20, 2014

core: add waitqueue support · f559807e

Avi Kivity authored 11 years ago


A waitqueue is an object on which multiple threads can wait; other threads
can wake up either one or all waiting threads.  A waitqueue is associated
with an external mutex which the user must supply for both wait and wake
operations.

Waitqueues differ from condition variables in three respects:
- waitqueues do not contain an internal mutex.  This makes them smaller, and
  reduces lock acquisitions.  On the other hand the waker must hold the
  associated mutex, whereas this is not required with condition variables.
- waitqueues support sched::thread::wait_for()

waitqueues support wait morphing and do not cause excess lock contention,
even with wake_all().

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

f559807e

sched: wake_lock() · acd36f2d

Avi Kivity authored 11 years ago


This adds a facility to wake a thread, but with the intention that it will
acquire a certain lock after waking, and while the waker holds the lock.
This is implemented using the regular wait morphing code (send_lock() and
receive_lock()), but with additional mutual exclusion to allow regular
wake()s in parallel.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

acd36f2d

posix_memalign: Remove the extra check on size · d0c473d4

Vlad Zolotarov authored 11 years ago


Remove the extra check on size just like the remark above implies.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d0c473d4

Jan 19, 2014

elf: fix object::lookup_addr to lookup correct symbol · 38566444

Takuya ASADA authored 11 years ago


Fix object::lookup_addr to lookup correct symbol.
It should returns the nearest symbol which is s_addr < addr, but it
compares opposite way.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

38566444

Jan 17, 2014

DHCP: Support MTU option · 69bf74a7

Dmitry Fleytman authored 11 years ago

This patch introduces support for MTU option as described in
RFC2132, chapter 5.1. Interface MTU Option

Amazon EC2 networking uses this option in some cases and it gives
throughput improvement of about 250% on big instances with 10G networking.

Netperf results for hi1.4xlarge instances, TCP_MAERTS test, OSv runs netserver:

Send buffer size Throughput w/ patch (Mbps) Throughput w/o patch (Mbps) Improvement (%)

32 4912.29 1386.28 254
64 4832.01 1385.99 249
128 4835.09 1401.46 245
256 4746.41 1382.28 243
512 4849.04 1375.23 253
1024 4631.8 1356.69 241
2048 4859.59 1371.92 254
4096 4864.99 1383.67 252
8192 4627.07 1364.05 239
16384 4868.73 1366.48 256
32768 4822.69 1366.63 253
65536 4837.67 1353.87 257

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

69bf74a7

mmu: procfs support · b01a5444

Pekka Enberg authored 11 years ago


Add procfs_maps() function to core/mmu.cc that returns all the VMAs
formatted for Linux compatible "/proc/<pid>/maps" file.

This will be called by the procfs filesystem.

Limitations:

  * Shared mappings are not identified as such.
  * File-backed mmap offset, device, inode, and pathname are not
    reported.
  * Special region names such as [heap] and [stack] are not reported.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b01a5444

Jan 16, 2014

balloon: remove leftover code · 32abe784

Glauber Costa authored 11 years ago

This code should not be here. I am 100 % positive that I removed it in my
testings, but I must have forgotten to git add the remove before I sent out the
patch and it ended up in tree.

This is simply a test leftover, it will have the effect of having threads to
loop forever and never waiting because the initial value won't be 0.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

32abe784

Jan 15, 2014

debug: create circular buffer for silent mode · 63ff4d06

Eduardo Piva authored 11 years ago


Create a circular buffer that stored all debug messages accordingly.
If the debug buffer is full, reuse it. A method called flush_debug_buffer
is added to enable printing all messages to console if verbose mode is
configured.

The global variable debug_buffer_full is used to track if, when
flushing debug buffer to console, we need to flush both buffer sides.

If verbose boolean variable is set, all messages are printed to the
console after beeing stored in the buffer.

The size of the buffer is 50Kb, defined in debug.hh.

A function debugf that received a variable list of arguments
is defined so we can change some printf from boot sequence
to debugf call. A different name is used to prevent C overload.

Signed-off-by: Eduardo Piva <efpiva@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

63ff4d06

Jan 13, 2014

reclaim: export address of the OSV reclaimer · 5a60e13d

Glauber Costa authored 11 years ago


ZFS will perform some checks to determine if the current calling "process"
is the reclaimer. Export the address of the reclaimer thread so that test
can work.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

5a60e13d

alarm: wake threads in interruptible sleep · 8b5d9994

Dmitry Fleytman authored 11 years ago


Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8b5d9994

api: Add osv::version for querying OSv version · 53399b86

Amnon Heiman authored 11 years ago


The uname() function returns a fake Linux version number for application
compatibility.  Add a new osv::version() API that returns OSv version
that can be used by the management code.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
[ penberg: cleanups ]
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

53399b86

sched: add tracepoint for resuming from wait · 60f39aea

Tomasz Grabiec authored 11 years ago


Useful for calculating time during which thread
was scheduled out because of wait().

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

60f39aea

Jan 10, 2014

mm: Count total memory used by the JVM heap · 478f8746

Glauber Costa authored 11 years ago

To make informed reclaim decisions, we need to have as much relevant
information as possible about our reclaim targets. Specifically, it
is useful to know how much memory is currently used by the JVM heap.

The reasoning behind this is that if pressure is coming from the heap,
ballooning will harm us, instead of helping us.

Note: This is really just a first approximation. Ideally, total memory
shouldn't matter, but rather memory delta since a last common event.
But counting memory is the initial first step for both.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

478f8746

jvm: insert probe · b32a006b

Glauber Costa authored 11 years ago

To find out which vmas hold the Java heap, we will use a technique that is very
close to ballooning (in the implementation, it is effectively the same)

What we will do is we will insert a very small element (2 pages), and mark the
vma where the object is present as containing the JVM heap. Due to the way the
JVM allocates objects, that will end up in the young generation. As time
passes, the object will move the same way the balloon moves, and every new vma
that is seen will be marked as holding the JVM heap.

That mechanism should work for every generational GC, which should encompass
most of the JDK7 GCs (it not all). It shouldn't work with the G1GC, but that
debuts at JDK8, and for that we can do something a lot simpler, namely: having
the JVM to tell us in advance which map areas contain the heap.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b32a006b

jvm_balloon: control shrinker activation / deactivation · 52cb4738

Glauber Costa authored 11 years ago

There are restrictions on when and how a shrinker can run. For instance, if we
have no balloons inflated, there is nothing to deflate (the relaxer should,
then, be deactivated). Or also, when the JVM fails to allocate memory for an
extra balloon, it is pointless to keep trying (which would only lead to
unnecessary spins) until *at least* the next garbage collection phase.

I believe this behavior of activation / deactivation ought to be shrinker
specific. The reclaiming framework will only provide the infrastructure to do
so.

In this patch, the JVM Balloon uses that to inform the reclaimer when it makes
sense for the shrinker or relaxer to be called.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

52cb4738

JVM ballon driver · 9c59e7e8

Glauber Costa authored 11 years ago

This patch implements the JVM balloon driver, that is responsible for borrowing
memory from the JVM when OSv is short on memory, and giving it back when we are
plentiful. It works by allocating a java byte array, and then unmapping a large
page-aligned region inside it (as big as our size allows).

This array is good to go until the GC decides to move us. When that happens, we
need to carefuly emulate the memcpy fault and put things back in place.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9c59e7e8

mmu: implement a new JVM vma · b657d2b3

Glauber Costa authored 11 years ago


After carrying on some testing, I quickly realized that the old fixup-only
solution I was attempting for the ballooning was not really flying. The reason
for that, is that we would take a fault, figure out the fixup address, and
return.  If that wasn't a JVM fault, we were forced to take another fault
(since we were already out of fault context).

Once demand paging is a reality, the vast majority of the faults are for non
balloon addresses, so we were effectively doubling our number of page faults
for no reason. I have decided to go with the VMA (+fixups for instruction
decoding) route after all. This is way more efficient and it seems to be
working fine.

The JVM vma is really close to the normal anonymous VMA. Except that it can
never hold pages, and its fault handler calls into the JVM balloon facilities
for decoding.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b657d2b3

mempool: shrink memory when no longer used. · 4afd087b

Glauber Costa authored 11 years ago


This patch introduces the memory reclaimer thread, which I hope to use to
dispose of unused memory when pressure kicks in. "Pressure" right now is
defined to be when we have only 20 % of total memory available. But that can be
revisited.

The way it will work is that each memory user that is able to dispose of its
memory will register a shrinker, and the reclaimer will loop through them.
However, the current "loop through all" only "works" because we have only one
shrinker being registered. When other appears, we need better policies to drive
how much to take, and from whom.

Memory allocation will now wait if memory is not available, instead of
aborting.  The decision of aborting should belong to the reclaimer and no one
else.

We should never expect to have an unbounded and more importantly, all opaque,
number of shrinkers like Linux does. We have control of who they are and how
they behave, so I expect that we will be able to make a lot better decisions
in the long run.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4afd087b

semaphore: allow extending the interface · 21d9c318

Glauber Costa authored 11 years ago

Following an early suggestion from Nadav, I am trying to use semaphores for the
balloon instead of keeping our own queue. For that to work, I need to have a bit
more functionality that may not belong in the main balloon class. Namely:

1) I need to query for the presence of waiters (and maybe in the future for the
number of waiters)

2) I need a special post that would allow me to make sure that we are almost posting
at most as much we're waiting for, and nothing more.

This patch transforms the post method in an unlocked version (and exposes a
trivial version that just locks around it) and make other changes necessary to allow
subclassing

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

21d9c318

mmu: account evacuated size · ab459e83

Glauber Costa authored 11 years ago


This will be useful when we shrink, so we know how much memory we newly
released for system consumption.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ab459e83

mmu: make operate quantifiable. · f1cd4f8d

Glauber Costa authored 11 years ago


operate so far operates in a page range and at the very most sets a success
flag somewhere. I am here extending the API to allow it to return how much
data it manipulated.

So as an example, if we fault in 2Mb in an empty range, it will return 2 << 20.
But if fault in the same 2Mb in a range that already contained some sparse 4k
pages, we will return 2 << 20 - previous_pages.

That will be useful to count memory usage in certain VMAs.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f1cd4f8d

Jan 08, 2014

mem: fix allocation accounting · 8d7812fa

Glauber Costa authored 11 years ago

There was a small bug in the free memory tracking code that I've only hit
recently. I was wrong in assuming that in the first branch for huge page
allocation, where we erase the entire range, we should account for N bytes.
This assumption came from my - wrong - understanding that we would do that when
the range is exactly N bytes.

Looking at the code with fresh eyes, that is definitely not what happens. In my
previous stress test we were hitting the second branch all the time, so this
bug lived on.

Turns out that we will delete the entire page range, which may be bigger than
N, the allocation size. Therefore, the whole range should be discounted from
our calculation. The remainder (bigger than N part) will be accounted for later
when we reinsert it in the page range, in the same way it is for the second
branch of this code.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8d7812fa

Jan 07, 2014

Exile spinlock to a separate file · 8fcad509

Nadav Har'El authored 11 years ago

In very early OSv history, the spinlock was used in the mutex's
implementation so it made sense to put it in mutex.cc and mutex.h.

But now that the spinlock is all that's left in mutex.cc (the real mutex
is in lfmutex.cc), rename this file spinlock.cc. Also, move the spinlock
definitions from <osv/mutex.h> to a new <osv/spinlock.h>, so if someone
wants to make the grave mistake of using a spinlock - they will at least
need to explicitly include this header file.

Currently, the only remaining user of the spinlock is the console.
Using a spinlock (and not a mutex) in the console allows printing debug
messages while preemption is disabled. Arguably, this use-case is no
longer important (we have tracepoints), so in the future we can consider
dropping the spinlock completely.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8fcad509

sched: reserve some thread ids · a9169887

Glauber Costa authored 11 years ago


This patch reserves some thread ids, that are kept unused. This is so we can
construct values that reuse the thread public id and add it together with other
information and still fit in 32-bits.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a9169887

sched: keep track of thread's runtime · 880c7291

Glauber Costa authored 11 years ago


This will be used later to determine for how long have a thread been running.
It can easily be updated right before we call ran_for(), reusing its interval
parameter.

Fixes #135

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

880c7291

Jan 03, 2014

sched: add tracepoint for thread creation · b97c0e3a

Tomasz Grabiec authored 11 years ago

Useful if you want to know who created that large pile of threads.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b97c0e3a

Jan 02, 2014

mmu: Make map_file() more efficient · c31fff09

Gleb Natapov authored 11 years ago

Currently map_file() do three passes over vma memory in a worst case.
First it maps memory with write permission while zeroing it, then it
reads a file into memory and, if vma is read only, it does one more
pass to fix memory permissions. Fix it by providing new specialization
of fill_page class which builds iovec of all allocated memory and
reads from a file using the iovec at the end of populate stage.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c31fff09

Sched: fix race start_early_threads() · 4bed7ed5

Nadav Har'El authored 11 years ago


In issue #145 I reported a crash during boot in start_early_threads().
I wasn't actually able to replicate this bug on master, but it happens
quite frequently (e.g., on virtually every "make check" run) with some
patches of mine that seem unrelated to this bug.

The problem is that start_early_threads() (added in 63216e85)
iterates on the threads in the thread list, and uses
t->remote_thread_local_var() for each thread. This can only work if
the thread has its TLS initialized, but unfortunately in thread's
constructor we first added the new thread to the list, and only later
called setup_tcb() (which allocates and initializes the TLS). If we're
unlucky, start_early_threads() can find a thread on the list which still
doesn't have its TLS allocated, so remote_thread_local_var() will crash.

The simple fix is to switch the order of the construction: First
set up the new thread's TLS, and only then add it to the list of
threads.

Fixes #145.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4bed7ed5

core: extract graceful shutdown logic · 8b616285

Tomasz Grabiec authored 11 years ago


In order to reuse the logic it needs to be extracted.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8b616285

Jan 01, 2014

fs: clean up old "fo_*" C functions · a844d248

Nadav Har'El authored 11 years ago


Instead of the old C-style file-operation function types and fo_*()
functions, since recently we have methods of the "file" class. All our
filesystem code is now C++, and can use these methods directly.

So this patch drops the old types and functions, and uses the class methods
instead.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

a844d248

mutex: remove spinlock-based mutex · da577595

Avi Kivity authored 11 years ago


The code has bitrotted, and it doesn't support wait morphing.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

da577595

condvar: switch implementation to member functions · 5295f19f

Avi Kivity authored 11 years ago


Instead of free standing functions, use member functions, which are easier
to work with.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

5295f19f

condvar: rename members to conform to coding style · 7460be08

Avi Kivity authored 11 years ago


the _ prefix helps to distinguish between members and non-members; helps with
the next patch.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

7460be08

condvar: make WAIT_MORPHING required · 8b224474

Avi Kivity authored 11 years ago


Make the code more maintainable by removing the #ifdefs; it doesn't make
sense to disable wait morphing.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

8b224474

file: reduce boiler-plate code in special files · 9478a14d

Nadav Har'El authored 11 years ago

Each implementation of "struct file" needs to implement 8 different file
operations. Most special file implementations, such as pipe, socketpair,
epoll and timerfd, don't support many of these operations. We had in
unsupported.h functions that can be reused for the unsupported operation,
but this resulted in a lot of ugly boiler-plate code.

Instead, this patch switches to a cleaner, more C++-like, method:
It defines a new "file" subclass, called "special_file", which implements
all file operations except close(), with a default implementation identical
to the old unsupported.h implementations.

The files of pipe(), socketpair(), timerfd() and epoll_create() now inherit
from special_file, and only override the file operations they really want
to implement.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

9478a14d

Dec 31, 2013

mmu: fix vma locking during mmap related operations · 015d96ce

Gleb Natapov authored 11 years ago

Right now most of mmap related functions have the same bug related to vma
locking: they validate mapping under vma lock then release the lock
and do actual vma operation, but since between validation and operation
the mapping can go away it is incorrect. This patch fixes this by doing
validation and operation under the same lock.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

015d96ce

mmu: fix calculation of offset during page table traversal · c0b170fc

Gleb Natapov authored 11 years ago


Currently offset calculation is incorrect. Fix it by tracking base
address of a region and calculating an offset by subtracting base
address from current mapping address.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c0b170fc

mmu: make map_level<> usable only through map_range() function. · b1011d60

Gleb Natapov authored 11 years ago

map_range() is a entry point into a page mapper, so make it
impossible to instantiate map_level class directly by making all of its
function and constructor private and declaring map_range() as a friend.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

b1011d60

mmu: split map_range() into two functions · 95c5149c

Gleb Natapov authored 11 years ago


One for initial call, another for recursion.

This gets rid of default parameters, the need for std::integral_constant
and makes code much more readable.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

95c5149c