Commits · 9d87ebe8116efd33cdb5ec526d79ee5dc09bcc02 · Verlässliche Systemsoftware / projects / osv

Oct 10, 2013

build: define _KERNEL everywhere · 95ce17e3

Avi Kivity authored 11 years ago

We have _KERNEL defines scattered throughout the code, which makes
understanding it difficult.

Define it just once, and adjust the source to build.

We define it in an overridable variable, so that non-kernel imported code
can undo it.

95ce17e3

Oct 07, 2013

mempool: Fix early page allocator tracepoint · 79746c74

Pekka Enberg authored 11 years ago


Fix memory_page_alloc tracepoint if untracked_alloc_page() uses the
early page allocator.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

79746c74

post-processing/memory: page allocator support · 7d06588d

Pekka Enberg authored 11 years ago


This patch adds page allocator support to memory analyzer post
processing script.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7d06588d

mmu: Fix msync() vma range walk · db0bddbd

Pekka Enberg authored 11 years ago


The tst-mmap.so test case noticed the following breakage in msync():

  Running mmap tests
  Assertion failed: msync(buf, 4096*9, MS_ASYNC) == 0
  (../../tests/tst-mmap.cc: main: 207)
  Aborted

The problem is that msync() uses the contains function which checks if
[start, end) is in a vma range but we really are interested in whether
the vma intersects with [start, end).

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

db0bddbd

dhcp: wire dhcp dns servers with the libc DNS resolver code. · 5f43b6a5

Benoît Canet authored 11 years ago


This Close issue #8 ""Make Java InetAddress.getHostName() works".

Signed-off-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

5f43b6a5

mmu: Fix vma::sync() return value · c724ce8c

Pekka Enberg authored 11 years ago


Fix vma::sync() to return zero for msync'ing anonymous vmas like Linux
does.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c724ce8c

Oct 03, 2013

dhcp: boost cleanup · b18bdfca

Benoît Canet authored 11 years ago


This patch uses boost::asio::ip::address* to cleanup the core dhcp code.

Signed-off-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

b18bdfca

Sep 29, 2013

trace: fix --trace-backtraces uninitialized data in backtrace · f53d6ca7

Avi Kivity authored 11 years ago


Also fix other review comments related to 1f161695.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

f53d6ca7

Sep 28, 2013

mmu: Fix off-by-one on range covered by ELF · 580c1f91

Raphael S.Carvalho authored 11 years ago


The correct range is elf_start:(elf_start + elf_size - 1)

Signed-off-by: Raphael S.Carvalho <raphael.scarv@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

580c1f91

Sep 25, 2013

Dynamic linker: run finalizers when unloading shared object · bf0688f4

Nadav Har'El authored 11 years ago


ELF allows specifying initializers - functions to be run after loading a
a shared object, in DT_INIT_ARRAY, and also finalizers - functions to be
run before unloading a shared objects, in DT_FINI_ARRAY. The existing code
ran the initializers, but forgot to run the finalizers, and this patch
fixes this oversight.

This fix is necessary for destructors of static objects defined in the
shared object. But this fix is not sufficient for C++ destructors - see
also the next patch.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>

bf0688f4

Sep 24, 2013

Fix missing poll() wakeup on POLLHUP · 554e80f6

Nadav Har'El authored 11 years ago

Our poll_wake() code ignored calls with the POLLHUP event, because
the user did not explicitly ask for this event. This causes a poll()
waiting on read from a pipe whose write side closes not to wake up.

This patch adds a test for this case in tst-pipe.cc, and fixes the
bug by adding to the poll structure's _events also ~POLL_REQUESTABLE,
i.e., any bits which do not have to be explicitly requested by the
user (POLL_REQUESTABLE is a new macro defined in this patch).

After this patch, poll() wakes as needed in the test (instead of just
hang), but returns the wrong event because of another bug which will
be fixed in a separate patch.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>

554e80f6

elf: Fix assert() in object::relocate_pltgot · e4c4696f

Pekka Enberg authored 11 years ago


The assertion in object::relocate_pltgot uses assignment instead of
comparison.  Fix that up.

Spotted by Coverity.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e4c4696f

Sep 23, 2013

Our select() function is emulated using poll(), which is a sensible thing · b53d39ac

Nadav Har'El authored 11 years ago

to do. However, it did several things wrong that this patch fixes. Thanks
to Paolo Bonzini for finding these problems (see issue #35).

1. When poll() returned a bare POLLHUP, without POLLIN, our select() didn't
return a read event. But nothing in the manpages guarantees that POLLHUP
is accompanied by POLLIN, and some special file implementations might
forget it. As an example, in Linux POLLHUP without POLLIN is common.
But POLLHUP on its own already means that there's nothing more to read,
so a read() will return immediately without blocking - and therefore
select() needs to turn on the readable bit for this fd.

2. Similarly, a bare POLLRDHUP should turn on the writable bit: The
reader on this file hug up, so a write will fail immediately.

3. Our poll() and select() confused what POLLERR means. POLLERR does not
mean poll() found a bad file descriptor - there is POLLNVAL for that.
So this patch fixes poll() to set POLLNVAL, not POLLERR, and select()
to return with errno=EBADF when it sees POLLNVAL, not POLLERR.

4. Rather, POLLERR means the file descriptor is in an error state, so every
read() or write() will return immediately (with an error). So when we see
it, we need to turn on both read and write bits in this case.

5. The meaning of "exceptfds" isn't clear in any manual page, and it
seems there're a lot of opinions on what it might mean. In this patch I
did what Paolo suggested, which is to set the except bit when POLLPRI.
(I don't set exceptfds on POLLERR, or any other case).

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>

b53d39ac

Sep 21, 2013

dhcp: Remove double htonl conversion. · fd7cbd0c

Benoît Canet authored 11 years ago


After 5f0e9733 htonl was called twice: once before calling
dhcp_mbuf::compose_request and once more inside this function.
Fix this, verified on wire packets with wireshark and audited code.

Signed-off-by: Benoit Canet <benoit@irqsave.net>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

fd7cbd0c

Sep 20, 2013

dhcp: store all IP address fields in host order · 5f0e9733

Benoît Canet authored 11 years ago


Some IP fields are stored in host order and some other in network order.
It makes discovery of the code difficult.
Convert all IP fields to host order and propagate to callers.

Signed-off-by: Benoit Canet <benoit@irqsave.net>

5f0e9733

Fix comments in lfmutex.cc · fced9ebf

Nadav Har'El authored 11 years ago


Trivial fixes to comments in lock-free mutex implementation.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>

fced9ebf

Sep 15, 2013

Add copyright statement to core/* · 4c0b39f3

Nadav Har'El authored 11 years ago

Added Cloudius copyright statement to core/*.

poll.cc already had a BSD copyright statement, I believe this is a mistake
(I think Guy wrote this code from scratch), but not wanting to rush to a
conclusion I'm leaving both copyright statements and we should address this
issue later.

4c0b39f3

poll: Improve tracepoints · f8c106ae

Pekka Enberg authored 11 years ago

Pass function arguments to the tracepoint and add a tracepoints for
poll() return value and errno.

f8c106ae

Sep 12, 2013
- trace: add option to log backtraces in tracepoint records · 1f161695
  Avi Kivity authored 11 years ago
  
  Command line option: --trace-backtraces
  1f161695
Sep 11, 2013

Add reboot function · 542c319b

Nadav Har'El authored 11 years ago

Added a new function, osv::reboot() (declared in <osv/power.hh>)
for rebooting the VM.

Also added a Java interface - com.cloudius.util.Power.reboot().

NOTE: Power.java and/or jni/power.cc also need to be copied into
the mgmt submodule.

542c319b

Sep 10, 2013

mmu: Fix file-backed vma splitting · d72b550c

Pekka Enberg authored 11 years ago

Commit 3510a5ea ("mmu: File-backed VMAs") forgot to fix vma::split() to
take file-backed mappings into account. Fix the problem by making
vma::split() a virtual function and implementing it separately for
file_vma.

Spotted by Avi Kivity.

d72b550c

DHCP: Fix crash · 68f4d147

Nadav Har'El authored 11 years ago

Rarely (about once every 20 runs) we had OSV crash during boot, in the
DHCP code. It turns out that the code first sends out the DCHP requests,
and then creates a thread to handle the replies. When a reply arrives,
the code wake()s the thread, but on rare occasions the thread hasn't yet
been set up (still a null pointer) so we have a crash.

Fix this by reversing the order - first create the reply handling thread,
and only then send the request.

68f4d147

Sep 08, 2013

Scheduler: Fix load-balancer bug · e9f0cf29

Nadav Har'El authored 11 years ago

The load_balance() code checks if another CPU has fewer threads in its
run queue than this thread, and if so, migrates one of this CPU's threads
to the other CPU.

However, when we count this core's runnable threads, we overcount it by
1, because as soon as load_balance() goes back to sleep, one of the
runnable threads will start running. So if this core has just one more
runnable threads than some remote's core runnable threads, they are
actually even, so in that case we should *not* migrate a thread.

Overcounting the number of threads on the core running load_balance
caused bad performance in 2-core and 2-thread SpecJVM: Normally, the
size of the run queue on each core is 1 (each core is running one of
the two threads, and on the run queue we have the idle thread). But
when load_balance runs it sees 2 runnable threads (the idle thread and
the preempted benchmark thread), and the second core has just 1, so
it decides to migrate one of its threads to the second CPU. When this
is over, the second CPU has both benchmark threads, and the first CPU
has nothing, and this will only be fixed some time later when the
second CPU's load_balance thread runs, and later the balance will be
ruined again. All this time that the two threads run on the same CPU
significantly hurt performance, and on the host's "top" we see qemu
taking just 120%-150% instead of 200% as it should (and as it does
after this patch).

e9f0cf29

Scheduler: Avoid vruntime jump when clock jumps · 253e4536

Nadav Har'El authored 11 years ago

Currently, clock::get()->time() jumps (by system_time(), i.e., the host's
uptime) at some point during the initialization. This can be a huge jump
(e.g., a week if the host's uptime is a week). Fixing this jump is hard,
so we'd rather just tolerate it.

reschedule_from_interrupt() handles this clock jump badly. It calculates
current_run, the amount of time the current thread has run, to include this
jump while the thread was running. In the above example, a run time of
a whole week is wrongly attributed to some thread, and added to its vruntime,
causing it not to be scheduled again until all other threads yield the
CPU.

The fix in this patch is to limit the vruntime increase after a long
run to max_slice (10ms). Even if a thread runs for longer (or just thinks
it ran for longer), it won't be "penalized" in its dynamic priority more
than a thread that ran for 10ms. Note that this cap makes sense, as
cpu::enqueue already enforces a similar limit on the vruntime "bonus"
of a woken thread, and this patch works toward a similar goal (avoid
giving one thread a huge bonus because another thread was given a huge
penalty).

This bug is very visible in the CPU-bound SPECjvm2008 benchmarks, when
running two benchmark threads on two virtual cpus. As it happens, the
load_balancer() is the one that gets the huge vruntime increase, so
it doesn't get to run until no other thread wants to run. Because we start
with both CPU-bound threads on the same CPU, and these hardly yield the
CPU (and even more rarely are the two threads sleeping at the same time),
the load balancer thread on this CPU doesn't get to run, and the two threads
remain on the same CPU, giving us halved performance (2-cpu performance
identical to 1-cpu performance) and on the host we see qemu using 100% cpu,
instead of 200% as expected with two vcpus.

253e4536

DHCP: add an option to wait for an IP and make it the default · 6be8f6b0
Guy Zana authored 11 years ago

6be8f6b0

Sep 03, 2013

irq_lock: avoid 'irq_lock defined but not used' warning · 90390cca

Avi Kivity authored 11 years ago

In an attempt to be clever, we define irq_lock as an object in an anonymous
namespace, so that each translation unit gets its own copy, which is then
optimized away, since the object is never touched. But the compiler complains
that the object is defined but not used if we include the file but don't
use irq_lock.

Simplify by only declaring the object there, and defining it somewhere else.

90390cca

Sep 02, 2013

mmu: msync for file-backed memory maps · 1691c89d

Pekka Enberg authored 11 years ago

This adds simple msync() implementation for file-backed memory maps. It
uses the newly added 'file_vma' data structure to write out and fsync
the msync'd region as suggested by Avi Kivity.

1691c89d

mmu: File-backed VMAs · 3510a5ea

Pekka Enberg authored 11 years ago

Add a new 'file_vma' class that extends 'vma'. This is needed to keep
track of fileref and offset for file-backed VMAs for msync().

3510a5ea

Aug 29, 2013
- mempool: use DROP_LOCK() · 72d21c49
  Avi Kivity authored 11 years ago
  
  72d21c49
Aug 27, 2013

Fix mincore() on non-mmap()ed memory · 6924f7db

Nadav Har'El authored 11 years ago

Commit 65afd075 fixed mincore() to recognize
unmapped addresses. However, it used mmu::ismapped() which just checks for
mmap()'ed addresses, and doesn't know about malloc()ed memory. This causes
trouble for libunwind (which we use for backtrace()) which tests mincore()
on an on-stack variable, and for non-pthread threads, this stack might be
malloc'ed, not mmap'ed.

So this patch adds mmu::isreadable(), which checks that a given memory range
is all readable (this memory can be mmapped, malloced, stack, whatever).
mincore() now uses that.

mmu::isreadable() is implemented, following Avi's idea, by trying to read,
with safe_load(), one byte from every page in the range. This approach is
faster than page-table-walking especially for one-byte checks (which all
libunwind uses anyway), and also very simple.

6924f7db

mempool.c: trace large allocations · 0a798e4d

Glauber Costa authored 11 years ago

Most of the performance problems I have found on Xen were due to the fact that
we were hitting malloc_large consistently, for allocations that we should be
able to service in some other way. Because malloc_large in our implementation
is such a bottleneck, it was very useful for me to have separate tracepoints
for them. I am then proposing for inclusion.

0a798e4d

Fix deadlock in leak detector · 227eb39b

Nadav Har'El authored 11 years ago

Commit 65afd075 that fixed mincore()
exposed a deadlock in the leak detector, caused by two threads taking
two locks in opposite order:

Thread 1: malloc() does alloc_tracker::remember(). This takes the tracker
lock and calls backtrace() calling mincore() which takes the
vma_list_mutex.

Thread 2: mmap() does mmu::allocate() which takes the vma_list_mutex and
then through mmu::populate::small_page calls memory::alloc_page() which
calls alloc_tracker::remember() and takes the tracker lock.

This patch fixes this deadlock: alloc_tracker::remember() will now drop its
lock while running backtrace(), as the lock is only needed to protect the
allocations[] array. We need to retake the lock after backtrace() completes,
to copy the backtrace back to the allocations[] array.

Previously, the lock's depth was also (ab)used for avoiding nested
allocation tracking (e.g., tracking of memory allocation done inside
backtrace() itself), but now that backtrace() is run without the lock,
we need a different mechanism - a per-thread "in_tracker" flag, which
is turned on inside the alloc_tracker::remember()/forget() methods.

227eb39b

Aug 26, 2013

Avoid including elf.hh from sched.hh · 714d313a

Nadav Har'El authored 11 years ago

sched.hh included elf.hh, just so it can refer to the elf::tls_data
type. But now that we have rcu.hh which includes sched.hh and therefore
elf.hh, if we wish to use rcu in elf.hh (we'll do this in a later patch),
we have an include loop mess.

So better not include elf.hh from sched.hh, and just declare the one
struct we need.

After sched.hh no longer includes elf.hh and the dozen includes that
it further included, we need to add missing includes to some of the
code that included sched.hh and relied on its implict includes.

714d313a

mmu: don't pass really bad faults to the application · 6f464e76

Avi Kivity authored 11 years ago

Trying to execute the null pointer, or faults within the kernel code, are
a really bad sign and it's better to abort early with them.

6f464e76

alloctracker: Fix forget() if remember() hasn't been called · 0affe14a

Pekka Enberg authored 11 years ago

If leak detector is enabled after OSv startup, the first call can be to
free(), not malloc(). Fix alloctracker::forget() to deal with that.

Fixes the SIGSEGV when "osv leak on" is used to enable detection from
gdb after OSv has started up:

  #
  # A fatal error has been detected by the Java Runtime Environment:
  #
  #  SIGSEGV (0xb) at pc=0x00000000003b8ee6, pid=0, tid=18446673706168635392
  #
  # JRE version: 7.0_25
  # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops)
  # Problematic frame:
  # C  0x00000000003b8ee6
  #
  # Core dump written. Default location: //core or core.0
  #
  # An error report file with more information is saved as:
  # /tmp/jvm-0/hs_error.log
  #
  # If you would like to submit a bug report, please include
  # instructions on how to reproduce the bug and visit:
  #   http://icedtea.classpath.org/bugzilla
  #
  Aborted

  [penberg@localhost osv]$ addr2line -e build/debug/loader.elf
  0x00000000003b8ee6
  /home/penberg/osv/build/debug/../../core/alloctracker.cc:90

0affe14a

Aug 25, 2013

rcu: fix hang due to race while awaiting a quiescent state · ac7a8447

Avi Kivity authored 11 years ago

Waiting for a quiescent state happens in two stages: first, we request all
cpus to schedule at least once. Then, we wait until they do so.

If, between the two stages, a cpu is brought online, then we will request
N cpus to schedule but wait for N+1 to respond. This of course never happens,
and the system hangs.

Fix by copying the vector which holds the cpus which we signal and wait for;
forcing them to be consistent. This is safe since newly-added cpus cannot
be accessing any rcu-protected variables before we started signalling.

Fixes random hangs with rcu, mostly seen with 'perf callstack'

ac7a8447

Aug 19, 2013
- dhcp: convert to WITH_LOCK · eed5bafd
  Avi Kivity authored 11 years ago
  
  eed5bafd
Aug 18, 2013

dhcp: allow broadcast responses · d9a5ed59
Avi Kivity authored 11 years ago
```
the QEMU DHCP server responds with broadcast packets; allow them.
```
d9a5ed59

sched: reduce wakeup IPIs further · 5b05bade

Avi Kivity authored 11 years ago

Following 71fec998, we note that if any bit in the wakeup mask
is set, then an IPI to that cpu is either imminent or already in flight, and
we can elide our own IPI to that cpu.

5b05bade

Aug 16, 2013

sched: Avoid IPIs in thread::wake() · 71fec998

Pekka Enberg authored 11 years ago

Avoid sending an IPI to a CPU that's already being woken up by another
IPI.  This reduces IPIs by 17% for a cassandra-stress run. Execution
time is obviously unaffected because execution is bound by lock
contention.

Before:

[penberg@localhost ~]$ sudo perf kvm stat -e kvm:* -p `pidof qemu-system-x86_64`
^C
 Performance counter stats for process id '610':

         6,909,333 kvm:kvm_entry
                 0 kvm:kvm_hypercall
                 0 kvm:kvm_hv_hypercall
         1,035,125 kvm:kvm_pio
                 0 kvm:kvm_cpuid
         5,149,393 kvm:kvm_apic
         6,909,369 kvm:kvm_exit
         2,108,440 kvm:kvm_inj_virq
                 0 kvm:kvm_inj_exception
               982 kvm:kvm_page_fault
         2,783,005 kvm:kvm_msr
                 0 kvm:kvm_cr
             7,354 kvm:kvm_pic_set_irq
         2,366,388 kvm:kvm_apic_ipi
         2,468,569 kvm:kvm_apic_accept_irq
         2,067,044 kvm:kvm_eoi
         1,982,000 kvm:kvm_pv_eoi
                 0 kvm:kvm_nested_vmrun
                 0 kvm:kvm_nested_intercepts
                 0 kvm:kvm_nested_vmexit
                 0 kvm:kvm_nested_vmexit_inject
                 0 kvm:kvm_nested_intr_vmexit
                 0 kvm:kvm_invlpga
                 0 kvm:kvm_skinit
             3,677 kvm:kvm_emulate_insn
                 0 kvm:vcpu_match_mmio
                 0 kvm:kvm_update_master_clock
                 0 kvm:kvm_track_tsc
             7,354 kvm:kvm_userspace_exit
             7,354 kvm:kvm_set_irq
             7,354 kvm:kvm_ioapic_set_irq
               674 kvm:kvm_msi_set_irq
                 0 kvm:kvm_ack_irq
                 0 kvm:kvm_mmio
           609,915 kvm:kvm_fpu
                 0 kvm:kvm_age_page
                 0 kvm:kvm_try_async_get_page
                 0 kvm:kvm_async_pf_doublefault
                 0 kvm:kvm_async_pf_not_present
                 0 kvm:kvm_async_pf_ready
                 0 kvm:kvm_async_pf_completed

      81.180469772 seconds time elapsed

After:

[penberg@localhost ~]$ sudo perf kvm stat -e kvm:* -p `pidof qemu-system-x86_64`
^C
 Performance counter stats for process id '30824':

         6,411,175 kvm:kvm_entry                                                [100.00%]
                 0 kvm:kvm_hypercall                                            [100.00%]
                 0 kvm:kvm_hv_hypercall                                         [100.00%]
           992,454 kvm:kvm_pio                                                  [100.00%]
                 0 kvm:kvm_cpuid                                                [100.00%]
         4,300,001 kvm:kvm_apic                                                 [100.00%]
         6,411,133 kvm:kvm_exit                                                 [100.00%]
         2,055,189 kvm:kvm_inj_virq                                             [100.00%]
                 0 kvm:kvm_inj_exception                                        [100.00%]
             9,760 kvm:kvm_page_fault                                           [100.00%]
         2,356,260 kvm:kvm_msr                                                  [100.00%]
                 0 kvm:kvm_cr                                                   [100.00%]
             3,354 kvm:kvm_pic_set_irq                                          [100.00%]
         1,943,731 kvm:kvm_apic_ipi                                             [100.00%]
         2,047,024 kvm:kvm_apic_accept_irq                                      [100.00%]
         2,019,044 kvm:kvm_eoi                                                  [100.00%]
         1,949,821 kvm:kvm_pv_eoi                                               [100.00%]
                 0 kvm:kvm_nested_vmrun                                         [100.00%]
                 0 kvm:kvm_nested_intercepts                                    [100.00%]
                 0 kvm:kvm_nested_vmexit                                        [100.00%]
                 0 kvm:kvm_nested_vmexit_inject                                 [100.00%]
                 0 kvm:kvm_nested_intr_vmexit                                   [100.00%]
                 0 kvm:kvm_invlpga                                              [100.00%]
                 0 kvm:kvm_skinit                                               [100.00%]
             1,677 kvm:kvm_emulate_insn                                         [100.00%]
                 0 kvm:vcpu_match_mmio                                          [100.00%]
                 0 kvm:kvm_update_master_clock                                  [100.00%]
                 0 kvm:kvm_track_tsc                                            [100.00%]
             3,354 kvm:kvm_userspace_exit                                       [100.00%]
             3,354 kvm:kvm_set_irq                                              [100.00%]
             3,354 kvm:kvm_ioapic_set_irq                                       [100.00%]
               927 kvm:kvm_msi_set_irq                                          [100.00%]
                 0 kvm:kvm_ack_irq                                              [100.00%]
                 0 kvm:kvm_mmio                                                 [100.00%]
           620,278 kvm:kvm_fpu                                                  [100.00%]
                 0 kvm:kvm_age_page                                             [100.00%]
                 0 kvm:kvm_try_async_get_page                                   [100.00%]
                 0 kvm:kvm_async_pf_doublefault                                 [100.00%]
                 0 kvm:kvm_async_pf_not_present                                 [100.00%]
                 0 kvm:kvm_async_pf_ready                                       [100.00%]
                 0 kvm:kvm_async_pf_completed

      79.947992238 seconds time elapsed

71fec998