Commits · fbd7d0846f3daa980b7182ca08a7f83248d24853 · Verlässliche Systemsoftware / projects / osv

Feb 12, 2014

README: maven added to Fedora prerequisites · fbd7d084

Dmitry Fleytman authored 11 years ago


Build failed on my Fedora 20 due to lack of maven package.

Reviewed-by Zhi Yong Wu <zwu.kernel@gmail.com>
Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fbd7d084

vfs: remove the dead code in namei() · d87b5c2b

Zhi Yong Wu authored 11 years ago


When control flow reaches at the bottom inner loop in namei(), the pointer p
will point to either a '\0' or a '/' character because of the upper inner loop
break condition:


        for (i = 0; i < PATH_MAX; i++) {
            if (*p == '\0' || *p == '/') {
                break;
            }
            name[i] = *p++;
        }

So the "while" loop will never be executed and we can eliminate it as dead
code.

Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Zhi Yong Wu <zwu.kernel@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d87b5c2b

run.py: Enable --sata option · adf23e05

Asias He authored 11 years ago


Add --sata option to use AHCI driver instead virtio-blk for QEMU. It
makes no sense to use sata device instead of virtio-blk device. But this
is mainly for test purpose.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

adf23e05

ahci: Initial support · 1ee8e2d1

Asias He authored 11 years ago


AHCI is supported on various VMM, e.g. Virtual Box, VMware Workstation.
Adding AHCI support enables OSv to run on them if the para-virtualized
block device is not present or not supported yet.

Tested on VirtualBox, VMware Workstation and QEMU.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1ee8e2d1

pci: Support MSI interrupt · 708d2e61

Asias He authored 11 years ago


Currently, only MSI-X is support in our PCI layer. Devices like AHCI
controller support MSI interrupt only. This paves the way for AHCI
driver.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

708d2e61

interrupt: Support MSI interrupt in interrupt_manager · 3f2039b9

Asias He authored 11 years ago

MSI support is supported in pci layer now. Enable MSI support interrupt_manager.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3f2039b9

pci: Fix probe of BAR · ed3030f3

Asias He authored 11 years ago


The first BAR is not present does not mean the entire BAR are not
present. Some implementation of AHCI controller only has BAR6 present
with BAR1 to BAR5 empty.

Keep probing if a BAR is not present.

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ed3030f3

git: ignore java build targets · 6c5a00b3

Or Cohen authored 11 years ago


Reviewed-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Or Cohen <orc@fewbytes.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6c5a00b3

Feb 11, 2014

epoll: Support epoll()'s EPOLLET · d41d748f

Nadav Har'El authored 11 years ago

This patch adds support for epoll()'s edge-triggered mode, EPOLLET.
Fixes #188.

As explained in issue #188, Boost's asio uses EPOLLET heavily, and we use
that library in our management http server, and also in our image creation
tool (cpiod.so). By ignoring EPOLLET, like we did until now, the code worked,
but unnecessarily wasted CPU when epoll_wait() always returned immediately
instead of waiting until a new event.

This patch works within the confines of our existing poll mechanisms -
where epoll() call poll(). We do not change this in this patch, and it
should be changed in the future (see issue #17).

In this patch we add to each struct file a field "poll_wake_count", which
as its name suggests counts the number of poll_wake()s done on this
file. Additionally, epoll remembers the last value it saw of this counter,
so that in poll_scan(), if we see that an fp (polled with EPOLLET) has
an unchanged counter from last time, we do not return readiness on this fp
regardless on whether or not it has readable data.

We have a complication with EPOLLET on sockets. These have an "SB_SEL"
optimization, which avoids calling poll_wake() when it thinks the new
data is not interesting because the old data was not yet consumed, and
also avoids calling poll_wake() if fp->poll() was not previously done.
This optimization is counter-productive for EPOLLET (and causes missed
wakeups) so we need to work around it in the EPOLLET case.

This patch also adds a test for the EPOLLET case in tst-epoll.cc. The test
runs on both OSv and Linux, and can confirm that in the tested scenarios,
Linux and OSv behave the same, including even one same false-positive:
When epoll_wait() tells us there is data in a pipe, and we don't read it,
but then more data comes on a pipe, epoll_wait() will again return a new
event, despite this is not really being an edge event (the pipe didn't
change from empty to not-empty, as it was previously not-empty as well).

Concluding remarks:

The primary goal of this implementation is to stop EPOLLET epoll_wait()
from returning immediately despite nothing have happened on the file.
That was what caused the 100% CPU use before this patch. That being said,
the goal of this patch is NOT to avoid all false-positives or unnecessary
wakeups; When events do occur on the file, we may be doing a bit more
wakeups than strictly necessary. I think this is acceptable (our epoll()
has worse problems) but for posterity, I want to explain:

I already mentioned above one false-positive that also happens on Linux.
Another false-positive wakeup that remains is in one of EPOLLET's classic
use cases: Consider several threads sleeping on epoll() on the same socket
(e.g., TCP listening socket, or UDP socket). When one packet arrives, normal
level-triggered epoll() will wake all the threads, but only one will read
the packet and the rest will find they have nothing to read. With edge-
triggered epoll, only one thread should be woken and the rest would not.
But in our implementation, poll_wake() wakes up *all* the pollers on this
file, so we cannot currently support this optimization.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d41d748f

msix: thread affinity · b4e8d47d

Vlad Zolotarov authored 11 years ago


Instead of binding all msix interrupts to cpu 0, have them chase the
interrupt service routine thread and pin themselves to the same cpu.

This patch is based on the patch from Avi Kivity <avi@cloudius-systems.com>
and used some ideas of Nadav Har'El <nyh@cloudius-systems.com>.

It improves the performance of the single thread Rx netperf test by 16%:
before - 25694 Mbps
after  - 29875 Mbps

New in V2:
 - Dropped the functor class - use lambda instead.
 - Fixed the race in a waking flow.
 - Added some comments.
 - Added the performance numbers to the patch description.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b4e8d47d

gdb: fix "osv mmap" and friends · ce177a50

Nadav Har'El authored 11 years ago


It appears that in GDB, (mmu::vma*)0 does not work, and one needs to enclose
the type's name in single quotes: ('mmu::vma'*)0. This broke the vma_list
function in scripts/loader.py, and caused an exception in "osv mmap" and
other commands using the vma_list function.

This patch adds the missing single-quotes.

I don't understand how this code ever worked for anybody...
I'm using gdb-7.6.1 from Fedora 19, if it matters.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ce177a50

loader: move x64-specific stuff from premain · a06c22d7

Claudio Fontana authored 11 years ago


move the arch-specific stuff in premain to
arch/x64/arch-setup.cc.

Introduce arch_init_premain() and arch_setup_tls().

arch_init_premain() is supposed to perform arch-specific
initialization before the common premain code is run.

arch_setup_tls() is run _after_ the common setup_tls code.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a06c22d7

net: fix deadlock in net channel poll support · 16086d77

Avi Kivity authored 11 years ago


Path 1:

  poll()
   take file lock
   file::poll_install
     take socket lock

Path 2:

  sowakep() (holding socket lock)
    so_wake_poll()
      take file lock

Fix by running poll_install() outside the file lock (which isn't really
needed).

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

16086d77

elf: Fix program::lookup · c9b32230

Raphael S. Carvalho authored 11 years ago


Found the problem while running tst-resolve.so, follow the output:
Success: nonexistant = 0
Failed: debug = -443987883
Failed: condvar_wait = -443987883
The time: 1392070630
2 failures.

Bisect pointed to the commit 1dc81fe5.
After understanding the actual purpose of the changes introduced by this
commit, I figured out that program::lookup simply lacks a return when the
target symbol is found from the underlying module.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c9b32230

Feb 10, 2014

run.py: map images using loop devices for xen guests · 8b39f8ca

Dmitry Fleytman authored 11 years ago


There are 2 Xen block device backend implementations exist.
First is a host kernel driver (xen_blkback in Linux) and another
is anti-driver implemented in qemu-dm used by Xen HVM guests.

xl toolset used by run.py selects implementation to use
based of following rules (simplified to avoid non-relevant details):

  1. block device specified as storage - use host kernel driver
  2. file specified as storage - use QEMU anti-driver

while Linux xen_blkback is highly optimized and supports all newly
introduced features, QEMU implementation is rather simple and outdated.

This patch forces xl to use xen_blkback driver by mapping OSv image file
as a loop (block) device.

Write speed and latency measurement results (image on RAM disk):

+++ image file approach (before this patch) +++

misc-bdev-write.so:

109.434 Mb/s
107.822 Mb/s
106.684 Mb/s
102.080 Mb/s
111.211 Mb/s
117.465 Mb/s
107.311 Mb/s
115.834 Mb/s
Wrote 1099.867 MB in 10.03 s = 109.689 Mb/s

misc-bdev-wlatency.so:

Min      50%      90%      99%      99.99%   99.999%  Max     [msec]
---      ---      ---      ---      ------   -------  ---
0.1000   0.1121   0.1079   0.1245   0.1309   0.2199   0.4412

+++ loop device approach (with this patch) +++

misc-bdev-write.so:

OSv v0.05-193-g021dad4
444.600 Mb/s
579.262 Mb/s
547.984 Mb/s
615.998 Mb/s
519.428 Mb/s
535.732 Mb/s
471.388 Mb/s
Wrote 5535.938 MB in 10.40 s = 532.126 Mb/s

misc-bdev-wlatency.so:

Min      50%      90%      99%      99.99%   99.999%  Max     [msec]
---      ---      ---      ---      ------   -------  ---
0.0304   0.0362   0.0341   0.0394   0.0487   0.0781   0.1331

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8b39f8ca

tests/misc-bdev-write: Introduce offset limitation · 89c5c1a2

Dmitry Fleytman authored 11 years ago


Useful for testing on RAM disks when writes are fast enough
to fill the whole image in less than test execution time.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

89c5c1a2

x64: nested exceptions · 89464d6a

Avi Kivity authored 11 years ago


The scenario

  run elf file
    demand fault
      allocation
        leak detector tracking
          backtrace_safe()
            page fault

leads to a nested exception.  Add support for it by allocating an extra
stack.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

89464d6a

memory: switch leak detector to backtrace_safe() · 0568e8fe

Avi Kivity authored 11 years ago


The scenario

  run elf image
    demand page fault
      allocation
        leak detector tracking
           backtrace()
             access dwarf tables

leads to a nested demand page fault, which we don't (and probably can't)
support.  Switch to backtrace_safe(), which is of lower quality, but is safer.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0568e8fe

java: fix NPE when context is called from a thread attached via JNI · 41f04a1c

Tomasz Grabiec authored 11 years ago

In such case current context was not initialized. The fix is to default to master context.

Reported-by: Amnon Heiman <amnon@cloudius-systems.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

41f04a1c

Don't completely ignore submodules · 89b9b3e8

Avi Kivity authored 11 years ago


Ignore a dirty work tree, but not a wrong HEAD; this makes it impossible
to update submodules.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

89b9b3e8

Update openjdk.bin submodule · 5767c948
Avi Kivity authored 11 years ago
```
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
```
5767c948

Merge branch 'net_channel' · 2828ef50

Avi Kivity authored 11 years ago


Net channel implementation.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

2828ef50

README.md: Add maven as dependency to compile on Debian · 4293ad37

Rodrigo Campos authored 11 years ago


Without maven, "/bin/sh: 1: mvn: not found" error is shown when trying to
compile.

Reviewed-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Rodrigo Campos <rodrigo@sdfg.com.ar>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4293ad37

mmu: map file mappings with clear ptes. · f6717dbc

Gleb Natapov authored 11 years ago

When one does not care about dirty bit in a page table it is beneficial
to map everything as dirty since it will relieve HW from setting the bit
on first access, but since now file mappings use dirty bit to reduce
write back during sync lets map them clean initially, but only if file
is shared since for private file mapping sync is disabled.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f6717dbc

mmu: enable on demand file paging · 1a9de4fb

Gleb Natapov authored 11 years ago

Now everything is ready to enable on demand file paging, so disable
unconditional populate on mmap for files and provide fault() for file
backed vmas. In fact file vma fault() is not different from anon vma
fault() so provide one general vma::fault() handler for both.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1a9de4fb

mmu: write back only dirty pages during file sync · b6a1aa2e

Gleb Natapov authored 11 years ago


Walk page table and write out only dirty pages during file sync instead
of writing back entire mapping.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b6a1aa2e

mmu: make map_file_page object reusable. · bce41369

Gleb Natapov authored 11 years ago


Currently map_file_page object can be used only once since its internal
state changes during mapping operation. For demand paging we do not want
to allocate/delete an object each time mapping is needed, so, to make
map_file_page object reusable, this patch resets its sate to a starting
one during finalize().

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

bce41369

mmu: separate page allocation from page mapping process · 55693e5c

Gleb Natapov authored 11 years ago

Currently pages are allocated just before they are mapped, but sometimes
we want to map preallocated pages (for shared mapping or mapping page
cache page for instance). This patch moves page acquisition process into
separate class. This will allow us to add subclass that provides pages
from shared pages pool or from a page cache.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

55693e5c

mmu: rely on short read to figure out what should be zeroed during file_vma populate · 8336e392

Gleb Natapov authored 11 years ago


Makes code easier to understand.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8336e392

mmu: fix fill_file_page starting read offset · 0b098b93

Gleb Natapov authored 11 years ago

Currently fill_file_page is used to populate entire vma, so starting offset
and mapped file offset are always the same, but if fill_file_page will be used
to populate only part of a vma this will no longer be necessarily true. Keep
track of the lowest file offset populated and use it for read.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0b098b93

mmu: fix iovecs initialization in fill_file_page · cd8dc4a5

Gleb Natapov authored 11 years ago


Value passed to std::vector constructor increase not only vector
capacity() but also its size(). We want only former, not later and it
is achieved by reserve() function.

Also remove prev_off which is a leftover of a debug code.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cd8dc4a5

mmu: fix offset calculation in page mapper · d8b67a3f

Gleb Natapov authored 11 years ago

Generic page mapper code provides offset to mapper's small/huge_page
callbacks, but to make this offset meaningful it should be an offset
from a beginning of a vma that is mapped, but currently it is an offset
from a starting address of current mapping operation which means it
will be always zero it mapping is done page by page. Fix this by passing
starting vma address to page mapper for correct offset calculation.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d8b67a3f

mmu: fix vma execution permission checking · f9e68a26

Gleb Natapov authored 11 years ago


Trying to execute unmapped page is not a punishable offense as long as
vma permissions are correct, so check them.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f9e68a26

tests: fix test_malloc to use condvar for synchronisation. · 146690d1

Gleb Natapov authored 11 years ago

It uses low level thread::wait_until() now which calls caller supplied
predicate with preemption disabled. If caller supplied code access not yet
mapped memory it will trigger an assertion on a page fault path.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

146690d1

Merge branch 'gce' of https://github.com/asias/osv · 75b7a789
Avi Kivity authored 11 years ago
```
Google Cloud Engine enabling.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
```
75b7a789

reclaimer: name reclaimer thread · d9e6966f

Glauber Costa authored 11 years ago


Now we are showing names in gdb, it helps to name all of them.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d9e6966f

Convert tst-epoll.cc to standard C++ · b406ae7a

Nadav Har'El authored 11 years ago


Replace OSv-specific constructs in tst-epoll.cc by their standard C++
counterparts (i.e., std::thread, std::chrono, std::cout).
This test now also runs (and of course, succeeds) on Linux.

In general, it is important at all our Linux-ABI tests (where we test our
implementation of the Linux/glibc functionality) to be able to run on Linux
as well. Otherwise, it is possible our tests don't actually test the right
thing (we may test for some expected behavior, but the actual behavior on
Linux is different).

I'm doing this in preparation for fixing issue #188 (fix edge-triggered
epoll).

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b406ae7a

syscall(): Make abort message visible · b1d5e081

Nadav Har'El authored 11 years ago

When syscall() was asked to perform an unknown syscall, we printed a
message with debug() and then called abort(). Since, recently, the message
from debug() is not visible on the console.

So pass the message directly to abort(), to do the right thing with it.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b1d5e081

Makefile: Fix make clean · bd198d61

Raphael S. Carvalho authored 11 years ago


The line respective to the mgmt was building it instead of cleaning
it up.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

bd198d61

ide: Fix ide probe · d73d5842

Asias He authored 11 years ago


ide_drive::probe(), all the PCI_CLASS_STORAGE devices are probed
including the virtio-scsi device. This way ide driver will hijack
virtio-scsi handling.

Fix it by only matching PCI_CLASS_STORAGE && PCI_SUB_CLASS_STORAGE_IDE
devices

Signed-off-by: Asias He <asias@cloudius-systems.com>

d73d5842