Commits · 4afd087baa7b4eb77a65d6214a756e426f1ae10c · Verlässliche Systemsoftware / projects / osv

Jan 10, 2014

mempool: shrink memory when no longer used. · 4afd087b

Glauber Costa authored 11 years ago


This patch introduces the memory reclaimer thread, which I hope to use to
dispose of unused memory when pressure kicks in. "Pressure" right now is
defined to be when we have only 20 % of total memory available. But that can be
revisited.

The way it will work is that each memory user that is able to dispose of its
memory will register a shrinker, and the reclaimer will loop through them.
However, the current "loop through all" only "works" because we have only one
shrinker being registered. When other appears, we need better policies to drive
how much to take, and from whom.

Memory allocation will now wait if memory is not available, instead of
aborting.  The decision of aborting should belong to the reclaimer and no one
else.

We should never expect to have an unbounded and more importantly, all opaque,
number of shrinkers like Linux does. We have control of who they are and how
they behave, so I expect that we will be able to make a lot better decisions
in the long run.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4afd087b

semaphore: allow extending the interface · 21d9c318

Glauber Costa authored 11 years ago

Following an early suggestion from Nadav, I am trying to use semaphores for the
balloon instead of keeping our own queue. For that to work, I need to have a bit
more functionality that may not belong in the main balloon class. Namely:

1) I need to query for the presence of waiters (and maybe in the future for the
number of waiters)

2) I need a special post that would allow me to make sure that we are almost posting
at most as much we're waiting for, and nothing more.

This patch transforms the post method in an unlocked version (and exposes a
trivial version that just locks around it) and make other changes necessary to allow
subclassing

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

21d9c318

mmu: account evacuated size · ab459e83

Glauber Costa authored 11 years ago


This will be useful when we shrink, so we know how much memory we newly
released for system consumption.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ab459e83

mmu: make operate quantifiable. · f1cd4f8d

Glauber Costa authored 11 years ago


operate so far operates in a page range and at the very most sets a success
flag somewhere. I am here extending the API to allow it to return how much
data it manipulated.

So as an example, if we fault in 2Mb in an empty range, it will return 2 << 20.
But if fault in the same 2Mb in a range that already contained some sparse 4k
pages, we will return 2 << 20 - previous_pages.

That will be useful to count memory usage in certain VMAs.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f1cd4f8d

string: add fixups for memcpy operations · 9cce0f87

Glauber Costa authored 11 years ago

When we start using the JVM balloon, our memcpy could fail for valid reasons
when the JVM is moving memory that is now in an unmapped region. To handle that,
register a fixup that will trigger a JVM call when the fault happens. If all goes
well, we will be able to continue normally.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9cce0f87

pci: Fix offsets in *_pci_config_* · f0aa8143

Takuya ASADA authored 11 years ago

On VMware, pci_readw(PCI_CFG_DEVICE_ID) returns the *vendor ID*.
pci_readw(PCI_CFG_VENDOR_ID) returns vendor ID as well.

Compare to FreeBSD implementation of read/write PCI config space,
FreeBSD masks lower bit of offset when write to PCI_CONFIG_ADDRESS, and
adds lower bit of offset to PCI_CONFIG_DATA.

http://fxr.watson.org/fxr/source/amd64/pci/pci_cfgreg.c#L206



This patch changes accessing method in OSv to the FreeBSD way.  Tested
on QEMU/KVM and VMware.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f0aa8143

clock: add monotonic uptime clock · 8dffa912

Nadav Har'El authored 11 years ago


This patch starts to solve both issue #142 ("Support MONOTONIC_CLOCK")
and issue #81 (use <chrono> for time).

First, it adds an uptime() function to the "clock" interface, and
implements it for kvm/xen/hpet by returning the system time from which
we subtract the system time at boot (but not adding any correction
for wallclock).

Second, it adds a new std::chrono-based interface to this clock, in
a new header file <osv/clock.hh>. Instead of the old-style
clock::get()->uptime(), one should prefer osv::clock::uptime::now().
This returns a std::chrono::time_point which is type-safe, in the
sense that: 1. It knows what its epoch is (i.e., that it belongs to
osv::clock::uptime), and 2. It knows what its units are (nanoseconds).
This allows the compiler to prevent a user from confusing measurements
from this clock with those from other clocks, or making mistakes in
its units.

Third, this patch implements clock_gettime(MONOTONIC_CLOCK), using
the new osv::clock::uptime::now().

Note that though the new osv::clock::uptime is almost identical to
std::chrono::steady_clock, they should not be confused. The former is
actually OSv's implementation of the latter: steady_clock is implemented
by the C++11 standard library using the Posix clock_gettime, and that
is implemented (in this patch) using osv::clock::uptime.

With this patch, we're *not* done with either issues #142 or #81.
For issue #142, i.e., for supporting MONOTONIC_CLOCK in timerfd, we
need OSv's timers to work on uptime(), not on clock::get()->time().
For issue #81, we should add a osv::clock::wall type too (similar to
what clock::get()->time() does today, but more correctly), and use either
osv::clock::wall or osv::clock::uptime everywhere that
clock::get()->time() is currently used in the code.
clock::get()->time() should be removed.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8dffa912

build: incremental make without image= argument should use the default · 5c68e049

Tomasz Grabiec authored 11 years ago


Currently the parameter was read from the generated Makefile which was
not re-generated on incremental build. The fix is to move the default
to build.mk, this way the default will always be picked unless masked
by command line argument.

Fixes #153

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

5c68e049

Jan 09, 2014

Add netperf image configuration · 31874575

Tomasz Grabiec authored 11 years ago


To start netserver inside OSv just do:

  make image=netperf
  sudo scripts/run.py -nv

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

31874575

zfs: Fix on-disk data inconsistency on shutdown · 2d93af3b

Raphael S. Carvalho authored 11 years ago

This problem was found when running 'tests/tst-zfs-mount.so' multiple times.
At the first time, all tests succeed, however, a subsequent run would
fail at the test: 'mkdir /foo/bar', the error message reported
that the target file already exists.

The test basically creates a directory /foo/bar, rename it to /foo/bar2,
then remove /foo/bar2. How could /foo/bar still be there?

Quite simple. Our shutdown function calls unmount_rootfs() which will
attempt to unmount zfs with the flag MNT_FOURCE, however, it's not being
passed to zfs_unmount(), neither unmount_rootfs() tests itself the
return status (which was always getting failures previously).
So OSv is really being shutdown while there is remaining data waiting to
be synced with the backing store. As a result, inconsitency.

This problem was fixed by passing the flag to VFS_UNMOUNT which will now
unmount the fs properly on sudden shutdowns.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2d93af3b

module.py: move test.manifest.gen processing out of jvm class · 69c49ea0

Tomasz Grabiec authored 11 years ago


Processing of this manifest was inside JVM-specific code which caused
the manifest was not processed if there was no java application
in the image.

For example:

  make image=empty check
  ...
  run_main(): cannot execute tests/tst-af-local.so. Powering off.
  Test tst-af-local.so FAILED
  make: *** [check] Error 1

Let's move it to the main manifest processing function.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

69c49ea0

Jan 08, 2014

conf: make release build use -O3 · 4af010f1

Tomasz Grabiec authored 11 years ago


In some workloads it noticably improves performance. I measured
6% increase in netperf throughput on my laptop.

Object file size is only slightly bloated:

 loader.elf (O2): 47246227
 loader.elf (O3): 51272625 (+8.5%)

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4af010f1

Update mgmt.git version · 4366e2e1

Pekka Enberg authored 11 years ago


Changes:

  - web: Added /upload view class
  - Shell: Rewrite 'ls' and add formatting/sort flags
  - Update the jvm API to be more verbose
  - adding REST API specification: api, os, jvm

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4366e2e1

string: implement memmove using memcpy · 856bb361

Glauber Costa authored 11 years ago

The current implementation of memmove is a PITA (I mean the bread, of course)
to decode if a fault happens. We have very little control of where exactly in
the code the fault happens, therefore it is difficult to reason about it. This
patch implements memmove in terms of memcpy + memcpy_backwards.

For those, we can have specific fixups in the possible fault sites, that will
allow us to decode the faults with ease.

Note that originally, the only reason why the first branch was not a memcpy is
that we would like to handle alignment. Since our implementation of memcpy is
fast enough, we can just ignore that and we will end up being even faster.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

856bb361

x64: Provide a backwards version of memcpy · d25859ce

Glauber Costa authored 11 years ago

This patch provides a backwards version of memcpy. It works all the same, but
will start the copy from dst + n <= src + n, instead of dst <= src. That is
needed for memmove when the source and destination operands overlap.

Being a nonstandard interface, I've just named it "memcpy_backwards"

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d25859ce

mem: fix allocation accounting · 8d7812fa

Glauber Costa authored 11 years ago

There was a small bug in the free memory tracking code that I've only hit
recently. I was wrong in assuming that in the first branch for huge page
allocation, where we erase the entire range, we should account for N bytes.
This assumption came from my - wrong - understanding that we would do that when
the range is exactly N bytes.

Looking at the code with fresh eyes, that is definitely not what happens. In my
previous stress test we were hitting the second branch all the time, so this
bug lived on.

Turns out that we will delete the entire page range, which may be bigger than
N, the allocation size. Therefore, the whole range should be discounted from
our calculation. The remainder (bigger than N part) will be accounted for later
when we reinsert it in the page range, in the same way it is for the second
branch of this code.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8d7812fa

java: add all JRE files to the image by default · b35affb3

Tomasz Grabiec authored 11 years ago


Fixes issue with JVM failing when started with a debugger
with the following message:

  NPT ERROR: Cannot find nptInitialize

Missing openjdk files in usr.manifest were a fertile source
of issues. This patch aims at making them less likely and
adding all files   except blacklisted files to the image.

This patch skips two files from JRE which are broken links
and inclusion of which would cause manifest upload failure:
 - jre/lib/audio/default.sf2
 - jre/lib/security/cacerts

These should be fixed incrementally.

Reported-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b35affb3

modules: allow for file mapping declaration in module.py · e1b95d8e

Tomasz Grabiec authored 11 years ago


If module has 'usr_files' or 'bootfs_files' declared then
their value will be interpreted as FileMaps and appended
to appropriate manifests.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e1b95d8e

modules: introduce higer level API to aid manifest generation · 71d88f94

Tomasz Grabiec authored 11 years ago


When plain manifests are not enough this is a concise alternative
with improved expresiveness. It allows to declare exclude and include
patterns. It's python based.

Example:

  m = FileMap()
  m.add('${OSV_BUILD_PATH}/tests').to('/tests') \
     .include('**/*.so') \
     .exclude('host/**')

Declared mappings can be saved in manifest form or be subject
of further processing. To save in manifest format:

  save_as_manifest(m, 'my.manifest')

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

71d88f94

modules: make java a separate OSv module · 31c55622

Tomasz Grabiec authored 11 years ago


This patch makes java files are copied to the guest image only when
'java' modue is included.

Modules can pull it explicitly by stating:

  require('java')

or implicitly, by creating api.run_java() run configurations.
In future we could consider moving api.run_java() into a java
meta-module.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

31c55622

modules: cleanup resolve.py · a5357cf5

Tomasz Grabiec authored 11 years ago


No functional changes, just renames to more adequate names.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a5357cf5

modules: make presence of either manifests optional · 772c320c

Tomasz Grabiec authored 11 years ago


No need to create an empty bootfs.manifest anymore.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

772c320c

modules: support nested module dependencies · 76267d45

Tomasz Grabiec authored 11 years ago


Currently importing module from a module definition would fail because
we cannot call import module with the same name (module.py)
recursively, __import__ will complain that we removed 'module' from
sys.modules.

There is a simple solution to this problem, we can use runpy.run_path()
which works like a charm.

In addition to this we cache loaded modules so that we don't
have to load the file twice.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

76267d45

config.json: convert tabs to spaces · 4fa757b3

Tomasz Grabiec authored 11 years ago


Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4fa757b3

Refactor timerfd.cc · cb19cf98

Nadav Har'El authored 11 years ago


In his review of timerfd.cc, Avi asked that I simplify the implementation
by having a single "timerfd" object (instead of two I had - timerfd_file
and timerfd_object), and by using a single mutex instead of the complex
combination of mutexes and atomic variable.

This new version indeed does this. It should be easier to understand this
code, and it is 30 lines shorter.

The performance of this code is slightly inferior to the previous one -
in particular poll() now locks and unlocks a mutex - but this should be
negligible in practice.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cb19cf98

gdb: improve 'osv zfs' command · 9b53859d

Raphael S. Carvalho authored 11 years ago


This patch improves the command by adding useful info for debugging ZFS
in general, and also addresses some stylistic issues.

The new output is as follow:
(gdb) osv zfs
:: ZFS TUNABLES ::
	zil_replay_disable:       0
	zfs_nocacheflush:         0
	zfs_prefetch_disable:     0
	zfs_no_write_throttle:    0
	zfs_txg_timeout:          5
	zfs_write_limit_override: 0
	vdev_min_pending:         4
	vdev_max_pending:         10
:: ARC SIZES ::
	Actual ARC Size:        122905056
	Target size of ARC:     1341923840
	Min Target size of ARC: 167740480
	Max Target size of ARC: 1341923840
		Most Recently Used (MRU) size:   670961920 (50.00%)
		Most Frequently Used (MFU) size: 670961920 (50.00%)
:: ARC EFFICIENCY ::
Total ARC accesses: 42662
	ARC hits: 41615 (97.55%)
		ARC MRU hits: 12550 (30.16%)
			Ghost Hits: 0
		ARC MFU hits: 29045 (69.79%)
			Ghost Hits: 0
	ARC misses: 1047 (2.45%)
Prefetch workload ratio: 0.0097%
Prefetch total:          412
	Prefetch hits:   20
	Prefetch misses: 392
Total Hash elements: 1053
	Max Hash elements: 1053
	Hash collisions:   13
	Hash chains:       11

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9b53859d

Jan 07, 2014

build.mk: Fix dependency · ce68645c

Nadav Har'El authored 11 years ago

A previous patch renamed mutex.cc to spinlock.cc. This fixes the build.mk
dependency to make the code compile again... Sorry about that.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ce68645c

OSv Coding Style · 00f9a97e

Amnon Heiman authored 11 years ago


This document describe OSv coding style.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

00f9a97e

Exile spinlock to a separate file · 8fcad509

Nadav Har'El authored 11 years ago

In very early OSv history, the spinlock was used in the mutex's
implementation so it made sense to put it in mutex.cc and mutex.h.

But now that the spinlock is all that's left in mutex.cc (the real mutex
is in lfmutex.cc), rename this file spinlock.cc. Also, move the spinlock
definitions from <osv/mutex.h> to a new <osv/spinlock.h>, so if someone
wants to make the grave mistake of using a spinlock - they will at least
need to explicitly include this header file.

Currently, the only remaining user of the spinlock is the console.
Using a spinlock (and not a mutex) in the console allows printing debug
messages while preemption is disabled. Arguably, this use-case is no
longer important (we have tracepoints), so in the future we can consider
dropping the spinlock completely.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8fcad509

tests: add test for memmove string library functionality · 86391f96

Glauber Costa authored 11 years ago


Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

86391f96

pthread: add support for pthread_getcpuclockid · b7c59ac5

Glauber Costa authored 11 years ago


This patch implements a simplified version of pthread_getcpuclockid
that should be enough for our needs.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b7c59ac5

sched: reserve some thread ids · a9169887

Glauber Costa authored 11 years ago


This patch reserves some thread ids, that are kept unused. This is so we can
construct values that reuse the thread public id and add it together with other
information and still fit in 32-bits.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a9169887

sched: keep track of thread's runtime · 880c7291

Glauber Costa authored 11 years ago


This will be used later to determine for how long have a thread been running.
It can easily be updated right before we call ran_for(), reusing its interval
parameter.

Fixes #135

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

880c7291

Jan 06, 2014

Fix tests/testrunner.cc coding style · 079e1b9e

Raphael S. Carvalho authored 11 years ago


Start using spaces instead of tabs and surround all single-line
control statements with curly braces.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

079e1b9e

scripts/test: Add option to run all test cases in a single OSv instance · 03f0bd2a

Raphael S. Carvalho authored 11 years ago


Previously, scripts/test.py had no option to do that. It launched an OSv
instance for each test case.
Terribly slow PCs like mine took a bunch of time to run all test cases
through 'make check'.

Then let's take advantage of testrunner.so which will use a single OSv instance
to run all test cases, consequently boosting the speed considerably.
Let's also change testunner.so to conform our needs, e.g. blacklist.

To run this fast check, do: scripts/test.py --single;

Results show that this option is about 2.5x faster than the current one.

By now, let's not use this approach as the default version given that its
output has to be better formatted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

03f0bd2a

README: Add a short intro to OSv · dd2a0af1

Nadav Har'El authored 11 years ago

Add on the top of README.md a short introduction to what OSv is.
If someone gets to our github page, https://github.com/cloudius-systems/osv

,
and scrolls down, it's strange that we only explain how to build OSv,
without first mentioning what it is.

Fixes #148

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

dd2a0af1

Jan 03, 2014

tst-vfs: Add test case to dentry hierarchy support · 41614da6

Raphael S. Carvalho authored 11 years ago


Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

41614da6

Fix fs/vfs/vfs_lookup.c coding style · 67d82557

Raphael S. Carvalho authored 11 years ago


Start using spaces instead of tabs and surround all single-line
control statements with curly braces.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

67d82557

vfs: Add hierarchy support to directory entries · 3bd235e9

Raphael S. Carvalho authored 11 years ago

It will be useful to take better and safer VFS decisions in the future.
For example, avoiding code that uses the absolute path to determine something.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3bd235e9

vfs: Fix dentry leak in sys_pivot_root · e30bed5c

Raphael S. Carvalho authored 11 years ago


newmp->m_covered must be released if not NULL.
Found this problem while dumping dcache content.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e30bed5c