Commits · c4e1e147054fb8b4de7d480a3215e4339e320221 · Verlässliche Systemsoftware / projects / osv

May 27, 2014

include/api: Remove dead code · c4e1e147

Pekka Enberg authored 10 years ago


The code in <api/x86/reloc.h> is not used. Avi says it's dead code that
originates fro Musl.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c4e1e147

Merge branch 'pagecache' of https://github.com/gleb-cloudius/osv · 336d416e

Avi Kivity authored 10 years ago


"Nothing spectacular here, just making msync() pagecache aware. Reduces
code size a bit if nothing else"

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

336d416e

elf: resolve R_X86_64_DPTMOD64 properly · 9466cf13

Pawel Dziepak authored 10 years ago


R_X86_64_DPTMOD64 may not be associated with any symbol (which is common
when the linker uses local dynamic TLS mode) in which case it should be
resovled to the index of the current module.

Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

9466cf13

pagecache: do not write back clean pages during write pages eviction · 6e70f178

Gleb Natapov authored 10 years ago


Not all pages in the write page cache are dirty sync msync() may write
some of them back. Check for that and do not write back clean pages
needlessly.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

6e70f178

mmu: clear pte in unmap() function. · 6163c030

Gleb Natapov authored 10 years ago


This will allow pagecache code to atomically clear pte and
check for a dirty bit.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

6163c030

pagecache: make msync pagecache aware · e7bb7b97

Gleb Natapov authored 10 years ago


Current msync implementation is scanning all pages in msync are via
pagetables to find dirty pages, but pagecache already knows what pages are
potentially dirty for given file/offsets, so it is can check if they are
dirty via rmap.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

e7bb7b97

pagecache: unify rmap flush and access scanning classes. · 05f66d2f

Gleb Natapov authored 10 years ago


As Avi pointed ptep_flush and ptep_accessed classes can be replaced by
general map-reduce mechanism with customizable map and reduce functions.
The patch implements that.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

05f66d2f

console: Fix isatty on a fd that refers to /dev/console · 7ac4442e

Raphael S. Carvalho authored 10 years ago


The problem was the flag D_TTY, meaning that device is TTY, not
being passed to device_create.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7ac4442e

Revert "memcpy: aggressive unrolling" · f9311884

Avi Kivity authored 10 years ago


This reverts commit 2da050a4 - apparently
we're calling memcpy() in a non-safe path.

Revert for now.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

f9311884

May 26, 2014

tests: allow running tests/tst-mmap.cc · 771099ca

Nadav Har'El authored 10 years ago

Fix a missing include to allow tst-mmap.cc to be compiled on Linux, and
add in a comment the (far from obvious) command line needed to compile it.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

771099ca

pthread: implement raise() function · a99b0a96

Glauber Costa authored 10 years ago


The man page for raise says:

In a single-threaded program it is equivalent to

           kill(getpid(), sig);

In a multithreaded program it is equivalent to

           pthread_kill(pthread_self(), sig);

In our case we should mimic the second. It is a good question then whether this
function should be in a more generic place or in pthread.cc itself, but I will
argue for the second, since it will make it easier for people to notice that
this is what our implementation does.

Of course at this moment pthread_kill is stubbed and so is raise. But if we
ever implement the first, we gain the later for free.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a99b0a96

libc: implement gai_strerror · b9ec8c36

Glauber Costa authored 10 years ago


Code from musl.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b9ec8c36

libc: implement ffsl and ffsll functions · c5ed256f

Glauber Costa authored 10 years ago


We already have the ffs function, ffsl and ffsll are easy from here.
Theoretically, a_ctz_l should do the job here as well since we're 64 all
over, but I found a_ctz_64 safer.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c5ed256f

pipe: implement pipe2 · 6be064a2

Glauber Costa authored 10 years ago

Similar to pipe, but taking flags. We will ignore exec related flags.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6be064a2

libc: implement lockf · bef75331

Glauber Costa authored 10 years ago

This is mainly a wrapper around fcntl, so it should work to the extent
that fcntl works and fail gracefully where it doesn't. Code is imported
from musl with some modifications to allow it to compile as C++ code.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

bef75331

pthread: implement sem_destroy · 3f9d203b

Glauber Costa authored 10 years ago

Note that we don't allocate memory in sem_init: we are using placement new to
just construct the object over an already existing memory location. Therefore,
all we need to do is release our unique_ptr

Thanks Pawel for noticing we need to release memory of the internal semaphore

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

3f9d203b

memcpy: aggressive unrolling · 2da050a4

Avi Kivity authored 10 years ago


Copy < 256 bytes without any loops; 16 bytes and above use sse to reduce
instruction count.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2da050a4

tests: test more sizes for memcpy · 041b07db

Avi Kivity authored 10 years ago


Simple power of two is too easy.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

041b07db

Merge branch 'mempool' of https://github.com/tgrabiec/osv · c1e0888e

Avi Kivity authored 10 years ago


"The main improvement is in the last patch which removes contention
inside free_different_cpu() on sync._mtx. It improves my
micro-benchmark by ~30%."

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c1e0888e

mempool: remove contention inside free_different_cpu() · ce0c0f99

Tomasz Grabiec authored 10 years ago


Fixes #308.

When the per-cpu-pair ring fills up, the freeing thread is blocked and
enters a synchronous object hand-off. That synchronous hand-off is
cause of contention. Instead of having a bounded ring we can use an
unordered_queue_mpsc which links the freed objects in a chain. In this
implementation push() always succeeds and we don't need to block.

In a test which allocates 1K blocks on once CPU and having two threads
freeing them on two other CPUs, there is a ~40% improvement of free()
throughput.

I tested various implementaions, based on different queues. Statistics
of free/sec reported by misc-free-perf (one sample = one run):

current:

  avg =  8133055.09
  stdev = 118322.06
  samples = 5

ring_spsc<1M> (no blocking):

  avg = 10442665.98
  stdev = 476334.93
  samples = 5

unordered_queue_spsc:

  avg = 10258212.69
  stdev = 418194.22
  samples = 5

unordered_queue_mpsc:

  avg = 11701334.99
  stdev = 725299.97
  samples = 5

Testing showed that unordered_queue_mpsc() performs best in this case.

Dead objects are collected by per-CPU worker thread (same as
before). The thread is woken up once every 256 frees. That threshold
was chosen so that the behavior would more or less correspond to what
was before.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

ce0c0f99

tests: add test for free() throughput across CPUs · f86b36a6

Tomasz Grabiec authored 10 years ago


There is one allocating threads and two freeing threads. Each thread
is allocated on a different core. The test measures throughput of
objects freed by both threads.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

f86b36a6

tests: introduce latch.await_for() · 328537ba
Tomasz Grabiec authored 10 years ago
```
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
```
328537ba
tests: introduce latch.is_released() · a1b0b963
Tomasz Grabiec authored 10 years ago
```
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
```
a1b0b963

lockfree: introduce unordered_queue_spsc · 7788b5b2

Tomasz Grabiec authored 10 years ago


It is meant to provide both the speed of a ring buffer and
non-blocking properties of linked queues by combining the two. Unlike
for ring_spsc, push() is always guaranteed to succeed.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

7788b5b2

lockfree: introduce unordered_queue_mpsc · f0135963

Tomasz Grabiec authored 10 years ago


It's like queue_mpsc with two improvements:

 * consumer and producer links are cache line aligned to avoid false
   sharing. I was tempted to apply this to queue_mpsc too but then
   discovered that this queue is embedded in a mutex, and doing so
   would greatly bloat mutex size, so I gave up on this idea.

 * The contract of pop() is relaxed to return items in no particular
   order so that we can avoid the cost of reversing the chain.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

f0135963

mempool: erase page range using an iterator rather than a reference · 850bf893

Tomasz Grabiec authored 10 years ago

free_page_ranges is an intrusive set. erasing via a reference requires
iteration over reference equal_range under the hood, which means
traversing the tree to the leafs. Whereas erasing via an iterator
requires no such lookups so should be faster.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

850bf893

mempool: drop redundant assert · 91ac7d37
Tomasz Grabiec authored 10 years ago
```
Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
```
91ac7d37

mempool: make mempool_cpuid() inline · 6423593d

Tomasz Grabiec authored 10 years ago


In some runs callq to mempool_cpuid shows up in 'perf kvm top' profile.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>

6423593d

stub setsid · a85ed501

Glauber Costa authored 10 years ago


Same as fork, vfork, etc. So goes in the same place.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a85ed501

drivers/clock: Fix comment · 2af316f8

Nadav Har'El authored 10 years ago

Fixed error in ::clock's Doxygen comment. It referred to
osv::clock::monotonic, while in fact the correct name is osv::clock::uptime.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2af316f8

May 25, 2014

pagecache: define access_scanner static members · bf76dc2b
Avi Kivity authored 10 years ago
```
Fixes debug build.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
```
bf76dc2b

zfs: lz4: Recognize __aarch64__ when detecting CPU word size. · ac9c8322

Raphael S. Carvalho authored 10 years ago


The lz4 code checks from a predetermined list of definitions
if the CPU word size is 64. Otherwise, it's 32.
Therefore, __aarch64__ definition must be added into the afore-
mentioned list.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

ac9c8322

May 23, 2014

zfs: Enable compression on zfs dataset when creating the image · 6a29063c

Raphael S. Carvalho authored 10 years ago


This patch enables LZ4 compression on the ZFS dataset right after its
insertion in the pool. Then the image creation process will go through
all the steps with compression enabled, and when it's done, compression
is disabled. From that moment on, compression stops taking effect, and
files previously compressed will be still supported.

Why disabling compression after image creation?
There seems to be corner-cases where setting compression by default
would affect applications performance.
For example, applications that compress data themselves (e.g. Cassandra)
might end up slower as ZFS would be duplicating the compression process
that was previously done, and consequently wasting CPU cycles.
It's worth mentioning that LZ4 is ~300% faster than LZJB when compressing
'in-compressible' data, so it might be good even for Cassandra.

Additional information: The first version of this patch used the LZJB
algorithm, however, it slowed down read operations on compressed files.
On the other hand, LZ4 improves read on compressed files, improves boot
time, and still provides a good compression ratio.

RESULTS
=====

- UNCOMPRESSED:
* Image size
-rw-r--r--. 1 root root 154533888 May 19 23:02 build/release/usr.img

* Read benchmark
REPORT
-----
Files:    552
Read:    127399kb
Time:    1069.90ms
MBps:    115.90

* Boot time
1)
    ZFS mounted: 426.57ms, (+157.75ms)
2)
    ZFS mounted: 439.13ms, (+156.24ms)

- COMPRESSED (LZ4):
* Image size
-rw-r--r--. 1 root root 81002496 May 19 23:33 build/release/usr.img

* Read benchmark
REPORT
-----
Files:    552
Read:    127399kb
Time:    957.96ms
MBps:    129.44

* Boot time
1)
    ZFS mounted: 414.55ms, (+145.47ms)
2)
    ZFS mounted: 403.72ms, (+142.82ms)

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6a29063c

mkfs: Code refactoring and allow instances of the same shared object · 9ca6522a

Raphael S. Carvalho authored 10 years ago

Besides refactoring the code, this patch makes mkfs support more than
one instance of the same shared object within the same mkfs instance,
i.e. by releasing the resources at the function prologue.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9ca6522a

tests: Add read-only fsop benchmark · cb5db36c

Raphael S. Carvalho authored 10 years ago


Useful for getting a notion of response time and throughput
on sequential read operations.
Random read option should be added later on.
Currently being used by me to measure read performance on
compressed vs uncompressed data.

Example output:
OSv v0.08-160-gddb9322
eth0: 192.168.122.15
/zpool.so: 96kb: 1.77ms, (+1.77ms)
/libzfs.so: 211kb: 6.57ms, (+4.80ms)
/zfs.so: 96kb: 8.25ms, (+1.68ms)
/tools/mkfs.so: 10kb: 9.32ms, (+1.07ms)
/tools/cpiod.so: 244kb: 14.08ms, (+4.76ms)
...
/usr/lib/jvm/jre/lib/content-types.properties: 5kb: 1066.17ms, (+2.87ms)
/usr/lib/jvm/jre/lib/cmm/GRAY.pf: 556b: 1066.74ms, (+0.57ms)
/usr/lib/jvm/jre/lib/cmm/CIEXYZ.pf: 784b: 1067.34ms, (+0.60ms)
/usr/lib/jvm/jre/lib/cmm/sRGB.pf: 6kb: 1067.96ms, (+0.62ms)
/usr/lib/jvm/jre/lib/cmm/LINEAR_RGB.pf: 488b: 1068.61ms, (+0.64ms)
/usr/lib/jvm/jre/lib/cmm/PYCC.pf: 228kb: 1073.96ms, (+5.36ms)
/usr/lib/jvm/jre/lib/sound.properties: 1kb: 1074.65ms, (+0.69ms)

REPORT
-----
Files:	552
Read:	127395kb
Time:	1074.65ms
MBps:	115.39

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cb5db36c

zfs: Port lz4 compression algorithm from FreeBSD · ac3f540a

Raphael S. Carvalho authored 10 years ago


OSv port details:
- Discarded manpage changes.
- lz4 license was added to the licenses directory.
- Addressed some conflicts in zfs/zfs_ioctl.c.
- Add unused attributed to a few functions in zfs/lz4.c which are
actually unused.

 * Illumos zfs issue #3035 [1] LZ4 compression support in ZFS.

LZ4 is a new high-speed BSD-licensed compression algorithm created
by Yann Collet that delivers very high compression and decompression
performance compared to lzjb (>50% faster on compression, >80% faster
on decompression and around 3x faster on compression of incompressible
data), while giving better compression ratio [1].

FreeBSD commit hash: c6d9dc1

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ac3f540a

memset: make memset faster for small sizes · 28ff5b27

Glauber Costa authored 10 years ago


Just like memcpy, memset can also benefit from special cases for small sizes.
However, as expected, the tradeoffs are different and the benefit is not as
large. In the best case, we are able to get it better up to 64 bytes. There
should still be a gain, because in workloads where memcpy will deal with small
sizes, memset will likely do so as well.

Again, I have compared the simple loop, duff's device, and "glommer's device",
with the latest being the winner. Here are the results, up to the point each
one starts losing:

Original:
=========

memset,4,9.007000,9.161000,9.024967,0.042445
memset,8,9.007000,9.137000,9.028934,0.043388
memset,16,9.006000,9.267000,9.028168,0.056487
memset,32,9.007000,11.719000,9.287668,0.716163
memset,64,9.007000,9.143000,9.023834,0.034745
memset,128,9.007000,9.174000,9.030134,0.044414

Loop:
=====

memset,4,3.122000,3.293000,3.158033,0.026586
memset,8,4.151000,5.077000,4.570933,0.207710
memset,16,7.021000,8.288000,7.873499,0.276310
memset,32,19.414000,19.792999,19.551334,0.086234

Duff:
=====

memset,4,3.602000,4.829000,3.936233,0.425657
memset,8,4.117000,4.526000,4.282266,0.100237
memset,16,4.889000,5.227000,5.105134,0.084525
memset,32,8.748000,8.884000,8.763433,0.038910
memset,64,16.983999,17.163000,17.018702,0.051896

Glommer:
========

memset,4,3.524000,3.664000,3.601167,0.028642
memset,8,3.088000,3.144000,3.092500,0.009790
memset,16,4.117000,4.170000,4.126300,0.014074
memset,32,4.888000,5.400000,5.172900,0.123619
memset,64,6.963000,7.023000,6.968966,0.013802
memset,128,11.065000,11.174000,11.076533,0.027541

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

28ff5b27

tests: increment memcpy tests to test memset too · 94f00eec

Glauber Costa authored 10 years ago

It is really the same kind of test, so let's just reuse memcpy example

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

94f00eec

memory_analyzer: major rework · 5878840b

Pawel Dziepak authored 10 years ago

This patch makes memory_analyzer understand the newly introduced tracepoint
arguments: allocator type, allocated memory and requested alignment.
Allocations are grouped and shown in as a tree together with frequency
information, number of blocks that hasn't been freed yet and amount of
memory wasted by internal fragmentation.

Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

5878840b

mempool: add more information to tracepoints · 8fd55c8d

Pawel Dziepak authored 10 years ago


Signed-off-by: Pawel Dziepak <pdziepak@quarnos.org>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8fd55c8d