- May 28, 2014
-
-
Avi Kivity authored
Since we cannot guarantee that the fpu will not be used in interrupts and exceptions, we must save it earlier rather than later. This was discovered with an fpu-based memcpy, but can be triggered in other ways. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Tomasz Grabiec authored
The summary will only include samples within the specified range. Timed samples will be trimmed to the given time range. Requested-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 27, 2014
-
-
Raphael S. Carvalho authored
When using 'zpool.so import' fdopendir symbol is missed. Code taken from musl libc. Additional details: fcntl(fd, F_SETFD, FD_CLOEXEC) call was removed as the close-on-exec flag is left unchanged on Linux (Glauber Costa). Also removing struct __DIR_s definition from fs/vfs/main.c, and referencing the new one added into libc/dirent/dirent.h. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
The code in <api/x86/reloc.h> is not used. Avi says it's dead code that originates fro Musl. Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
https://github.com/gleb-cloudius/osvAvi Kivity authored
"Nothing spectacular here, just making msync() pagecache aware. Reduces code size a bit if nothing else" Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Pawel Dziepak authored
R_X86_64_DPTMOD64 may not be associated with any symbol (which is common when the linker uses local dynamic TLS mode) in which case it should be resovled to the index of the current module. Signed-off-by:
Pawel Dziepak <pdziepak@quarnos.org> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Gleb Natapov authored
Not all pages in the write page cache are dirty sync msync() may write some of them back. Check for that and do not write back clean pages needlessly. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
This will allow pagecache code to atomically clear pte and check for a dirty bit. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
Current msync implementation is scanning all pages in msync are via pagetables to find dirty pages, but pagecache already knows what pages are potentially dirty for given file/offsets, so it is can check if they are dirty via rmap. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
As Avi pointed ptep_flush and ptep_accessed classes can be replaced by general map-reduce mechanism with customizable map and reduce functions. The patch implements that. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Raphael S. Carvalho authored
The problem was the flag D_TTY, meaning that device is TTY, not being passed to device_create. Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Avi Kivity authored
This reverts commit 2da050a4 - apparently we're calling memcpy() in a non-safe path. Revert for now. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 26, 2014
-
-
Nadav Har'El authored
Fix a missing include to allow tst-mmap.cc to be compiled on Linux, and add in a comment the (far from obvious) command line needed to compile it. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
The man page for raise says: In a single-threaded program it is equivalent to kill(getpid(), sig); In a multithreaded program it is equivalent to pthread_kill(pthread_self(), sig); In our case we should mimic the second. It is a good question then whether this function should be in a more generic place or in pthread.cc itself, but I will argue for the second, since it will make it easier for people to notice that this is what our implementation does. Of course at this moment pthread_kill is stubbed and so is raise. But if we ever implement the first, we gain the later for free. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Code from musl. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
We already have the ffs function, ffsl and ffsll are easy from here. Theoretically, a_ctz_l should do the job here as well since we're 64 all over, but I found a_ctz_64 safer. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Similar to pipe, but taking flags. We will ignore exec related flags. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
This is mainly a wrapper around fcntl, so it should work to the extent that fcntl works and fail gracefully where it doesn't. Code is imported from musl with some modifications to allow it to compile as C++ code. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Note that we don't allocate memory in sem_init: we are using placement new to just construct the object over an already existing memory location. Therefore, all we need to do is release our unique_ptr Thanks Pawel for noticing we need to release memory of the internal semaphore Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Copy < 256 bytes without any loops; 16 bytes and above use sse to reduce instruction count. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Avi Kivity authored
Simple power of two is too easy. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
https://github.com/tgrabiec/osvAvi Kivity authored
"The main improvement is in the last patch which removes contention inside free_different_cpu() on sync._mtx. It improves my micro-benchmark by ~30%." Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Tomasz Grabiec authored
Fixes #308. When the per-cpu-pair ring fills up, the freeing thread is blocked and enters a synchronous object hand-off. That synchronous hand-off is cause of contention. Instead of having a bounded ring we can use an unordered_queue_mpsc which links the freed objects in a chain. In this implementation push() always succeeds and we don't need to block. In a test which allocates 1K blocks on once CPU and having two threads freeing them on two other CPUs, there is a ~40% improvement of free() throughput. I tested various implementaions, based on different queues. Statistics of free/sec reported by misc-free-perf (one sample = one run): current: avg = 8133055.09 stdev = 118322.06 samples = 5 ring_spsc<1M> (no blocking): avg = 10442665.98 stdev = 476334.93 samples = 5 unordered_queue_spsc: avg = 10258212.69 stdev = 418194.22 samples = 5 unordered_queue_mpsc: avg = 11701334.99 stdev = 725299.97 samples = 5 Testing showed that unordered_queue_mpsc() performs best in this case. Dead objects are collected by per-CPU worker thread (same as before). The thread is woken up once every 256 frees. That threshold was chosen so that the behavior would more or less correspond to what was before. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
There is one allocating threads and two freeing threads. Each thread is allocated on a different core. The test measures throughput of objects freed by both threads. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
It is meant to provide both the speed of a ring buffer and non-blocking properties of linked queues by combining the two. Unlike for ring_spsc, push() is always guaranteed to succeed. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
It's like queue_mpsc with two improvements: * consumer and producer links are cache line aligned to avoid false sharing. I was tempted to apply this to queue_mpsc too but then discovered that this queue is embedded in a mutex, and doing so would greatly bloat mutex size, so I gave up on this idea. * The contract of pop() is relaxed to return items in no particular order so that we can avoid the cost of reversing the chain. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
free_page_ranges is an intrusive set. erasing via a reference requires iteration over reference equal_range under the hood, which means traversing the tree to the leafs. Whereas erasing via an iterator requires no such lookups so should be faster. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
In some runs callq to mempool_cpuid shows up in 'perf kvm top' profile. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Glauber Costa authored
Same as fork, vfork, etc. So goes in the same place. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
Fixed error in ::clock's Doxygen comment. It referred to osv::clock::monotonic, while in fact the correct name is osv::clock::uptime. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 25, 2014
-
-
Avi Kivity authored
Fixes debug build. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Raphael S. Carvalho authored
The lz4 code checks from a predetermined list of definitions if the CPU word size is 64. Otherwise, it's 32. Therefore, __aarch64__ definition must be added into the afore- mentioned list. Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 23, 2014
-
-
Raphael S. Carvalho authored
This patch enables LZ4 compression on the ZFS dataset right after its insertion in the pool. Then the image creation process will go through all the steps with compression enabled, and when it's done, compression is disabled. From that moment on, compression stops taking effect, and files previously compressed will be still supported. Why disabling compression after image creation? There seems to be corner-cases where setting compression by default would affect applications performance. For example, applications that compress data themselves (e.g. Cassandra) might end up slower as ZFS would be duplicating the compression process that was previously done, and consequently wasting CPU cycles. It's worth mentioning that LZ4 is ~300% faster than LZJB when compressing 'in-compressible' data, so it might be good even for Cassandra. Additional information: The first version of this patch used the LZJB algorithm, however, it slowed down read operations on compressed files. On the other hand, LZ4 improves read on compressed files, improves boot time, and still provides a good compression ratio. RESULTS ===== - UNCOMPRESSED: * Image size -rw-r--r--. 1 root root 154533888 May 19 23:02 build/release/usr.img * Read benchmark REPORT ----- Files: 552 Read: 127399kb Time: 1069.90ms MBps: 115.90 * Boot time 1) ZFS mounted: 426.57ms, (+157.75ms) 2) ZFS mounted: 439.13ms, (+156.24ms) - COMPRESSED (LZ4): * Image size -rw-r--r--. 1 root root 81002496 May 19 23:33 build/release/usr.img * Read benchmark REPORT ----- Files: 552 Read: 127399kb Time: 957.96ms MBps: 129.44 * Boot time 1) ZFS mounted: 414.55ms, (+145.47ms) 2) ZFS mounted: 403.72ms, (+142.82ms) Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
Besides refactoring the code, this patch makes mkfs support more than one instance of the same shared object within the same mkfs instance, i.e. by releasing the resources at the function prologue. Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
Useful for getting a notion of response time and throughput on sequential read operations. Random read option should be added later on. Currently being used by me to measure read performance on compressed vs uncompressed data. Example output: OSv v0.08-160-gddb9322 eth0: 192.168.122.15 /zpool.so: 96kb: 1.77ms, (+1.77ms) /libzfs.so: 211kb: 6.57ms, (+4.80ms) /zfs.so: 96kb: 8.25ms, (+1.68ms) /tools/mkfs.so: 10kb: 9.32ms, (+1.07ms) /tools/cpiod.so: 244kb: 14.08ms, (+4.76ms) ... /usr/lib/jvm/jre/lib/content-types.properties: 5kb: 1066.17ms, (+2.87ms) /usr/lib/jvm/jre/lib/cmm/GRAY.pf: 556b: 1066.74ms, (+0.57ms) /usr/lib/jvm/jre/lib/cmm/CIEXYZ.pf: 784b: 1067.34ms, (+0.60ms) /usr/lib/jvm/jre/lib/cmm/sRGB.pf: 6kb: 1067.96ms, (+0.62ms) /usr/lib/jvm/jre/lib/cmm/LINEAR_RGB.pf: 488b: 1068.61ms, (+0.64ms) /usr/lib/jvm/jre/lib/cmm/PYCC.pf: 228kb: 1073.96ms, (+5.36ms) /usr/lib/jvm/jre/lib/sound.properties: 1kb: 1074.65ms, (+0.69ms) REPORT ----- Files: 552 Read: 127395kb Time: 1074.65ms MBps: 115.39 Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
OSv port details: - Discarded manpage changes. - lz4 license was added to the licenses directory. - Addressed some conflicts in zfs/zfs_ioctl.c. - Add unused attributed to a few functions in zfs/lz4.c which are actually unused. * Illumos zfs issue #3035 [1] LZ4 compression support in ZFS. LZ4 is a new high-speed BSD-licensed compression algorithm created by Yann Collet that delivers very high compression and decompression performance compared to lzjb (>50% faster on compression, >80% faster on decompression and around 3x faster on compression of incompressible data), while giving better compression ratio [1]. FreeBSD commit hash: c6d9dc1 Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Just like memcpy, memset can also benefit from special cases for small sizes. However, as expected, the tradeoffs are different and the benefit is not as large. In the best case, we are able to get it better up to 64 bytes. There should still be a gain, because in workloads where memcpy will deal with small sizes, memset will likely do so as well. Again, I have compared the simple loop, duff's device, and "glommer's device", with the latest being the winner. Here are the results, up to the point each one starts losing: Original: ========= memset,4,9.007000,9.161000,9.024967,0.042445 memset,8,9.007000,9.137000,9.028934,0.043388 memset,16,9.006000,9.267000,9.028168,0.056487 memset,32,9.007000,11.719000,9.287668,0.716163 memset,64,9.007000,9.143000,9.023834,0.034745 memset,128,9.007000,9.174000,9.030134,0.044414 Loop: ===== memset,4,3.122000,3.293000,3.158033,0.026586 memset,8,4.151000,5.077000,4.570933,0.207710 memset,16,7.021000,8.288000,7.873499,0.276310 memset,32,19.414000,19.792999,19.551334,0.086234 Duff: ===== memset,4,3.602000,4.829000,3.936233,0.425657 memset,8,4.117000,4.526000,4.282266,0.100237 memset,16,4.889000,5.227000,5.105134,0.084525 memset,32,8.748000,8.884000,8.763433,0.038910 memset,64,16.983999,17.163000,17.018702,0.051896 Glommer: ======== memset,4,3.524000,3.664000,3.601167,0.028642 memset,8,3.088000,3.144000,3.092500,0.009790 memset,16,4.117000,4.170000,4.126300,0.014074 memset,32,4.888000,5.400000,5.172900,0.123619 memset,64,6.963000,7.023000,6.968966,0.013802 memset,128,11.065000,11.174000,11.076533,0.027541 Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-