- Jun 01, 2014
-
-
Gleb Natapov authored
Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 30, 2014
-
-
Glauber Costa authored
We had an implementation of getmntent, but its signature was wrong to begin with. I doubt it was ever working, since it is not such a common function. But even then, we lacked other functions in this family, like setmntent. This patch implements them all. The main matching code comes from musl, but the end result is significantly different. In particular, I didn't really want to mess with creating new virtual proc files, symlinks and the such. Instead, trying to opening a dynamic file (like /proc/mounts) or any of its famous aliases will return a special value. Functions like getmntent will be able to parse that value and act accordingly. The code to show dynamic mounts comes from our old getmntent() implementation, but it is here modified to not use statics for the strings so we can implement getmntent_r() correctly. Fixes #326 Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> CC: Lyor Goldstein <lgoldstein@vmware.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
This patch adds fallocate as a vnode operation, implements the fallocate function and the system call. Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 29, 2014
-
-
Tomasz Grabiec authored
OpenJDK's SocketInputStream.read() is calling poll() to support timeout. The call site looks like this: pfd.fd = s; pfd.events = POLLIN | POLLERR; poll(&pfd, 1, timeout); Our current implementation of poll() is quite complex because it needs to handle polling on many files. It also allocates memory in several places: - in poll() due to std::vector - in poll_install() - in net_channel::add_poller - in net_channel::del_poller poll() on one socket can be greatly simplified and we can avoid memory allocation completely. This change adds special casing for that. It reduces allocation rate in half in tomcat benchmark with 256 connections. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Nadav Har'El authored
Before commit 202b2ccc, the scheduler was responsible for saving the FPU state, so we needed to know whether the scheduler itself uses it or not. Now that the FPU state is always saved at interrupt time, we no longer care whether or not the scheduler uses the FPU, so we can drop this flag. Also drop the optional "pseudo-float" (integer-based floating point operations) support from the scheduler. This never had any real advantage over the actual floating point, and now that we save the FPU state unconditionally, it makes even less sense to avoid floating point. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Nadav Har'El authored
Before commit 202b2ccc, cpu::reschedule_from_interrupt() needed to know whether we were called from an interrupt (preempt=true) or as an ordinary function (preempt=false), to know whether or not to save the FPU state. As we now save the FPU state at the interrupt code, reschedule_from_interrupt() no longer needs to deal with this, and so this patch removes the unneeded paramter "preempt" to that function. One thing we are losing in this patch is the "sched_preempt" tracepoint, which we previously had when an interrupt caused an actual context switch (not just a reschedule call, but actually switching to a different thread). We still have the "sched_switch" tracepoint which traces all the context switches, which is probably more interesting anyway. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 28, 2014
-
-
Avi Kivity authored
Waitqueues fullfil the same role as condition variables, but are much lighter since they rely on an external mutex, which is already provided by rwlock. Replacing condvars by waitqueues significantly reduces the size of an rwlock, and in addition reduces the number of atomic operations in contended cases significantly. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Implement facades for rwlock so it can be used with WITH_LOCK, specifying whether we want a read lock or a write lock: rwlock my_rwlock; WITH_LOCK(my_rwlock.for_read()) { read stuff } WITH_LOCK(my_rwlock.for_write()) { write stuff } Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Since we cannot guarantee that the fpu will not be used in interrupts and exceptions, we must save it earlier rather than later. This was discovered with an fpu-based memcpy, but can be triggered in other ways. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 27, 2014
-
-
Pekka Enberg authored
The code in <api/x86/reloc.h> is not used. Avi says it's dead code that originates fro Musl. Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Gleb Natapov authored
Not all pages in the write page cache are dirty sync msync() may write some of them back. Check for that and do not write back clean pages needlessly. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
Current msync implementation is scanning all pages in msync are via pagetables to find dirty pages, but pagecache already knows what pages are potentially dirty for given file/offsets, so it is can check if they are dirty via rmap. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
As Avi pointed ptep_flush and ptep_accessed classes can be replaced by general map-reduce mechanism with customizable map and reduce functions. The patch implements that. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
- May 26, 2014
-
-
Glauber Costa authored
This is mainly a wrapper around fcntl, so it should work to the extent that fcntl works and fail gracefully where it doesn't. Code is imported from musl with some modifications to allow it to compile as C++ code. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
Fixes #308. When the per-cpu-pair ring fills up, the freeing thread is blocked and enters a synchronous object hand-off. That synchronous hand-off is cause of contention. Instead of having a bounded ring we can use an unordered_queue_mpsc which links the freed objects in a chain. In this implementation push() always succeeds and we don't need to block. In a test which allocates 1K blocks on once CPU and having two threads freeing them on two other CPUs, there is a ~40% improvement of free() throughput. I tested various implementaions, based on different queues. Statistics of free/sec reported by misc-free-perf (one sample = one run): current: avg = 8133055.09 stdev = 118322.06 samples = 5 ring_spsc<1M> (no blocking): avg = 10442665.98 stdev = 476334.93 samples = 5 unordered_queue_spsc: avg = 10258212.69 stdev = 418194.22 samples = 5 unordered_queue_mpsc: avg = 11701334.99 stdev = 725299.97 samples = 5 Testing showed that unordered_queue_mpsc() performs best in this case. Dead objects are collected by per-CPU worker thread (same as before). The thread is woken up once every 256 frees. That threshold was chosen so that the behavior would more or less correspond to what was before. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
There is one allocating threads and two freeing threads. Each thread is allocated on a different core. The test measures throughput of objects freed by both threads. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
It is meant to provide both the speed of a ring buffer and non-blocking properties of linked queues by combining the two. Unlike for ring_spsc, push() is always guaranteed to succeed. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
It's like queue_mpsc with two improvements: * consumer and producer links are cache line aligned to avoid false sharing. I was tempted to apply this to queue_mpsc too but then discovered that this queue is embedded in a mutex, and doing so would greatly bloat mutex size, so I gave up on this idea. * The contract of pop() is relaxed to return items in no particular order so that we can avoid the cost of reversing the chain. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
- May 23, 2014
-
-
Gleb Natapov authored
Run a thread in a background to scan pagecache for accessed and propagate them to ARC. The thread may take anywhere from 0.1% to 20% of CPU time. There is no hard science behind how current CPU usage is determined, it uses page access rate to calculate how hard pagecache should be scanned currently. It can be improved by taking eviction rate into account too. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 22, 2014
-
-
Glauber Costa authored
preadv, pwritev. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 21, 2014
-
-
Claudio Fontana authored
the thread_control_block structure needs to be different between x64 and AArch64; For AArch64's implementation for local execution, try to match the layout in glibc and the generated code. Do not align .tdata and .tbss sections with .tdata : ALIGN(64) or it will affect the TLS loads. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com> Cc: Glauber Costa <glommer@cloudius-systems.com> Cc: Will Newton <will.newton@linaro.org> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Gleb Natapov authored
Java uses PROT_READ page to synchronize threads, so it worthwhile to be able to catch this as fast as possible without taking vma_list_mutex. The patch does it by checking that pte is not marked as cow during write fault on present pte since cow or PROT_READ are the only reasons for pte to be write protected. The problem is that to get pte we need to walk the page table, but access to a page table is currently protected by vma_list_mutex. The patch uses RCU to free intermediate page table levels which makes it possible to get to pte without taking vma_list_mutex lock. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Gleb Natapov authored
Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Gleb Natapov authored
Java uses accesses to PROT_NONE region to stop threads sometimes, so it worthwhile to be able to catch this as fast as possible without taking vma_list_mutex. The patch does it by setting reserved bit on all ptes in PROT_NONE VMA which causes RSVD bit to be set in a page fault error code. It is enough to check it to know that access is to a valid VMA, but permission is lacking. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 19, 2014
-
-
Tomasz Grabiec authored
memory_order_acquire does not prevent previous stores from moving past the barrier so if the _migration_lock_counter incrementation is split into two accesses, this is eligible: tmp = _migration_lock_counter; atomic_signal_fence(std::memory_order_acquire); // load-load, load-store <critical instructions here> _migration_lock_counter = tmp + 1; // was moved past the barrier To prevent this, we need to order previous stores with future loads and stores, which is given only by std::memory_order_seq_cst. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 18, 2014
-
-
Avi Kivity authored
Instead of forcing a reload (and a flush) of all variables in memory, use the minimum required barrier via std::atomic_signal_fence(). Reviewed-by:
Tomasz Grabiec <tgrabiec@gmail.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Vlad Zolotarov authored
Proper memory ordering should be implied to loads and stores of _begin field. Otherwise they may be reordered with the appropriate stores and loads to/from the _ring array and in a corner case when the ring is full it may lead to ring data corruption. Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Reported-by:
Nadav Har'el <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Tomasz Grabiec authored
These functions are used to demark a critical section and should follow a contract which says that no operation inside the critical section should be moved before migrate_disable() and after migrate_enable(). These functions are declared inline and the compiler could theoretically move instructions across these. Spotted during code contemplation. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 16, 2014
-
-
Nadav Har'El authored
thread::current()->thread_clock() returns the CPU time consumed by this thread. A thread that wishes to measure the amount of CPU time consumed by some short section of code will want this clock to have high resolution, but in the existing code it was only updated on context switches, so shorter durations could not be measured with it. This patch fixes thread_clock() to also add the time that passed since the the time slice started. When running thread_clock() on *another* thread (not thread::current()), we still return a cpu time snapshot from the last context switch - even if the thread happens to be running now (on another CPU). Fixing that case is quite difficult (and will probably require additional memory-ordering guarantees), and anyway not very important: Usually we don't need a high-resolution estimate of a different thread's cpu time. Fixes #302. Reviewed-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Again, we are currently calling a function everytime we disable/enable preemption (actually a pair of functions), where simple mov instructions would do. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
We are heavily using this function to grab the address of the current thread. That means a function call will be issued every time that is done, where a simple mov instruction would do. For objects outside the main ELF, we don't want that to be inlined, since that would mean the resolution would have to go through an expensive __tls_get_addr. So what we do is that we don't present the symbol as inline for them, and make sure the symbol is always generated. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 15, 2014
-
-
Vlad Zolotarov authored
Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
This class is a heart of a per-CPU Tx framework. Except for a constructor it has two public methods: - xmit(buff): push the packet descriptor downstream either to the HW or into the per-CPU queue if there is a contention. - poll_until(cond): this is a main function of a worker thread that will consume packet descriptors from the per-CPU queue(s) and send them to the output iterator (which is responsible to ensure their successful sending to the HW channel). Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
This class will represent a single per-CPU Tx queue. These queues will be subject to the merging by the nway_merger class in order to address the reordering issue. Therefore this class will implement the following methods/classes: - push(val) - empty() - front(), which will return the iterator that implements: - operator *() to access the underlying value - erase(it), which would pop the front element. If the producer fails to push a new element into the queue (the queue is full) then it may start "waiting for the queue": request to be woken when the queue is not full anymore (when the consumer frees some some entries from the queue): - push_new_waiter() method. Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
This class allows efficiently merge n sorted containers. It allows both a single-call merging with a merge() method and the iterator-like semantincs with a pop() method. In both cases the merged stream/next element are streamed to the output iterator. Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-