- May 26, 2014
-
-
Tomasz Grabiec authored
Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
It is meant to provide both the speed of a ring buffer and non-blocking properties of linked queues by combining the two. Unlike for ring_spsc, push() is always guaranteed to succeed. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
Tomasz Grabiec authored
It's like queue_mpsc with two improvements: * consumer and producer links are cache line aligned to avoid false sharing. I was tempted to apply this to queue_mpsc too but then discovered that this queue is embedded in a mutex, and doing so would greatly bloat mutex size, so I gave up on this idea. * The contract of pop() is relaxed to return items in no particular order so that we can avoid the cost of reversing the chain. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com>
-
- May 23, 2014
-
-
Gleb Natapov authored
Run a thread in a background to scan pagecache for accessed and propagate them to ARC. The thread may take anywhere from 0.1% to 20% of CPU time. There is no hard science behind how current CPU usage is determined, it uses page access rate to calculate how hard pagecache should be scanned currently. It can be improved by taking eviction rate into account too. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 22, 2014
-
-
Glauber Costa authored
preadv, pwritev. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 21, 2014
-
-
Claudio Fontana authored
the thread_control_block structure needs to be different between x64 and AArch64; For AArch64's implementation for local execution, try to match the layout in glibc and the generated code. Do not align .tdata and .tbss sections with .tdata : ALIGN(64) or it will affect the TLS loads. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com> Cc: Glauber Costa <glommer@cloudius-systems.com> Cc: Will Newton <will.newton@linaro.org> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Gleb Natapov authored
Java uses PROT_READ page to synchronize threads, so it worthwhile to be able to catch this as fast as possible without taking vma_list_mutex. The patch does it by checking that pte is not marked as cow during write fault on present pte since cow or PROT_READ are the only reasons for pte to be write protected. The problem is that to get pte we need to walk the page table, but access to a page table is currently protected by vma_list_mutex. The patch uses RCU to free intermediate page table levels which makes it possible to get to pte without taking vma_list_mutex lock. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Gleb Natapov authored
Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Gleb Natapov authored
Java uses accesses to PROT_NONE region to stop threads sometimes, so it worthwhile to be able to catch this as fast as possible without taking vma_list_mutex. The patch does it by setting reserved bit on all ptes in PROT_NONE VMA which causes RSVD bit to be set in a page fault error code. It is enough to check it to know that access is to a valid VMA, but permission is lacking. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 19, 2014
-
-
Tomasz Grabiec authored
memory_order_acquire does not prevent previous stores from moving past the barrier so if the _migration_lock_counter incrementation is split into two accesses, this is eligible: tmp = _migration_lock_counter; atomic_signal_fence(std::memory_order_acquire); // load-load, load-store <critical instructions here> _migration_lock_counter = tmp + 1; // was moved past the barrier To prevent this, we need to order previous stores with future loads and stores, which is given only by std::memory_order_seq_cst. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 18, 2014
-
-
Avi Kivity authored
Instead of forcing a reload (and a flush) of all variables in memory, use the minimum required barrier via std::atomic_signal_fence(). Reviewed-by:
Tomasz Grabiec <tgrabiec@gmail.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Vlad Zolotarov authored
Proper memory ordering should be implied to loads and stores of _begin field. Otherwise they may be reordered with the appropriate stores and loads to/from the _ring array and in a corner case when the ring is full it may lead to ring data corruption. Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Reported-by:
Nadav Har'el <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Tomasz Grabiec authored
These functions are used to demark a critical section and should follow a contract which says that no operation inside the critical section should be moved before migrate_disable() and after migrate_enable(). These functions are declared inline and the compiler could theoretically move instructions across these. Spotted during code contemplation. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 16, 2014
-
-
Nadav Har'El authored
thread::current()->thread_clock() returns the CPU time consumed by this thread. A thread that wishes to measure the amount of CPU time consumed by some short section of code will want this clock to have high resolution, but in the existing code it was only updated on context switches, so shorter durations could not be measured with it. This patch fixes thread_clock() to also add the time that passed since the the time slice started. When running thread_clock() on *another* thread (not thread::current()), we still return a cpu time snapshot from the last context switch - even if the thread happens to be running now (on another CPU). Fixing that case is quite difficult (and will probably require additional memory-ordering guarantees), and anyway not very important: Usually we don't need a high-resolution estimate of a different thread's cpu time. Fixes #302. Reviewed-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Again, we are currently calling a function everytime we disable/enable preemption (actually a pair of functions), where simple mov instructions would do. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
We are heavily using this function to grab the address of the current thread. That means a function call will be issued every time that is done, where a simple mov instruction would do. For objects outside the main ELF, we don't want that to be inlined, since that would mean the resolution would have to go through an expensive __tls_get_addr. So what we do is that we don't present the symbol as inline for them, and make sure the symbol is always generated. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 15, 2014
-
-
Vlad Zolotarov authored
Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
This class is a heart of a per-CPU Tx framework. Except for a constructor it has two public methods: - xmit(buff): push the packet descriptor downstream either to the HW or into the per-CPU queue if there is a contention. - poll_until(cond): this is a main function of a worker thread that will consume packet descriptors from the per-CPU queue(s) and send them to the output iterator (which is responsible to ensure their successful sending to the HW channel). Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
This class will represent a single per-CPU Tx queue. These queues will be subject to the merging by the nway_merger class in order to address the reordering issue. Therefore this class will implement the following methods/classes: - push(val) - empty() - front(), which will return the iterator that implements: - operator *() to access the underlying value - erase(it), which would pop the front element. If the producer fails to push a new element into the queue (the queue is full) then it may start "waiting for the queue": request to be woken when the queue is not full anymore (when the consumer frees some some entries from the queue): - push_new_waiter() method. Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
This class allows efficiently merge n sorted containers. It allows both a single-call merging with a merge() method and the iterator-like semantincs with a pop() method. In both cases the merged stream/next element are streamed to the output iterator. Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pawel Dziepak authored
Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pawel Dziepak <pdziepak@quarnos.org> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pawel Dziepak authored
This patch implements lockfree_queue (which is used as incoming_wakeup_queue) so that it doesn't need exchange or compare_exchange operations. The idea is to use a linked list but interleave actual objects stored in the queue with helper object (lockless_queue_helper) which are just pointer to the next element. Each object in the queue owns the helper that precedes it (and they are dequeued together) while the last helper, which does not precede any object is owned by the queue itself. When a new object is enqueued it gains ownership of the last helper in the queue in exchange of the helper it owned before which now becomes the new tail of the list. Unlike the original implementation this version of lockfree_queue really requires that there is no more than one concurrent producer and no more than one concurrent consumer. The results oftests/misc-ctxs on my test machine are as follows (the values are medians of five runs): before: colocated: 332 ns apart: 590 ns after: colocated: 313 ns apart: 558 ns Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pawel Dziepak <pdziepak@quarnos.org> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 14, 2014
-
-
Tomasz Grabiec authored
This introduces a simple timer-based sampling profiler which is reusing our tracing infrastructure to collect samples. To enable sampler from run.py run it like this: $ scripts/run.py ... --sampler [frequency] Where 'frequency' is an optional parameter for overriding sampling frequency. The default is 1000 (ticks per second). The bigger the frequency the bigger sampling overhead is. Too low values will hurt profile accuracy. Ad-hoc sampler enabling is planned. The code already takes that into account. To see the profile you need to extract the trace: $ trace extract And then show it like this: $ trace prof All 'prof' options can be applied, for example you can group by CPU: $ trace prof -g cpu Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
Sampler will need to set and later restore value of this option. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Takuya ASADA authored
lookup_name_demangled() lookups a symbol name, demangle it, then snprintf onto preallocated buffer. Signed-off-by:
Takuya ASADA <syuu@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 13, 2014
-
-
Glauber Costa authored
While running one of the redis benchmarks, I saw around 23k calls to malloc_large. Among those, ~10 - 11k were 2-page sized. I managed to track it down to the creation of net channels. The problem here is that the net channel structure is slightly larger than half a page - the maximum size for small object pools. That will throw all allocations into malloc_large. Besides being slow, it also wastes a page for every net channel created, since malloc_large will include an extra page in the beginning of each allocation. This patch fixes this by overloading the operators new and delete for the netchannel structure so that we use the more efficient and less wasteful alloc_page. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 12, 2014
-
-
Glauber Costa authored
Export the shrinker interface to C users. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Reviewed-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 08, 2014
-
-
Nadav Har'El authored
OSv is currently limited to 64 vCPUs, because we use a 64-bit bitmask for wakeups (see max_cpus in sched.cc). Having exactly 64 CPUs *should* work, but unfortunately didn't because of a bug: cpu_set::operator++ first incremented the index, and then called advance() to find the following one-bit. We had a bug when the index was 63: we then expect operator++ to return 64 (end(), signaling the end of the iteration), but what happened was that after it incremented the index to 64, advance() wrongly handled the case idx=64 (1<<64 returns 1, unexpectedly) and moved it back to idx=63. The patch fixes operator++ to not call advance when idx=64 is reached, so now it works correctly also for idx=63, and booting with 64 CPUs now works. Fixes #234. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Jaspal Singh Dhillon authored
This patch changes the definition of __assert_fail() in api/assert.h which would allow it and other header files which include it (such as debug.hh) to be used in mgmt submodules. Fixes conflict with declaration of __assert_fail() in external/x64/glibc.bin/usr/include/assert.h Signed-off-by:
Jaspal Singh Dhillon <jaspal.iiith@gmail.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 07, 2014
-
-
Jani Kokkonen authored
the class construction of the page_table_root must happen before priority "mempool", or all the work in arch-setup will be destroyed by the class constructor. Problem noticed while working on the page fault handler for AArch64. Signed-off-by:
Jani Kokkonen <jani.kokkonen@huawei.com> Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 05, 2014
-
-
Tomasz Grabiec authored
The synchronizer allows any thread to block on it until it is unlocked. It is unlocked once count_down() has been called given number of times. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Takuya ASADA authored
Signed-off-by:
Takuya ASADA <syuu@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Takuya ASADA authored
Signed-off-by:
Takuya ASADA <syuu@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
The current tracepoint coverage does not handle all situations well. In particular: * it does not cover link layer devices other than virtio-net. This change fixes that by tracing in more abstract layers. * it records incoming packets at enqueue time, whereas sometimes it's better to trace at handling time. This can be very useful when correlating TCP state changes with incoming packets. New tracepoint was introduced for that: net_packet_handling. * it does not record protocol of the buffer. For non-ethernet protocols we should set appropriate protocol type when reconstructing ethernet frame when dumping to PCAP. We now have the following tracepoints: * net_packet_in - for incoming packets, enqueued or handled directly. * net_packet_out - for outgoing packets hitting link layer (not loopback). * net_packet_handling - for packets which have been queued and are now being handled. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 04, 2014
-
-
Tomasz Grabiec authored
Currently tracepoint's signature string is encoded into u64 which gives 8 character limit to the signature. When signature does not fit into that limit, only the first 8 characters are preserved. This patch fixes the problem by storing the signature as a C string of arbitrary length. Fixes #288. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 03, 2014
-
-
Gleb Natapov authored
Attempt to get read ARC buffer for a hole in a file results in temporary ARC buffer which is destroyed immediately after use. It means that mapping such buffer is impossible, it is unmapped before page fault handler return to application. The patch solves this by detecting that hole in a file is accessed and mapping special zero page instead. It is mapped as COW, so on write attempt new page is allocated. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 29, 2014
-
-
Claudio Fontana authored
Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-