- Apr 24, 2014
-
-
Glauber Costa authored
The jemalloc memory allocator will make intense use of MADV_DONTNEED to flush pages it is no longer using. Respect that advice. Let's keep returning -1 for the remaining cases so we don't fool anybody Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
MongoDB wants it. In general, I am providing the information that is easy to get, and ignoring the ones which are not - with the exception of process count, that seemed easy enough to implement. This is the kind of thing Mongo does with it: 2014-04-15T09:54:12.322+0000 [clientcursormon] mem (MB) res:670160 virt:25212 2014-04-15T09:54:12.323+0000 [clientcursormon] mapped (incl journal view):160 2014-04-15T09:54:12.324+0000 [clientcursormon] connections:0 Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
This is one of the statistics that shows up in /proc/self/stat under Linux, but this is generally interesting for applications. Since we don't have kernel mode and userspace mode, it is very hard to differentiate between "time spent in userspace" and "kernel time spent on behalf of the process". Therefore, we will present system time as always 0. If we wanted, we could at least differentiate clearly osv-specific threads as system time, but there is no need to go through the trouble now. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
It will be used for procfs compatibility. Applications may want to know how much memory is potentially available through mmap mappings (not necessarily allocated). Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 22, 2014
-
-
Glauber Costa authored
While we take pride in having no spinlocks in the system, if an application wants to use them, who are we to deny them this god given right? Some applications will implement spinlocks through a pthread interface, which is what I implement here. We did not have any standard trylock mechanism, so one is provided. Other than that, the interface is pretty trivial except for the fact that it seems to provide some protection against deadlocks. We will just ignore that for the moment and assume a well behaved application. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
MongoDB expects that call and would like to guarantee allocation of blocks in the file. It does have a fallback, so for the time being I am just providing the symbol. I have opened Issue #265 to track this. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Avi Kivity authored
The debug allocator can allocate non-contiguous memory for large requests, but since b7de9871 it uses only one sg entry for the entire buffer. One possible fix is to allocate contiguous memory even under the debug allocator, but in the future we may wish to allow discontiguous allocation when not enough contiguous space is available. So instead we implement a virt_to_phys() variant that takes a range, and outputs the physical segments that make it up, and use that to construct a minimal sg list depending on the input. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
Normally, symbol binding in shared objects is lazy, using the PLTGOT mechanism. This means that a symbol is resolved only when first used. This is great because it speeds up object load, and also allows us never to implement symbols which aren't actually used in any real code path. However, as issue #256 shows, symbols which are used in DSOs from a preemption-disabled context cannot be resolved on first use, because symbol resolution may sleep. Two important examples of this are sched::thread::wait() and sched::thread::stop_wait(), both used by wait_until() while it is in preempt_disable. This patch adds the missing support for the standard DT_BIND_NOW tag. This tag can be added added to an object with the "-z now" ld option. When an object has this tag, all its symbols should be resolved on load time, instead of lazily (when first used). Bug #256 can be fixed by linking tst-mmap.so with "-z now" (this will be a separate patch). Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 20, 2014
-
-
Avi Kivity authored
The debug allocator can allocate non-contiguous memory for large requests, but since b7de9871 it uses only one sg entry for the entire buffer. One possible fix is to allocate contiguous memory even under the debug allocator, but in the future we may wish to allow discontiguous allocation when not enough contiguous space is available. So instead we implement a virt_to_phys() variant that takes a range, and outputs the physical segments that make it up, and use that to construct a minimal sg list depending on the input. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 17, 2014
-
-
Calle Wilund authored
Per-cpu trace buffers. Actual buffer space is kept at roughly the "same" as previously, up to 4 vcpu. Above this used space will be higher. Does not handle vcpu:s appearing or disappearing in runtime. Trace events are allocated with a "not done" terminator marker, which is finalized when event is written, which should prevent any partial data messing up extraction. Fixes #146 Signed-off-by:
Calle Wilund <calle@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 16, 2014
-
-
Claudio Fontana authored
also enable core/pagecache.cc in the AArch64 build. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Jani Kokkonen authored
Signed-off-by:
Jani Kokkonen <jani.kokkonen@huawei.com> [claudio: some fixes] Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Jani Kokkonen authored
add the APIs to flush a single processor tlb or all tlbs in the cluster. Signed-off-by:
Jani Kokkonen <jani.kokkonen@huawei.com> Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
- Apr 15, 2014
-
-
Pawel Dziepak authored
Segment GNU_RELRO is used to inform dynamic linker which sections needs to be writable only when relocating ELF file and can be made read-only later. Usually GNU_RELRO overlaps with standard LOAD segment that contains readable and writable data but ELF file is generated in such way that it is possible to properly set per-page permissions. Signed-off-by:
Pawel Dziepak <pdziepak@quarnos.org> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pawel Dziepak authored
Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pawel Dziepak <pdziepak@quarnos.org> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pawel Dziepak authored
.tracepoint_patch_sites contains pointers to locations in .text section which needs to be relocated. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pawel Dziepak <pdziepak@quarnos.org> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 14, 2014
-
-
Glauber Costa authored
Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 13, 2014
-
-
Avi Kivity authored
This is a singly-linked list, suitable for building an rcu hash table. Only a minimal interface is implemented so far. The list itself exposes two list-like interfaces, one for a mutating owner, the other for read-only rcu access. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Add a callback to force all currently queued callbacks to execute. This is useful for global state changes, such as removing a loaded module, and for unit tests. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
This allows shared objects that disable preemption to work correctly, since faulting in executable pages with preemption disabled is not supported. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 08, 2014
-
-
Avi Kivity authored
When waitqueue::wake_all() wakes up waiting threads, it calls sched::thread::wake_lock() to enqueue those waiting threads on the mutex protecting the waitqueue, thus avoiding needless contention on the mutex. However, if a thread is already waking, we let it wake naturally and acquire the mutex itself. The problem is that the waitqueue code (wait_object<waitqueue>::poll()) examines the wait_record it sleeps on and see if it has woken, and if not, goes back to sleep. Since nothing in that thread-already-awake path clears the wait_record, that is what happens, and the thread stalls, until a timeout occurs. Fix by clearing the wait record. As it is protected by the mutex, no extra synchronization is needed. Observed with iperf -P 64 against the guest. Likely triggered by net channels waking up the thread, and then before it has a chance to wake up, a FIN packet arrives that is processed in the driver thread; so when the packets are consumed the thread is in the waking state. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
The lookup_opcode() function is incorrect. It was mishandling DHCP_OPTION_PAD, which does not have a following length byte. Also, the while condition is reading 'op' value which never changes. This may result in reads beyond packet size. Since this function is unused the best fix is to remove it. Reveiwed-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 07, 2014
-
-
Claudio Fontana authored
lots of mmu related changes require fixups for AArch64. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 03, 2014
-
-
Nadav Har'El authored
For a long time we've had the bug summarized in issue #178, where very rarely but consistently, in various runs such as Cassandra, Netperf and tst-queue-mpsc.so, we saw OSv crashing because of some corruption in the timer list, such as arming an already armed timer, or canceling and already canceled timer. It turns out the problem was the schedule() function, which basically did cpu::current()->schedule(). The problem is that if we're unlucky enough, the thread can be migrated right after calling cpu::current(), but before the irq disable in schedule(), which causes us to do a rescheduling for one CPU on a different CPU, which is a big faux pas. This can cause us, for example, to mess with one CPU's preemption_timer from a different CPU, causing the timer-related races and crashes we've seen in issue #178. Clearly, we shouldn't at all have a *method* cpu->schedule() which can operate on any cpu. Rather, we should have only a *function* (class-static) cpu::schedule() which operates on the current cpu - and makes sure we find that current CPU within the IRQ lock to ensure (among other things) the thread cannot get migrated. Another benefit of this patch is that it actually simplifies the code, with one less function called "schedule". Fixes #178. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 02, 2014
-
-
Claudio Fontana authored
include/api/x64/atomic.h is not used anywhere. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Claudio Fontana authored
implementation includes only for the few rels that are actually encountered during elf::get_init(). Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Claudio Fontana authored
protect x64 machine code in #ifdef __x86_64__ Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Claudio Fontana authored
the functions debug_early, debug_early_u64 and debug_early_entry can be used very early, before premain. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Claudio Fontana authored
Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Gleb Natapov authored
This reverts commit 64889277. The structure is not longer used. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
This patch adds write page cache implementation. On a read fault pages are initially mapped from ARC directly, but they are marked as RO in a page table. On a write fault pages are copied into small write page cache for shared mapping, or into anonymous page for private mappings. Pages are removed from write cache and written back into a file in FIFO order. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
File page allocator needs to know fault type (read/write) and mapping type (shared/private) to handle page allocation correctly. It also needs a way to communicate to a caller that some pages need to be mapped RO for COW. This patch adds required functionality without using it yet. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
Page allocator may return shared page that needs to be COWed during write access. There is not way to notify page mapper that such page should be mapped read only currently. This patch allows page allocator to control that. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
Currently flags are saved for anon_vma, but not for file_vma. Examining those flags in gdb proved to be helpful. Remove _shared since it no longer needed, the information is in the flags. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
Currently we track mapping between ARC pages and virtual addresses which require us to walk the page table to get to a pte pointer when invalidation is required. Change that to track pointers to pte directly. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
arch-mmu.hh depend on mmu.hh currently which make it impossible to include in header that mmu.hh itself includes. mmu.hh is pretty heavy header containing many definitions and code, so move bare definitions into separate header file. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
- Apr 01, 2014
-
-
Tomasz Grabiec authored
This is a wrapper of timer_task which should be used if atomicity of callback tasks and timer operations is required. The class accepts external lock to serialize all operations. It provides sufficient abstraction to replace callouts in the network stack. Unfortunately, it requires some cooperation from the callback code (see try_fire()). That's because I couldn't extract in_pcb lock acquisition out of the callback code in TCP stack because there are other locks taken before it and doing so _could_ result in lock order inversion problems and hence deadlocks. If we can prove these to be safe then the API could be simplified. It may be also worthwhile to propagate the lock passed to serial_timer_task down to timer_task to save extra CAS.
-
Tomasz Grabiec authored
The design behind timer_task timer_task was design for making cancel() and reschedule() scale well with the number of threads and CPUs in the system. These methods may be called frequently and from different CPUs. A task scheduled on one CPU may be rescheduled later from another CPU. To avoid expensive coordination between CPUs a lockfree per-CPU worker was implemented. Every CPU has a worker (async_worker) which has task registry and a thread to execute them. Most of the worker's state may only be changed from the CPU on which it runs. When timer_task is rescheduled it registers its percpu part in current CPU's worker. When it is then rescheduled from another CPU, the previous registration is marked as not valid and new percpu part is registered. When percpu task fires it checks if it is the last registration - only then it can fire. Because timer_task's state is scattered across CPUs some extra housekeeping needs to be done before it can be destroyed. We need to make sure that no percpu task will try to access timer_task object after it is destroyed. To ensure that we walk the list of registrations of given timer_task and atomically flip their state from ACTIVE to RELEASED. If that succeeds it means the task is now revoked and worker will not try to execute it. If that fails it means the task is in the middle of firing and we need to wait for it to finish. When the per-CPU task is moved to RELEASED state it is appended to worker's queue of released percpu tasks using lockfree mpsc queue. These objects may be later reused for registrations.
-