- Apr 08, 2014
-
-
Avi Kivity authored
When waitqueue::wake_all() wakes up waiting threads, it calls sched::thread::wake_lock() to enqueue those waiting threads on the mutex protecting the waitqueue, thus avoiding needless contention on the mutex. However, if a thread is already waking, we let it wake naturally and acquire the mutex itself. The problem is that the waitqueue code (wait_object<waitqueue>::poll()) examines the wait_record it sleeps on and see if it has woken, and if not, goes back to sleep. Since nothing in that thread-already-awake path clears the wait_record, that is what happens, and the thread stalls, until a timeout occurs. Fix by clearing the wait record. As it is protected by the mutex, no extra synchronization is needed. Observed with iperf -P 64 against the guest. Likely triggered by net channels waking up the thread, and then before it has a chance to wake up, a FIN packet arrives that is processed in the driver thread; so when the packets are consumed the thread is in the waking state. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
The lookup_opcode() function is incorrect. It was mishandling DHCP_OPTION_PAD, which does not have a following length byte. Also, the while condition is reading 'op' value which never changes. This may result in reads beyond packet size. Since this function is unused the best fix is to remove it. Reveiwed-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 07, 2014
-
-
Claudio Fontana authored
lots of mmu related changes require fixups for AArch64. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 03, 2014
-
-
Nadav Har'El authored
For a long time we've had the bug summarized in issue #178, where very rarely but consistently, in various runs such as Cassandra, Netperf and tst-queue-mpsc.so, we saw OSv crashing because of some corruption in the timer list, such as arming an already armed timer, or canceling and already canceled timer. It turns out the problem was the schedule() function, which basically did cpu::current()->schedule(). The problem is that if we're unlucky enough, the thread can be migrated right after calling cpu::current(), but before the irq disable in schedule(), which causes us to do a rescheduling for one CPU on a different CPU, which is a big faux pas. This can cause us, for example, to mess with one CPU's preemption_timer from a different CPU, causing the timer-related races and crashes we've seen in issue #178. Clearly, we shouldn't at all have a *method* cpu->schedule() which can operate on any cpu. Rather, we should have only a *function* (class-static) cpu::schedule() which operates on the current cpu - and makes sure we find that current CPU within the IRQ lock to ensure (among other things) the thread cannot get migrated. Another benefit of this patch is that it actually simplifies the code, with one less function called "schedule". Fixes #178. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 02, 2014
-
-
Claudio Fontana authored
include/api/x64/atomic.h is not used anywhere. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Claudio Fontana authored
implementation includes only for the few rels that are actually encountered during elf::get_init(). Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Claudio Fontana authored
protect x64 machine code in #ifdef __x86_64__ Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Claudio Fontana authored
the functions debug_early, debug_early_u64 and debug_early_entry can be used very early, before premain. Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Claudio Fontana authored
Signed-off-by:
Claudio Fontana <claudio.fontana@huawei.com>
-
Gleb Natapov authored
This reverts commit 64889277. The structure is not longer used. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
This patch adds write page cache implementation. On a read fault pages are initially mapped from ARC directly, but they are marked as RO in a page table. On a write fault pages are copied into small write page cache for shared mapping, or into anonymous page for private mappings. Pages are removed from write cache and written back into a file in FIFO order. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
File page allocator needs to know fault type (read/write) and mapping type (shared/private) to handle page allocation correctly. It also needs a way to communicate to a caller that some pages need to be mapped RO for COW. This patch adds required functionality without using it yet. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
Page allocator may return shared page that needs to be COWed during write access. There is not way to notify page mapper that such page should be mapped read only currently. This patch allows page allocator to control that. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
Currently flags are saved for anon_vma, but not for file_vma. Examining those flags in gdb proved to be helpful. Remove _shared since it no longer needed, the information is in the flags. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
Currently we track mapping between ARC pages and virtual addresses which require us to walk the page table to get to a pte pointer when invalidation is required. Change that to track pointers to pte directly. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
Gleb Natapov authored
arch-mmu.hh depend on mmu.hh currently which make it impossible to include in header that mmu.hh itself includes. mmu.hh is pretty heavy header containing many definitions and code, so move bare definitions into separate header file. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
- Apr 01, 2014
-
-
Tomasz Grabiec authored
This is a wrapper of timer_task which should be used if atomicity of callback tasks and timer operations is required. The class accepts external lock to serialize all operations. It provides sufficient abstraction to replace callouts in the network stack. Unfortunately, it requires some cooperation from the callback code (see try_fire()). That's because I couldn't extract in_pcb lock acquisition out of the callback code in TCP stack because there are other locks taken before it and doing so _could_ result in lock order inversion problems and hence deadlocks. If we can prove these to be safe then the API could be simplified. It may be also worthwhile to propagate the lock passed to serial_timer_task down to timer_task to save extra CAS.
-
Tomasz Grabiec authored
The design behind timer_task timer_task was design for making cancel() and reschedule() scale well with the number of threads and CPUs in the system. These methods may be called frequently and from different CPUs. A task scheduled on one CPU may be rescheduled later from another CPU. To avoid expensive coordination between CPUs a lockfree per-CPU worker was implemented. Every CPU has a worker (async_worker) which has task registry and a thread to execute them. Most of the worker's state may only be changed from the CPU on which it runs. When timer_task is rescheduled it registers its percpu part in current CPU's worker. When it is then rescheduled from another CPU, the previous registration is marked as not valid and new percpu part is registered. When percpu task fires it checks if it is the last registration - only then it can fire. Because timer_task's state is scattered across CPUs some extra housekeeping needs to be done before it can be destroyed. We need to make sure that no percpu task will try to access timer_task object after it is destroyed. To ensure that we walk the list of registrations of given timer_task and atomically flip their state from ACTIVE to RELEASED. If that succeeds it means the task is now revoked and worker will not try to execute it. If that fails it means the task is in the middle of firing and we need to wait for it to finish. When the per-CPU task is moved to RELEASED state it is appended to worker's queue of released percpu tasks using lockfree mpsc queue. These objects may be later reused for registrations.
-
Tomasz Grabiec authored
This can be useful when there's a need to perform operations on per-CPU structure(s) and all need to be executed on the same CPU but there is code in between which may sleep (eg malloc). For example this can be used to ensure that dynamically allocated object is always freed on the same CPU on which it was allocated: WITH_LOCK(migration_lock) { auto _owner = *percpu_owner; auto x = new X(); _owner->enqueue(x); }
-
Tomasz Grabiec authored
It is needed by the new async framework.
-
Glauber Costa authored
One of the main problems we face with the balloon is the fact that by the time the JVM needs more memory, it may be too late. We have many mechanisms in place to try to mitigate this by trying to release memory back to the JVM before it actually even notices it needs. This is yet another of them: a soft maximum is defined. Unlike a hard maximum, OSv can create as many balloons as it sees fit. However, if we ever go beyond this maximum, we will start putting pressure through the reclaimer in order to be able to react quicker if pressure builds up from the JVM side. Also, when we are over this limit and we need to release a balloon, we keep releasing until we are back in a safe zone, regardless of how many we were asked for. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
In case we are throttling, try to serve memory from the non-aggressive shrinker before going all-in. In particular, the current waiter may be allocating memory in behalf of the balloon (for instance, stack), so it is a bad idea to block. We may have more memory soon enough. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
When we create a new balloon, memory will soon be available. However, until we actually create it, high rate ARC users will continue to fill the ARC at will, which can easily lead to memory exhaustion. Blocking allocations until the balloon is created is not an option: in order to balloon, we need to allocate memory, and the source of this memory is quite unpredictable: backing heap, stack, and even file, if an object destruction needs to flush something out. Luckily, the ARC has a mechanism to avoid growing the cache. If the function arc_evict_needed() returns true, it will free buffers to give place to new ones instead of growing them. The most natural way of making that function return true would be to implement the function vm_paging_needed(), which is already there in generic code and would require 0 ARC changes. However, doing this would imply that arc_reclaim_needed() would now return true as well. This would not only evict some buffers, but actually put the ARC in aggressive mode (Especially because the allocations will keep coming) The solution I've chosen was to implement a new function in the ARC - wrapped in an ifdef so we don't lose track of it - that will allow us to communicate that we should throttle new buffers due to a VM request (in our case, because we are waiting for the balloon). Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
There are some shrinkers, like the ARC, that may become very agressive when called more than once in a row. When we have no waiters, we can be more relaxed and just go for a best effort shrinking. This patch introduce a soft mode shrinking that will do just that. In the future, we may introduce another threshold smaller than the minimum in which we can go in hard mode even before we have waiters. For now, this will do. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
The balloon is no longer synchronous, and now lives in its own thread. We used to have the requirement that the reclaimer needed to have a full pthread API, but that is no longer. Make it now a normal thread. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
Instead of just looking at which region our start address falls at, we can iterate through all the vmas that are affected by us installing a new vma, and then close the region earlier. This design has one major big advantage: split() will never happen, and if that is the case, the balloon becomes a lot more predictable - which at this point is a blow of sanity in the whole thing Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
Synchronously ballooning unfortunately won't work: as usual the garbage collector can run in order to release memory for the balloon, and then we will block. Since we can't make the balloon worker synchronous, we set its priority to a really high number, that would guarantee that Java threads won't run before it. This will help in really fast allocation situations. Durin those, between the notification and the time this thread gets to run, the JVM may have very well exhausted all the memory remaining. Notice that this value is guaranteed to be lower than any JVM created thread ever will get. This is since for JVM threads, priority will only be set through the setpriority interface, wich will yield 0.2 for its maximum prio.
-
Glauber Costa authored
The current ballooning mechanism has a very serious design flaw, that became obvious once the tests with it advanced: if we wait until memory is short to release memory, it may be that the JVM itself may block. In particular, the blocking thread can be running the Garbage Collector. More likely, it can start running the GC to open up space for an object so big as the balloon. I don't believe this is something we should fix with a quick hack, and I am then pushing for a redesign: The solution is to use a reservation system, that works by ballooning beforehand when the amount of reserved memory goes too low. This system works by noting that when we balloon, we are effectively limiting the maximum amount of memory java will ever use. We already have accounts of both the amount of free memory and the amount of memory allocated to the heap. With those figures, we can easily find out when the amount of used memory + a full heap would go over the amount of available memory. When that happens, we balloon. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
There are situations during which some Garbage Collectors will copy a large object in parallel, using various threads, each being responsible for a part of the object. In particular, that easily happens under high memory pressure for the Parallel and ParallelOld collectors. If such copy happens, the simple balloon move algorithm will break. However, because offset 'x' in the source will always be copied to offset 'x' in the destination, we can still calculate the final destination object. The problem is that we cannot open the new balloon yet. Since the JVM believes it is copying only a part of the object, the destination may (and usually will) contain valid objects, that need to be themselves moved somewhere else before we can install our object there. Also, we can't close the object fully when someone writes to it: because a part of the object is now already freed, the JVM may and will go ahead and copy another object to this location. This is a quite tricky problem, but the patch bellow solves it by keeping the source balloon open and counting the number of partial bytes copied. It makes the assumption that the JVM needs to always copy the whole object to a different location, meaning that when it does, the balloon can be closed. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
Right now, we are starting the jvm with the maximum amount of memory we can. This is, however, terribly bad. The reason is that if that if we go to reclaim during a heap allocation, this will easily crash the system. That will be just for granted if that allocation turns out to be done on behalf of the GC. In that case, no other Java threads will make progress, meaning the balloon cannot be activated. Even if the balloon is no longer part of the shrinker infrastructure, it may still be a problem, because we may need to use it to deflate memory. The best solution is to just limit heap usage to 90 % of the available memory, which is the reclaim criteria. Also, it is pointless to activate the balloon for very low memory conditions. So if our total memory is less than 1Gb, just leave it alone. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
After Avi's last comments, here is a new take for the reclaimer's wake queue. It does not abuse semaphore's in any way, so there is no need to extend it. (After this patch is applied, the semaphore extension patch can be reverted) In this new approach, we will not only check for free memory as a scalar, but will instead transverse the remaining page ranges (notice that in case of fragmentation this can be a *large* set) and try to find a range that would allow our allocation to happen. Also, because we are implementing both the wait and the "post" methods, the waking of the reclaimer happens atomically and the problem that Tomek described (wake before waiters is set) no longer exists. Finally, another advantage of this approach is that we no longer need to keep track of the amount of memory we have freed when making OOM decisions: if we can wake someone up, we can continue. Otherwise, we OOM. This will be specially helpful for the FS shrinkers, since they may not always be able to synchronously return a meaningful number accurately (due to, for instance, writeback). Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
The old explicit pattern would break if the vma that holds the balloon was already destroyed. Use a shared pointer to avoid the complications from it Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
While working in the reclaim policy, I came across an unhandled balloon case that seem to happen in some certain low memory conditions. The balloon will be moved to the same address it was before. While this seems weird at first, it is perfectly plausible if we think that we can have a space before the balloon almost a page in size. If the JVM is only shifting the elements slightly, they will end up at the same ballooned region. The problem with that is that evacuate() is called first - the vma is still in place, so it will be found, and it will call delete. When the jvm fault handler executes, it will also call delete and we will have a double free. This patch avoids the delete, but the problem is actually more intricate than that: if we just follow the normal path, we will remove the newly created address from the candidates list right after we insert it (because the addresses are the same). Also, if the address is *not* the same, but falls in the same region (this is theoretical), evacuate will be called and split the region in two. Our code to handle finish_move would have to be significantly complicated to deal with this fact. This patch also handles this theoretical case by making sure that the whole region is always disposed at once (by finishing the old balloon manually before evacuate is called) Why I am using a new mmu flag: when we search for the old vma, we need to identify which type of vma we are dealing with. If it is an anonymous or ballooned area. I use an extra flag to avoid a type field or anything like that. Couldn't we just take the vma out of the list before we delete it? In this case the region would be empty, and we could assume that was a ballooned region. That would also be a solution, but because we are copying the flag from the previous vma, we would have to fallback to a default flag set. The solution of marking the JVM vma and leaving it in place allows the flag information to be carried on. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
- Mar 27, 2014
-
-
Glauber Costa authored
Gleb has noticed that the ARC buffers can go unshared too early. This will happen because we call the UNMAP operation on every put(). That is certainly not what we want, since the buffer only has to be unshared when the last reference is gone. Design decisions: 1) We obviously can't use the arc natural reference count for that, since bumping it would make the buffer unevictable. 2) We could modify the arc_buf structure itself to add another refcnt (minimum 4 bytes). However, I am trying to keep core-ZFS modifications to a minimum, and only to places where it is totally unavoidable. Therefore, the solution is to add another hash, which will hash the whole buffer instead of the physaddr like the one we have currently. In terms of memory usage, it will add only 8 bytes per buffer (+/- 128k each buffer), which makes for a memory usage of 64k per mapped Gb compared to the arc refcount solution. This is a good trade off. I am also avoiding adding a new vop_map/unmap style operation just to query the buffer address from its file attributes (needed for the put side). Instead, I am conventioning that an empty iovec means query, and a filled iov means unshare. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
- Mar 24, 2014
-
-
Takuya ASADA authored
Signed-off-by:
Takuya ASADA <syuu@cloudius-systems.com> [ penberg: fix formatting ] Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Rescheduling in nested exception is not supported since nested exception stack is per cpu, but without the assert such reschedule will cause stack corruption which will be hard to debug. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Mar 21, 2014
-
-
Glauber Costa authored
Gleb have noticed a problem with the buffer mapping logic, that would make unrelated mappings that would be adjacent to the file mapping be invalidated as well. This is because we were invalidated an area as large as the buffer. If the mapping would, for instance, end in the middle of the buffer, we would keep invalidated adjacent regions. The solution is to change the code to handle per-page, not per-buffer mappings. We would likely need something like that anyway for COW maps, so let's just start handling it now. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
This case is not supposed to be utterly common, but we have to handle it anyway. This implementation is slightly naive, but it should be okay if the number of mappings is small. We should revisit it this number start growing too much Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Glauber Costa authored
To better deal with shared buffers, we should expand the interface to get_page/put_page. Three changes are needed: first, the start of the virtual address is passed to both. Together with the offset, it can be used to uniquely identify the virtual address being mapped. That will be used to manage the mapping list. Regarding the offset itself, there is need to also pass the file offset downwards, the one specified by the mapping. Reading that from the file pointer is not enough, because not only this is not what the user has asked for, but it can change during the map's lifetime. It also should be a separate parameter from the in-mapping offset, not summed together with it. Doing otherwise would make impossible for the mapping function to differentiate between those two concepts Also, put_page is called from the free operation, but does not receive the actuall address from it. Knowing the physical address will be important to shut down our mapping list. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-