- Feb 12, 2014
-
-
Asias He authored
MSI support is supported in pci layer now. Enable MSI support interrupt_manager. Signed-off-by:
Asias He <asias@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Feb 11, 2014
-
-
Nadav Har'El authored
This patch adds support for epoll()'s edge-triggered mode, EPOLLET. Fixes #188. As explained in issue #188, Boost's asio uses EPOLLET heavily, and we use that library in our management http server, and also in our image creation tool (cpiod.so). By ignoring EPOLLET, like we did until now, the code worked, but unnecessarily wasted CPU when epoll_wait() always returned immediately instead of waiting until a new event. This patch works within the confines of our existing poll mechanisms - where epoll() call poll(). We do not change this in this patch, and it should be changed in the future (see issue #17). In this patch we add to each struct file a field "poll_wake_count", which as its name suggests counts the number of poll_wake()s done on this file. Additionally, epoll remembers the last value it saw of this counter, so that in poll_scan(), if we see that an fp (polled with EPOLLET) has an unchanged counter from last time, we do not return readiness on this fp regardless on whether or not it has readable data. We have a complication with EPOLLET on sockets. These have an "SB_SEL" optimization, which avoids calling poll_wake() when it thinks the new data is not interesting because the old data was not yet consumed, and also avoids calling poll_wake() if fp->poll() was not previously done. This optimization is counter-productive for EPOLLET (and causes missed wakeups) so we need to work around it in the EPOLLET case. This patch also adds a test for the EPOLLET case in tst-epoll.cc. The test runs on both OSv and Linux, and can confirm that in the tested scenarios, Linux and OSv behave the same, including even one same false-positive: When epoll_wait() tells us there is data in a pipe, and we don't read it, but then more data comes on a pipe, epoll_wait() will again return a new event, despite this is not really being an edge event (the pipe didn't change from empty to not-empty, as it was previously not-empty as well). Concluding remarks: The primary goal of this implementation is to stop EPOLLET epoll_wait() from returning immediately despite nothing have happened on the file. That was what caused the 100% CPU use before this patch. That being said, the goal of this patch is NOT to avoid all false-positives or unnecessary wakeups; When events do occur on the file, we may be doing a bit more wakeups than strictly necessary. I think this is acceptable (our epoll() has worse problems) but for posterity, I want to explain: I already mentioned above one false-positive that also happens on Linux. Another false-positive wakeup that remains is in one of EPOLLET's classic use cases: Consider several threads sleeping on epoll() on the same socket (e.g., TCP listening socket, or UDP socket). When one packet arrives, normal level-triggered epoll() will wake all the threads, but only one will read the packet and the rest will find they have nothing to read. With edge- triggered epoll, only one thread should be woken and the rest would not. But in our implementation, poll_wake() wakes up *all* the pollers on this file, so we cannot currently support this optimization. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Vlad Zolotarov authored
Instead of binding all msix interrupts to cpu 0, have them chase the interrupt service routine thread and pin themselves to the same cpu. This patch is based on the patch from Avi Kivity <avi@cloudius-systems.com> and used some ideas of Nadav Har'El <nyh@cloudius-systems.com>. It improves the performance of the single thread Rx netperf test by 16%: before - 25694 Mbps after - 29875 Mbps New in V2: - Dropped the functor class - use lambda instead. - Fixed the race in a waking flow. - Added some comments. - Added the performance numbers to the patch description. Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Avi Kivity authored
Path 1: poll() take file lock file::poll_install take socket lock Path 2: sowakep() (holding socket lock) so_wake_poll() take file lock Fix by running poll_install() outside the file lock (which isn't really needed). Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
Found the problem while running tst-resolve.so, follow the output: Success: nonexistant = 0 Failed: debug = -443987883 Failed: condvar_wait = -443987883 The time: 1392070630 2 failures. Bisect pointed to the commit 1dc81fe5. After understanding the actual purpose of the changes introduced by this commit, I figured out that program::lookup simply lacks a return when the target symbol is found from the underlying module. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Feb 10, 2014
-
-
Avi Kivity authored
The scenario run elf image demand page fault allocation leak detector tracking backtrace() access dwarf tables leads to a nested demand page fault, which we don't (and probably can't) support. Switch to backtrace_safe(), which is of lower quality, but is safer. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
When one does not care about dirty bit in a page table it is beneficial to map everything as dirty since it will relieve HW from setting the bit on first access, but since now file mappings use dirty bit to reduce write back during sync lets map them clean initially, but only if file is shared since for private file mapping sync is disabled. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Now everything is ready to enable on demand file paging, so disable unconditional populate on mmap for files and provide fault() for file backed vmas. In fact file vma fault() is not different from anon vma fault() so provide one general vma::fault() handler for both. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Walk page table and write out only dirty pages during file sync instead of writing back entire mapping. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Currently map_file_page object can be used only once since its internal state changes during mapping operation. For demand paging we do not want to allocate/delete an object each time mapping is needed, so, to make map_file_page object reusable, this patch resets its sate to a starting one during finalize(). Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Currently pages are allocated just before they are mapped, but sometimes we want to map preallocated pages (for shared mapping or mapping page cache page for instance). This patch moves page acquisition process into separate class. This will allow us to add subclass that provides pages from shared pages pool or from a page cache. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Makes code easier to understand. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Currently fill_file_page is used to populate entire vma, so starting offset and mapped file offset are always the same, but if fill_file_page will be used to populate only part of a vma this will no longer be necessarily true. Keep track of the lowest file offset populated and use it for read. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Value passed to std::vector constructor increase not only vector capacity() but also its size(). We want only former, not later and it is achieved by reserve() function. Also remove prev_off which is a leftover of a debug code. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Generic page mapper code provides offset to mapper's small/huge_page callbacks, but to make this offset meaningful it should be an offset from a beginning of a vma that is mapped, but currently it is an offset from a starting address of current mapping operation which means it will be always zero it mapping is done page by page. Fix this by passing starting vma address to page mapper for correct offset calculation. Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Trying to execute unmapped page is not a punishable offense as long as vma permissions are correct, so check them. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Now we are showing names in gdb, it helps to name all of them. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Feb 09, 2014
-
-
Avi Kivity authored
Add an additional, optional, list of poll requests that the net channel producer awakens. This will be used by poll() to ensure it is awakened if there is net channel activity. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
A net channel can cause poll() to awaken due to a received packet, yet that packet may not cause any changes to a polled fd state. In that case, we need to loop again, to avoid returning 0, indicating a timeout which has not happened yet. Triggered by netserver which expects select() to return 1. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Some file types may be interested to know they are being polled. Add hooks for this. Needed for net channels, as the polling thread must wake up when there is activity on the channel. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Let them take the normal path for not to avoid complicating the locking rules. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
Add a structure that holds a hash table of net channels, and is able to classify tcp/ipv4 packets and dispatch them via the channel. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Avi Kivity authored
This adds a net_channel class the connects a packet producer with a packet consumer. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Feb 07, 2014
-
-
Nadav Har'El authored
This patch reduces our BSS size from 9MB to less than 1MB, and reduces the minimum amount of memory required to run the "rogue" image from 55 MB to just 47 MB. Together with the previous patch, this fixes #96. Our almost-lock-free SMP allocator uses cpu-to-cpu rings to send freed memory from one CPU to another. So for N cpus we need N^2 of these buffers. But in an apparent attempt to support CPU hotplugging the current code allocated queues for 64 CPUs, instead of the real number of CPUs we have. So even with a single CPU, we always have 64*64 of these buffers, each a little over 2K, so totaling a whopping 8MB, wile only a tiny fraction of this will ever be used This patch changes the code to only allocate the number of these queues we really need. It does not support CPU hot-plugging, but a lot of other places in the code don't support hot-plugging either, so I left a FIXME to remind us where we need to fix. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
Ever since debug() defaults to putting the message in memory, not on the console, it is not useful to call debug() before abort. Need to pass the message to abort() itself, and it will do the right thing (TM). Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
When program::lookup_function (used, for example, in our java.cc) can't find the function it throws an std::runtime_error exception. It used the not-very-useful message "symbol not found". This patch adds the symbol's demangled name, so if this exception is printed (e.g., when not caught), we get a more informative message. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
I am proposing a mechanism here that will allow us to have a better idea about how much time do we spend booting, and also how much time each of the pieces contribute to. For that, we need to be able to get time stamps really early, in places where tracepoints may not be available, and a clock most definitely won't. With my proposal, one should be able to register events. After the system boots, we will calculate the total time since the first event, as well as the delta since the last event. If the first event is early enough, that should produce a very good picture about our boot time. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
This patch provides a way to, given a measurement from tsc, acquire a nanosecond figure. It works only for xen and kvm pvclocks, and I intend to use it for acquiring early boot figures. It is possible to measure the tsc frequency and with that figure out how to convert a tsc read to nanoseconds, but I don't think we should pay that price. Most of the pvclock drivers already provide that functionality, and we are not planning that many users of that interface anyway. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
program::lookup and program::lookup_addr were written for optimal performance and look-freedom by using the RCU lock to protect the short module iteration. However, with demand-paged file mapping, the actual objects are demand-page, and we cannot assume that object memory can be read without sleeping. So switch these functions to use with_modules, the more general, and somewhat slower, technique to iterate over modules. With with_modules, the RCU lock is only used to copy the list of modules, and the protection against the individual modules being deleted is done separately (by incrementing a reference count preventing any module from being deleted). This makes symbol lookup a bit slower, but symbol lookups are relatively infrequent (usually occur during early application startup, the first time each symbol is used), and measuring Java's startup time, as an example, I didn't see any measureable slowdown. Fixes #169. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Feb 06, 2014
-
-
Glauber Costa authored
The JVM may unmap certain areas of the heap completely, which was confirmed by code inspection by Gleb. In that case, the current balloon code will break. This is because we were deleting the vma from finish_move(), and recreating the old mapping implicitly in the process. With this new patch, the tear down of the jvm balloon mapping is done by a separate function. Unmapping or evacuating the region won't trigger it. It still needs to communicate to the balloon code that this address is out of the balloons list. We do that by calling the page fault handler with an empty frame. jvm_balloon_fault will is patched to interpret an empty frame correctly. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
Fix also concurrent use of _module_index_list (for the per-module TLS). Use a new mutex _module_index_list mutex to protect it. We could probably have done something with the RCU instead, but just adding a new mutex is a lot easier. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
After the above patches, one race remains in the dynamic linker: If an object is *unloaded* while some symbol resolution or object iteration (dl_iterate_phdr) is in progress, the function in progress may reach this object after it is already unmapped from memory, and crash. Therefore, we need to delay unmapping of objects while any object iteration is going on. We need to allow the object to be deleted from the _modules and _files list (so that new calls will not find it) but temporarily delay the actual freeing of the object's memory. The cleanest way to achieve this would have been to increment each modules' reference in the RCU section of modules_get(), so they won't get deleted while still in use. However, this will signficantly slow down users like backtrace() with dozens of atomic operations. So we chose a different solution: keep a counter _modules_delete_disable, which when non-zero causes all module deletion to be delayed until the counter drops back to zero. with_modules() now only needs to increment this single counter, not every separate module. Fixes #176. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
In the current code, when a shared-object's reference count drops to zero, shared_ptr<object> calls delete on this object, which calls the object's destructor, which (see file::~file) calls program::remove_object() to remove this object from the program's _files list. This order was always awkward and unexpected, and I even had a comment in ~file noting that "we don't delete(ef) here - the contrary, delete(ef) calls us". But for the next patch, we can no longer live with this order, which is not just awkward but also wrong: In the next patch we'll want a shared object to be immediately removed from the program (so it is no longer found when resolving symbols, etc.), but the actual deletion of this object to be delayed till a more convenient time. For this to work, we cannot have the deletion of the object start before its removal from the program: We need the opposite - the removal of the object from the program will delete the object. Luckily, it's easy to do this properly using a shared_ptr deleter lambda. This tells shared_ptr to call remove_object(), not delete, when the reference count drops to zero. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
This patch addresses the bugs of *use* of the dynamic linker - looking up symbols or iterating the list of loaded objects - in parallel with new libraries being loaded with get_library(). The underlying problem is that we have an unprotected "_modules" vector of loaded objects, which we need to iterate to look up symbols, but this list of modules can change when a new shared object is loaded. We decided *not* to solve this problem by using the same mutex protecting object load/unload: _mutex. That would make boot slower, as threads using new symbols are blocked just because another thread is concurrently loading some unrelated shared object (not a big problem with demand-paged file mmaps). Using a mutex can also cause deadlocks in the leak detector, because of lock order reversal between malloc's and elf'c mutexes: malloc() takes a lock first and then backtrace() will take elf's lock, and on the other hand elf can take its lock and then call malloc taking malloc's lock. Instead, this patch uses RCU to allow lock-free reading of the modules list. As in RCU, writing (adding or removing an object from the list) manufactures a new list, defering the freeing of the old one, allowing reads to continue using the old object list. Note that after this patch, concurrent lookups and get_library() will work correctly, but concurrent lookups and object *unload* still will still not be correct because we need to defer an object's unloading from memory while lookups are in progress. This will be solved in a following patch. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
The previous patch fixed most of the concurrent shared-object load/unload bugs except one: We still have a rare race between loading and unloading the *same* library. In other words, it is possible that just when a library's reference count went down to 0, and its destruction will promptly start (the mutex is not yet taken), get_library() gets called to load the same library. We cannot return the already-loaded one (because it is in the process of being destructed). So in that case we need to ignore the library we found, and load it again. Luckily, std::weak_ptr's notion of "expired" shared pointers (that already went down to 0, and can't be lock()ed) and its atomic implementation thereof, makes fixing this problem easy and natural. The existing code already had an assert() in place to protect against this race (and it was so rare we never actually saw it), but this patch should fix this race once and for all. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
Our current dynamic-linker code (elf.cc) is not thread safe, and all sort of disasters can happen if shared objects are loaded, unloaded and/or used concurrently. This and the following patches solve this problem in stages: The first stage, in this patch, is to protect concurrent shared-library loads and unloads. (if the dynamic linker is also in use concurrently, this will still cause problems, and will be solved in the next patches). Library load and unload use a bunch of shared data without protection, so concurrency can cause disaster. For example, two concurrent loads can pick the same address to map the objects in. We solve this by using a mutex to ensure only one shared object is loaded or unloaded at a time. Instead of this coarse-grain locking, we could have used finer-grained locks to allow several library loads to proceed in parallel, protecting just the actual shared data. However the benefits will be very small because with demand-paged file mmaps, "loading" a library just sets up the memory map, very quickly, and the object will only be actually read from disk later, when its pages get used. Fixes #175. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Feb 05, 2014
-
-
Nadav Har'El authored
The current code the trace_log - a 4MB array which stores tracepoint data - is a global array. In other words, as explained in issue #96, it is part of the BSS, which always adds 4MB memory usage to OSv - whether or not tracepoints are actually enabled. This patch changes trace_log to be a dynamically-allocated array, allocated when the first tracepoint is enabled (e.g., with a --trace=... boot option). This will also allow us (in the future) to choose the size of the trace log at run time rather than compile time. This patch reduces 4 MB of BSS (from 13MB to 9MB), and correspondly lowers our minimum memory use by 4 MB. Whereas the minimum memory requirement of the "rogue" image used to be 59 MB, now it is 55 MB. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Feb 03, 2014
-
-
Avi Kivity authored
wake_lock() prevents a race with the target thread waking up after wake_lock() and attempting to take the mutex by itself, by changing the thread status to sending_lock. However, it doesn't prevent the reverse race: t1 t2 ========================= ====================== wait_for() mutex::unlock() thread::wait() t1->wake() mutex::lock() mutex::lock() [A] thread::wait() t1->wake_lock() [B] After [A], t1 is waiting on the mutex, and after [B], it is waiting twice, which is impossible and aborts on an assertion failure. Fix by detecting that we're already waiting in send_lock() aborting the send_lock(). Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Feb 02, 2014
-
-
Avi Kivity authored
Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-