- May 16, 2014
-
-
Nadav Har'El authored
thread::current()->thread_clock() returns the CPU time consumed by this thread. A thread that wishes to measure the amount of CPU time consumed by some short section of code will want this clock to have high resolution, but in the existing code it was only updated on context switches, so shorter durations could not be measured with it. This patch fixes thread_clock() to also add the time that passed since the the time slice started. When running thread_clock() on *another* thread (not thread::current()), we still return a cpu time snapshot from the last context switch - even if the thread happens to be running now (on another CPU). Fixing that case is quite difficult (and will probably require additional memory-ordering guarantees), and anyway not very important: Usually we don't need a high-resolution estimate of a different thread's cpu time. Fixes #302. Reviewed-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 15, 2014
-
-
Vlad Zolotarov authored
Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 14, 2014
-
-
Tomasz Grabiec authored
Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 08, 2014
-
-
Nadav Har'El authored
Implement three of the Linux functions we're still missing (see issue #77 for a list of more things to do). These are pretty easy to add, because we just need to correctly call the lower layer functions, sys_*, which already exist. This patch also adds tests for the three functions to tst-readdir.cc. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 05, 2014
-
-
Tomasz Grabiec authored
The test fails when the problem described in issue #283 exists. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Takuya ASADA authored
Signed-off-by:
Takuya ASADA <syuu@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- May 04, 2014
-
-
Tomasz Grabiec authored
Exploits issue #288. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Tomasz Grabiec authored
Currently tracepoint's signature string is encoded into u64 which gives 8 character limit to the signature. When signature does not fit into that limit, only the first 8 characters are preserved. This patch fixes the problem by storing the signature as a C string of arbitrary length. Fixes #288. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- May 03, 2014
-
-
Raphael S. Carvalho authored
Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 29, 2014
-
-
Tomasz Grabiec authored
This test passes on Linux but failed on OSv before the patch which changed solisten_proto() to always use SOMAXCONN. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Glauber Costa authored
Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 28, 2014
-
-
Gleb Natapov authored
Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Gleb Natapov authored
Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
When a small allocation is requested with large alignment, we ignored the alignment, and as a consequence posix_memalign() or alloc_phys_contiguous_aligned() could crash when it failed to achieve the desired alignment. This is not a common case (usually, size >= alignment, and the new C11 aligned_alloc() even supports only this case), but still it might happen, and we saw it in cloudius-systems/capstan#75. When size < alignment, this patch changes the size so we can achieve the desired alignment. For small alignments, this means setting size=alignment, so for example to get an alignment of 1024 bytes we need at least 1024-byte allocation. This is a waste of memory, but as these allocations are rare, we expect this to be acceptable. For large alignments, e.g., alignment=8192, we don't need size=alignment but we do need size to be large enough so we'll use malloc_large() (malloc_large() already supports arbitrarily large alignments). This patch also adds test cases to tst-align.so to test alignments larger than the desired size. Fixes #271 and cloudius-systems/capstan#75. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 25, 2014
-
-
Tomasz Grabiec authored
Tracepoint argument which extends 'blob_tag' will be interpreted as a range of byte-sized values. Storage required to serialize such object is proportional to its size. I need it to implement storage-fiendly packet capturing using tracing layer. It could be also used to capture variable length strings. Current limit (50 chars) is too short for some paths passed to vfs calls. With variable-length encoding, we could set a more generous limit. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 24, 2014
-
-
Nadav Har'El authored
Posix allows read() on directories in some filesystems. However, Linux always returns EISDIR in this case, so because we're emulating Linux, so should we, for every filesystem. All our filesystems except ZFS (e.g., ramfs) already return EISDIR when reading a directory, but ZFS doesn't, so this patch adds the missing check in ZFS. This patch is related to issue #94: the first step to fixing #94 is to return the right error when reading a directory. This patch also adds a test case, which can be compiled both on OSv and Linux, to verify they both have the same behavior. Before the patch, the test succeeded on Linux but failed on OSv when the directory is on ZFS. Instead of fixing zfs_read like I do in this patch, I could have also fixed sys_read() in vfs_syscalls.cc which is the top layer of all read() operations, and I could have done there (fp->f_dentry && fp->f_dentry->d_vnode->v_type == VDIR) { return EISDIR; } to cover all the filesystems. I decided not to do that, because all filesystems except ZFS already have this check, and because the lower layers like zfs_read() already have more natural access to d_vnode. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 15, 2014
-
-
Pawel Dziepak authored
Signed-off-by:
Pawel Dziepak <pdziepak@quarnos.org> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pawel Dziepak authored
This code is going to be shared by tst-mmap and tst-elf-permissions. Signed-off-by:
Pawel Dziepak <pdziepak@quarnos.org> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 14, 2014
-
-
Tomasz Grabiec authored
The test is supposed to trigger the problem from issue #259. I was not able to trigger the problem using guest-local communication, hence the client is external to the guest. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Avi Kivity authored
Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Nadav Har'El authored
Our existing implementation of posix_memalign() and the C11 aligned_alloc() used our regular malloc(), so it only work worked up to alignment of 4096 bytes - and crashed when it failed to achieve a higher desired alignment. Some applications do ask for higher alignment - for example MongoDB allocates a large buffer at 8192 byte alignment, and accordingly crashes half of the times, when the desired alignment is not achieved (half of the time, it is achieved by chance). This patch makes our support for alignment better organized, and fixes the alignment > 4096 case: The alignment is no longer available only to the outer function like posix_memalign(). Rather, it is passed down to lower-level allocation functions like malloc_large() which allocates whole pages - and this function now knows how to pick pages which start at a properly aligned boundary. This patch does not improve the wastefulness of our malloc_large(), so an overhaul of it would still be welcome. Case in point, malloc_large() always adds a full page to any allocation larger than half a page. Multiple allocations with posix_memalign(8192, 8192), rather than being tightly packed, each take 3 pages and are separated by a free page. This page is not wasted, but causes fragmentation of the heap. Note that after this patch, we still have one other bug in posix_memalign(size, align) - for small sizes and large alignments. For small sizes, we use a pool allocator with "size" alignment, and may not achieve the desired alignment (so causing an assertion failure). This bug can also be fixed, but is unrelated to this patch. This patch also adds a test for posix_memalign(), checking all alignments including large alignments which are the topic of this patch. The tests for small *sizes*, which as explained above are still buggy, are commented out, because they fail. Fixes #266. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@avi.cloudius>
-
- Apr 09, 2014
-
-
Glauber Costa authored
We have recently discovered a bug through which we fail to unmap a valid region. This is fixed now, and this patch adds the failing condition to the test suite. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
Our read() and write(), and their variants (pread, pwrite, readv, writev, preadv, pwritev) all shared the same bug when it comes to a partial read or write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully with the number of bytes actually written or read, as they should have. In the internals of the BSD read and write operations (e.g., sosend_generic) each operation returns *both* an error number and a number of bytes left. But at the end, the system call is expected to return just one of them - either an error *or* a number of bytes. The existing read()/write() code, when it saw the internals returning an error code, always returned it and ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK and the number of bytes is non-zero, we should return this number of bytes (i.e., a successful partial write), *not* the EWOULDBLOCK error. This bug went unnoticed almost since the dawn of OSv, because partial reads and writes are not common. For example, a write() to a blocking socket will always return after the entire write is successful, and will not partially succeed. Only when we write to an O_NONBLOCK socket, will it be possible to see a partial write - But even then, we would need a pretty large write() to see it only partially succeeding. But this bug is very noticable when running the Jetty Web server (see issue At some point it's like the response was restarted (complete with a second copy of the headers). In Jetty's demo this was seen as half-shown images, as well as corrupt output when fetching large text files like /test/da.txt. Turns out that Jetty sends static responses in a surprisingly efficient (for Java code...) way, using a single system call for the entire response: It mmap()s the file it wishes to send, and then uses one writev() call to send two arrays: The HTTP headers (built in malloc()ed memory), and the file itself (from mmapped memory). So Jetty tries to write even a 1MB file in one huge writev() call. But there's an added twist: It does so with the socket configured to O_NONBLOCK. So for large writes, the write will only partially succeed (empirically, only about 50KB will succeed), and Jetty will notice the partial write and continue writing the rest - until the whole file is sent. With the bug we had, part of the request will have been written, but Jetty still thought the write didn't write anything so it would start writing again from the beginning - causing the weird sort of response corruption we've been seeing. This patch also includes a test case which confirms this bug, and its fix. In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket (on the loopback interface), one thread write()s a very large buffer and the other receives what it can. We try this two times - once on a blocking socket and once on a non-blocking socket. In each case we expect the number of bytes written by one thread (return from write()) and the number read by the second thread (return from read()) to be the same. With the bug we had, in the non-blocking case we saw write() returning -1 (with errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test to fail. Fixes #257. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 03, 2014
-
-
Raphael S. Carvalho authored
Specific output: Reading directory entries at /proc... dentry name: . dentry name: .. dentry name: self Reading directory entries at /proc/self... dentry name: . dentry name: .. dentry name: maps Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Apr 01, 2014
-
-
Tomasz Grabiec authored
This is a wrapper of timer_task which should be used if atomicity of callback tasks and timer operations is required. The class accepts external lock to serialize all operations. It provides sufficient abstraction to replace callouts in the network stack. Unfortunately, it requires some cooperation from the callback code (see try_fire()). That's because I couldn't extract in_pcb lock acquisition out of the callback code in TCP stack because there are other locks taken before it and doing so _could_ result in lock order inversion problems and hence deadlocks. If we can prove these to be safe then the API could be simplified. It may be also worthwhile to propagate the lock passed to serial_timer_task down to timer_task to save extra CAS.
-
Tomasz Grabiec authored
The design behind timer_task timer_task was design for making cancel() and reschedule() scale well with the number of threads and CPUs in the system. These methods may be called frequently and from different CPUs. A task scheduled on one CPU may be rescheduled later from another CPU. To avoid expensive coordination between CPUs a lockfree per-CPU worker was implemented. Every CPU has a worker (async_worker) which has task registry and a thread to execute them. Most of the worker's state may only be changed from the CPU on which it runs. When timer_task is rescheduled it registers its percpu part in current CPU's worker. When it is then rescheduled from another CPU, the previous registration is marked as not valid and new percpu part is registered. When percpu task fires it checks if it is the last registration - only then it can fire. Because timer_task's state is scattered across CPUs some extra housekeeping needs to be done before it can be destroyed. We need to make sure that no percpu task will try to access timer_task object after it is destroyed. To ensure that we walk the list of registrations of given timer_task and atomically flip their state from ACTIVE to RELEASED. If that succeeds it means the task is now revoked and worker will not try to execute it. If that fails it means the task is in the middle of firing and we need to wait for it to finish. When the per-CPU task is moved to RELEASED state it is appended to worker's queue of released percpu tasks using lockfree mpsc queue. These objects may be later reused for registrations.
-
Glauber Costa authored
This is a test in which two threads compete for resources. One of them will (hopefully) trigger memory allocations that are served by the heap while the other will stress the filesystem through reads and/or writes (no mappings). This is designed to test how well the balloon code works together with the ARC reclaimer. There are three main goals I expect OSv to achieve when running this test: 1) When there is no filesystem activity, the balloon should never trigger, and the ARC cache should be reduced to its minimum 2) When there is no java activity, we should balloon as much as we can, leaving the memory available to the filesystem (this one is trickier because the IO code is itself a java application - on purpose - so we eventually have to stop) 3) When both are happening in tandem, the system should stabilize in reasonable values and not spend useless cycles switching memory back and forth. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
- Mar 30, 2014
-
-
Dmitry Fleytman authored
Test misc-bdev-rw introduced. The test writes buffers of various lengths to block device, reads data back and verifies data read is the same as data written. Signed-off-by:
Dmitry Fleytman <dmitry@daynix.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Dmitry Fleytman authored
Xen block driver now supports new featute called indirect descriptors. This feature allows to put more data into each ring cell but it activates for "long" reads and writes only - longer than 11 pages. With this patch test by default runs 2 scenarious: * 1 page buffers * 32 pages buffers Also introduced command line parameter to specify size of buffers explicitly. Signed-off-by:
Dmitry Fleytman <dmitry@daynix.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Mar 27, 2014
-
-
Glauber Costa authored
We have seen bugs with mmap shared/file handling for small files. This patch tests some of the corner scenarios to find those problems. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
- Mar 25, 2014
-
-
Nadav Har'El authored
tst-kill runs various signal handlers, which we run in separate threads. When the test completes, we may be unlucky enough for the last signal handler to still be running, at which point when the module's memory is unmapped (e.g., in test.py -s each test is unmapped when it ends) we can get a page fault and a crash. This patch sleeps for a second at the end of tst-kill, to make sure that the signal handler has completed; This sleep is a bit ugly, but I can't think of a cleaner way - Posix provides no way to check if there's a running handler, and I wouldn't like to add a new API just for this test. Fixes #249. Reviewed-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Mar 21, 2014
-
-
Glauber Costa authored
It will create a file on-disk twice as large as memory, and then will map it entirely in memory. The file is then read from using 3 different sequential patterns, and then later on 2 threaded patterns. This test does not handle writes. It goes in misc because it takes a very long time to run (especially with a random pattern) Example output: Total Ram 586 Mb Write done Double Pass OK (13.6323 usec / page) Recency OK (3.35954 usec / page) Random Access OK (640.926 usec / page) Threaded pass 1 address ended OK Threaded pass many addresses ended OK PASSED Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Nadav Har'El authored
Add support for SA_RESETHAND signal handler flag, which means that the signal handler is reset to the default one after handling the signal once. I admit it's not a very useful feature (our default handler is powering off the system...) but no reason not to support it. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Mar 17, 2014
-
-
Gleb Natapov authored
Signed-off-by:
Gleb Natapov <gleb@cloudius-systems.com>
-
- Mar 10, 2014
-
-
Raphael S. Carvalho authored
Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
Commit 951cca45 ("Revert "vfs/zfs: Sync vnode and znode refcounts") forgot to disable ZFS reference count test case. Reported-by:
Avi Kivity <avi@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Mar 03, 2014
-
-
Vlad Zolotarov authored
Running 'make check' after the fix: ... TEST tst-ring-spsc-wraparound.so OK (8.817 s) ... Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Vlad Zolotarov authored
We are about to kill ring_mpsc - delete the corresponding tests. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Vlad Zolotarov <vladz@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Feb 25, 2014
-
-
Nadav Har'El authored
This patch adds two more load-balancing tests to tests/misc-loadbalance.cc: 1. Three threads on two cpus. If load-balancing is working correctly, this should slow down all threads x1.5 equally, and not get two x2 threads and one x1. Our performance on this test are fairly close to the expected. 2. Three threads on two cpus, but one thread has priority 0.5, meaning it should get twice the CPU time of the two other threads, so fair load balancing is to keep the priority-0.5 thread on its own CPU, and the two normal-priority threads together on the second CPU - so at the end the priority-0.5 thread will get twice the CPU time of the other threads. Unfortunately, this test now gets bad results (x0.93,x0.94,x1.14 instead of x1,x1,x1), because our load balancer currently doesn't take into account thread priorities: It thinks the CPU running the priority-0.5 thread has load 1, while it should be considered to have the load 2. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-