- Apr 02, 2014
-
-
Nadav Har'El authored
Changes in v3, following Avi's review: * Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable(). * Make the global RCU "generation" counter a static class variable, instead of static function variable. Rename it "next_generation" (the name "generation" was grossly overloaded previously) * In rcu_synchronize(), use migration_lock to be sure we wake up the thread to which we just added work. * Use thread_handle, instead of thread*, for percpu_quiescent_state_thread. This is safer (atomic variable, so we can't see it half-set on some esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is a bit of an overkill here, but it's not in a performance sensitive area. The existing rcu_defer() used a global list of deferred work, protected by a global mutex. It also woke up the cleanup thread on every call. These decisions made rcu_dispose() noticably slower than a regular delete, to the point that when commit 70502950 introduced an rcu_dispose() to every poll() call, we saw performance of UDP memcached, which calls poll() on every request, drop by as much as 40%. The slowness of rcu_defer() was even more apparent in an artificial benchmark which repeatedly calls new and rcu_dispose from one or several concurrent threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse - when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex contention, the fact we free the memory on the "wrong" cpu, and the excessive context switches all bring the measurement to as much as 12,000 ns. With this patch the new/rcu_dispose numbers are down to 60 ns on a single thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is a x5.5 - x120 speedup :-) This patch replaces the single list of functions with a per-cpu list. rcu_defer() can add more callbacks to this per-cpu list without a mutex, and instead of a single "garbage collection" thread running these callbacks, the per-cpu RCU thread, which we already had, is the one that runs the work deferred on this cpu's list. This per-cpu work is particularly effective for free() work (i.e., rcu_dispose()) because it is faster to free memory on the same CPU where it was allocated. This patch also eliminates the single "garbage collection" thread which the previous code needed. The per-CPU work queue has a fixed size, currently set to 2000 functions. It is actually a double-buffer, so we can continue to accumulate more work while cleaning up; If rcu_defer() is used so quickly that it outpaces the cleanup, rcu_defer() will wait while the buffer is no longer full. The choice of buffer size is a tradeoff between speed and memory: a larger buffer means fewer context switches (between the thread doing rcu_defer() and the RCU thread doing the cleanup), but also more memory temporarily being used by unfreed objects. Unlike the previous code, we do not wake up the cleanup thread after every rcu_defer(). When the RCU cleanup work is frequent but still small relative to the main work of the application (e.g., memcached server), the RCU cleanup thread would always have low runtime which meant we suffered a context switch on almost every wakeup of this thread by rcu_defer(). In this patch, we only wake up the cleanup thread when the buffer becomes full, so we have far fewer context switches. This means that currently rcu_defer() may delay the cleanup an unbounded amount of time. This is normally not a problem, and when it it, namely in rcu_synchronize(), we wake up the thread immediately. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
https://github.com/tgrabiec/osvAvi Kivity authored
"After net channel merge in commit 2828ef50 the performance of tomcat benchmark dropped significantly. Investigation revealed that the biggest bottleneck was the callout subsystem, which was using global mutex to protect its operations. This series improves the performance by replacing use of callouts inside the TCP stack with a new framework which is supposed to scale better." Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Apr 01, 2014
-
-
Avi Kivity authored
This reverts commit d24cda2c. It wants migration_lock to be merged first. Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Nadav Har'El authored
The existing rcu_defer() used a global list of deferred work, protected by a global mutex. It also woke up the cleanup thread on every call. These decisions made rcu_dispose() noticably slower than a regular delete, to the point that when commit 70502950 introduced an rcu_dispose() to every poll() call, we saw performance of UDP memcached, which calls poll() on every request, drop by as much as 40%. The slowness of rcu_defer() was even more apparent in an artificial benchmark which repeatedly calls new and rcu_dispose from one or several concurrent threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse - when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex contention, the fact we free the memory on the "wrong" cpu, and the excessive context switches all bring the measurement to as much as 12,000 ns. With this patch the new/rcu_dispose numbers are down to 60 ns on a single thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is a x5.5 - x120 speedup :-) This patch replaces the single list of functions with a per-cpu list. rcu_defer() can add more callbacks to this per-cpu list without a mutex, and instead of a single "garbage collection" thread running these callbacks, the per-cpu RCU thread, which we already had, is the one that runs the work deferred on this cpu's list. This per-cpu work is particularly effective for free() work (i.e., rcu_dispose()) because it is faster to free memory on the same CPU where it was allocated. This patch also eliminates the single "garbage collection" thread which the previous code needed. The per-CPU work queue has a fixed size, currently set to 2000 functions. It is actually a double-buffer, so we can continue to accumulate more work while cleaning up; If rcu_defer() is used so quickly that it outpaces the cleanup, rcu_defer() will wait while the buffer is no longer full. The choice of buffer size is a tradeoff between speed and memory: a larger buffer means fewer context switches (between the thread doing rcu_defer() and the RCU thread doing the cleanup), but also more memory temporarily being used by unfreed objects. Unlike the previous code, we do not wake up the cleanup thread after every rcu_defer(). When the RCU cleanup work is frequent but still small relative to the main work of the application (e.g., memcached server), the RCU cleanup thread would always have low runtime which meant we suffered a context switch on almost every wakeup of this thread by rcu_defer(). In this patch, we only wake up the cleanup thread when the buffer becomes full, so we have far fewer context switches. This means that currently rcu_defer() may delay the cleanup an unbounded amount of time. This is normally not a problem, and when it it, namely in rcu_synchronize(), we wake up the thread immediately. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Tomasz Grabiec authored
The callout subsystem is using a shared global lock for most of its operations. This became a bottleneck after merging net channels. Explanation for this phenomena is that before net channels were merged packet processing on the receive side was done form one (virtio) thread and there was no contention on that lock. After the merge the packets started to be processed from many CPUs which made taking the lock expensive. The new framework uses only per-timer locks, the worker is lock-free. Below are measurements of the improvement. The measurements (both before and after) were taken with Nadav's per-CPU rcu_defer() improvement applied, because it was also a bottleneck. The value measured was HTTP request-response throughput of tomcat server as reported by the wrk tool. Server and client on different machines, 4 vCPUs, 3g of guest memory. === 16 connections === Before: avg = 39272.61 stdev = 3611.84 After: avg = 52701.82 stdev = 3953.76 Improvement: 34% === 256 connections === Before: avg = 35225.19 stdev = 2504.27 After: avg = 50576.67 stdev = 3533.39 Improvement: 43% One challenge in integrating the new framework with TCP stack was a proper teardown of timers. Current code assumed that after calling callout_cancel() it is safe to free the timer's memory. This was not correct because the timer may have already fired and will try to access memory which has been freed. The TCP stack had a workaround for this race, each timer checked the inp field of the tcpcb block for NULL, which was supposed to indicate that the block has been freed. It still was not perfect though because the timer may have performed the check before the field was nulled-out in the tcp_discardcb and then block on a mutex which will be promptly freed. The solution I went for is to delegate the release of memory into an async deferred task, which will be executed as soon as possible but in a safe context, in which we can wait until all timers are done and then free the memory.
-
Tomasz Grabiec authored
The new async API accepts lock of type 'mutex' so I need to convert in_pcb lock type, which will be used to synchronize callbacks.
-
Tomasz Grabiec authored
-
Tomasz Grabiec authored
-
Tomasz Grabiec authored
This is a wrapper of timer_task which should be used if atomicity of callback tasks and timer operations is required. The class accepts external lock to serialize all operations. It provides sufficient abstraction to replace callouts in the network stack. Unfortunately, it requires some cooperation from the callback code (see try_fire()). That's because I couldn't extract in_pcb lock acquisition out of the callback code in TCP stack because there are other locks taken before it and doing so _could_ result in lock order inversion problems and hence deadlocks. If we can prove these to be safe then the API could be simplified. It may be also worthwhile to propagate the lock passed to serial_timer_task down to timer_task to save extra CAS.
-
Tomasz Grabiec authored
The design behind timer_task timer_task was design for making cancel() and reschedule() scale well with the number of threads and CPUs in the system. These methods may be called frequently and from different CPUs. A task scheduled on one CPU may be rescheduled later from another CPU. To avoid expensive coordination between CPUs a lockfree per-CPU worker was implemented. Every CPU has a worker (async_worker) which has task registry and a thread to execute them. Most of the worker's state may only be changed from the CPU on which it runs. When timer_task is rescheduled it registers its percpu part in current CPU's worker. When it is then rescheduled from another CPU, the previous registration is marked as not valid and new percpu part is registered. When percpu task fires it checks if it is the last registration - only then it can fire. Because timer_task's state is scattered across CPUs some extra housekeeping needs to be done before it can be destroyed. We need to make sure that no percpu task will try to access timer_task object after it is destroyed. To ensure that we walk the list of registrations of given timer_task and atomically flip their state from ACTIVE to RELEASED. If that succeeds it means the task is now revoked and worker will not try to execute it. If that fails it means the task is in the middle of firing and we need to wait for it to finish. When the per-CPU task is moved to RELEASED state it is appended to worker's queue of released percpu tasks using lockfree mpsc queue. These objects may be later reused for registrations.
-
Tomasz Grabiec authored
This can be useful when there's a need to perform operations on per-CPU structure(s) and all need to be executed on the same CPU but there is code in between which may sleep (eg malloc). For example this can be used to ensure that dynamically allocated object is always freed on the same CPU on which it was allocated: WITH_LOCK(migration_lock) { auto _owner = *percpu_owner; auto x = new X(); _owner->enqueue(x); }
-
Tomasz Grabiec authored
It is needed by the new async framework.
-
Pekka Enberg authored
Add a OpenJDK/OSv base image for developers who want to use Capstan to package and run their Java applications. Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
This adds our own memcached server to an OSv release that is pushed to Capstan S3 repository. Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
When halt() is called very early, before smp_launch(), it crashes when calling crash_other_processors() because the other processors' IDT was not yet set up. For example, in loader.cc's prepare_commands() we call abort() when we failed to parse the command line, and this caused a crash reported in issue #252. With this patch, crash_other_processors does nothing when other processors have not yet been set up. This is normally the case before smp_launch(), but note that on single-vcpu VM, it will remain the case throughout the run. Fixes #252. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Mar 31, 2014
-
-
Tomasz Grabiec authored
You'll run into this when you have nested trace samples with undefined backtrace elements. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Glauber Costa authored
Useful to figure out which thread is waiting on what Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Pekka Enberg authored
Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Mar 30, 2014
-
-
Pekka Enberg authored
This adds a script for uploading OSv release to Capstan S3 bucket using "s3cmd sync". Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Pekka Enberg authored
This adds a script for building a Capstan repository for an OSv release with the following images: - OSv core (no API, no CLI) - Default (with API, CLI) - Cassandra virtual appliance (with API) - Tomcat virtual appliance (with API) - Iperf virtual appliance (with API) Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Pekka Enberg authored
This adds a script for creating Capstan compatible images that can be uploaded to a S3 bucket for distribution. To use it to build a base image, for example, type: $ ./scripts/build-capstan-img cloudius/osv-base empty "OSv base image" which builds a Capstan image named "cloudius/osv-base" with "make image=empty" and places QEMU and VirtualBox images under "build/capstan/cloudius/osv-base" together with an index file. Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Takuya ASADA authored
Add cursor update callback to libtsm, update cursor position register on VGA device using the callback. Fixes #220. Signed-off-by:
Takuya ASADA <syuu@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Dmitry Fleytman authored
Indirect descriptors for Xen PV disks is a newly introduced protocol feature that increases block devices performance in case of long transfers. Detailed description: http://blog.xen.org/index.php/2013/08/07/indirect-descriptors-for-xen-pv-disks/ Measurement results (RAM-backed storage): Xen w/o indirect descriptors: 1046.251 Mb/s 1097.931 Mb/s 1149.672 Mb/s 1137.696 Mb/s 1135.481 Mb/s 1249.163 Mb/s 1115.417 Mb/s 1118.063 Mb/s Wrote 11344.750 MB in 10.00 s = 1134.165 Mb/s Xen w/ indirect descriptors: bdev-write test offset limit: 250000000 byte(s) 1234.715 Mb/s 1360.013 Mb/s 1323.663 Mb/s 1336.916 Mb/s 1342.617 Mb/s 1332.882 Mb/s 1302.094 Mb/s Wrote 13233.250 MB in 10.04 s = 1318.639 Mb/s KVM: bdev-write test offset limit: 250000000 byte(s) 674.729 Mb/s 698.736 Mb/s 627.677 Mb/s 630.209 Mb/s 742.977 Mb/s 759.205 Mb/s 681.476 Mb/s 655.717 Mb/s 713.231 Mb/s 688.478 Mb/s Wrote 6872.750 MB in 10.00 s = 687.124 Mb/s Signed-off-by:
Dmitry Fleytman <dmitry@daynix.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Dmitry Fleytman authored
This is a refactoring commit to simplify future indirect descriptors code. Signed-off-by:
Dmitry Fleytman <dmitry@daynix.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Dmitry Fleytman authored
This is a refactoring commit to simplify future indirect descriptors code. Signed-off-by:
Dmitry Fleytman <dmitry@daynix.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Dmitry Fleytman authored
This is a refactoring commit that isolated some xenstore access logic to make it reusable Signed-off-by:
Dmitry Fleytman <dmitry@daynix.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Dmitry Fleytman authored
Test misc-bdev-rw introduced. The test writes buffers of various lengths to block device, reads data back and verifies data read is the same as data written. Signed-off-by:
Dmitry Fleytman <dmitry@daynix.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
Dmitry Fleytman authored
Xen block driver now supports new featute called indirect descriptors. This feature allows to put more data into each ring cell but it activates for "long" reads and writes only - longer than 11 pages. With this patch test by default runs 2 scenarious: * 1 page buffers * 32 pages buffers Also introduced command line parameter to specify size of buffers explicitly. Signed-off-by:
Dmitry Fleytman <dmitry@daynix.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
-
- Mar 28, 2014
-
-
Zifei Tong authored
Add explicit conversion form bytes array to string which returned by Popen.communicate(). Use Python3 style print function. Signed-off-by:
Zifei Tong <zifeitong@gmail.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Mar 27, 2014
-
-
Raphael S. Carvalho authored
Remove the embedded driver, and start using the one from drivers sub-system. Still relies on the manual creation of /etc/mnttab as the upload manifest wasn't yet processed. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
Previously, zfs device was being only provided to allow the use of commands needed to create the zpool, and so the file system. At that time, doing so was quite enough, however, making zfs device, i.e. /dev/zfs part of every OSv instance would allow us to use commands that will help analysing, debugging, tuning the zpool and file systems there contained. The basic explanation is that those commands use libzfs which in turn relies on /dev/zfs to communicate with the zfs code. Commands example: zpool, zfs, zdb. The latter one not being ported to OSv yet. This patch will also be helpful for the ongoing ztest porting. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
/etc/mnttab is required by libzfs to get running properly, so let's create it as an empty file. ryao from zfsonlinux and openzfs told me that an empty /etc/mnttab is used on Linux. Also reading the libzfs code shows that /etc/mnttab mostly used for management of the file itself, nothing that would prevent some libzfs functionality from working. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
The tst-dns-resolver.so fails spuriously. Blacklist it until the problem is fixed to keep Jenkin builds running. Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
-
Raphael S. Carvalho authored
The root dataset and ZFS are mounted at the mkfs phase, but they aren't unmounted aftwards. Running mkfs with VERBOSE flag enabled shows the following: Running mkfs... VFS: mounting zfs at /zfs zfs: mounting osv from device osv VFS: mounting zfs at /zfs/zfs zfs: mounting osv/zfs from device osv/zfs The first mount happens when issuing: {"zpool", "create", "-f", "-R", "/zfs", "osv", "/dev/vblk0.1"}, &ret); It creates a pool called osv and mounts the root dataset at /zfs The latter mount happens when issuing: {"zfs", "create", "osv/zfs"} It creates a file system called zfs at the pool OSv and automatically mounts it at the root dataset mountpoint. No data inconsistency problem was seen up to now because both mkfs.so and cpiod.so do an explicit sync() at the end, thus ensuring everything was correctly flushed out to the stable storage. There is an expression in Dutch that says: prevention is better than cure. Thus, this patch changes cpiod.so to unmount both mount points when the /zfs/zfs prefix was passed. It cannot be done at mkfs.so itself because cpiod.so is called afterwards at the same OSv instance. Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Zifei Tong authored
Python3 no longer allow implicitly conversion form bytes to string, add explicit decode() to convert input bytes. Tested with both Python2 and Python3. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Zifei Tong <zifeitong@gmail.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Gleb has noticed that the ARC buffers can go unshared too early. This will happen because we call the UNMAP operation on every put(). That is certainly not what we want, since the buffer only has to be unshared when the last reference is gone. Design decisions: 1) We obviously can't use the arc natural reference count for that, since bumping it would make the buffer unevictable. 2) We could modify the arc_buf structure itself to add another refcnt (minimum 4 bytes). However, I am trying to keep core-ZFS modifications to a minimum, and only to places where it is totally unavoidable. Therefore, the solution is to add another hash, which will hash the whole buffer instead of the physaddr like the one we have currently. In terms of memory usage, it will add only 8 bytes per buffer (+/- 128k each buffer), which makes for a memory usage of 64k per mapped Gb compared to the arc refcount solution. This is a good trade off. I am also avoiding adding a new vop_map/unmap style operation just to query the buffer address from its file attributes (needed for the put side). Instead, I am conventioning that an empty iovec means query, and a filled iov means unshare. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-
Zifei Tong authored
debugf() used to write log message with respect to the length of format string. This will cause the messages wrongly truncated. Also change confusing variable names: exchange 'fmt' and 'msg'. Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Zifei Tong <zifeitong@gmail.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Spotted by code review. Gleg had spotted one improper use of "i", but there was another. In this case we iterate over nothing, and i is always 0. It is uninitialized to begin with, and the code works just because it is being set to 0 by luck. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com>
-