Commits · e5fc1f1b7c4743c7ca54262c4515b4f1b1003638 · Verlässliche Systemsoftware / projects / osv

Apr 02, 2014

Nadav Har'El authored 10 years ago


Changes in v3, following Avi's review:
* Use WITH_LOCK(migration_lock) instead of migrate_disable()/enable().
* Make the global RCU "generation" counter a static class variable,
  instead of static function variable. Rename it "next_generation"
  (the name "generation" was grossly overloaded previously)
* In rcu_synchronize(), use migration_lock to be sure we wake up the
  thread to which we just added work.
* Use thread_handle, instead of thread*, for percpu_quiescent_state_thread.
  This is safer (atomic variable, so we can't see it half-set on some
  esoteric CPU), and cleaner (no need to check t!=0). Thread_handle is
  a bit of an overkill here, but it's not in a performance sensitive area.

The existing rcu_defer() used a global list of deferred work, protected by
a global mutex. It also woke up the cleanup thread on every call. These
decisions made rcu_dispose() noticably slower than a regular delete, to the
point that when commit 70502950 introduced
an rcu_dispose() to every poll() call, we saw performance of UDP memcached,
which calls poll() on every request, drop by as much as 40%.

The slowness of rcu_defer() was even more apparent in an artificial benchmark
which repeatedly calls new and rcu_dispose from one or several concurrent
threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose
from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse -
when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex
contention, the fact we free the memory on the "wrong" cpu, and the excessive
context switches all bring the measurement to as much as 12,000 ns.

With this patch the new/rcu_dispose numbers are down to 60 ns on a single
thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is
a x5.5 - x120 speedup :-)

This patch replaces the single list of functions with a per-cpu list.
rcu_defer() can add more callbacks to this per-cpu list without a mutex,
and instead of a single "garbage collection" thread running these callbacks,
the per-cpu RCU thread, which we already had, is the one that runs the work
deferred on this cpu's list. This per-cpu work is particularly effective
for free() work (i.e., rcu_dispose()) because it is faster to free memory
on the same CPU where it was allocated. This patch also eliminates the
single "garbage collection" thread which the previous code needed.

The per-CPU work queue has a fixed size, currently set to 2000 functions.
It is actually a double-buffer, so we can continue to accumulate more work
while cleaning up; If rcu_defer() is used so quickly that it outpaces the
cleanup, rcu_defer() will wait while the buffer is no longer full.
The choice of buffer size is a tradeoff between speed and memory: a larger
buffer means fewer context switches (between the thread doing rcu_defer()
and the RCU thread doing the cleanup), but also more memory temporarily
being used by unfreed objects.

Unlike the previous code, we do not wake up the cleanup thread after
every rcu_defer(). When the RCU cleanup work is frequent but still small
relative to the main work of the application (e.g., memcached server),
the RCU cleanup thread would always have low runtime which meant we suffered
a context switch on almost every wakeup of this thread by rcu_defer().
In this patch, we only wake up the cleanup thread when the buffer becomes
full, so we have far fewer context switches. This means that currently
rcu_defer() may delay the cleanup an unbounded amount of time. This is
normally not a problem, and when it it, namely in rcu_synchronize(),
we wake up the thread immediately.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e5fc1f1b

Merge branch 'async' of https://github.com/tgrabiec/osv · 2271a65b

Avi Kivity authored 10 years ago

"After net channel merge in commit 2828ef50
the performance of tomcat benchmark dropped significantly. Investigation
revealed that the biggest bottleneck was the callout subsystem, which was
using global mutex to protect its operations. This series improves
the performance by replacing use of callouts inside the TCP stack with
a new framework which is supposed to scale better."

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

2271a65b

Apr 01, 2014

Revert "rcu: Per-CPU rcu_defer()" · 6d68d1ab

Avi Kivity authored 10 years ago


This reverts commit d24cda2c.  It wants
migration_lock to be merged first.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

6d68d1ab

rcu: Per-CPU rcu_defer() · d24cda2c

Nadav Har'El authored 10 years ago

The existing rcu_defer() used a global list of deferred work, protected by
a global mutex. It also woke up the cleanup thread on every call. These
decisions made rcu_dispose() noticably slower than a regular delete, to the
point that when commit 70502950 introduced
an rcu_dispose() to every poll() call, we saw performance of UDP memcached,
which calls poll() on every request, drop by as much as 40%.

The slowness of rcu_defer() was even more apparent in an artificial benchmark
which repeatedly calls new and rcu_dispose from one or several concurrent
threads. While on my machine a new/delete pair takes 24 ns, a new/rcu_dispose
from a single thread (on a 4 cpus VM) takes a whopping 330 ns, and worse -
when we have 4 threads on 4 cpus in a tight new/rcu_dispose loop, the mutex
contention, the fact we free the memory on the "wrong" cpu, and the excessive
context switches all bring the measurement to as much as 12,000 ns.

With this patch the new/rcu_dispose numbers are down to 60 ns on a single
thread (on 4 cpus) and 111 ns on 4 concurrent threads (on 4 cpus). This is
a x5.5 - x120 speedup :-)

This patch replaces the single list of functions with a per-cpu list.
rcu_defer() can add more callbacks to this per-cpu list without a mutex,
and instead of a single "garbage collection" thread running these callbacks,
the per-cpu RCU thread, which we already had, is the one that runs the work
deferred on this cpu's list. This per-cpu work is particularly effective
for free() work (i.e., rcu_dispose()) because it is faster to free memory
on the same CPU where it was allocated. This patch also eliminates the
single "garbage collection" thread which the previous code needed.

The per-CPU work queue has a fixed size, currently set to 2000 functions.
It is actually a double-buffer, so we can continue to accumulate more work
while cleaning up; If rcu_defer() is used so quickly that it outpaces the
cleanup, rcu_defer() will wait while the buffer is no longer full.
The choice of buffer size is a tradeoff between speed and memory: a larger
buffer means fewer context switches (between the thread doing rcu_defer()
and the RCU thread doing the cleanup), but also more memory temporarily
being used by unfreed objects.

Unlike the previous code, we do not wake up the cleanup thread after
every rcu_defer(). When the RCU cleanup work is frequent but still small
relative to the main work of the application (e.g., memcached server),
the RCU cleanup thread would always have low runtime which meant we suffered
a context switch on almost every wakeup of this thread by rcu_defer().
In this patch, we only wake up the cleanup thread when the buffer becomes
full, so we have far fewer context switches. This means that currently
rcu_defer() may delay the cleanup an unbounded amount of time. This is
normally not a problem, and when it it, namely in rcu_synchronize(),
we wake up the thread immediately.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d24cda2c

net: replace callouts with the new async framework · 782de281

Tomasz Grabiec authored 11 years ago

The callout subsystem is using a shared global lock for most of its
operations. This became a bottleneck after merging net channels.
Explanation for this phenomena is that before net channels
were merged packet processing on the receive side was done form one
(virtio) thread and there was no contention on that lock. After the
merge the packets started to be processed from many CPUs which made
taking the lock expensive.

The new framework uses only per-timer locks, the worker is lock-free.

Below are measurements of the improvement. The measurements (both
before and after) were taken with Nadav's per-CPU rcu_defer()
improvement applied, because it was also a bottleneck.

The value measured was HTTP request-response throughput of tomcat
server as reported by the wrk tool. Server and client on different
machines, 4 vCPUs, 3g of guest memory.

=== 16 connections ===

Before:

  avg = 39272.61
  stdev = 3611.84

After:

  avg = 52701.82
  stdev = 3953.76

Improvement: 34%

=== 256 connections ===

Before:

 avg = 35225.19
 stdev = 2504.27

After:

 avg = 50576.67
 stdev = 3533.39

Improvement: 43%

One challenge in integrating the new framework with TCP stack was a
proper teardown of timers. Current code assumed that after calling
callout_cancel() it is safe to free the timer's memory. This was not
correct because the timer may have already fired and will try to
access memory which has been freed. The TCP stack had a workaround for
this race, each timer checked the inp field of the tcpcb block for
NULL, which was supposed to indicate that the block has been freed. It
still was not perfect though because the timer may have performed the
check before the field was nulled-out in the tcp_discardcb and then
block on a mutex which will be promptly freed. The solution I went for
is to delegate the release of memory into an async deferred task,
which will be executed as soon as possible but in a safe context, in
which we can wait until all timers are done and then free the
memory.

782de281

net: convert in_pcb lock from struct mtx to mutex · e70b0276

Tomasz Grabiec authored 10 years ago

The new async API accepts lock of type 'mutex' so I need to convert
in_pcb lock type, which will be used to synchronize callbacks.

e70b0276

net: remove dead code · a87583d1
Tomasz Grabiec authored 11 years ago

a87583d1
net: add tracepoints for inpcb life cycle · d7fa401b
Tomasz Grabiec authored 11 years ago

d7fa401b

core: introduce serial_timer_task · bd179712

Tomasz Grabiec authored 10 years ago

This is a wrapper of timer_task which should be used if atomicity of
callback tasks and timer operations is required. The class accepts
external lock to serialize all operations. It provides sufficient
abstraction to replace callouts in the network stack.

Unfortunately, it requires some cooperation from the callback code
(see try_fire()). That's because I couldn't extract in_pcb lock
acquisition out of the callback code in TCP stack because there are
other locks taken before it and doing so _could_ result in lock order
inversion problems and hence deadlocks. If we can prove these to be
safe then the API could be simplified.

It may be also worthwhile to propagate the lock passed to
serial_timer_task down to timer_task to save extra CAS.

bd179712

core: introduce deferred work framework · 34620ff0

Tomasz Grabiec authored 11 years ago

The design behind timer_task

timer_task was design for making cancel() and reschedule() scale well
with the number of threads and CPUs in the system. These methods may
be called frequently and from different CPUs. A task scheduled on one
CPU may be rescheduled later from another CPU. To avoid expensive
coordination between CPUs a lockfree per-CPU worker was implemented.

Every CPU has a worker (async_worker) which has task registry and a
thread to execute them. Most of the worker's state may only be changed
from the CPU on which it runs.

When timer_task is rescheduled it registers its percpu part in current
CPU's worker. When it is then rescheduled from another CPU, the
previous registration is marked as not valid and new percpu part is
registered. When percpu task fires it checks if it is the last
registration - only then it can fire.

Because timer_task's state is scattered across CPUs some extra
housekeeping needs to be done before it can be destroyed. We need to
make sure that no percpu task will try to access timer_task object
after it is destroyed. To ensure that we walk the list of
registrations of given timer_task and atomically flip their state from
ACTIVE to RELEASED. If that succeeds it means the task is now revoked
and worker will not try to execute it. If that fails it means the task
is in the middle of firing and we need to wait for it to finish. When
the per-CPU task is moved to RELEASED state it is appended to worker's
queue of released percpu tasks using lockfree mpsc queue. These
objects may be later reused for registrations.

34620ff0

sched: introduce thread migration lock · 6c8a861d

Tomasz Grabiec authored 11 years ago

This can be useful when there's a need to perform operations on
per-CPU structure(s) and all need to be executed on the same CPU but
there is code in between which may sleep (eg malloc).

For example this can be used to ensure that dynamically allocated
object is always freed on the same CPU on which it was allocated:

  WITH_LOCK(migration_lock) {
    auto _owner = *percpu_owner;
    auto x = new X();
    _owner->enqueue(x);
  }

6c8a861d

sched: add atomic reset() operation to timer_base · b122a924
Tomasz Grabiec authored 11 years ago
```
It is needed by the new async framework.
```
b122a924

build-osv-release: OpenJDK/OSv base image · 44d8450a

Pekka Enberg authored 10 years ago


Add a OpenJDK/OSv base image for developers who want to use Capstan to
package and run their Java applications.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

44d8450a

build-osv-release: OSv memcached server · 16d13511

Pekka Enberg authored 10 years ago


This adds our own memcached server to an OSv release that is pushed to
Capstan S3 repository.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

16d13511

x64: fix early halt() · 8192b61a

Nadav Har'El authored 10 years ago


When halt() is called very early, before smp_launch(), it crashes when
calling crash_other_processors() because the other processors' IDT was
not yet set up. For example, in loader.cc's prepare_commands() we call
abort() when we failed to parse the command line, and this caused a
crash reported in issue #252.

With this patch, crash_other_processors does nothing when other processors
have not yet been set up. This is normally the case before smp_launch(),
but note that on single-vcpu VM, it will remain the case throughout the run.

Fixes #252.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8192b61a

Mar 31, 2014

trace: fix use of undefined symbol · fd9d1c74

Tomasz Grabiec authored 10 years ago


You'll run into this when you have nested trace samples
with undefined backtrace elements.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

fd9d1c74

gdb: add function to check waiters · 7427933c

Glauber Costa authored 10 years ago


Useful to figure out which thread is waiting on what

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

7427933c

mgmt: update to latest · 0f48e2ca
Pekka Enberg authored 10 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
0f48e2ca

Update osv-apps submodule · 1c40ada7

Raphael S. Carvalho authored 10 years ago


Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1c40ada7

Mar 30, 2014

scripts: Add script for uploading OSv release · c4205d2c

Pekka Enberg authored 10 years ago


This adds a script for uploading OSv release to Capstan S3 bucket using
"s3cmd sync".

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c4205d2c

scripts: Add script for building OSv release · c4436165

Pekka Enberg authored 10 years ago


This adds a script for building a Capstan repository for an OSv release
with the following images:

  - OSv core (no API, no CLI)
  - Default (with API, CLI)
  - Cassandra virtual appliance (with API)
  - Tomcat virtual appliance (with API)
  - Iperf virtual appliance (with API)

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c4436165

scripts: Add script for building a Capstan image · b1c4335a

Pekka Enberg authored 10 years ago


This adds a script for creating Capstan compatible images that can be
uploaded to a S3 bucket for distribution.

To use it to build a base image, for example, type:

  $ ./scripts/build-capstan-img cloudius/osv-base empty "OSv base image"

which builds a Capstan image named "cloudius/osv-base" with "make
image=empty" and places QEMU and VirtualBox images under
"build/capstan/cloudius/osv-base" together with an index file.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

b1c4335a

support vga cursor · b7f9b5d2

Takuya ASADA authored 10 years ago


Add cursor update callback to libtsm, update cursor position register on VGA device using the callback.
Fixes #220.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

b7f9b5d2

blkfront: introduce indirect descriptors support · 4b4fa05e

Dmitry Fleytman authored 11 years ago

Indirect descriptors for Xen PV disks is a newly
introduced protocol feature that increases
block devices performance in case of long transfers.

Detailed description:
http://blog.xen.org/index.php/2013/08/07/indirect-descriptors-for-xen-pv-disks/



Measurement results (RAM-backed storage):

Xen w/o indirect descriptors:

1046.251 Mb/s
1097.931 Mb/s
1149.672 Mb/s
1137.696 Mb/s
1135.481 Mb/s
1249.163 Mb/s
1115.417 Mb/s
1118.063 Mb/s
Wrote 11344.750 MB in 10.00 s = 1134.165 Mb/s

Xen w/ indirect descriptors:

bdev-write test offset limit: 250000000 byte(s)
1234.715 Mb/s
1360.013 Mb/s
1323.663 Mb/s
1336.916 Mb/s
1342.617 Mb/s
1332.882 Mb/s
1302.094 Mb/s
Wrote 13233.250 MB in 10.04 s = 1318.639 Mb/s

KVM:

bdev-write test offset limit: 250000000 byte(s)
674.729 Mb/s
698.736 Mb/s
627.677 Mb/s
630.209 Mb/s
742.977 Mb/s
759.205 Mb/s
681.476 Mb/s
655.717 Mb/s
713.231 Mb/s
688.478 Mb/s
Wrote 6872.750 MB in 10.00 s = 687.124 Mb/s

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

4b4fa05e

blkfront: isolate commands allocation · 6f10005c

Dmitry Fleytman authored 11 years ago


This is a refactoring commit to simplify future indirect descriptors code.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

6f10005c

blkfront: isolate command data grant references aquisition · 1bf389c6

Dmitry Fleytman authored 11 years ago


This is a refactoring commit to simplify future indirect descriptors code.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

1bf389c6

blkfront: introduce functions for device features detection · 7e7796ec

Dmitry Fleytman authored 11 years ago


This is a refactoring commit that isolated some xenstore access
logic to make it reusable

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

7e7796ec

tests: introduce write/read/verify test · 96a9ebcb

Dmitry Fleytman authored 11 years ago


Test misc-bdev-rw introduced.

The test writes buffers of various lengths to block device,
reads data back and verifies data read is the same as data written.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

96a9ebcb

tests: test various buffer sizes in misc-bdev-write test · 08b04e4a

Dmitry Fleytman authored 11 years ago


Xen block driver now supports new featute called indirect descriptors.
This feature allows to put more data into each ring cell but it activates
for "long" reads and writes only - longer than 11 pages.

With this patch test by default runs 2 scenarious:
  * 1 page buffers
  * 32 pages buffers

Also introduced command line parameter to specify
size of buffers explicitly.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

08b04e4a

Mar 28, 2014

scripts: Python3 support for scripts/silentant.py · 1e6cd8f7

Zifei Tong authored 10 years ago


Add explicit conversion form bytes array to string which returned by
Popen.communicate().

Use Python3 style print function.

Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1e6cd8f7

Mar 27, 2014

mkfs: Switch to the newly created zfs device driver · ff9cb610

Raphael S. Carvalho authored 11 years ago


Remove the embedded driver, and start using the one from drivers
sub-system. Still relies on the manual creation of /etc/mnttab as
the upload manifest wasn't yet processed.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ff9cb610

drivers: Add zfs device to allow use of zfs commands · ff3534e2

Raphael S. Carvalho authored 11 years ago


Previously, zfs device was being only provided to allow the use of
commands needed to create the zpool, and so the file system.
At that time, doing so was quite enough, however, making zfs
device, i.e. /dev/zfs part of every OSv instance would allow us
to use commands that will help analysing, debugging, tuning
the zpool and file systems there contained.

The basic explanation is that those commands use libzfs which in
turn relies on /dev/zfs to communicate with the zfs code.

Commands example:
zpool, zfs, zdb. The latter one not being ported to OSv yet.
This patch will also be helpful for the ongoing ztest porting.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ff3534e2

manifest: Add /etc/mnttab from the upload manifest process · 8078bc20

Raphael S. Carvalho authored 11 years ago


/etc/mnttab is required by libzfs to get running properly, so let's
create it as an empty file.

ryao from zfsonlinux and openzfs told me that an empty /etc/mnttab is used
on Linux. Also reading the libzfs code shows that /etc/mnttab mostly used for
management of the file itself, nothing that would prevent some libzfs
functionality from working.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8078bc20

scripts/test.py: Blacklist tst-dns-resolver.so · 7f6529e7

Pekka Enberg authored 10 years ago


The tst-dns-resolver.so fails spuriously. Blacklist it until the problem
is fixed to keep Jenkin builds running.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7f6529e7

Merge branch 'glommer/zfsbuffers' of github.com:glommer/osv · 12d1df73
Pekka Enberg authored 10 years ago

12d1df73

cpiod.so: Unmount file systems mounted over the mkfs phase · 8e31549c

Raphael S. Carvalho authored 11 years ago


The root dataset and ZFS are mounted at the mkfs phase, but they aren't
unmounted aftwards.

Running mkfs with VERBOSE flag enabled shows the following:
Running mkfs...
VFS: mounting zfs at /zfs
zfs: mounting osv from device osv
VFS: mounting zfs at /zfs/zfs
zfs: mounting osv/zfs from device osv/zfs

The first mount happens when issuing:
{"zpool", "create", "-f", "-R", "/zfs", "osv", "/dev/vblk0.1"}, &ret);
It creates a pool called osv and mounts the root dataset at /zfs

The latter mount happens when issuing:
{"zfs", "create", "osv/zfs"}
It creates a file system called zfs at the pool OSv and automatically
mounts it at the root dataset mountpoint.

No data inconsistency problem was seen up to now because both mkfs.so and
cpiod.so do an explicit sync() at the end, thus ensuring everything was
correctly flushed out to the stable storage.
There is an expression in Dutch that says: prevention is better than cure.
Thus, this patch changes cpiod.so to unmount both mount points when the
/zfs/zfs prefix was passed. It cannot be done at mkfs.so itself because
cpiod.so is called afterwards at the same OSv instance.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8e31549c

scripts: make 'scripts/test.py' support Python3 · 1563c916

Zifei Tong authored 11 years ago


Python3 no longer allow implicitly conversion form bytes to string,
add explicit decode() to convert input bytes.

Tested with both Python2 and Python3.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1563c916

zfsbuffers: reference count the arc buffer · 063a8a56

Glauber Costa authored 11 years ago


Gleb has noticed that the ARC buffers can go unshared too early. This will
happen because we call the UNMAP operation on every put(). That is certainly
not what we want, since the buffer only has to be unshared when the last
reference is gone.

Design decisions:
1) We obviously can't use the arc natural reference count for that, since
bumping it would make the buffer unevictable.
2) We could modify the arc_buf structure itself to add another refcnt (minimum
4 bytes).  However, I am trying to keep core-ZFS modifications to a minimum,
and only to places where it is totally unavoidable.

Therefore, the solution is to add another hash, which will hash the whole
buffer instead of the physaddr like the one we have currently. In terms of
memory usage, it will add only 8 bytes per buffer (+/- 128k each buffer), which
makes for a memory usage of 64k per mapped Gb compared to the arc refcount
solution. This is a good trade off.

I am also avoiding adding a new vop_map/unmap style operation just to query the
buffer address from its file attributes (needed for the put side). Instead, I
am conventioning that an empty iovec means query, and a filled iov means
unshare.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

063a8a56

core: fix misbehaving debugf() · 7c8d7415

Zifei Tong authored 10 years ago


debugf() used to write log message with respect to the length of format
string. This will cause the messages wrongly truncated.

Also change confusing variable names: exchange 'fmt' and 'msg'.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7c8d7415

zfsbuffers: kill bogus variable · fd233dd7

Glauber Costa authored 11 years ago


Spotted by code review. Gleg had spotted one improper use of "i", but
there was another. In this case we iterate over nothing, and i is always 0.
It is uninitialized to begin with, and the code works just because it is
being set to 0 by luck.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>

fd233dd7