Commits · d2a06d343b2047c995e16acd28b8e6c02e242f65 · Verlässliche Systemsoftware / projects / osv

Nov 26, 2013

tests: add a test for strerror_r() · d2a06d34

Avi Kivity authored 11 years ago


Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d2a06d34

libc: implement the GNU variant of strerror_r() · 7053ac3a

Avi Kivity authored 11 years ago


We previously had the POSIX variant only.  Implement the GNU variant as well,
and update the header to point to the correct function based on the dialect
selected.

The POSIX variant is renamed __xpg_strerror_r() to conform to the ABI
standards.

This fixes calls to strerror_r() from binaries which were compiled with
_GNU_SOURCE (libboost_system.a) but preserves the correct behaviour for
BSD derived source.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7053ac3a

build: source dialect control · 3e1e86c4

Avi Kivity authored 11 years ago


Some functions (strerror_r()) are defined differently based on the source
dialect.  We need to provide both dialects since we have mixed source.

Add a source-dialect macro (defaulting to _GNU_SOURCE) and override it
as appropriate.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3e1e86c4

sched: Doxygen documentation of a bit of the scheduler · 6f825816

Nadav Har'El authored 11 years ago


Started adding Doxygen documentation for the scheduler. Currently
only set_priority() and priority() are documented.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

6f825816

sched: New scheduler algorithm · dbc0d507

Nadav Har'El authored 11 years ago

This patch replaces the algorithm which the scheduler uses to keep track of
threads' runtime, and to choose which thread to run next and for how long.

The previous algorithm used the raw cumulative runtime of a thread as its
runtime measure. But comparing these numbers directly was impossible: e.g.,
should a thread that slept for an hour now get an hour of uninterrupted CPU
time? This resulted in a hodgepodge of heuristics which "modified" and
"fixed" the runtime. These heuristics did work quite well in our test cases,
but we were forced to add more and more unjustified heuristics and constants
to fix scheduling bugs as they were discovered. The existing scheduler was
especially problematic with thread migration (moving a thread from one CPU
to another) as the runtime measure on one CPU was meaningless in another.
This bug, if not corrected, (e.g., by the patch which I sent a month
ago) can cause crucial threads to acquire exceedingly high runtimes by
mistake, and resulted in the tst-loadbalance test using only one CPU on
a two-CPU guest.

The new scheduling algorithm follows a much more rigorous design,
proposed by Avi Kivity in:
https://docs.google.com/document/d/1W7KCxOxP-1Fy5EyF2lbJGE2WuKmu5v0suYqoHas1jRM/edit?usp=sharing

To make a long story short (read the document if you want all the
details), the new algorithm is based on a runtime measure R which
is the running decaying average of the thread's running time.
It is a decaying average in the sense that the thread's act of running or
sleeping in recent history is given more weight than its behavior
a long time ago. This measure R can tell us which of the runnable
threads to run next (the one with the lowest R), and using some
highschool-level mathematics, we can calculate for how long to run
this thread until it should be preempted by the next one. R carries
the same meaning on all CPUs, so CPU migration becomes trivial.

The actual implementation uses a normalized version of R, called R''
(Rtt in the code), which is also explained in detail in the document.
This Rtt allows updating just the running thread's runtime - not all
threads' runtime - as time passes, making the whole calculation much
more tractable.

The benefits of the new scheduler code over the existing one are:

1. A more rigourous design with fewer unjustified heuristics.

2. A thread's runtime measurement correctly survives a migration to a
different CPU, unlike the existing code (which sometimes botches
it up, leading to threads hanging). In particular, tst-loadbalance
now gives good results for the "intermittent thread" test, unlike
the previous code which in 50% of the runs caused one CPU to be
completely wasted (when the load- balancing thread hung).

3. The new algorithm can look at a much longer runtime history than the
previous algorithm did. With the default tau=200ms, the one-cpu
intermittent thread test of tst-scheduler now provides good
fairness for sleep durations of 1ms-32ms.
The previous algorithm was never fair in any of those tests.

4. The new algorithm is more deterministic in its use of timers
(with thyst=2_ms: up to 500 timers a second), resulting in less
varied performance in high-context-switch benchmarks like tst-ctxsw.

This scheduler does very well on the fairness tests tst-scheduler and
fairly well on tst-loadbalance. Even better performance on that second
test will require an additional patch for the idle thread to wake other
cpus' load balanacing threads.

As expected the new scheduler is somewhat slower than the existing one
(as we now do some relatively complex calculations instead of trivial
integer operations), but thanks to using approximations when possible
and to various other optimizations, the difference is relatively small:

On my laptop, tst-ctxsw.so, which measures "context switch" time (actually,
also including the time to use mutex and condvar which this test uses to
cause context switching), on the "colocated" test I measured 355 ns with
the old scheduler, and 382 ns with the new scheduler - meaning that the
new scheduler adds 27ns of overhead to every context switch. To see that
this penalty is minor, consider that tst-ctxsw is an extreme example,
doing 3 million context switches a second, and even there it only slows
down the workload by 7%.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

dbc0d507

sched: No need for "yield" parameter of schedule() · e1722351

Nadav Har'El authored 11 years ago


The schedule() and cpu::schedule() functions had a "yield" parameter.
This parameter was inconsistently used (it's not clear why specific
places called it with "true" and other with "false"), but moreover, was
always ignored!

So this patch removes the parameter of schedule(). If you really want
a yield, call yield(), not schedule().

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e1722351

sched: Use schedule(), not yield() in idle thread · da583f27

Nadav Har'El authored 11 years ago


The idle thread cpu::idle() waits for other threads to become runnable,
and then lets them run. It used to yield the CPU by calling yield(),
because in early OSv history we didn't have an idle priority so simply
calling schedule() would not guarantee that the new thread, not the idle
thread, will run.

But now we actually do have an idle priority; If the run queue is not
empty, we are sure that calling schedule() will run another thread,
not the idle thread. So this patch calls schedule(), which is simpler,
faster, and more reliable than yield().

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

da583f27

sched: Don't change runtime of a queued thread · e60ebaf3

Nadav Har'El authored 11 years ago


The scheduler (reschedule_from_interrupt()) changes the runtime of the
current thread. This assumes that the current thread is not in the
runqueue - because the runqueue is sorted by runtime, and modifying the
runtime of a thread which is already in the runqueue ruins the sorted
tree's invariants.

Unfortunately, the existing code broke this assumption in two places:

1.  When handle_incoming_wakeups() wakes up the current thread (i.e., a
thread that prepared to wait but was woken before it could go to sleep),
the current thread was queued. We need to instead to simply return
the thread to the "running" state.

2.  yield() queued the current thread. Rather, it needs to just change
its runtime, and reschedule_from_interrupt() will decide to queue this
thread.

This patch fixes the first problem. The second problem will be solved
by a yield() rewrite which is part of the new scheduler in a later
patch.

By the way, after we fix both problems, we can also be sure that the
strange if(n != thread::current()) in the scheduler is always true.
This is because n, picked up from the run queue, could never be the
current thread, because the current thread isn't in the run queue.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e60ebaf3

test.py: add utimes.so · 64d97a06
Pekka Enberg authored 11 years ago
```
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
```
64d97a06

tests: Add tst-utimes · 2570b30c

Raphael S. Carvalho authored 11 years ago

v2: Let's convert everything to std::chrono::timepoint (Avi Kivity)
v3: Use the to_timeptr approach suggested by Nadav Har'El

This test checks the functionality of the utimes support.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2570b30c

vfs: Add the utimes system call · 832bba6e

Raphael S. Carvalho authored 11 years ago


v2: Check limit of microseconds, among other minor changes (Nadav Har'El, Avi Kivity).
v3: Get rid of goto & label by adding an else clause (Nadav Har'El).

- This patch adds utimes support.
- This patch addresses the issue #93

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Tested-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

832bba6e

vfs: Unify attribute flags into a common place · 1519d3d1

Raphael S. Carvalho authored 11 years ago


Attribute flags were moved from 'bsd/sys/cddl/compat/opensolaris/sys/vnode.h'
to 'include/osv/vnode_attr.h'

'bsd/sys/cddl/compat/opensolaris/sys/vnode.h' now includes 'include/osv/vnode_attr.h'
exactly at the place the flags were previously located.

'fs/vfs/vfs.h' includes 'include/osv/vnode_attr.h' as functions that rely on the setattr
feature must specify the flags respective to the attr fields that are going to be changed.

Approach sugested by Nadav Har'El

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Tested-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1519d3d1

devfs/ramfs: Change vop_null to vop_eperm · fe3c7df0

Raphael S. Carvalho authored 11 years ago

Use vop_eperm instead to warn the caller about the lack of support (Glauber Costa).

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Tested-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fe3c7df0

zfs: Wire up VOP_SETATTR · 3854b9db

Raphael S. Carvalho authored 11 years ago


Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Tested-by: Tomasz Grabiec <tgrabiec@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3854b9db

Warn about incorrect use of percpu<> / PERCPU(..). · 8add1b91

Nadav Har'El authored 11 years ago


This patch causes incorrect usage of percpu<>/PERCPU() to cause
compilation errors instead of silent runtime corruptions.

Thanks to Dmitry for first noticing this issue in xen_intr.cc (see his
separate patch), and to Avi for suggesting a compile-time fix.

With this patch:

1. Using percpu<...> to *define* a per-cpu variable fails compilation.
   Instead, PERCPU(...) must be used for the definition, which is important
   because it places the variable in the ".percpu" section.

2. If a *declaration* is needed additionally (e.g., for a static class
   member), percpu<...> must be used, not PERCPU().
   Trying to use PERCPU() for declaration will cause a compilation error.

3. PERCPU() only works on statically-constructed objects - global variables,
   static function-variables and static class-members. Trying to use it
   on a dynamically-constructed object - stack variable, class field,
   or operator new - will cause a compilation error.

With this patch, the bug in xen_intr.cc would have been caught at
compile time.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8add1b91

xen: move per-cpu interrupt threads to .percpu section · 63d2e472

Dmitry Fleytman authored 11 years ago


Bug fixed by this patch made OSv crash on Xen during boot.
The problem started to show up after commit:

  commit ed808267
  Author: Nadav Har'El <nyh@cloudius-systems.com>
  Date:   Mon Nov 18 23:01:09 2013 +0200

      percpu: Reduce size of .percpu section

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

63d2e472

Nov 25, 2013

release-ec2: Introduce image and version parameters · cf482bcc

Dmitry Fleytman authored 11 years ago


This feature will be used to release images
with preinstalled applications.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cf482bcc

Start up shell and management web in parallel · c29222c6

Amnon Heiman authored 11 years ago


Start up shell and management web in parallel to make boot faster.  Note
that we also switch to latest mgmt.git which decouples JRuby and CRaSH
startup.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c29222c6

java: Support for loading multiple mains · 10d6f18b

Amnon Heiman authored 11 years ago


When using the MultiJarLoader as the main class, it will use a
configuration file for the java loading.  Each line in the file will be
used to start a main, you can use -jar in each line or specify a main
class.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Reviewed-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

10d6f18b

tests: mincore() tests for demand paging · 20aad632

Pekka Enberg authored 11 years ago


As suggested by Nadav, add tests for mincore() interraction with demand
paging.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

20aad632

tests: Anonymous demand paging microbenchmark · d4bcf559

Pekka Enberg authored 11 years ago


This adds a simple mmap microbenchmark that can be run on both OSv and
Linux.  The benchmark mmaps memory for various sizes and touches the
mmap'd memory in 4K increments to fault in memory.  The benchmark also
repeats the same tests using MAP_POPULATE for reference.

OSv page faults are slightly slower than Linux on first iteration but
faster on subsequent iterations after host operating system has faulted
in memory for the guest.

I've included full numbers on 2-core Sandy Bridge i7 for a OSv guest,
Linux guest, and Linux host below:

  OSv guest
  ---------

  Iteration 1

       time (seconds)
   MiB demand populate
     1 0.004  0.000
     2 0.000  0.000
     4 0.000  0.000
     8 0.001  0.000
    16 0.003  0.000
    32 0.007  0.000
    64 0.013  0.000
   128 0.024  0.000
   256 0.052  0.001
   512 0.229  0.002
  1024 0.587  0.005

  Iteration 2

       time (seconds)
   MiB demand populate
     1 0.001  0.000
     2 0.000  0.000
     4 0.000  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.004  0.000
    64 0.010  0.000
   128 0.019  0.001
   256 0.036  0.001
   512 0.069  0.002
  1024 0.137  0.005

  Iteration 3

       time (seconds)
   MiB demand populate
     1 0.001  0.000
     2 0.000  0.000
     4 0.000  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.005  0.000
    64 0.010  0.000
   128 0.020  0.000
   256 0.039  0.001
   512 0.087  0.002
  1024 0.138  0.005

  Iteration 4

       time (seconds)
   MiB demand populate
     1 0.001  0.000
     2 0.000  0.000
     4 0.000  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.004  0.000
    64 0.012  0.000
   128 0.025  0.001
   256 0.040  0.001
   512 0.082  0.002
  1024 0.138  0.005

  Iteration 5

       time (seconds)
   MiB demand populate
     1 0.001  0.000
     2 0.000  0.000
     4 0.000  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.004  0.000
    64 0.012  0.000
   128 0.028  0.001
   256 0.040  0.001
   512 0.082  0.002
  1024 0.166  0.005

  Linux guest
  -----------

  Iteration 1

       time (seconds)
   MiB demand populate
     1 0.001  0.000
     2 0.001  0.000
     4 0.002  0.000
     8 0.003  0.000
    16 0.005  0.000
    32 0.008  0.000
    64 0.015  0.000
   128 0.151  0.001
   256 0.090  0.001
   512 0.266  0.003
  1024 0.401  0.006

  Iteration 2

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.000  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.005  0.000
    64 0.009  0.000
   128 0.019  0.001
   256 0.037  0.001
   512 0.072  0.003
  1024 0.144  0.006

  Iteration 3

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.001  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.005  0.000
    64 0.010  0.000
   128 0.019  0.001
   256 0.037  0.001
   512 0.072  0.003
  1024 0.143  0.006

  Iteration 4

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.001  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.003  0.000
    32 0.005  0.000
    64 0.010  0.000
   128 0.020  0.001
   256 0.038  0.001
   512 0.073  0.003
  1024 0.143  0.006

  Iteration 5

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.001  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.003  0.000
    32 0.005  0.000
    64 0.010  0.000
   128 0.020  0.001
   256 0.037  0.001
   512 0.072  0.003
  1024 0.144  0.006

  Linux host
  ----------

  Iteration 1

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.001  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.005  0.000
    64 0.009  0.000
   128 0.019  0.001
   256 0.035  0.001
   512 0.152  0.003
  1024 0.286  0.011

  Iteration 2

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.000  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.004  0.000
    64 0.010  0.000
   128 0.018  0.001
   256 0.035  0.001
   512 0.192  0.003
  1024 0.334  0.011

  Iteration 3

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.000  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.004  0.000
    64 0.010  0.000
   128 0.018  0.001
   256 0.035  0.001
   512 0.194  0.003
  1024 0.329  0.011

  Iteration 4

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.000  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.004  0.000
    64 0.010  0.000
   128 0.018  0.001
   256 0.036  0.001
   512 0.138  0.003
  1024 0.341  0.011

  Iteration 5

       time (seconds)
   MiB demand populate
     1 0.000  0.000
     2 0.000  0.000
     4 0.001  0.000
     8 0.001  0.000
    16 0.002  0.000
    32 0.004  0.000
    64 0.010  0.000
   128 0.018  0.001
   256 0.035  0.001
   512 0.135  0.002
  1024 0.324  0.011

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

d4bcf559

mmu: MAP_POPULATE support for anon mmap() · fb24dc3e

Pekka Enberg authored 11 years ago


Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fb24dc3e

mmu: Anonymous memory demand paging · c1d5fccb

Pekka Enberg authored 11 years ago


Switch to demand paging for anonymous virtual memory.

I used SPECjvm2008 to verify performance impact. The numbers are mostly
the same with few exceptions, most visible in the 'serial' benchmark.
However, there's quite a lot of variance between SPECjvm2008 runs so I
wouldn't read too much into them.

As we need the demand paging mechanism and the performance numbers
suggest that the implementation is reasonable, I'd merge the patch as-is
and see optimize it later.

  Before:

    Running specJVM2008 benchmarks on an OSV guest.
    Score on compiler.compiler: 331.23 ops/m
    Score on compiler.sunflow: 131.87 ops/m
    Score on compress: 118.33 ops/m
    Score on crypto.aes: 41.34 ops/m
    Score on crypto.rsa: 204.12 ops/m
    Score on crypto.signverify: 196.49 ops/m
    Score on derby: 170.12 ops/m
    Score on mpegaudio: 70.37 ops/m
    Score on scimark.fft.large: 36.68 ops/m
    Score on scimark.lu.large: 13.43 ops/m
    Score on scimark.sor.large: 22.29 ops/m
    Score on scimark.sparse.large: 29.35 ops/m
    Score on scimark.fft.small: 195.19 ops/m
    Score on scimark.lu.small: 233.95 ops/m
    Score on scimark.sor.small: 90.86 ops/m
    Score on scimark.sparse.small: 64.11 ops/m
    Score on scimark.monte_carlo: 145.44 ops/m
    Score on serial: 94.95 ops/m
    Score on sunflow: 73.24 ops/m
    Score on xml.transform: 207.82 ops/m
    Score on xml.validation: 343.59 ops/m

  After:

    Score on compiler.compiler: 346.78 ops/m
    Score on compiler.sunflow: 132.58 ops/m
    Score on compress: 116.05 ops/m
    Score on crypto.aes: 40.26 ops/m
    Score on crypto.rsa: 206.67 ops/m
    Score on crypto.signverify: 194.47 ops/m
    Score on derby: 175.22 ops/m
    Score on mpegaudio: 76.18 ops/m
    Score on scimark.fft.large: 34.34 ops/m
    Score on scimark.lu.large: 15.00 ops/m
    Score on scimark.sor.large: 24.80 ops/m
    Score on scimark.sparse.large: 33.10 ops/m
    Score on scimark.fft.small: 168.67 ops/m
    Score on scimark.lu.small: 236.14 ops/m
    Score on scimark.sor.small: 110.77 ops/m
    Score on scimark.sparse.small: 121.29 ops/m
    Score on scimark.monte_carlo: 146.03 ops/m
    Score on serial: 87.03 ops/m
    Score on sunflow: 77.33 ops/m
    Score on xml.transform: 205.73 ops/m
    Score on xml.validation: 351.97 ops/m

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c1d5fccb

mmu: Optimistic locking in populate() · 7e568ba0

Pekka Enberg authored 11 years ago


Use optimistic locking in populate() to make it robust against
concurrent page faults.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7e568ba0

mmu: VMA permission flags · 8a56dc8c

Pekka Enberg authored 11 years ago


Add permission flags to VMAs. They will be used by mprotect() and the
page fault handler.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

8a56dc8c

loader.py: add commands for function duration analysis · af723084

Tomasz Grabiec authored 11 years ago


Duration analysis is based on trace pairs which follow the convention
in which function entry generates trace named X and ends with either
trace X_ret or X_err. Traces which do not have an accompanying return
tracepoint are ignored.

New commands:

  osv trace summary

      Prints execution time statistics for traces

  osv trace duration {function}

      Prints timed traces sorted by duration in descending order.
      Optionally narrowed down to a specified function

gdb$ osv trace summary
Execution times [ms]:
name          count      min      50%      90%      99%    99.9%      max    total
vfs_pwritev       3    0.682    1.042    1.078    1.078    1.078    1.078    2.801
vfs_pwrite       32    0.006    1.986    3.313    6.816    6.816    6.816   53.007

gdb$ osv trace duration
0xffffc000671f0010  1    1385318632.103374   6.816 vfs_pwrite
0xffffc0003bbef010  0    1385318637.929424   3.923 vfs_pwrite

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

af723084

loader.py: extract trace iteration so that it can be reused · 9e321062

Tomasz Grabiec authored 11 years ago


Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9e321062

loader.py: add wrapper for intrusive list · 6cc939a6

Tomasz Grabiec authored 11 years ago

The iteration logic was duplicated in two places. The patches yet to
come would add yet another place, so let's refactor first.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6cc939a6

libc/network: feof shouldn't be used on a closed file · df6278fe

Raphael S. Carvalho authored 11 years ago


Calling feof on a closed file isn't safe, and the result is undefined.
Found while auditing the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

df6278fe

sched: fix iteration across timer list · 9c3308f1

Avi Kivity authored 11 years ago


We iterate over the timer list using an iterator, but the timer list can
change during iteration due to timers being re-inserted.

Switch to just looking at the head of the list instead, maintaining no
state across loop iterations.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Tested-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9c3308f1

sched: prevent a re-armed timer from being ignored · 870d8410

Avi Kivity authored 11 years ago


When a hardware timer fires, we walk over the timer list, expiring timers
and erasing them from the list.

This is all well and good, except that a timer may rearm itself in its
callback (this only holds for timer_base clients, not sched::timer, which
consumes its own callback).  If it does, we end up erasing it even though
it wants to be triggered.

Fix by checking for the armed state before erasing.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Tested-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

870d8410

Fix possible deadlock in condvar · 15a32ac8

Nadav Har'El authored 11 years ago

When a condvar's timeout and wakeup race, we wait for the concurrent
wakeup to complete, so it won't crash. We did this wr.wait() with
the condvar's internal mutex (m) locked, which was fine when this code
was written; But now that we have wait morphing, wr.wait() waits not
just for the wakeup to complete, but also for the user_mutex to become
available. With m locked and us waiting for user_mutex, we're now in
deadlock territory - because a common idiom of using a condvar is to
do the locks in opposite order: lock user_mutex first and then use the
condvar, which locks m.

I can't think of an easy way to actually demonstrate this deadlock,
short of having a locked condvar_wait timeout racing with condvar_wake_one
racing and then an additional locked condvar operation coming in
concurrently, but I don't have a test case demonstrating this.
I am hoping it will fix the lockups that Pekka is seeing in his
Cassandra tests (which are the reason I looked for possible condvar
deadlocks in the first place).

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Tested-by: Pekka Enberg <penberg@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

15a32ac8

sched: delay initialization of early threads · d91d7799

Glauber Costa authored 11 years ago


The problem with sleep, is that we can initialize early threads before the
cpu itself is initialized. If we note what goes on in init_on_cpu, it should
become clear:

void cpu::init_on_cpu()
{
    arch.init_on_cpu();
    clock_event->setup_on_cpu();
}

When we finally initialize the clock_event, it can get lost if we already have
pending timers of any kind - which we may, if we have early threads being
start()ed before that. I have played with many potential solutions, but in the
end, I think the most sensible thing to do is to delay initialization of early
threads to the point when we are first idle. That is the best way to guarantee
that everything will be properly initialized and running.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d91d7799

Nov 22, 2013

README.md: Fix typos · f336eaa8

ufokaradagli@gmail.com authored 11 years ago


Fixed a couple of spelling mistakes in README.md

Signed-off-by: Omer Karadagli <ufokaradagli@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f336eaa8

epoll: remove epoll registrations after close() · c8d212eb

Avi Kivity authored 11 years ago


To prevent leaks when a file is close()d without an EPOLL_CTL_DEL,
record epoll registrations in the file structure and remove them
when the file is destroyed.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c8d212eb

tst-tcp: close socket before retaking the lock · 0f085966

Avi Kivity authored 11 years ago


Avoid possible blocking.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0f085966

tst-tcp: clean up better after threads · 4ed4e4da

Avi Kivity authored 11 years ago


Make sure to wait until the running thread count drops to zero
before destroying things.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

4ed4e4da

vfs: prevent double initialization of file::f_lock · b38cd33b

Avi Kivity authored 11 years ago

Since it's initialized with the constructor, the mutex is already initialized.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b38cd33b

vfs: give 'file' a proper destructor · cfb1d736

Avi Kivity authored 11 years ago


use file::operator delete to ensure it is reclaimed via rcu,
and let the rest of the cleanup happen via the destructor.

This allows us to add other members to file, and let the standard
construction/destruction sequence take place.

Note the constructor is already used (falloc_noinstall()).

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cfb1d736

epoll: switch epoll_object to using file pointers · 7185874b

Avi Kivity authored 11 years ago


Holding filerefs causes close() to be delayed indefinitly in case
the user "forgets" to EPOLL_CTL_DEL the file before close().

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7185874b