Skip to content
Snippets Groups Projects
Commit dbc0d507 authored by Nadav Har'El's avatar Nadav Har'El Committed by Avi Kivity
Browse files

sched: New scheduler algorithm

This patch replaces the algorithm which the scheduler uses to keep track of
threads' runtime, and to choose which thread to run next and for how long.

The previous algorithm used the raw cumulative runtime of a thread as its
runtime measure. But comparing these numbers directly was impossible: e.g.,
should a thread that slept for an hour now get an hour of uninterrupted CPU
time? This resulted in a hodgepodge of heuristics which "modified" and
"fixed" the runtime. These heuristics did work quite well in our test cases,
but we were forced to add more and more unjustified heuristics and constants
to fix scheduling bugs as they were discovered. The existing scheduler was
especially problematic with thread migration (moving a thread from one CPU
to another) as the runtime measure on one CPU was meaningless in another.
This bug, if not corrected, (e.g., by the patch which I sent a month
ago) can cause crucial threads to acquire exceedingly high runtimes by
mistake, and resulted in the tst-loadbalance test using only one CPU on
a two-CPU guest.

The new scheduling algorithm follows a much more rigorous design,
proposed by Avi Kivity in:
https://docs.google.com/document/d/1W7KCxOxP-1Fy5EyF2lbJGE2WuKmu5v0suYqoHas1jRM/edit?usp=sharing



To make a long story short (read the document if you want all the
details), the new algorithm is based on a runtime measure R which
is the running decaying average of the thread's running time.
It is a decaying average in the sense that the thread's act of running or
sleeping in recent history is given more weight than its behavior
a long time ago. This measure R can tell us which of the runnable
threads to run next (the one with the lowest R), and using some
highschool-level mathematics, we can calculate for how long to run
this thread until it should be preempted by the next one. R carries
the same meaning on all CPUs, so CPU migration becomes trivial.

The actual implementation uses a normalized version of R, called R''
(Rtt in the code), which is also explained in detail in the document.
This Rtt allows updating just the running thread's runtime - not all
threads' runtime - as time passes, making the whole calculation much
more tractable.

The benefits of the new scheduler code over the existing one are:

1. A more rigourous design with fewer unjustified heuristics.

2. A thread's runtime measurement correctly survives a migration to a
different CPU, unlike the existing code (which sometimes botches
it up, leading to threads hanging). In particular, tst-loadbalance
now gives good results for the "intermittent thread" test, unlike
the previous code which in 50% of the runs caused one CPU to be
completely wasted (when the load- balancing thread hung).

3. The new algorithm can look at a much longer runtime history than the
previous algorithm did. With the default tau=200ms, the one-cpu
intermittent thread test of tst-scheduler now provides good
fairness for sleep durations of 1ms-32ms.
The previous algorithm was never fair in any of those tests.

4. The new algorithm is more deterministic in its use of timers
(with thyst=2_ms: up to 500 timers a second), resulting in less
varied performance in high-context-switch benchmarks like tst-ctxsw.

This scheduler does very well on the fairness tests tst-scheduler and
fairly well on tst-loadbalance. Even better performance on that second
test will require an additional patch for the idle thread to wake other
cpus' load balanacing threads.

As expected the new scheduler is somewhat slower than the existing one
(as we now do some relatively complex calculations instead of trivial
integer operations), but thanks to using approximations when possible
and to various other optimizations, the difference is relatively small:

On my laptop, tst-ctxsw.so, which measures "context switch" time (actually,
also including the time to use mutex and condvar which this test uses to
cause context switching), on the "colocated" test I measured 355 ns with
the old scheduler, and 382 ns with the new scheduler - meaning that the
new scheduler adds 27ns of overhead to every context switch. To see that
this penalty is minor, consider that tst-ctxsw is an extreme example,
doing 3 million context switches a second, and even there it only slows
down the workload by 7%.

Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
parent e1722351
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment