Skip to content
Snippets Groups Projects
Commit ee92f736 authored by Nadav Har'El's avatar Nadav Har'El Committed by Avi Kivity
Browse files

sched: fix rare crashes caused by reschedule running on the wrong CPU


For a long time we've had the bug summarized in issue #178, where very
rarely but consistently, in various runs such as Cassandra, Netperf and
tst-queue-mpsc.so, we saw OSv crashing because of some corruption in the
timer list, such as arming an already armed timer, or canceling and already
canceled timer.

It turns out the problem was the schedule() function, which basically did
cpu::current()->schedule(). The problem is that if we're unlucky enough,
the thread can be migrated right after calling cpu::current(), but before
the irq disable in schedule(), which causes us to do a rescheduling for
one CPU on a different CPU, which is a big faux pas. This can cause us,
for example, to mess with one CPU's preemption_timer from a different CPU,
causing the timer-related races and crashes we've seen in issue #178.

Clearly, we shouldn't at all have a *method* cpu->schedule() which can
operate on any cpu. Rather, we should have only a *function* (class-static)
cpu::schedule() which operates on the current cpu - and makes sure we find
that current CPU within the IRQ lock to ensure (among other things) the
thread cannot get migrated.

Another benefit of this patch is that it actually simplifies the code,
with one less function called "schedule".

Fixes #178.

Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
parent 955ab861
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment