sched: fix rare crashes caused by reschedule running on the wrong CPU
For a long time we've had the bug summarized in issue #178, where very rarely but consistently, in various runs such as Cassandra, Netperf and tst-queue-mpsc.so, we saw OSv crashing because of some corruption in the timer list, such as arming an already armed timer, or canceling and already canceled timer. It turns out the problem was the schedule() function, which basically did cpu::current()->schedule(). The problem is that if we're unlucky enough, the thread can be migrated right after calling cpu::current(), but before the irq disable in schedule(), which causes us to do a rescheduling for one CPU on a different CPU, which is a big faux pas. This can cause us, for example, to mess with one CPU's preemption_timer from a different CPU, causing the timer-related races and crashes we've seen in issue #178. Clearly, we shouldn't at all have a *method* cpu->schedule() which can operate on any cpu. Rather, we should have only a *function* (class-static) cpu::schedule() which operates on the current cpu - and makes sure we find that current CPU within the IRQ lock to ensure (among other things) the thread cannot get migrated. Another benefit of this patch is that it actually simplifies the code, with one less function called "schedule". Fixes #178. Signed-off-by:Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>
Loading
Please register or sign in to comment