Skip to content
Snippets Groups Projects
user avatar
Nadav Har'El authored
This patch fixes the following bug, of CLI & memcached on two vcpus
crashing on startup. The cause of the crash is this: Java is running
two threads. One loads a new shared library (in this example, libnio.so),
and the second thread just running normally and runs some function it hasn't
run before (pthread_cond_destroy()). When our on-demand resolver code tries
to resolve this function name, it iterates over the module list, and sees
libnio.so, but this object hasn't been completely set up yet (we put it in
the list first - see program::add_object()), so looking up a symbol in it
crashes.

Why hasn't this problem been noticed before the recent link-order change?
Because before that change, the half-loaded library was always last in the
list (OSV itself was the first), so existing symbols were always found before
reaching the partially-set-up object. Now OSV, with many symbols, is last, and
the half-set-up object is in the middle, so the problem is common. But it
also could happen previously, if we had unresolved symbols (e.g., weak symbols),
but these were probably rare enough for the bug not to happen in practice.

The fix in this patch is "hacky", because I wanted to avoid restructuring
the whole code. The problem is that the functions called in add_object()
(including relocate_rela(), nested add_object(), etc.) all assume that
they can look up symbols in the being-set-up object, while we don't want
these objects to be visible for other threads. So we do exactly this -
each object gets a "visiblity" field. If "visibility" is 0, all threads
can use this object, but if visibility is a thread pointer, only this
thread searches in this object. So add_object() starts the object
with visibility set to its thread, and only when add_object() is done,
it sets the visibility to 0 so all threads can see it.

While this solves the common bug, not that this patch still leaves
a small room for SMP bugs, because it doesn't add locking to _modules,
so a lookup during an add_object() can see a broken vector for a short
duration. We should fix this remaining problem later, using RCU.
d703ec00
History
Name Last commit Last update