Skip to content
Snippets Groups Projects
  1. Dec 11, 2013
    • Pekka Enberg's avatar
      mmu: Use addr_range for vma constructors · bbec1a18
      Pekka Enberg authored
      
      Make vma constructors more strongly typed by using the addr_range type.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      bbec1a18
    • Pekka Enberg's avatar
      core: vma abstract base class · d83db0c9
      Pekka Enberg authored
      
      Separate the common vma code to an abstract base class that's inherited
      by anon_vma and file_vma.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      d83db0c9
    • Glauber Costa's avatar
      mmu: fix allocate_intermediate_level · 3e6763f7
      Glauber Costa authored
      
      We have recently seen a problems where eventual page fault outside
      application would occur.
      
      I managed to track that down to my huge page failure patch, but wasn't
      really sure what was going on. Kudos for Raphael, then,  that figured
      out that the problem happened when allocate_intemediate_level was called
      from split_huge_page.
      
      The problem here, is that in that case we do *not* enter
      allocate_intermediate_level with the pte emptied, and were previously
      expecting the write of the new pte to happen unconditionally. The
      compare_exchange broke it, because the exchange doesn't really happen.
      
      There are many ways to fix this issue, but the least confusing of them,
      given that there are other callers to this function that could
      potentially display this problem, is to do some deffensive programming
      and clearly separate the semantics of both types of callers.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Tested-by: default avatarRaphael S. Carvalho <raphaelsc@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      3e6763f7
    • Nadav Har'El's avatar
      Verify slow page fault only happens when preemption is allowed · b7620ca2
      Nadav Har'El authored
      
      Once page_fault() checks that this is not a fast fixup (see safe_load()),
      we reach the page-fault slow path, which needs to allocate memory or
      even read from disk, and might sleep.
      
      If we ever get such a slow page-fault inside kernel code which has
      preemption or interrupts disabled, this is a serious bug, because the
      code in question thinks it cannot sleep. So this patch adds two
      assertions to verify this.
      
      The preemptable() assertion is easily triggered if stacks are demand-paged
      as explained in commit 41efdc1c (I have
      a patch to solve this, but it won't fit in the margin).
      However, I've also seen this assertion without demand-paged stacks, when
      running all tests together through testrunner.so. So I'm hoping these
      assertions will be helpful in hunting down some elusive bugs we still have.
      
      This patch adds a third use of the "0x200" constant (the nineth bit of
      the rflags register is the interrupt flag), so it replaces them by a
      new symbolic name, processor::rflags_if.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      b7620ca2
    • Glauber Costa's avatar
      vma_fault: propagate exception frame to fault handlers · 7ab5f9e8
      Glauber Costa authored
      
      We suddenly stop propagating the exception frame down the vma_fault path.
      There is no reason not to propagate it further, aside from the fact that
      currently there are no users. However, aside from the fact that it presents a
      more consistent frame passing, I intend to use it for the JVM balloon.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      7ab5f9e8
  2. Dec 10, 2013
  3. Dec 09, 2013
    • Glauber Costa's avatar
      mmu: don't bail out on huge page failure · eeeaf888
      Glauber Costa authored
      
      Addressing that FIXME, as part of my memory reclamation series. But this
      is ready to go already. The goal is to retry to serve the allocation if a
      huge page allocation fails, and fill the range with the 4k pages.
      
      The simplest and most robust way I've found to do that was to propagate the
      error up until we reach operate(). Being there, all we need to do is to
      re-walk the range with 4k pages instead of 2Mb.
      
      We could theoretically just bail out on huge pages and move hp_end, but,
      specially when we have reclaim, it is likely that one operation will fail while
      the upcoming ones may succeed.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      [ penberg: s/NULL/nullptr/ ]
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      eeeaf888
  4. Dec 08, 2013
    • Glauber Costa's avatar
      sched: implement pthread_detach · afcf4735
      Glauber Costa authored
      
      I needed to call detach in a test code of mine, and this is isn't implemented.
      The code I wrote to use it may or may not stay in the end, but nevertheless,
      let's implement it.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      afcf4735
    • Glauber Costa's avatar
      sched: standardize call to _cleanup · d754d662
      Glauber Costa authored
      
      set_cleanup is quite a complicated piece of code. It is very easy to get it to
      race with other thread destruction sites, which was made abundantly clear when
      we tried to implement pthread detach.
      
      This patch tries to make it easier, by restricting how and when set_cleanup can
      be called. The trick here is that currently, a thread may or may not have a
      cleanup function, and through a call to set_cleanup, our decision to cleanup
      may change.
      
      From this point on, set_cleanup will only tell us *how* to cleanup. If and
      when, is a decision that we will make ourselves. For instance, if a thread
      is block-local, the destructor will be called by the end of the block. In
      that case, the _cleanup function will be there anyhow: we'll just not call
      it.
      
      We're setting here a default cleanup function for all created threads, that
      just deletes the current thread object. Anything coming from pthread will try
      to override it by also deleting the pthread object. And again, it is important
      to node that they will set up those cleanup function unconditionally.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      d754d662
    • Glauber Costa's avatar
      sched: Use an integer for thread ids · 5c652796
      Glauber Costa authored
      
      Linux uses a 32-bit integer for pid_t, so let's do it as well. This is because
      there are function in which we have to return our id back to the application.
      One application is gettid, that we already have in the tree.
      
      Theoretically, we could come up with a mapping between our 64-bit id and the
      Linux one, but since we have to maintain the mapping anyway, we might as well
      just use the Linux pids as our default IDs. The max size for that is 32-bit. It
      is not enough if we're just allocating pids by bumping the counter, but again,
      since we will have to maintain the bitmaps, 32-bit will allow us as much as 4
      billion PIDs.
      
      avi: remove unneeded #include
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      5c652796
    • Glauber Costa's avatar
      sched: initialize clock later · 1d31d9c3
      Glauber Costa authored
      
      Right now we are taking a clock measure very early for cpu initialization.
      That forces an unnecessary dependency between sched and clock initializations.
      
      Since that lock is used to determine for how long the cpu has been running, we
      can initialize the runtime later, when we init the idle thread. Nothing should
      be running before it. After doing this, we can move the sched initialization
      a bit earlier.
      
      Signed-off-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      1d31d9c3
  5. Dec 06, 2013
  6. Dec 05, 2013
  7. Dec 04, 2013
    • Nadav Har'El's avatar
      Add a few missing __*_chk functions · 2f4b8777
      Nadav Har'El authored
      
      When source code is compiled with -D_FORTIFY_SOURCE on Linux, various
      functions are sometimes replaced by __*_chk variants (e.g., __strcpy_chk)
      which can help avoid buffer overflows when the compiler knows the buffer's
      size during compilation.
      
      If we want to run source compiled on Linux with -D_FORTIFY_SOURCE (either
      deliberately or unintentionally - see issue #111), we need to implement
      these functions otherwise the program will crash because of a missing
      symbol. We already implement a bunch of _chk functions, but we are
      definitely missing some more.
      
      This patch implements 6 more _chk functions which are needed to run
      the "rogue" program (mentioned in issue #111) when compiled with
      -D_FORTIFY_SOURCE=1.
      
      Following the philosophy of our existing *_chk functions, we do not
      aim for either ultimate performance or iron-clad security for our
      implementation of these functions. If this becomes important, we
      should revisit all our *_chk functions.
      
      When compiled with -D_FORTIFY_SOURCE=2, rogue still doesn't work, but
      not because of a missing symbol, but because it fails reading the
      terminfo file for a yet unknown reason (a patch for that issue will
      be sent separately).
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      2f4b8777
    • Avi Kivity's avatar
      008d5245
  8. Dec 03, 2013
  9. Dec 01, 2013
    • Nadav Har'El's avatar
      Fix crash on malformed command line · 082ff373
      Nadav Har'El authored
      
      Before this patch, OSv crashes or continuously reboots when given unknown
      command line paramters, e.g.,
      
              scripts/run.py -c1 -e "--help --z a"
      
      With this patch, it says, as expected that the "--z" option is not
      recognized, and displays the list of known options:
      
          unrecognised option '--z'
          OSv options:
            --help                show help text
            --trace arg           tracepoints to enable
            --trace-backtrace     log backtraces in the tracepoint log
            --leak                start leak detector after boot
            --nomount             don't mount the file system
            --noshutdown          continue running after main() returns
            --env arg             set Unix-like environment variable (putenv())
            --cwd arg             set current working directory
          Aborted
      
      The problem was that to parse the command line options, we used Boost,
      which throws an exception when an unrecognized option is seen. We need
      to catch this exception, and show a message accordingly.
      
      But before this patch, C++ exceptions did not work correctly during this
      stage of the boot process, because exceptions use elf::program(), and we
      only set it up later. So this patch moves the setup of the elf::program()
      object earlier in the boot, to the beginning of main_cont().
      
      Now we'll be able to use C++ exceptions throughout main_cont(), not just
      in command line parsing.
      
      This patch also removes the unused "filesystem" paramter of
      elf::program(), rather than move the initializion of this empty object
      as well.
      
      Fixes #103.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      082ff373
  10. Nov 28, 2013
  11. Nov 27, 2013
  12. Nov 26, 2013
    • Nadav Har'El's avatar
      sched: New scheduler algorithm · dbc0d507
      Nadav Har'El authored
      This patch replaces the algorithm which the scheduler uses to keep track of
      threads' runtime, and to choose which thread to run next and for how long.
      
      The previous algorithm used the raw cumulative runtime of a thread as its
      runtime measure. But comparing these numbers directly was impossible: e.g.,
      should a thread that slept for an hour now get an hour of uninterrupted CPU
      time? This resulted in a hodgepodge of heuristics which "modified" and
      "fixed" the runtime. These heuristics did work quite well in our test cases,
      but we were forced to add more and more unjustified heuristics and constants
      to fix scheduling bugs as they were discovered. The existing scheduler was
      especially problematic with thread migration (moving a thread from one CPU
      to another) as the runtime measure on one CPU was meaningless in another.
      This bug, if not corrected, (e.g., by the patch which I sent a month
      ago) can cause crucial threads to acquire exceedingly high runtimes by
      mistake, and resulted in the tst-loadbalance test using only one CPU on
      a two-CPU guest.
      
      The new scheduling algorithm follows a much more rigorous design,
      proposed by Avi Kivity in:
      https://docs.google.com/document/d/1W7KCxOxP-1Fy5EyF2lbJGE2WuKmu5v0suYqoHas1jRM/edit?usp=sharing
      
      
      
      To make a long story short (read the document if you want all the
      details), the new algorithm is based on a runtime measure R which
      is the running decaying average of the thread's running time.
      It is a decaying average in the sense that the thread's act of running or
      sleeping in recent history is given more weight than its behavior
      a long time ago. This measure R can tell us which of the runnable
      threads to run next (the one with the lowest R), and using some
      highschool-level mathematics, we can calculate for how long to run
      this thread until it should be preempted by the next one. R carries
      the same meaning on all CPUs, so CPU migration becomes trivial.
      
      The actual implementation uses a normalized version of R, called R''
      (Rtt in the code), which is also explained in detail in the document.
      This Rtt allows updating just the running thread's runtime - not all
      threads' runtime - as time passes, making the whole calculation much
      more tractable.
      
      The benefits of the new scheduler code over the existing one are:
      
      1. A more rigourous design with fewer unjustified heuristics.
      
      2. A thread's runtime measurement correctly survives a migration to a
      different CPU, unlike the existing code (which sometimes botches
      it up, leading to threads hanging). In particular, tst-loadbalance
      now gives good results for the "intermittent thread" test, unlike
      the previous code which in 50% of the runs caused one CPU to be
      completely wasted (when the load- balancing thread hung).
      
      3. The new algorithm can look at a much longer runtime history than the
      previous algorithm did. With the default tau=200ms, the one-cpu
      intermittent thread test of tst-scheduler now provides good
      fairness for sleep durations of 1ms-32ms.
      The previous algorithm was never fair in any of those tests.
      
      4. The new algorithm is more deterministic in its use of timers
      (with thyst=2_ms: up to 500 timers a second), resulting in less
      varied performance in high-context-switch benchmarks like tst-ctxsw.
      
      This scheduler does very well on the fairness tests tst-scheduler and
      fairly well on tst-loadbalance. Even better performance on that second
      test will require an additional patch for the idle thread to wake other
      cpus' load balanacing threads.
      
      As expected the new scheduler is somewhat slower than the existing one
      (as we now do some relatively complex calculations instead of trivial
      integer operations), but thanks to using approximations when possible
      and to various other optimizations, the difference is relatively small:
      
      On my laptop, tst-ctxsw.so, which measures "context switch" time (actually,
      also including the time to use mutex and condvar which this test uses to
      cause context switching), on the "colocated" test I measured 355 ns with
      the old scheduler, and 382 ns with the new scheduler - meaning that the
      new scheduler adds 27ns of overhead to every context switch. To see that
      this penalty is minor, consider that tst-ctxsw is an extreme example,
      doing 3 million context switches a second, and even there it only slows
      down the workload by 7%.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      dbc0d507
    • Nadav Har'El's avatar
      sched: No need for "yield" parameter of schedule() · e1722351
      Nadav Har'El authored
      
      The schedule() and cpu::schedule() functions had a "yield" parameter.
      This parameter was inconsistently used (it's not clear why specific
      places called it with "true" and other with "false"), but moreover, was
      always ignored!
      
      So this patch removes the parameter of schedule(). If you really want
      a yield, call yield(), not schedule().
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      e1722351
    • Nadav Har'El's avatar
      sched: Use schedule(), not yield() in idle thread · da583f27
      Nadav Har'El authored
      
      The idle thread cpu::idle() waits for other threads to become runnable,
      and then lets them run. It used to yield the CPU by calling yield(),
      because in early OSv history we didn't have an idle priority so simply
      calling schedule() would not guarantee that the new thread, not the idle
      thread, will run.
      
      But now we actually do have an idle priority; If the run queue is not
      empty, we are sure that calling schedule() will run another thread,
      not the idle thread. So this patch calls schedule(), which is simpler,
      faster, and more reliable than yield().
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      da583f27
    • Nadav Har'El's avatar
      sched: Don't change runtime of a queued thread · e60ebaf3
      Nadav Har'El authored
      
      The scheduler (reschedule_from_interrupt()) changes the runtime of the
      current thread. This assumes that the current thread is not in the
      runqueue - because the runqueue is sorted by runtime, and modifying the
      runtime of a thread which is already in the runqueue ruins the sorted
      tree's invariants.
      
      Unfortunately, the existing code broke this assumption in two places:
      
      1.  When handle_incoming_wakeups() wakes up the current thread (i.e., a
      thread that prepared to wait but was woken before it could go to sleep),
      the current thread was queued. We need to instead to simply return
      the thread to the "running" state.
      
      2.  yield() queued the current thread. Rather, it needs to just change
      its runtime, and reschedule_from_interrupt() will decide to queue this
      thread.
      
      This patch fixes the first problem. The second problem will be solved
      by a yield() rewrite which is part of the new scheduler in a later
      patch.
      
      By the way, after we fix both problems, we can also be sure that the
      strange if(n != thread::current()) in the scheduler is always true.
      This is because n, picked up from the run queue, could never be the
      current thread, because the current thread isn't in the run queue.
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      e60ebaf3
  13. Nov 25, 2013
    • Pekka Enberg's avatar
    • Pekka Enberg's avatar
      mmu: Anonymous memory demand paging · c1d5fccb
      Pekka Enberg authored
      
      Switch to demand paging for anonymous virtual memory.
      
      I used SPECjvm2008 to verify performance impact. The numbers are mostly
      the same with few exceptions, most visible in the 'serial' benchmark.
      However, there's quite a lot of variance between SPECjvm2008 runs so I
      wouldn't read too much into them.
      
      As we need the demand paging mechanism and the performance numbers
      suggest that the implementation is reasonable, I'd merge the patch as-is
      and see optimize it later.
      
        Before:
      
          Running specJVM2008 benchmarks on an OSV guest.
          Score on compiler.compiler: 331.23 ops/m
          Score on compiler.sunflow: 131.87 ops/m
          Score on compress: 118.33 ops/m
          Score on crypto.aes: 41.34 ops/m
          Score on crypto.rsa: 204.12 ops/m
          Score on crypto.signverify: 196.49 ops/m
          Score on derby: 170.12 ops/m
          Score on mpegaudio: 70.37 ops/m
          Score on scimark.fft.large: 36.68 ops/m
          Score on scimark.lu.large: 13.43 ops/m
          Score on scimark.sor.large: 22.29 ops/m
          Score on scimark.sparse.large: 29.35 ops/m
          Score on scimark.fft.small: 195.19 ops/m
          Score on scimark.lu.small: 233.95 ops/m
          Score on scimark.sor.small: 90.86 ops/m
          Score on scimark.sparse.small: 64.11 ops/m
          Score on scimark.monte_carlo: 145.44 ops/m
          Score on serial: 94.95 ops/m
          Score on sunflow: 73.24 ops/m
          Score on xml.transform: 207.82 ops/m
          Score on xml.validation: 343.59 ops/m
      
        After:
      
          Score on compiler.compiler: 346.78 ops/m
          Score on compiler.sunflow: 132.58 ops/m
          Score on compress: 116.05 ops/m
          Score on crypto.aes: 40.26 ops/m
          Score on crypto.rsa: 206.67 ops/m
          Score on crypto.signverify: 194.47 ops/m
          Score on derby: 175.22 ops/m
          Score on mpegaudio: 76.18 ops/m
          Score on scimark.fft.large: 34.34 ops/m
          Score on scimark.lu.large: 15.00 ops/m
          Score on scimark.sor.large: 24.80 ops/m
          Score on scimark.sparse.large: 33.10 ops/m
          Score on scimark.fft.small: 168.67 ops/m
          Score on scimark.lu.small: 236.14 ops/m
          Score on scimark.sor.small: 110.77 ops/m
          Score on scimark.sparse.small: 121.29 ops/m
          Score on scimark.monte_carlo: 146.03 ops/m
          Score on serial: 87.03 ops/m
          Score on sunflow: 77.33 ops/m
          Score on xml.transform: 205.73 ops/m
          Score on xml.validation: 351.97 ops/m
      
      Reviewed-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      c1d5fccb
    • Pekka Enberg's avatar
      mmu: Optimistic locking in populate() · 7e568ba0
      Pekka Enberg authored
      
      Use optimistic locking in populate() to make it robust against
      concurrent page faults.
      
      Reviewed-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      7e568ba0
    • Pekka Enberg's avatar
      mmu: VMA permission flags · 8a56dc8c
      Pekka Enberg authored
      
      Add permission flags to VMAs. They will be used by mprotect() and the
      page fault handler.
      
      Reviewed-by: default avatarGlauber Costa <glommer@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      8a56dc8c
    • Avi Kivity's avatar
      sched: fix iteration across timer list · 9c3308f1
      Avi Kivity authored
      
      We iterate over the timer list using an iterator, but the timer list can
      change during iteration due to timers being re-inserted.
      
      Switch to just looking at the head of the list instead, maintaining no
      state across loop iterations.
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      Tested-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      9c3308f1
    • Avi Kivity's avatar
      sched: prevent a re-armed timer from being ignored · 870d8410
      Avi Kivity authored
      
      When a hardware timer fires, we walk over the timer list, expiring timers
      and erasing them from the list.
      
      This is all well and good, except that a timer may rearm itself in its
      callback (this only holds for timer_base clients, not sched::timer, which
      consumes its own callback).  If it does, we end up erasing it even though
      it wants to be triggered.
      
      Fix by checking for the armed state before erasing.
      
      Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
      Tested-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      870d8410
    • Nadav Har'El's avatar
      Fix possible deadlock in condvar · 15a32ac8
      Nadav Har'El authored
      
      When a condvar's timeout and wakeup race, we wait for the concurrent
      wakeup to complete, so it won't crash. We did this wr.wait() with
      the condvar's internal mutex (m) locked, which was fine when this code
      was written; But now that we have wait morphing, wr.wait() waits not
      just for the wakeup to complete, but also for the user_mutex to become
      available. With m locked and us waiting for user_mutex, we're now in
      deadlock territory - because a common idiom of using a condvar is to
      do the locks in opposite order: lock user_mutex first and then use the
      condvar, which locks m.
      
      I can't think of an easy way to actually demonstrate this deadlock,
      short of having a locked condvar_wait timeout racing with condvar_wake_one
      racing and then an additional locked condvar operation coming in
      concurrently, but I don't have a test case demonstrating this.
      I am hoping it will fix the lockups that Pekka is seeing in his
      Cassandra tests (which are the reason I looked for possible condvar
      deadlocks in the first place).
      
      Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
      Tested-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cloudius-systems.com>
      15a32ac8
Loading