- Jan 10, 2014
-
-
Glauber Costa authored
We are currently only answering requests for CLOCK_REALTIME, but we could easily handle: * CLOCK_REALTIME_COARSE, which is effective the same as CLOCK_REALTIME but faster. In our case, all time sources are equally fast * CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, since we can easily get runtimes for our threads and publish that. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
The "APIC base" message is not very useful to users. Drop it. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
Currently, OSv prints out the following at boot: acpi 0 apic 0 acpi 1 apic 1 acpi 2 apic 2 acpi 3 apic 3 replace that with a simpler message: 4 CPUs detected We do lose the ACPI ID -> CPU ID mapping but it is not terribly important for users. Suggested-by:
Nadav Har'El <nyh@cloudius-systems.com> Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
Simplify networking boot initialization message as suggested by Tzach. Suggested-by:
Tzach Livyatan <tzach@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
It's 2014 now. Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
We respect -Xmx when instructed by the user, but when that is left blank, we set that to be all remaining memory that we have. That is not 100 % perfect because the JVM itself will use some memory, but that should be good enough of an estimate. Specially given that some of the memory currently in use by OSv could be potentially freed in the future should we need it. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Reviewed-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
The biggest problem I am seeing with the balloon is that right now the only time we call the balloon is when we're seeing memory pressure. If pressure is coming from the JVM, we can livelock in quite interesting ways. We need to detect that and disable the ballon in those situations, since ballooning when the pressure comes from the JVM will only trash our workloads. It's not yet working reliably, but this is the direction I plan to start from. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
To make informed reclaim decisions, we need to have as much relevant information as possible about our reclaim targets. Specifically, it is useful to know how much memory is currently used by the JVM heap. The reasoning behind this is that if pressure is coming from the heap, ballooning will harm us, instead of helping us. Note: This is really just a first approximation. Ideally, total memory shouldn't matter, but rather memory delta since a last common event. But counting memory is the initial first step for both. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
To find out which vmas hold the Java heap, we will use a technique that is very close to ballooning (in the implementation, it is effectively the same) What we will do is we will insert a very small element (2 pages), and mark the vma where the object is present as containing the JVM heap. Due to the way the JVM allocates objects, that will end up in the young generation. As time passes, the object will move the same way the balloon moves, and every new vma that is seen will be marked as holding the JVM heap. That mechanism should work for every generational GC, which should encompass most of the JDK7 GCs (it not all). It shouldn't work with the G1GC, but that debuts at JDK8, and for that we can do something a lot simpler, namely: having the JVM to tell us in advance which map areas contain the heap. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
The best possible criteria for deflating balloons is heap pressure: Whenever there is pressure in the JVM, we should give back memory so pressure stops. To accomplish that, we need to somehow tap into the JVM. This patch register a MXBean that will send us notifications about collections. We will ignore minor collections and act upon major collections by deflating any existing balloons. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
There are restrictions on when and how a shrinker can run. For instance, if we have no balloons inflated, there is nothing to deflate (the relaxer should, then, be deactivated). Or also, when the JVM fails to allocate memory for an extra balloon, it is pointless to keep trying (which would only lead to unnecessary spins) until *at least* the next garbage collection phase. I believe this behavior of activation / deactivation ought to be shrinker specific. The reclaiming framework will only provide the infrastructure to do so. In this patch, the JVM Balloon uses that to inform the reclaimer when it makes sense for the shrinker or relaxer to be called. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
This patch implements the JVM balloon driver, that is responsible for borrowing memory from the JVM when OSv is short on memory, and giving it back when we are plentiful. It works by allocating a java byte array, and then unmapping a large page-aligned region inside it (as big as our size allows). This array is good to go until the GC decides to move us. When that happens, we need to carefuly emulate the memcpy fault and put things back in place. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
After carrying on some testing, I quickly realized that the old fixup-only solution I was attempting for the ballooning was not really flying. The reason for that, is that we would take a fault, figure out the fixup address, and return. If that wasn't a JVM fault, we were forced to take another fault (since we were already out of fault context). Once demand paging is a reality, the vast majority of the faults are for non balloon addresses, so we were effectively doubling our number of page faults for no reason. I have decided to go with the VMA (+fixups for instruction decoding) route after all. This is way more efficient and it seems to be working fine. The JVM vma is really close to the normal anonymous VMA. Except that it can never hold pages, and its fault handler calls into the JVM balloon facilities for decoding. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
This patch introduces the memory reclaimer thread, which I hope to use to dispose of unused memory when pressure kicks in. "Pressure" right now is defined to be when we have only 20 % of total memory available. But that can be revisited. The way it will work is that each memory user that is able to dispose of its memory will register a shrinker, and the reclaimer will loop through them. However, the current "loop through all" only "works" because we have only one shrinker being registered. When other appears, we need better policies to drive how much to take, and from whom. Memory allocation will now wait if memory is not available, instead of aborting. The decision of aborting should belong to the reclaimer and no one else. We should never expect to have an unbounded and more importantly, all opaque, number of shrinkers like Linux does. We have control of who they are and how they behave, so I expect that we will be able to make a lot better decisions in the long run. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
Following an early suggestion from Nadav, I am trying to use semaphores for the balloon instead of keeping our own queue. For that to work, I need to have a bit more functionality that may not belong in the main balloon class. Namely: 1) I need to query for the presence of waiters (and maybe in the future for the number of waiters) 2) I need a special post that would allow me to make sure that we are almost posting at most as much we're waiting for, and nothing more. This patch transforms the post method in an unlocked version (and exposes a trivial version that just locks around it) and make other changes necessary to allow subclassing Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
This will be useful when we shrink, so we know how much memory we newly released for system consumption. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
operate so far operates in a page range and at the very most sets a success flag somewhere. I am here extending the API to allow it to return how much data it manipulated. So as an example, if we fault in 2Mb in an empty range, it will return 2 << 20. But if fault in the same 2Mb in a range that already contained some sparse 4k pages, we will return 2 << 20 - previous_pages. That will be useful to count memory usage in certain VMAs. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
When we start using the JVM balloon, our memcpy could fail for valid reasons when the JVM is moving memory that is now in an unmapped region. To handle that, register a fixup that will trigger a JVM call when the fault happens. If all goes well, we will be able to continue normally. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Takuya ASADA authored
On VMware, pci_readw(PCI_CFG_DEVICE_ID) returns the *vendor ID*. pci_readw(PCI_CFG_VENDOR_ID) returns vendor ID as well. Compare to FreeBSD implementation of read/write PCI config space, FreeBSD masks lower bit of offset when write to PCI_CONFIG_ADDRESS, and adds lower bit of offset to PCI_CONFIG_DATA. http://fxr.watson.org/fxr/source/amd64/pci/pci_cfgreg.c#L206 This patch changes accessing method in OSv to the FreeBSD way. Tested on QEMU/KVM and VMware. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Takuya ASADA <syuu@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
This patch starts to solve both issue #142 ("Support MONOTONIC_CLOCK") and issue #81 (use <chrono> for time). First, it adds an uptime() function to the "clock" interface, and implements it for kvm/xen/hpet by returning the system time from which we subtract the system time at boot (but not adding any correction for wallclock). Second, it adds a new std::chrono-based interface to this clock, in a new header file <osv/clock.hh>. Instead of the old-style clock::get()->uptime(), one should prefer osv::clock::uptime::now(). This returns a std::chrono::time_point which is type-safe, in the sense that: 1. It knows what its epoch is (i.e., that it belongs to osv::clock::uptime), and 2. It knows what its units are (nanoseconds). This allows the compiler to prevent a user from confusing measurements from this clock with those from other clocks, or making mistakes in its units. Third, this patch implements clock_gettime(MONOTONIC_CLOCK), using the new osv::clock::uptime::now(). Note that though the new osv::clock::uptime is almost identical to std::chrono::steady_clock, they should not be confused. The former is actually OSv's implementation of the latter: steady_clock is implemented by the C++11 standard library using the Posix clock_gettime, and that is implemented (in this patch) using osv::clock::uptime. With this patch, we're *not* done with either issues #142 or #81. For issue #142, i.e., for supporting MONOTONIC_CLOCK in timerfd, we need OSv's timers to work on uptime(), not on clock::get()->time(). For issue #81, we should add a osv::clock::wall type too (similar to what clock::get()->time() does today, but more correctly), and use either osv::clock::wall or osv::clock::uptime everywhere that clock::get()->time() is currently used in the code. clock::get()->time() should be removed. Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
Currently the parameter was read from the generated Makefile which was not re-generated on incremental build. The fix is to move the default to build.mk, this way the default will always be picked unless masked by command line argument. Fixes #153 Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Jan 09, 2014
-
-
Tomasz Grabiec authored
To start netserver inside OSv just do: make image=netperf sudo scripts/run.py -nv Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
This problem was found when running 'tests/tst-zfs-mount.so' multiple times. At the first time, all tests succeed, however, a subsequent run would fail at the test: 'mkdir /foo/bar', the error message reported that the target file already exists. The test basically creates a directory /foo/bar, rename it to /foo/bar2, then remove /foo/bar2. How could /foo/bar still be there? Quite simple. Our shutdown function calls unmount_rootfs() which will attempt to unmount zfs with the flag MNT_FOURCE, however, it's not being passed to zfs_unmount(), neither unmount_rootfs() tests itself the return status (which was always getting failures previously). So OSv is really being shutdown while there is remaining data waiting to be synced with the backing store. As a result, inconsitency. This problem was fixed by passing the flag to VFS_UNMOUNT which will now unmount the fs properly on sudden shutdowns. Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
Processing of this manifest was inside JVM-specific code which caused the manifest was not processed if there was no java application in the image. For example: make image=empty check ... run_main(): cannot execute tests/tst-af-local.so. Powering off. Test tst-af-local.so FAILED make: *** [check] Error 1 Let's move it to the main manifest processing function. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Jan 08, 2014
-
-
Tomasz Grabiec authored
In some workloads it noticably improves performance. I measured 6% increase in netperf throughput on my laptop. Object file size is only slightly bloated: loader.elf (O2): 47246227 loader.elf (O3): 51272625 (+8.5%) Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Pekka Enberg authored
Changes: - web: Added /upload view class - Shell: Rewrite 'ls' and add formatting/sort flags - Update the jvm API to be more verbose - adding REST API specification: api, os, jvm Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
The current implementation of memmove is a PITA (I mean the bread, of course) to decode if a fault happens. We have very little control of where exactly in the code the fault happens, therefore it is difficult to reason about it. This patch implements memmove in terms of memcpy + memcpy_backwards. For those, we can have specific fixups in the possible fault sites, that will allow us to decode the faults with ease. Note that originally, the only reason why the first branch was not a memcpy is that we would like to handle alignment. Since our implementation of memcpy is fast enough, we can just ignore that and we will end up being even faster. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
This patch provides a backwards version of memcpy. It works all the same, but will start the copy from dst + n <= src + n, instead of dst <= src. That is needed for memmove when the source and destination operands overlap. Being a nonstandard interface, I've just named it "memcpy_backwards" Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Glauber Costa authored
There was a small bug in the free memory tracking code that I've only hit recently. I was wrong in assuming that in the first branch for huge page allocation, where we erase the entire range, we should account for N bytes. This assumption came from my - wrong - understanding that we would do that when the range is exactly N bytes. Looking at the code with fresh eyes, that is definitely not what happens. In my previous stress test we were hitting the second branch all the time, so this bug lived on. Turns out that we will delete the entire page range, which may be bigger than N, the allocation size. Therefore, the whole range should be discounted from our calculation. The remainder (bigger than N part) will be accounted for later when we reinsert it in the page range, in the same way it is for the second branch of this code. Signed-off-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
Fixes issue with JVM failing when started with a debugger with the following message: NPT ERROR: Cannot find nptInitialize Missing openjdk files in usr.manifest were a fertile source of issues. This patch aims at making them less likely and adding all files except blacklisted files to the image. This patch skips two files from JRE which are broken links and inclusion of which would cause manifest upload failure: - jre/lib/audio/default.sf2 - jre/lib/security/cacerts These should be fixed incrementally. Reported-by:
Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
If module has 'usr_files' or 'bootfs_files' declared then their value will be interpreted as FileMaps and appended to appropriate manifests. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
When plain manifests are not enough this is a concise alternative with improved expresiveness. It allows to declare exclude and include patterns. It's python based. Example: m = FileMap() m.add('${OSV_BUILD_PATH}/tests').to('/tests') \ .include('**/*.so') \ .exclude('host/**') Declared mappings can be saved in manifest form or be subject of further processing. To save in manifest format: save_as_manifest(m, 'my.manifest') Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
This patch makes java files are copied to the guest image only when 'java' modue is included. Modules can pull it explicitly by stating: require('java') or implicitly, by creating api.run_java() run configurations. In future we could consider moving api.run_java() into a java meta-module. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
No functional changes, just renames to more adequate names. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
No need to create an empty bootfs.manifest anymore. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
Currently importing module from a module definition would fail because we cannot call import module with the same name (module.py) recursively, __import__ will complain that we removed 'module' from sys.modules. There is a simple solution to this problem, we can use runpy.run_path() which works like a charm. In addition to this we cache loaded modules so that we don't have to load the file twice. Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Tomasz Grabiec authored
Signed-off-by:
Tomasz Grabiec <tgrabiec@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Nadav Har'El authored
In his review of timerfd.cc, Avi asked that I simplify the implementation by having a single "timerfd" object (instead of two I had - timerfd_file and timerfd_object), and by using a single mutex instead of the complex combination of mutexes and atomic variable. This new version indeed does this. It should be easier to understand this code, and it is 30 lines shorter. The performance of this code is slightly inferior to the previous one - in particular poll() now locks and unlocks a mutex - but this should be negligible in practice. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
Raphael S. Carvalho authored
This patch improves the command by adding useful info for debugging ZFS in general, and also addresses some stylistic issues. The new output is as follow: (gdb) osv zfs :: ZFS TUNABLES :: zil_replay_disable: 0 zfs_nocacheflush: 0 zfs_prefetch_disable: 0 zfs_no_write_throttle: 0 zfs_txg_timeout: 5 zfs_write_limit_override: 0 vdev_min_pending: 4 vdev_max_pending: 10 :: ARC SIZES :: Actual ARC Size: 122905056 Target size of ARC: 1341923840 Min Target size of ARC: 167740480 Max Target size of ARC: 1341923840 Most Recently Used (MRU) size: 670961920 (50.00%) Most Frequently Used (MFU) size: 670961920 (50.00%) :: ARC EFFICIENCY :: Total ARC accesses: 42662 ARC hits: 41615 (97.55%) ARC MRU hits: 12550 (30.16%) Ghost Hits: 0 ARC MFU hits: 29045 (69.79%) Ghost Hits: 0 ARC misses: 1047 (2.45%) Prefetch workload ratio: 0.0097% Prefetch total: 412 Prefetch hits: 20 Prefetch misses: 392 Total Hash elements: 1053 Max Hash elements: 1053 Hash collisions: 13 Hash chains: 11 Reviewed-by:
Glauber Costa <glommer@cloudius-systems.com> Signed-off-by:
Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-
- Jan 07, 2014
-
-
Nadav Har'El authored
A previous patch renamed mutex.cc to spinlock.cc. This fixes the build.mk dependency to make the code compile again... Sorry about that. Signed-off-by:
Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Pekka Enberg <penberg@cloudius-systems.com>
-