Commits · a8526d28a800fa75982d8be2a98933d1a70dbd33 · Verlässliche Systemsoftware / projects / osv

Apr 28, 2014

test: add test for renaming of opened file. · a8526d28

Gleb Natapov authored 10 years ago


Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a8526d28

vfs: move dentry to its new location during file rename · 3f050ee4

Gleb Natapov authored 10 years ago

If dentry has elevated reference count during rename (a file is opened
for instance) it is not destroyed and still hashed with old path. As a
result renamed file can still be accessed by its old name.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3f050ee4

libc: Move alarm implementation to itimer class, Add setitimer/getitimer · db571962

Takuya ASADA authored 10 years ago

Move static variables for alarm and related function to itimer class.
itimer class offers interfaces to implement setitimer/getitimer, so added setitimer/getitimer.
Now alarm rewrite as a just a wrapper function to call setitimer.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

db571962

clock: Add fill_tv(), duration to timeval conversion function · e684d88a

Takuya ASADA authored 10 years ago


Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

e684d88a

malloc: Fix small allocations with alignment > size. · c8845bb5

Nadav Har'El authored 10 years ago

When a small allocation is requested with large alignment, we ignored
the alignment, and as a consequence posix_memalign() or
alloc_phys_contiguous_aligned() could crash when it failed to achieve
the desired alignment. This is not a common case (usually, size >= alignment,
and the new C11 aligned_alloc() even supports only this case), but still
it might happen, and we saw it in cloudius-systems/capstan#75.

When size < alignment, this patch changes the size so we can achieve the
desired alignment. For small alignments, this means setting size=alignment,
so for example to get an alignment of 1024 bytes we need at least 1024-byte
allocation. This is a waste of memory, but as these allocations are rare,
we expect this to be acceptable. For large alignments, e.g., alignment=8192,
we don't need size=alignment but we do need size to be large enough so we'll
use malloc_large() (malloc_large() already supports arbitrarily large
alignments).

This patch also adds test cases to tst-align.so to test alignments larger
than the desired size.

Fixes #271 and cloudius-systems/capstan#75.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

c8845bb5

ahci: Relax _cmd_table and _recv_fis alignment · a1aa5243

Asias He authored 10 years ago


AHCI spec says cmd_table need to be aligned to 128 bytes, respect it.

AHCI spec does not specify the alignment requirement of _cmd_list and
_recv_fis. To be safe, we align to their size, i.e. 1024 and 256 bytes.

This fixes the issue reported by Roman Shaposhnik.

   $ capstan run -p vbox
   Created instance: i1398297574
   OSv v0.07
   Assertion failed: !(reinterpret_cast<uintptr_t>(ret) & (align - 1))
   (/home/penberg/osv/core/mempool.cc: alloc_phys_contiguous_aligned:
   1378)

   [backtrace]
   0x398316 <memory::alloc_phys_contiguous_aligned(unsigned long,
   unsigned long)+214>
   0x3542bb <ahci::port::setup()+283>
   0x3549ae <ahci::port::port(unsigned int, ahci::hba*)+142>
   0x354a5f <ahci::hba::scan()+111>
   0x354d34 <ahci::hba::hba(pci::device&)+148>
   0x354e07 <ahci::hba::probe(hw::hw_device*)+119>
   0x33e65c
   <hw::driver_manager::register_driver(std::function<hw::hw_driver*
   (hw::hw_device*)>)+188>
   0x33c8c3 <hw::device_manager::for_each_device(std::function<void
   (hw::hw_device*)>)+51>
   0x33e3de <hw::driver_manager::load_all()+78>
   0x20edf5 <do_main_thread(void*)+629>
   0x3ff0a5 <sync+69>
   0x3a3eaa <thread_main_c+26>
   0x361585 <thread_main+7>

Signed-off-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

a1aa5243

Apr 27, 2014
- apps: update for netperf jenkins integration · 2e7bb5c6
  Avi Kivity authored 10 years ago
  
  Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
  2e7bb5c6
Apr 25, 2014

ec2 scripts: optionally support small instances · e15caa09

Dmitry Fleytman authored 10 years ago


There are 2 types of HVM AMIs on Amazon:

1. Windows based that support all instances types
2. Linux based that support instances larger than medium only

Linux based AMIs suit OSv better because their default seciurity
policy exactly fits OSv model (SSH access) and doesn't require
reconfiguration.

Windows based AMIs require default security policy reconfiguration
in order to allow SSH access but from other hand allow running OSv
on free-tier eligible micro instances which is good for enthusiasts
and initial evaluations.

In order to support both options this patch extends release scripts
with ability to choose between Windows and Linux based templates
during AMIs creation.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e15caa09

Revert "tests: introduce tracing smoke test" · b41c4c50

Pekka Enberg authored 10 years ago


This reverts commit 532abb89 which makes
Mr. Jenkins unhappy.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b41c4c50

tests: introduce tracing smoke test · 532abb89

Tomasz Grabiec authored 10 years ago


The test verifies that:
 * samples are collected by OSv
 * extraction via `trace extract` succeeds
 * `trace summary` shows expected tracepoint names
 * listing via `trace list` shows expected tracepoints
 * network packet parsing works

This test is run as part of standard 'make check'.

Note: this single test can be executed as easily as:

 $ scripts/test.py --test tracing_smoke_test

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

532abb89

trace: make `trace extract` silent · 6aa066d3

Tomasz Grabiec authored 10 years ago

The command runs GDB which is very chatty. This will spoil our clean
output of 'make check' once the tracing test is added. Let's hide GDB
output unless it fails or the tracefile is not created.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6aa066d3

trace: introduce 'trace list --tcpdump' · 1e53a52f

Tomasz Grabiec authored 10 years ago


The --tcpdump switch enables in-line decoding of net_packet* samples
so that packet content is displayed in human-readable form next to
other samples. This allows to corelate packets with other samples
easily.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1e53a52f

trace: introduce pcap-dump and tcpdump commands · 0f477a6f

Tomasz Grabiec authored 10 years ago


The 'scripts/trace.py pcap-dump' command will process all packet
capturing samples and output a pcap stream. This is the format used by
tcpdump. It can be used together with tcpdump to display packets in
human-readable form like this:

  scripts/trace.py pcap-dump | tcpdump -r - | less

Because pcap stream is often used together with tcpdump, I also
introduce a shorthand for the above:

  scripts/trace.py tcpdump

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0f477a6f

net: log packets going through loopback and virtio-net. · f30ba40d

Tomasz Grabiec authored 10 years ago


There was no way to sniff packets going through OSv's loopback
interface. I faced a need to debug in-guest TCP traffic. Packets are
logged using tracing infrastructure. Packet data is serialized as
sample data up to a limit, which is currently hardcoded to 128 bytes.

To enable capturing of packets just enable tracepoints named:
  - net_packet_loopback
  - net_packet_eth

Raw data can be seen in `trace list` output. Better presentation
methods will be added in the following patches.

This may also become useful when debugging network problems in the
cloud, as we have no ability to run tcpdump on the host there.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f30ba40d

trace: support for serializing variable-length sequences of bytes · 2d795f99

Tomasz Grabiec authored 10 years ago

Tracepoint argument which extends 'blob_tag' will be interpreted as a
range of byte-sized values. Storage required to serialize such object
is proportional to its size.

I need it to implement storage-fiendly packet capturing using tracing layer.

It could be also used to capture variable length strings. Current
limit (50 chars) is too short for some paths passed to vfs calls. With
variable-length encoding, we could set a more generous limit.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2d795f99

Merge branch 'pcfixes' of github.com:gleb-cloudius/osv · 5768358a
Pekka Enberg authored 10 years ago

5768358a

rcu: declare as a memory reclaimer · 7136fe18

Avi Kivity authored 10 years ago


If the rcu threads need memory, let them have it, since they will use it
to free even more memory.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

7136fe18

memory: fix debug allocator interaction with reclaimer · 09b8cc8f

Avi Kivity authored 10 years ago


malloc() must wait for memory, and since page table operations can
allocate memory, it must be able to dip into the reserve pool.  free()
should indicate it is a reclaimer.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

09b8cc8f

memory: add facility to indicate that a thread is a temporarily a reclaimer · cfba1a5e

Avi Kivity authored 10 years ago

We already have a facility that to indicate that a thread is a reclaimer
and should be allowed to allocate reserve memory (since that memory will be
used to free memory). Extend it to allow indicating that a particular
code section is used to free memory, not the entire thread.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cfba1a5e

loader: when can't read elf, fail normally instead of crashing · f5ba2ce5

Nadav Har'El authored 10 years ago


After the previous patches, when we try to run an executable we cannot
read (e.g., a directory - see issue #94), a "struct error" exception will
be thrown out of osv::run, and nobody will catch it so the user will see
a somewhat-puzzling "uncaught exception" error.

With this patch, we catch the read error exception inside osv::run(), and
when it happens, just return a normal load failure (nullptr). E.g, now
trying to run a directory will result in a normal failure:

   $ scripts/run.py -e /
   OSv v0.07-39-g03feb99
   run_main(): cannot execute /. Powering off.

Fixes #94.

The osv::run() API currently (before this patch, and also after it)
doesn't have any way to say *why* the loading failed - it could have been
that the executable was a directory, that it was not an ELF shared object,
that it was a shared object and didn't have a main - in all cases the
return value is nullptr. In the future this should probably change.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f5ba2ce5

fileref: replace assertions by exceptions · 15fd92c1

Nadav Har'El authored 10 years ago


read(fileref, ...) and write(fileref, ...), when they notice an error,
currently assert() causing a crash which looks to non-expert users like
an OSv bug. This causes, for example, bug #94, where doing osv::run()
of a directory, instead of an executable, causes such an assertion
failure.

With this patch, these read() and write() will throw an exception
(a struct error) instead of asserting.

This fixes #94, as now the user will see an uncaught exception instead
of buggy-looking assertion failure. In the next patch we'll catch this
exception, so the user won't even see that.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

15fd92c1

mmu: drop finalize() from page_allocator · f5a46140
Gleb Natapov authored 10 years ago
```
No longer used.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
```
f5a46140

mmu: enable pagecache again · 395dd21e

Gleb Natapov authored 10 years ago


The bug that caused it to be disabled should be fixed now.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

395dd21e

pagecache: change locking between mmu and ARC · 4792e60c

Gleb Natapov authored 10 years ago

Now vma_list_mutex is used to protect against races between ARC buffer
mapping by MMU and eviction by ZFS. The problem is that MMU code calls
into ZFS with vma_list_mutex held, so on that path all ZFS related locks
are taken after vma_list_mutex. An attempt to acquire vma_list_mutex
during ARC buffer eviction, while many of the same ZFS locks are already
held, causes deadlock. It was solved by using trylock() and skipping an
eviction if vma_list_mutex cannot be acquired, but it appears that some
mmapped buffers are destroyed not during eviction, but after writeback and
this destruction cannot be delayed. It calls for locking scheme redesign.

This patch introduce arc_lock that have to be held during access to
read_cache. It prevents simultaneous eviction and mapping. arc_lock should
be the most inner lock held on any code path. Code is change to adhere to
this rule. For that the patch replaces ARC_SHARED_BUF flag by new b_mmaped
field. The reason is that access to b_flags field is guarded by hash_lock
and it is impossible to guaranty same order between hash_lock and arc_lock
on all code paths. Dropping the need for hash_lock is a nice solution.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

4792e60c

mmu: populate pte in page_allocator · 31d939c7

Gleb Natapov authored 10 years ago

Currently page_allocator return a page to a page mapper and the later
populates a pte with it. Sometimes page allocation and pte population
needs to be appear atomic though. For instance in case of a pagecache
we want to prevent page eviction before pte is populated since page
eviction clears pte, but if allocation and mapping is not atomic pte
can be populated with stale data after eviction. With current approach
very wide scoped lock is needed to guaranty atomicity. Moving pte
population into page_allocator allows for much simpler locking.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

31d939c7

pagecache: track ARC buffers in the pagecache · 4fd8693a

Gleb Natapov authored 10 years ago

Current code assumes that for the same file and same offset ZFS will
always return same ARC buffer, but this appears to be not the case.
ZFS may create new ARC buffer while an old one is undergoing writeback.
It means that we need to track mapping between file/offset and mmapped
ARC buffer by ourselves. It's exactly what this patch is about. It adds
new kind of cached page that holds pointers to an ARC buffer and stores
them in new read_cache map.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

4fd8693a

pagecache: avoid unnecessary tlb flushes in unmap_address · b11aa4d1
Gleb Natapov authored 10 years ago
```
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
```
b11aa4d1

pagecache: remove redundant locking · 39cdb2db

Gleb Natapov authored 10 years ago


All pagecache functions run under vma_list_lock, so no additional
locking is needed.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

39cdb2db

pagecache: move cache management code into pagecache.cc · 832a16f6
Gleb Natapov authored 10 years ago
```
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
```
832a16f6

mmu: separate page unmapping and freeing. · e9756adc

Gleb Natapov authored 10 years ago


Unmap page as soon as possible instead of waiting for max_pages to
accumulate. Will allow to free pages outside of vma_list_mutex in the
feature.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

e9756adc

pagecache: add traces to pagecache function · 150522a2

Gleb Natapov authored 10 years ago


Useful for debugging cache related problems.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

150522a2

mmu: propagate IO errors during msync() to an application · 059db30b
Gleb Natapov authored 10 years ago
```
Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>
```
059db30b

pagecache: assert if sys_read() fails · 4e5fc71c

Gleb Natapov authored 10 years ago


sys_read() error should be propagated to an application, but for now
assert here instead of silent memory corruption.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

4e5fc71c

pagecache: use boost::variant to store rmap of a cached page · e6a4e155

Gleb Natapov authored 10 years ago


rmap will usually contain only one element so using unordered_set for it
is a little bit heavy. boost::variant allow us to use direct pointer in
common case and fall back to unordered_set only when more than one
elements are added.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

e6a4e155

Apr 24, 2014

mmu: mark COW ptes to skip them during protection changes · dfb7b624

Gleb Natapov authored 10 years ago

Write permission should not be granted to ptes that has no write
permission because they are COW, but currently there is no way to
distinguish between write protection due to vma permission and write
protection due to COW. Use bit reserved for software use in pte as a
marker for COW ptes and check it during permission changes.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

dfb7b624

mmu: map page as dirty on write fault · 17330251

Gleb Natapov authored 10 years ago


Mapping page as dirty saves CPU from doing additional memory access to
write out dirty bit when access will succeed.

Signed-off-by: Gleb Natapov <gleb@cloudius-systems.com>

17330251

zfs: fix read() of directory to return EISDIR · 0ad78a14

Nadav Har'El authored 10 years ago


Posix allows read() on directories in some filesystems. However, Linux
always returns EISDIR in this case, so because we're emulating Linux,
so should we, for every filesystem. All our filesystems except ZFS
(e.g., ramfs) already return EISDIR when reading a directory, but ZFS
doesn't, so this patch adds the missing check in ZFS.

This patch is related to issue #94: the first step to fixing #94 is to
return the right error when reading a directory.

This patch also adds a test case, which can be compiled both on OSv and
Linux, to verify they both have the same behavior. Before the patch, the
test succeeded on Linux but failed on OSv when the directory is on ZFS.

Instead of fixing zfs_read like I do in this patch, I could have also fixed
sys_read() in vfs_syscalls.cc which is the top layer of all read()
operations, and I could have done there
   (fp->f_dentry && fp->f_dentry->d_vnode->v_type == VDIR) {
      return EISDIR;
   }
to cover all the filesystems. I decided not to do that, because all
filesystems except ZFS already have this check, and because the lower
layers like zfs_read() already have more natural access to d_vnode.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0ad78a14

vmxnet3: fix avail count on txq_gc · cc8dd4d2

Takuya ASADA authored 10 years ago

txq_encap decrements txq.avail for each mbuf, so we need to increment same number here.

Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

cc8dd4d2

vfs: simple flock implementation · fe6b2d50

Glauber Costa authored 10 years ago


This patch implements a version of flock that only performs error checking.
It assumes any valid operation succeeds, since we are using a single-process
model.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

fe6b2d50

mmu: implement madvise for MADV_DONTNEED · 0b00f490

Glauber Costa authored 10 years ago

The jemalloc memory allocator will make intense use of MADV_DONTNEED to flush
pages it is no longer using. Respect that advice.

Let's keep returning -1 for the remaining cases so we don't fool anybody

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0b00f490