Commits · 4513537e3b1bad7bc232385868d56fa89b6421f6 · Verlässliche Systemsoftware / projects / osv

Jan 27, 2014

clock: condvar::wait with a time point · 0f5992ca

Nadav Har'El authored 11 years ago


Replace the old function condvar::wait(mutex*, uint64_t) with one taking
a timepoint. This timepoint can use any clock which the timer supports,
namely osv::clock::uptime or osv::clock::wall (as usual, wall-clock timers
are not recommended, and are converted to an uptime timer at the point
of instantiation).

Leave a C-only function condvar_wait(convar*, mutex*, s64) but comment on
what it takes.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

0f5992ca

clock: Use new clock APIs in sbwait implementation · ab098254

Nadav Har'El authored 11 years ago


Fix sbwait implementation to use the new <osv/clock.hh> APIs and the
monotonic clock.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ab098254

clock: Use new clock APIs in BSD time functions · ec81df3d

Nadav Har'El authored 11 years ago


Reimplement the BSD functions getmicrotime(9), getmicrouptime(9)
and variable "ticks", using the new clock APIs.

getmicrotime() returns the system time ("wall clock"), while getmicrouptime
and ticks return the time since boot.

I believe this is the correct implementation according to the FreeBSD
documentation, but our previous implementation didn't quite do this and
it also worked ;-) The previous implementation pretended, according to
getmicrouptime() and get_ticks(), that the system is up since 1970,
and yet the variable "time_uptime" (which FreeBSD has) is never updated,
and is fixed at 1 second :-)

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ec81df3d

clock: Use new clock APIs in msleep implementation · 99eafa7e

Nadav Har'El authored 11 years ago


Fix msleep implementation to use the new <osv/clock.hh> APIs and the
monotonic clock.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

99eafa7e

clock: Use new clock APIs in callout implementation · 3c30aa33

Nadav Har'El authored 11 years ago


Change callout implementation to use the new <osv/clock.hh> APIs and the
monotonic clock.

Since _callout.h now uses the C++ type osv::clock::uptime::time_point,
it can only be used from C++ code. All the relevant code is already C++.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

3c30aa33

Jan 23, 2014

network: fix compile error · 1d641f38

Zhi Yong Wu authored 11 years ago


BUILD SUCCESSFUL

Total time: 35.396 secs
make -r -C build/release/ all
make[1]: Entering directory `/home/zwu/osv/build/release'
  CXX loader.o
  CXX runtime.o
  CXX drivers/vga.o
  CXX bsd/net.o
  CXX bsd/porting/networking.o
/home/zwu/osv/bsd/porting/networking.cc: In function ‘int osv::if_set_mtu(std::string, u16)’:
/home/zwu/osv/bsd/porting/networking.cc:43:32: error: missing braces around initializer for ‘char [16]’ [-Werror=missing-braces]
cc1plus: all warnings being treated as errors
make[1]: *** [bsd/porting/networking.o] Error 1
make[1]: Leaving directory `/home/zwu/osv/build/release'
make: *** [all] Error 2

Signed-off-by: Zhi Yong Wu <zwu.kernel@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1d641f38

zfs: Fix zfs_inactive on unlinked znode cases · 08290fd5

Raphael S. Carvalho authored 11 years ago


This patch addresses a corner-case in our zfs_inactive which can potentially
leak a znode object.

*** Some background on znode/zfs_inactive ***
- Used to deallocate fs-specific data.

- Before destroying the znode, a DMU transaction is created to sync the znode
to the backing store *if* its z_atime_dirty is set (Not relevant to this
patch though).

- When removing a link, zfs_remove sets the field zp->z_unlinked of the
underlying znode if the number of links reached 0 (Simply put, not present in
the fs anymore).

*** The problem ***
The actual problem shows up when zfs_inactive is used on znodes with the
unlinked field set.

The code wrapped around by this patch was previously added to speed up the call
to vrecycle, whose name partially explains itself. Its first functionality is
to eliminate all activity associated to the vnode, then put the vnode back into
a list of free vnodes.

OSv VFS layer doesn't support vrecycle, but our zfs_inactive is acting as if it
were supported. Another thing is that vrecycle call was also removed.

*** Solution ***
Let's fix this problem by simply wrapping around the test which prevented
zfs_inactive from working properly on unlinked znodes, thus leaking references
to the underlying mount point afterwards.

The commentary added into zfs_inactive also explains why these changes are
needed. It would also make things easier when people look at it in the future,
and try to understand why things are the way they are.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

08290fd5

zfs: Fix znode reference count leaks · 76c0caa7

Raphael S. Carvalho authored 11 years ago

The zfs_remove() function calls zfs_dirent_lock, which in turn calls
zfs_zget() which bumps up the underlying znode reference count once.

However, neither zfs_remove() or zfs_rmdir() release the reference count
after using it. This prevents zfs_zinactive() which is used to destroy
the znode object from working properly. Another consequence is that each
znode holds a reference to the underlying mount point, keeping it busy
for unmount.

Fix the znode refcnt by calling zfs_zinactive after znode usage.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

76c0caa7

netfront: get IOCTL definitions from proper place · 75311f28

Dmitry Fleytman authored 11 years ago


There were 2 places with ioctl definitions, Xen netfront driver
was compiled with IOCTL definitions from wrong place.

Fixed by changing include and deleting file with improper definitions

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

75311f28

Jan 22, 2014
- include: Move debug.hh to include/osv · 7809519b
  Pekka Enberg authored 11 years ago
  
  Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
  7809519b
- include: Move mempool.hh to include/osv · 9c95f49d
  Pekka Enberg authored 11 years ago
  
  Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
  9c95f49d
- include: Move preempt-lock.hh to include/osv · 5e374b7f
  Pekka Enberg authored 11 years ago
  
  Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
  5e374b7f
- include: Move mmio.hh to include/osv · 078d4732
  Pekka Enberg authored 11 years ago
  
  Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
  078d4732
- include: Move mmu.hh to include/osv · 9cb900b7
  Pekka Enberg authored 11 years ago
  
  Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
  9cb900b7
- include: Move align.hh to include/osv · 4473f2ca
  Pekka Enberg authored 11 years ago
  
  Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
  4473f2ca
- include: Move sched.hh to include/osv · fae5693e
  Pekka Enberg authored 11 years ago
  
  Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
  fae5693e
Jan 21, 2014

net: optimize sblock(), sbunlock() · 1283d800

Avi Kivity authored 11 years ago


Instead of acquiring sockbuf::sb_mtx inside sblock() and sbunlock(), rely
on the caller to take the lock for us.  Expand existing lock hold regions
in callers to make it so.  This reduces acquisitions of sb_mtx.

As a side effect, copies to and from userspace are done under the lock.
This can affect MSG_NOWAIT with demand paging major faults, but these are
screwed anyway.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

1283d800

net: replace sockbuf::sb_rwlock with a waitqueue · ac5e19c6

Avi Kivity authored 11 years ago

sb_rwlock is used to serialize concurrent writers (or readers) to the same
socket buffer, but is quite expensive as it requires 4 atomic operations
per transaction, even if there is no contention.

Replace it with a waitqueue, and use the sockbuf::sb_mtx for serialization.
This still has exactly the same cost, but we can later move sblock() and
sbunlock() into contexts where the sockbuf::sb_mtx is already acquired.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ac5e19c6

net: fix sblock() · ec5c91e7

Avi Kivity authored 11 years ago


sblock() takes sb_rwlock for reading, and is the only such locker, so it
clearly has no effect.

Change it to acquire the lock for writing, so it serializes access to the
socket buffer as intended.

Bug introduced in 6296cbab.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ec5c91e7

net: enable ifnet constructor/destructor · e0e4e11a

Avi Kivity authored 11 years ago


This allows placing C++ objects in ifnet.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

e0e4e11a

net: add comparison operator for in_addr · a81ca9e6

Avi Kivity authored 11 years ago


Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a81ca9e6

net: add ntoh()/hton() for in_addr · 5f36ebb7

Avi Kivity authored 11 years ago


Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

5f36ebb7

bsd: reconcile sys/param.h with system header · ef235552

Avi Kivity authored 11 years ago


Include the system header and remove duplicate definitions.

Change some solaris imports to use _GNU_SOURCE to make rlim64_t defined
consistently.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

ef235552

bsd: convert netport.c to C++ · a24adb4b

Avi Kivity authored 11 years ago


Allows making ifnet a C++ class.

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

a24adb4b

Jan 20, 2014

net: convert sockbuf waiting to use waitqueues · d125e4a2

Avi Kivity authored 11 years ago

Add sockbuf::sb_cc_wq for using instead of msleep(&sb->sb_cc). This paves
the way for lockless wakeups for net channels, as the wait primitive is now
thread::wait_for() instead of msleep(). In addition this reduces lock
acquisition and improves netperf bandwidth by about 10%.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

d125e4a2

net: remove sockbuf::test_fn() · 0e69a2d0

Avi Kivity authored 11 years ago


This was used to ensure all socket-using code was converted to C++, but not
cleaned up later.

Clean it up now.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

0e69a2d0

Jan 17, 2014

DHCP: Support MTU option · 69bf74a7

Dmitry Fleytman authored 11 years ago

This patch introduces support for MTU option as described in
RFC2132, chapter 5.1. Interface MTU Option

Amazon EC2 networking uses this option in some cases and it gives
throughput improvement of about 250% on big instances with 10G networking.

Netperf results for hi1.4xlarge instances, TCP_MAERTS test, OSv runs netserver:

Send buffer size Throughput w/ patch (Mbps) Throughput w/o patch (Mbps) Improvement (%)

32 4912.29 1386.28 254
64 4832.01 1385.99 249
128 4835.09 1401.46 245
256 4746.41 1382.28 243
512 4849.04 1375.23 253
1024 4631.8 1356.69 241
2048 4859.59 1371.92 254
4096 4864.99 1383.67 252
8192 4627.07 1364.05 239
16384 4868.73 1366.48 256
32768 4822.69 1366.63 253
65536 4837.67 1353.87 257

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

69bf74a7

xen: Fix "error: unable to find tring literal operator" · 6b5b6696

Zhi Yong Wu authored 11 years ago


Total time: 30.77 secs
make -r -C build/release/ all
make[1]: Entering directory `/home/zwu/osv/build/release'
  CXX bsd/sys/xen/gnttab.o
In file included from /home/zwu/osv/bsd/sys/xen/hypervisor.h:40:0,
                 from /home/zwu/osv/bsd/sys/xen/gnttab.cc:29:
/home/zwu/osv/bsd/machine/xen/hypercall.h: In function ‘int HYPERVISOR_set_trap_table(const trap_info_t*)’:
/home/zwu/osv/bsd/machine/xen/hypercall.h:146:9: error: unable to find string literal operator ‘operator"" STR’
/home/zwu/osv/bsd/machine/xen/hypercall.h: In function ‘int HYPERVISOR_mmu_update(mmu_update_t*, unsigned int, unsigned int*, domid_t)’:
/home/zwu/osv/bsd/machine/xen/hypercall.h:154:9: error: unable to find string literal operator ‘operator"" STR’
/home/zwu/osv/bsd/machine/xen/hypercall.h: In function ‘int HYPERVISOR_mmuext_op(mmuext_op*, unsigned int, unsigned int*, domid_t)’:
......
make[1]: *** [bsd/sys/xen/gnttab.o] Error 1
make[1]: Leaving directory `/home/zwu/osv/build/release'
make: *** [all] Error 2

Reviewed-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Zhi Yong Wu <zwu.kernel@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

6b5b6696

vfs: 'struct file' to VOP_READ · 9f68c2d9

Pekka Enberg authored 11 years ago


Add 'struct file' to VOP_READ API. This is needed for procfs which
generates file contents at open() time and read() must operate on it,
not the vnode.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9f68c2d9

Jan 15, 2014

net: fix socket send/receive timeout · 9b9aa6cb

Avi Kivity authored 11 years ago


Microsecond values were passed as is instead of being scaled by hz.

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

9b9aa6cb

debug: Change calls to printf on boot messages · f37b7e53

Eduardo Piva authored 11 years ago


Change some printf calls on boot messages, so it will call
the apropriate debug function. This will enable OSv
to operate on silent mode.

Added debug.h header so we can link debug functions to C files.

Fixes #118

Signed-off-by: Eduardo Piva <efpiva@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

f37b7e53

Jan 13, 2014

ZFS: remove one OSv ifdef · 622d8bba

Glauber Costa authored 11 years ago

For every OSv specific ifdef we remove in ZFS, God ressurects a kitten.

After some recent additions and fixes, that piece of code can now be
compiled out. It is the reclaimer code, so it is very welcome.

But beware: that does not means we are reclaiming yet. That only means that we
wired up the ZFS ARC reclaiming process to the BSD notification system.

We now need to somehow wire that notification system with the OSv shrinking
infrastructure. That is the easy part. And after that, of course, balance the
calls between ARC and balloon. That is the hard part.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

622d8bba

reclaim: export address of the OSV reclaimer · 5a60e13d

Glauber Costa authored 11 years ago


ZFS will perform some checks to determine if the current calling "process"
is the reclaimer. Export the address of the reclaimer thread so that test
can work.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

5a60e13d

bsd glue: simplify curproc so it returns a pointer · 36c0cebc

Glauber Costa authored 11 years ago


There is (currently commented out) code in ZFS that checks things like:

    if (curproc == pageproc) {
        /* Do something really great */
    }

The problem is that with our current implementation of curproc (designed for Xen)
that will break, because we will return a pointer to a in-stack variable that is
created on-demand and only contains the pid of the process.

Returning the thread address will make those checks works, but we will be forced
to give up on accessing fields inside it altogether. If we *really* must, we can
have a structure that have the fields in the same offset as our thread class.

But our thread class is defined in a .hh file, so *good luck* calculating the
offset of a field (say, id) at compile time so we can include in this other .h
file that contains exclusively C code. Since xen is the only user of the PID test,
and our resistance to changing the xen code is quite low (if at all), I'll just go
ahead and change it: storing the address of the process itself should allow us to
do compare tests the way ZFS does and get everything working.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

36c0cebc

msleep: make it accept any kind of mutex · 72d4d8f4

Glauber Costa authored 11 years ago

First of all, I am sorry. I am sorry Avi, Dor, Pekka, God, Dennis Ritchie, et caterva.
I am so very sorry. This is probably one of the ugliest things ever written by a C
programmer in the history of programming.

The story is: ZFS defines its own mutex of type kmutex_t, which is basically just a OSv
in our implementation. In a piece of code currently commented out (not for long), it calls:

msleep(&needfree, &arc_reclaim_thr_lock, 0, "zfs:lowmem", 0);

The problem is that our msleep implementation expects a "struct mtx" which is our own
wrapper around mutex (Maybe that should be changed? Does anybody remember why it was
done this way?)

Keep in mind that we are going to great lenghts not to change ZFS (ifdefing code out
is generally fine), so the casting solution could not be used. I've tried to change the
for-ZFS definitions of mutex in the BSD glue code, but then, after a couple of hours
I was still resolving conflicts with all the other parts that would break because they
were expecting a certain type that was now changed.

I eventually set for the current ugly but functional solution: code msleep in a
way that it can accept any kind of mutex. That is really ugly because by "any
kind of mutex" I really mean any kind of crap the user passes and good bye type
safety altogether. But it works with minimal changes, and more importantly, with
all the changes being in *our* glue code.

If anybody have other ideas, I would be happy to try them out. But at this time,
I believe that to be the best compromise.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

72d4d8f4

msleep: wait in interruptible manner · 407301a5

Dmitry Fleytman authored 11 years ago


Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

407301a5

msleep: prepare for interruptible sleep implementation · b03f081c

Dmitry Fleytman authored 11 years ago


synch_port::msleep: merge time-out and non-time-out cases
into one conditional branch to avoid code duplication.

This both simplifies the code and makes future implementation of
interruption handling code for interruptable sleeps easier.

Signed-off-by: Dmitry Fleytman <dmitry@daynix.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

b03f081c

Jan 10, 2014

bsd: Simplify networking init message · c69e2d34

Pekka Enberg authored 11 years ago

Simplify networking boot initialization message as suggested by Tzach.

Suggested-by: Tzach Livyatan <tzach@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

c69e2d34

Jan 09, 2014

zfs: Fix on-disk data inconsistency on shutdown · 2d93af3b

Raphael S. Carvalho authored 11 years ago

This problem was found when running 'tests/tst-zfs-mount.so' multiple times.
At the first time, all tests succeed, however, a subsequent run would
fail at the test: 'mkdir /foo/bar', the error message reported
that the target file already exists.

The test basically creates a directory /foo/bar, rename it to /foo/bar2,
then remove /foo/bar2. How could /foo/bar still be there?

Quite simple. Our shutdown function calls unmount_rootfs() which will
attempt to unmount zfs with the flag MNT_FOURCE, however, it's not being
passed to zfs_unmount(), neither unmount_rootfs() tests itself the
return status (which was always getting failures previously).
So OSv is really being shutdown while there is remaining data waiting to
be synced with the backing store. As a result, inconsitency.

This problem was fixed by passing the flag to VFS_UNMOUNT which will now
unmount the fs properly on sudden shutdowns.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

2d93af3b

Jan 03, 2014

vfs: change the approach of releasing dentries during unmount · af466dbc

Raphael S. Carvalho authored 11 years ago


Currently, vflush is used in the unmount process to release remaining
dentries. vflush in turn calls vevict that is releasing dentries that
it doesn't own.
This behavior is not correct neither good to the future of VFS.

So Avi suggested switching to a different approach. We could only
release those dentries owned by the mountpoint when unmounting it as
there wouldn't be anything else in the dcache (given its functionality).

The problem was fixed by doing the following steps:
 - Drop vflush calls in sys_umount2, make vevict an empty function,
and remove vevict.

 - Created the function release_mp_dentries to release dentries of a mount
point which will be called by VFS_UNMOUNT. It cannot be called before
VFS_UNMOUNT as failures must be considered, neither after as the mount point
would be considered busy.
Don't respect this "rule", and that previously seen ZFS replay transaction
error would happen.

NOTE: vflush is currently duplicated in zfs unmount cases to address the problem
above. This patch fixes this duplication as well.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>

af466dbc