build.mk · 4aa2675eb6415e96945634ce27da4e5ee0c40c83 · Verlässliche Systemsoftware / projects / osv

10 years ago

vfs: fix partial non-blocking write · ef169330

Nadav Har'El authored 10 years ago


Our read() and write(), and their variants (pread, pwrite, readv, writev,
preadv, pwritev) all shared the same bug when it comes to a partial read or
write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully
with the number of bytes actually written or read, as they should have.

In the internals of the BSD read and write operations (e.g., sosend_generic)
each operation returns *both* an error number and a number of bytes left.
But at the end, the system call is expected to return just one of them -
either an error *or* a number of bytes. The existing read()/write() code,
when it saw the internals returning an error code, always returned it and
ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK
and the number of bytes is non-zero, we should return this number of bytes
(i.e., a successful partial write), *not* the EWOULDBLOCK error.

This bug went unnoticed almost since the dawn of OSv, because partial reads
and writes are not common. For example, a write() to a blocking socket will
always return after the entire write is successful, and will not partially
succeed. Only when we write to an O_NONBLOCK socket, will it be possible to
see a partial write - But even then, we would need a pretty large write()
to see it only partially succeeding.

But this bug is very noticable when running the Jetty Web server (see issue
At some point it's like the response was restarted (complete with a second
copy of the headers). In Jetty's demo this was seen as half-shown images,
as well as corrupt output when fetching large text files like /test/da.txt.

Turns out that Jetty sends static responses in a surprisingly efficient
(for Java code...) way, using a single system call for the entire response:
It mmap()s the file it wishes to send, and then uses one writev() call to
send two arrays: The HTTP headers (built in malloc()ed memory), and the
file itself (from mmapped memory). So Jetty tries to write even a 1MB file
in one huge writev() call. But there's an added twist: It does so with the
socket configured to O_NONBLOCK. So for large writes, the write will only
partially succeed (empirically, only about 50KB will succeed), and Jetty
will notice the partial write and continue writing the rest - until the
whole file is sent. With the bug we had, part of the request will have been
written, but Jetty still thought the write didn't write anything so it would
start writing again from the beginning - causing the weird sort of response
corruption we've been seeing.

This patch also includes a test case which confirms this bug, and its fix.
In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket
(on the loopback interface), one thread write()s a very large buffer and
the other receives what it can. We try this two times - once on a blocking
socket and once on a non-blocking socket. In each case we expect the number
of bytes written by one thread (return from write()) and the number read
by the second thread (return from read()) to be the same. With the bug we
had, in the non-blocking case we saw write() returning -1 (with
errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test
to fail.

Fixes #257.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>

ef169330

History

vfs: fix partial non-blocking write

Nadav Har'El authored 10 years ago


Our read() and write(), and their variants (pread, pwrite, readv, writev,
preadv, pwritev) all shared the same bug when it comes to a partial read or
write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully
with the number of bytes actually written or read, as they should have.

In the internals of the BSD read and write operations (e.g., sosend_generic)
each operation returns *both* an error number and a number of bytes left.
But at the end, the system call is expected to return just one of them -
either an error *or* a number of bytes. The existing read()/write() code,
when it saw the internals returning an error code, always returned it and
ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK
and the number of bytes is non-zero, we should return this number of bytes
(i.e., a successful partial write), *not* the EWOULDBLOCK error.

This bug went unnoticed almost since the dawn of OSv, because partial reads
and writes are not common. For example, a write() to a blocking socket will
always return after the entire write is successful, and will not partially
succeed. Only when we write to an O_NONBLOCK socket, will it be possible to
see a partial write - But even then, we would need a pretty large write()
to see it only partially succeeding.

But this bug is very noticable when running the Jetty Web server (see issue
At some point it's like the response was restarted (complete with a second
copy of the headers). In Jetty's demo this was seen as half-shown images,
as well as corrupt output when fetching large text files like /test/da.txt.

Turns out that Jetty sends static responses in a surprisingly efficient
(for Java code...) way, using a single system call for the entire response:
It mmap()s the file it wishes to send, and then uses one writev() call to
send two arrays: The HTTP headers (built in malloc()ed memory), and the
file itself (from mmapped memory). So Jetty tries to write even a 1MB file
in one huge writev() call. But there's an added twist: It does so with the
socket configured to O_NONBLOCK. So for large writes, the write will only
partially succeed (empirically, only about 50KB will succeed), and Jetty
will notice the partial write and continue writing the rest - until the
whole file is sent. With the bug we had, part of the request will have been
written, but Jetty still thought the write didn't write anything so it would
start writing again from the beginning - causing the weird sort of response
corruption we've been seeing.

This patch also includes a test case which confirms this bug, and its fix.
In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket
(on the loopback interface), one thread write()s a very large buffer and
the other receives what it can. We try this two times - once on a blocking
socket and once on a non-blocking socket. In each case we expect the number
of bytes written by one thread (return from write()) and the number read
by the second thread (return from read()) to be the same. With the bug we
had, in the non-blocking case we saw write() returning -1 (with
errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test
to fail.

Fixes #257.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>