Nadav Har'El
authored
Our read() and write(), and their variants (pread, pwrite, readv, writev, preadv, pwritev) all shared the same bug when it comes to a partial read or write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully with the number of bytes actually written or read, as they should have. In the internals of the BSD read and write operations (e.g., sosend_generic) each operation returns *both* an error number and a number of bytes left. But at the end, the system call is expected to return just one of them - either an error *or* a number of bytes. The existing read()/write() code, when it saw the internals returning an error code, always returned it and ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK and the number of bytes is non-zero, we should return this number of bytes (i.e., a successful partial write), *not* the EWOULDBLOCK error. This bug went unnoticed almost since the dawn of OSv, because partial reads and writes are not common. For example, a write() to a blocking socket will always return after the entire write is successful, and will not partially succeed. Only when we write to an O_NONBLOCK socket, will it be possible to see a partial write - But even then, we would need a pretty large write() to see it only partially succeeding. But this bug is very noticable when running the Jetty Web server (see issue At some point it's like the response was restarted (complete with a second copy of the headers). In Jetty's demo this was seen as half-shown images, as well as corrupt output when fetching large text files like /test/da.txt. Turns out that Jetty sends static responses in a surprisingly efficient (for Java code...) way, using a single system call for the entire response: It mmap()s the file it wishes to send, and then uses one writev() call to send two arrays: The HTTP headers (built in malloc()ed memory), and the file itself (from mmapped memory). So Jetty tries to write even a 1MB file in one huge writev() call. But there's an added twist: It does so with the socket configured to O_NONBLOCK. So for large writes, the write will only partially succeed (empirically, only about 50KB will succeed), and Jetty will notice the partial write and continue writing the rest - until the whole file is sent. With the bug we had, part of the request will have been written, but Jetty still thought the write didn't write anything so it would start writing again from the beginning - causing the weird sort of response corruption we've been seeing. This patch also includes a test case which confirms this bug, and its fix. In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket (on the loopback interface), one thread write()s a very large buffer and the other receives what it can. We try this two times - once on a blocking socket and once on a non-blocking socket. In each case we expect the number of bytes written by one thread (return from write()) and the number read by the second thread (return from read()) to be the same. With the bug we had, in the non-blocking case we saw write() returning -1 (with errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test to fail. Fixes #257. Signed-off-by:Nadav Har'El <nyh@cloudius-systems.com> Signed-off-by:
Avi Kivity <avi@cloudius-systems.com>