Skip to content
Snippets Groups Projects
  • Nadav Har'El's avatar
    ef169330
    vfs: fix partial non-blocking write · ef169330
    Nadav Har'El authored
    
    Our read() and write(), and their variants (pread, pwrite, readv, writev,
    preadv, pwritev) all shared the same bug when it comes to a partial read or
    write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully
    with the number of bytes actually written or read, as they should have.
    
    In the internals of the BSD read and write operations (e.g., sosend_generic)
    each operation returns *both* an error number and a number of bytes left.
    But at the end, the system call is expected to return just one of them -
    either an error *or* a number of bytes. The existing read()/write() code,
    when it saw the internals returning an error code, always returned it and
    ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK
    and the number of bytes is non-zero, we should return this number of bytes
    (i.e., a successful partial write), *not* the EWOULDBLOCK error.
    
    This bug went unnoticed almost since the dawn of OSv, because partial reads
    and writes are not common. For example, a write() to a blocking socket will
    always return after the entire write is successful, and will not partially
    succeed. Only when we write to an O_NONBLOCK socket, will it be possible to
    see a partial write - But even then, we would need a pretty large write()
    to see it only partially succeeding.
    
    But this bug is very noticable when running the Jetty Web server (see issue
    At some point it's like the response was restarted (complete with a second
    copy of the headers). In Jetty's demo this was seen as half-shown images,
    as well as corrupt output when fetching large text files like /test/da.txt.
    
    Turns out that Jetty sends static responses in a surprisingly efficient
    (for Java code...) way, using a single system call for the entire response:
    It mmap()s the file it wishes to send, and then uses one writev() call to
    send two arrays: The HTTP headers (built in malloc()ed memory), and the
    file itself (from mmapped memory). So Jetty tries to write even a 1MB file
    in one huge writev() call. But there's an added twist: It does so with the
    socket configured to O_NONBLOCK. So for large writes, the write will only
    partially succeed (empirically, only about 50KB will succeed), and Jetty
    will notice the partial write and continue writing the rest - until the
    whole file is sent. With the bug we had, part of the request will have been
    written, but Jetty still thought the write didn't write anything so it would
    start writing again from the beginning - causing the weird sort of response
    corruption we've been seeing.
    
    This patch also includes a test case which confirms this bug, and its fix.
    In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket
    (on the loopback interface), one thread write()s a very large buffer and
    the other receives what it can. We try this two times - once on a blocking
    socket and once on a non-blocking socket. In each case we expect the number
    of bytes written by one thread (return from write()) and the number read
    by the second thread (return from read()) to be the same. With the bug we
    had, in the non-blocking case we saw write() returning -1 (with
    errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test
    to fail.
    
    Fixes #257.
    
    Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
    Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>
    ef169330
    History
    vfs: fix partial non-blocking write
    Nadav Har'El authored
    
    Our read() and write(), and their variants (pread, pwrite, readv, writev,
    preadv, pwritev) all shared the same bug when it comes to a partial read or
    write: they returned EWOULDBLOCK (EAGAIN) instead of returning successfully
    with the number of bytes actually written or read, as they should have.
    
    In the internals of the BSD read and write operations (e.g., sosend_generic)
    each operation returns *both* an error number and a number of bytes left.
    But at the end, the system call is expected to return just one of them -
    either an error *or* a number of bytes. The existing read()/write() code,
    when it saw the internals returning an error code, always returned it and
    ignored the number of bytes. This was wrong: When the error is EWOULDBLOCK
    and the number of bytes is non-zero, we should return this number of bytes
    (i.e., a successful partial write), *not* the EWOULDBLOCK error.
    
    This bug went unnoticed almost since the dawn of OSv, because partial reads
    and writes are not common. For example, a write() to a blocking socket will
    always return after the entire write is successful, and will not partially
    succeed. Only when we write to an O_NONBLOCK socket, will it be possible to
    see a partial write - But even then, we would need a pretty large write()
    to see it only partially succeeding.
    
    But this bug is very noticable when running the Jetty Web server (see issue
    At some point it's like the response was restarted (complete with a second
    copy of the headers). In Jetty's demo this was seen as half-shown images,
    as well as corrupt output when fetching large text files like /test/da.txt.
    
    Turns out that Jetty sends static responses in a surprisingly efficient
    (for Java code...) way, using a single system call for the entire response:
    It mmap()s the file it wishes to send, and then uses one writev() call to
    send two arrays: The HTTP headers (built in malloc()ed memory), and the
    file itself (from mmapped memory). So Jetty tries to write even a 1MB file
    in one huge writev() call. But there's an added twist: It does so with the
    socket configured to O_NONBLOCK. So for large writes, the write will only
    partially succeed (empirically, only about 50KB will succeed), and Jetty
    will notice the partial write and continue writing the rest - until the
    whole file is sent. With the bug we had, part of the request will have been
    written, but Jetty still thought the write didn't write anything so it would
    start writing again from the beginning - causing the weird sort of response
    corruption we've been seeing.
    
    This patch also includes a test case which confirms this bug, and its fix.
    In this test (tst-tcp-nbwrite), two threads communicate over a TCP socket
    (on the loopback interface), one thread write()s a very large buffer and
    the other receives what it can. We try this two times - once on a blocking
    socket and once on a non-blocking socket. In each case we expect the number
    of bytes written by one thread (return from write()) and the number read
    by the second thread (return from read()) to be the same. With the bug we
    had, in the non-blocking case we saw write() returning -1 (with
    errno=EWOULDBLOCK) but read returned over 50,000 bytes, causing the test
    to fail.
    
    Fixes #257.
    
    Signed-off-by: default avatarNadav Har'El <nyh@cloudius-systems.com>
    Signed-off-by: default avatarAvi Kivity <avi@cloudius-systems.com>