Skip to content
Snippets Groups Projects
  • kohensu's avatar
    51e435b7
    Improve performance of Buf::get_*() (#195) · 51e435b7
    kohensu authored
    The new implementation tries to get the data directly from bytes() (this is
    possible most of the time) and if there is not enough data in bytes() use the
    previous code: copy the needed bytes in a temporary buffer before returning
    the data
    
    Here the bench results:
                                   Before                After           x-faster
    get_f32::cursor             64 ns/iter (+/- 0)    20 ns/iter (+/- 0)    3.2
    get_f32::tbuf_1             77 ns/iter (+/- 1)    34 ns/iter (+/- 0)    2.3
    get_f32::tbuf_1_costly      87 ns/iter (+/- 0)    62 ns/iter (+/- 0)    1.4
    get_f32::tbuf_2            151 ns/iter (+/- 18)  160 ns/iter (+/- 1)    0.9
    get_f32::tbuf_2_costly     180 ns/iter (+/- 2)   187 ns/iter (+/- 2)    1.0
    
    get_f64::cursor             67 ns/iter (+/- 0)    21 ns/iter (+/- 0)    3.2
    get_f64::tbuf_1             80 ns/iter (+/- 0)    35 ns/iter (+/- 0)    2.3
    get_f64::tbuf_1_costly      82 ns/iter (+/- 3)    60 ns/iter (+/- 0)    1.4
    get_f64::tbuf_2            154 ns/iter (+/- 1)   164 ns/iter (+/- 0)    0.9
    get_f64::tbuf_2_costly     170 ns/iter (+/- 2)   187 ns/iter (+/- 1)    0.9
    
    get_u16::cursor             66 ns/iter (+/- 0)    20 ns/iter (+/- 0)    3.3
    get_u16::tbuf_1             77 ns/iter (+/- 0)    35 ns/iter (+/- 0)    2.2
    get_u16::tbuf_1_costly      85 ns/iter (+/- 2)    62 ns/iter (+/- 0)    1.4
    get_u16::tbuf_2            147 ns/iter (+/- 0)   154 ns/iter (+/- 0)    1.0
    get_u16::tbuf_2_costly     160 ns/iter (+/- 1)   177 ns/iter (+/- 0)    0.9
    
    get_u32::cursor             64 ns/iter (+/- 0)    20 ns/iter (+/- 0)    3.2
    get_u32::tbuf_1             77 ns/iter (+/- 0)    35 ns/iter (+/- 0)    2.2
    get_u32::tbuf_1_costly      91 ns/iter (+/- 2)    63 ns/iter (+/- 0)    1.4
    get_u32::tbuf_2            151 ns/iter (+/- 40)  157 ns/iter (+/- 0)    1.0
    get_u32::tbuf_2_costly     162 ns/iter (+/- 0)   180 ns/iter (+/- 0)    0.9
    
    get_u64::cursor             67 ns/iter (+/- 0)    20 ns/iter (+/- 0)    3.4
    get_u64::tbuf_1             78 ns/iter (+/- 0)    35 ns/iter (+/- 1)    2.2
    get_u64::tbuf_1_costly      87 ns/iter (+/- 1)    59 ns/iter (+/- 1)    1.5
    get_u64::tbuf_2            154 ns/iter (+/- 0)   160 ns/iter (+/- 0)    1.0
    get_u64::tbuf_2_costly     168 ns/iter (+/- 0)   184 ns/iter (+/- 0)    0.9
    
    get_u8::cursor              64 ns/iter (+/- 0)    19 ns/iter (+/- 0)    3.4
    get_u8::tbuf_1              77 ns/iter (+/- 0)    35 ns/iter (+/- 0)    2.2
    get_u8::tbuf_1_costly       68 ns/iter (+/- 0)    51 ns/iter (+/- 0)    1.3
    get_u8::tbuf_2              85 ns/iter (+/- 0)    43 ns/iter (+/- 0)    2.0
    get_u8::tbuf_2_costly       75 ns/iter (+/- 0)    61 ns/iter (+/- 0)    1.2
    get_u8::option              77 ns/iter (+/- 0)    59 ns/iter (+/- 0)    1.3
    
    Improvement on the basic std::Cursor implementation are clearly visible.
    
    Other implementations are specific to the bench tests and just map a static
    slice. Different variant are:
     - tbuf_1: only one call of 'bytes()' is needed.
     - tbuf_2: two calls of 'bytes()' is needed to read more than one byte.
     - _costly version are implemented with #[inline(never)] on 'bytes()',
       'remaining()' and 'advance()'.
    
    The cases that are slower (slightly) correspond to implementations that are not
    really realistic: more than one byte is never possible in one time
    Improve performance of Buf::get_*() (#195)
    kohensu authored
    The new implementation tries to get the data directly from bytes() (this is
    possible most of the time) and if there is not enough data in bytes() use the
    previous code: copy the needed bytes in a temporary buffer before returning
    the data
    
    Here the bench results:
                                   Before                After           x-faster
    get_f32::cursor             64 ns/iter (+/- 0)    20 ns/iter (+/- 0)    3.2
    get_f32::tbuf_1             77 ns/iter (+/- 1)    34 ns/iter (+/- 0)    2.3
    get_f32::tbuf_1_costly      87 ns/iter (+/- 0)    62 ns/iter (+/- 0)    1.4
    get_f32::tbuf_2            151 ns/iter (+/- 18)  160 ns/iter (+/- 1)    0.9
    get_f32::tbuf_2_costly     180 ns/iter (+/- 2)   187 ns/iter (+/- 2)    1.0
    
    get_f64::cursor             67 ns/iter (+/- 0)    21 ns/iter (+/- 0)    3.2
    get_f64::tbuf_1             80 ns/iter (+/- 0)    35 ns/iter (+/- 0)    2.3
    get_f64::tbuf_1_costly      82 ns/iter (+/- 3)    60 ns/iter (+/- 0)    1.4
    get_f64::tbuf_2            154 ns/iter (+/- 1)   164 ns/iter (+/- 0)    0.9
    get_f64::tbuf_2_costly     170 ns/iter (+/- 2)   187 ns/iter (+/- 1)    0.9
    
    get_u16::cursor             66 ns/iter (+/- 0)    20 ns/iter (+/- 0)    3.3
    get_u16::tbuf_1             77 ns/iter (+/- 0)    35 ns/iter (+/- 0)    2.2
    get_u16::tbuf_1_costly      85 ns/iter (+/- 2)    62 ns/iter (+/- 0)    1.4
    get_u16::tbuf_2            147 ns/iter (+/- 0)   154 ns/iter (+/- 0)    1.0
    get_u16::tbuf_2_costly     160 ns/iter (+/- 1)   177 ns/iter (+/- 0)    0.9
    
    get_u32::cursor             64 ns/iter (+/- 0)    20 ns/iter (+/- 0)    3.2
    get_u32::tbuf_1             77 ns/iter (+/- 0)    35 ns/iter (+/- 0)    2.2
    get_u32::tbuf_1_costly      91 ns/iter (+/- 2)    63 ns/iter (+/- 0)    1.4
    get_u32::tbuf_2            151 ns/iter (+/- 40)  157 ns/iter (+/- 0)    1.0
    get_u32::tbuf_2_costly     162 ns/iter (+/- 0)   180 ns/iter (+/- 0)    0.9
    
    get_u64::cursor             67 ns/iter (+/- 0)    20 ns/iter (+/- 0)    3.4
    get_u64::tbuf_1             78 ns/iter (+/- 0)    35 ns/iter (+/- 1)    2.2
    get_u64::tbuf_1_costly      87 ns/iter (+/- 1)    59 ns/iter (+/- 1)    1.5
    get_u64::tbuf_2            154 ns/iter (+/- 0)   160 ns/iter (+/- 0)    1.0
    get_u64::tbuf_2_costly     168 ns/iter (+/- 0)   184 ns/iter (+/- 0)    0.9
    
    get_u8::cursor              64 ns/iter (+/- 0)    19 ns/iter (+/- 0)    3.4
    get_u8::tbuf_1              77 ns/iter (+/- 0)    35 ns/iter (+/- 0)    2.2
    get_u8::tbuf_1_costly       68 ns/iter (+/- 0)    51 ns/iter (+/- 0)    1.3
    get_u8::tbuf_2              85 ns/iter (+/- 0)    43 ns/iter (+/- 0)    2.0
    get_u8::tbuf_2_costly       75 ns/iter (+/- 0)    61 ns/iter (+/- 0)    1.2
    get_u8::option              77 ns/iter (+/- 0)    59 ns/iter (+/- 0)    1.3
    
    Improvement on the basic std::Cursor implementation are clearly visible.
    
    Other implementations are specific to the bench tests and just map a static
    slice. Different variant are:
     - tbuf_1: only one call of 'bytes()' is needed.
     - tbuf_2: two calls of 'bytes()' is needed to read more than one byte.
     - _costly version are implemented with #[inline(never)] on 'bytes()',
       'remaining()' and 'advance()'.
    
    The cases that are slower (slightly) correspond to implementations that are not
    really realistic: more than one byte is never possible in one time