Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: Slurping a file (was: more spllitting travails)

X-seq: zsh-users 29472
From: Roman Perepelitsa <roman.perepelitsa@xxxxxxxxx>
To: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>
Cc: Zsh Users <zsh-users@xxxxxxx>
Subject: Re: Slurping a file (was: more spllitting travails)
Date: Sun, 14 Jan 2024 11:34:00 +0100
Archived-at: <https://zsh.org/users/29472>
In-reply-to: <CAH+w=7aT-gbt7PRo=uvPK5=+rR3X-PE7nEssOkh+=fxwdeG_7w@mail.gmail.com>
List-id: <zsh-users.zsh.org>
References: <ca1761f1-6d8f-452a-b16d-2bfce9076e25@eastlink.ca> <CAH+w=7ZJsr7hGRvD8f-wUogPcGt0DMOcPyiYMpcwCsbBNkRwuQ@mail.gmail.com> <CAA=-s3zc5a+PA7draaA=FmXtwU9K8RrHbb70HbQN8MhmuXTYrQ@mail.gmail.com> <CAH+w=7bAWOF-v36hdNjaxBB-5rhjsp97mAtyESyR2OcojcEFUQ@mail.gmail.com> <205735b2-11e1-4b5e-baa2-7418753f591f@eastlink.ca> <CAH+w=7Y5_oQL20z7mkMUGSLnsdc9ceJ3=QqdAHVRF9jDZ_hZoQ@mail.gmail.com> <CAA=-s3x4nkLST56mhpWqb9OXUQR8081ew63p+5sEsyw5QmMdpw@mail.gmail.com> <CAH+w=7Yi+M1vthseF3Awp9JJh5KuFoCbFjLa--a22BGJgEJK_g@mail.gmail.com> <CAN=4vMpexntEq=hZcmsiXySy-2ptXMvBKunJ1knDkkS+4sYYLA@mail.gmail.com> <CAH+w=7aT-gbt7PRo=uvPK5=+rR3X-PE7nEssOkh+=fxwdeG_7w@mail.gmail.com>

On Sat, Jan 13, 2024 at 9:02 PM Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Jan 12, 2024 at 9:39 PM Roman Perepelitsa
> <roman.perepelitsa@xxxxxxxxx> wrote:
> >
> > The standard trick here is to print an extra character after the
> > content of the file and then remove it. This works when capturing
> > stdout of commands, too.
>
> This actually led me to the best (?) solution:
>
>   IFS= read -rd '' file_content <file
>
> If IFS is not set, newlines are not stripped.  Of course this still
> only works if the file does not contain nul bytes, the -d delimiter
> has to be something that's not in the file.

In addition to being unable to read files with nul bytes, this
solution suffers from additional drawbacks:

- It's impossible to distinguish EOF from I/O error.
- It's slow when reading from non-file file descriptors.
- It's slower than the optimized sysread-based slurp (see below) for
larger files.

Conversely, sysread-based slurp can read the full content of any file
descriptor quickly and report success if and only if it manages to
read until EOF. Its only downside is that it can be up to 2x slower
for tiny files.

> >         sysread 'content[$#content+1]' && continue
>
> You can speed this up a little by using the -c option to sysread to
> get back a count of bytes read, and accumulate that in another var to
> avoid having to re-calculate $#content on every loop.

Indeed, this would be faster but the code would still have quadratic
time complexity. Here's a version with linear time complexity:

    function slurp() {
      emulate -L zsh -o no_multibyte
      zmodload zsh/system || return
      local -a content
      local -i i
      while true; do
        sysread 'content[++i]' && continue
        (( $? == 5 )) || return
        break
      done
      typeset -g REPLY=${(j::)content}
    }

(I am not certain it's linear. I've benchmarked it for files up to
512MB in size, and it is linear in practice.)

I've benchmarked read and slurp for reading files and pipes.

    emulate -L zsh -o pipe_fail -o no_multibyte
    zmodload zsh/datetime || return

    local -i i len

    function bench() {
      local REPLY
      local -F start end
      start=EPOCHREALTIME
      eval $1
      end=EPOCHREALTIME
      (( $#REPLY == len )) || return
      printf ' %10d' '1e6 * (end - start)' || return
    }

    printf '%2s %7s %10s %10s %10s %10s\n' \
           n size read-file slurp-file read-pipe slurp-pipe || return

    for ((i = 1; i != 26; ++i)); do
      len='i == 1 ? 0 : 1 << (i - 2)'
      head -c $len </dev/urandom | tr '\0' x >$i || return
      <$i >/dev/null || return

      printf '%2d %7d' i len || return

      # read-file
      bench 'IFS= read -rd "" <$i' || return

      # slurp-file
      bench 'slurp <$i || return' || return

      # read-pipe
      bench '<$i | IFS= read -rd ""' || return

      # slurp-pipe
      bench '<$i | slurp || return' || return

      print || return
    done

Here's the output (best viewed with a fixed-width font):

     n    size  read-file slurp-file  read-pipe slurp-pipe
     1       0         74        107       1908       2068
     2       1         52        126       2182       1931
     3       2         52        111       1863       2471
     4       4         65        150       2097       2028
     5       8         58        159       1849       2073
     6      16         61        118       1934       2089
     7      32         73        123       1867       2235
     8      64         73        120       2067       2033
     9     128        102        122       1904       2172
    10     256        129        115       2025       2114
    11     512        254        123       2070       2089
    12    1024        372        137       2441       2190
    13    2048        762        156       2624       2132
    14    4096       1306        177       3488       2500
    15    8192       2486        263       4446       2540
    16   16384       4718        390       6565       3140
    17   32768      13919        953      13524       4323
    18   65536      20965       1195      21532       5195
    19  131072      41741       2124     127089      11325
    20  262144      81777       4214     461189      12515
    21  524288     161077       8342    1068388      21149
    22 1048576     312015      16330    2321501      37422
    23 2097152     606270      31752    4773261      67625
    24 4194304    1291121      61298   10253544     154340
    25 8388608    2534093     135694   19551480     264041

The second column is the file size, ranging from 0 to 8MB. After that
we have four columns listing the amount of time it takes to read the
file in various ways, in microseconds.

Observations from the data:

- All routines appear to have linear time complexity.
- For small files, read is up to twice as fast as slurp.
- For files over 256 bytes in size, slurp is faster.
- With slurp, the time it takes to read from a pipe is about 2x
  compared to reading from a file. With read, the penalty is 8x.
- For an 8MB file, slurp is 20 times faster than read when reading
  from a file, and 70 times faster when reading from a pipe.

I am tempted to declare slurp the winner here.

Roman.

Follow-Ups:
- Re: Slurping a file (was: more spllitting travails)
  - From: Stephane Chazelas
- Re: Slurping a file (was: more spllitting travails)
  - From: Bart Schaefer
- Re: Slurping a file (was: more spllitting travails)
  - From: Bart Schaefer
- Re: Slurping a file
  - From: Ray Andrews
- Re: Slurping a file (was: more spllitting travails)
  - From: Roman Perepelitsa

References:
- more splitting travails
  - From: Ray Andrews
- Re: more splitting travails
  - From: Bart Schaefer
- Fwd: more splitting travails
  - From: Bart Schaefer
- Re: Fwd: more splitting travails
  - From: Ray Andrews
- Re: Fwd: more splitting travails
  - From: Bart Schaefer
- Re: Fwd: more splitting travails
  - From: Mark J. Reed
- Re: Fwd: more splitting travails
  - From: Bart Schaefer
- Re: Fwd: more splitting travails
  - From: Roman Perepelitsa
- Slurping a file (was: more spllitting travails)
  - From: Bart Schaefer

Messages sorted by: Reverse Date, Date, Thread, Author