Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: Slurping a file (was: more spllitting travails)
- X-seq: zsh-users 29472
- From: Roman Perepelitsa <roman.perepelitsa@xxxxxxxxx>
- To: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>
- Cc: Zsh Users <zsh-users@xxxxxxx>
- Subject: Re: Slurping a file (was: more spllitting travails)
- Date: Sun, 14 Jan 2024 11:34:00 +0100
- Archived-at: <https://zsh.org/users/29472>
- In-reply-to: <CAH+w=7aT-gbt7PRo=uvPK5=+rR3X-PE7nEssOkh+=fxwdeG_7w@mail.gmail.com>
- List-id: <zsh-users.zsh.org>
- References: <ca1761f1-6d8f-452a-b16d-2bfce9076e25@eastlink.ca> <CAH+w=7ZJsr7hGRvD8f-wUogPcGt0DMOcPyiYMpcwCsbBNkRwuQ@mail.gmail.com> <CAA=-s3zc5a+PA7draaA=FmXtwU9K8RrHbb70HbQN8MhmuXTYrQ@mail.gmail.com> <CAH+w=7bAWOF-v36hdNjaxBB-5rhjsp97mAtyESyR2OcojcEFUQ@mail.gmail.com> <205735b2-11e1-4b5e-baa2-7418753f591f@eastlink.ca> <CAH+w=7Y5_oQL20z7mkMUGSLnsdc9ceJ3=QqdAHVRF9jDZ_hZoQ@mail.gmail.com> <CAA=-s3x4nkLST56mhpWqb9OXUQR8081ew63p+5sEsyw5QmMdpw@mail.gmail.com> <CAH+w=7Yi+M1vthseF3Awp9JJh5KuFoCbFjLa--a22BGJgEJK_g@mail.gmail.com> <CAN=4vMpexntEq=hZcmsiXySy-2ptXMvBKunJ1knDkkS+4sYYLA@mail.gmail.com> <CAH+w=7aT-gbt7PRo=uvPK5=+rR3X-PE7nEssOkh+=fxwdeG_7w@mail.gmail.com>
On Sat, Jan 13, 2024 at 9:02 PM Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Jan 12, 2024 at 9:39 PM Roman Perepelitsa
> <roman.perepelitsa@xxxxxxxxx> wrote:
> >
> > The standard trick here is to print an extra character after the
> > content of the file and then remove it. This works when capturing
> > stdout of commands, too.
>
> This actually led me to the best (?) solution:
>
> IFS= read -rd '' file_content <file
>
> If IFS is not set, newlines are not stripped. Of course this still
> only works if the file does not contain nul bytes, the -d delimiter
> has to be something that's not in the file.
In addition to being unable to read files with nul bytes, this
solution suffers from additional drawbacks:
- It's impossible to distinguish EOF from I/O error.
- It's slow when reading from non-file file descriptors.
- It's slower than the optimized sysread-based slurp (see below) for
larger files.
Conversely, sysread-based slurp can read the full content of any file
descriptor quickly and report success if and only if it manages to
read until EOF. Its only downside is that it can be up to 2x slower
for tiny files.
> > sysread 'content[$#content+1]' && continue
>
> You can speed this up a little by using the -c option to sysread to
> get back a count of bytes read, and accumulate that in another var to
> avoid having to re-calculate $#content on every loop.
Indeed, this would be faster but the code would still have quadratic
time complexity. Here's a version with linear time complexity:
function slurp() {
emulate -L zsh -o no_multibyte
zmodload zsh/system || return
local -a content
local -i i
while true; do
sysread 'content[++i]' && continue
(( $? == 5 )) || return
break
done
typeset -g REPLY=${(j::)content}
}
(I am not certain it's linear. I've benchmarked it for files up to
512MB in size, and it is linear in practice.)
I've benchmarked read and slurp for reading files and pipes.
emulate -L zsh -o pipe_fail -o no_multibyte
zmodload zsh/datetime || return
local -i i len
function bench() {
local REPLY
local -F start end
start=EPOCHREALTIME
eval $1
end=EPOCHREALTIME
(( $#REPLY == len )) || return
printf ' %10d' '1e6 * (end - start)' || return
}
printf '%2s %7s %10s %10s %10s %10s\n' \
n size read-file slurp-file read-pipe slurp-pipe || return
for ((i = 1; i != 26; ++i)); do
len='i == 1 ? 0 : 1 << (i - 2)'
head -c $len </dev/urandom | tr '\0' x >$i || return
<$i >/dev/null || return
printf '%2d %7d' i len || return
# read-file
bench 'IFS= read -rd "" <$i' || return
# slurp-file
bench 'slurp <$i || return' || return
# read-pipe
bench '<$i | IFS= read -rd ""' || return
# slurp-pipe
bench '<$i | slurp || return' || return
print || return
done
Here's the output (best viewed with a fixed-width font):
n size read-file slurp-file read-pipe slurp-pipe
1 0 74 107 1908 2068
2 1 52 126 2182 1931
3 2 52 111 1863 2471
4 4 65 150 2097 2028
5 8 58 159 1849 2073
6 16 61 118 1934 2089
7 32 73 123 1867 2235
8 64 73 120 2067 2033
9 128 102 122 1904 2172
10 256 129 115 2025 2114
11 512 254 123 2070 2089
12 1024 372 137 2441 2190
13 2048 762 156 2624 2132
14 4096 1306 177 3488 2500
15 8192 2486 263 4446 2540
16 16384 4718 390 6565 3140
17 32768 13919 953 13524 4323
18 65536 20965 1195 21532 5195
19 131072 41741 2124 127089 11325
20 262144 81777 4214 461189 12515
21 524288 161077 8342 1068388 21149
22 1048576 312015 16330 2321501 37422
23 2097152 606270 31752 4773261 67625
24 4194304 1291121 61298 10253544 154340
25 8388608 2534093 135694 19551480 264041
The second column is the file size, ranging from 0 to 8MB. After that
we have four columns listing the amount of time it takes to read the
file in various ways, in microseconds.
Observations from the data:
- All routines appear to have linear time complexity.
- For small files, read is up to twice as fast as slurp.
- For files over 256 bytes in size, slurp is faster.
- With slurp, the time it takes to read from a pipe is about 2x
compared to reading from a file. With read, the penalty is 8x.
- For an 8MB file, slurp is 20 times faster than read when reading
from a file, and 70 times faster when reading from a pipe.
I am tempted to declare slurp the winner here.
Roman.
Messages sorted by:
Reverse Date,
Date,
Thread,
Author