Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: Slurping a file (was: more spllitting travails)
- X-seq: zsh-users 29487
- From: Roman Perepelitsa <roman.perepelitsa@xxxxxxxxx>
- To: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>
- Cc: Zsh Users <zsh-users@xxxxxxx>
- Subject: Re: Slurping a file (was: more spllitting travails)
- Date: Mon, 15 Jan 2024 09:53:05 +0100
- Archived-at: <https://zsh.org/users/29487>
- In-reply-to: <CAH+w=7bZqKwJT-D8BMRe+Smo70iUzAV3SCFFnG-9HSY=XGMzHw@mail.gmail.com>
- List-id: <zsh-users.zsh.org>
- References: <ca1761f1-6d8f-452a-b16d-2bfce9076e25@eastlink.ca> <CAH+w=7ZJsr7hGRvD8f-wUogPcGt0DMOcPyiYMpcwCsbBNkRwuQ@mail.gmail.com> <CAA=-s3zc5a+PA7draaA=FmXtwU9K8RrHbb70HbQN8MhmuXTYrQ@mail.gmail.com> <CAH+w=7bAWOF-v36hdNjaxBB-5rhjsp97mAtyESyR2OcojcEFUQ@mail.gmail.com> <205735b2-11e1-4b5e-baa2-7418753f591f@eastlink.ca> <CAH+w=7Y5_oQL20z7mkMUGSLnsdc9ceJ3=QqdAHVRF9jDZ_hZoQ@mail.gmail.com> <CAA=-s3x4nkLST56mhpWqb9OXUQR8081ew63p+5sEsyw5QmMdpw@mail.gmail.com> <CAH+w=7Yi+M1vthseF3Awp9JJh5KuFoCbFjLa--a22BGJgEJK_g@mail.gmail.com> <CAN=4vMpexntEq=hZcmsiXySy-2ptXMvBKunJ1knDkkS+4sYYLA@mail.gmail.com> <CAH+w=7aT-gbt7PRo=uvPK5=+rR3X-PE7nEssOkh+=fxwdeG_7w@mail.gmail.com> <CAN=4vMq=E4s2a0sDFq-Mc8=pVzPnYOM9NaTmesgXQqi+O+mHpw@mail.gmail.com> <CAH+w=7bZqKwJT-D8BMRe+Smo70iUzAV3SCFFnG-9HSY=XGMzHw@mail.gmail.com>
On Sun, Jan 14, 2024 at 11:10 PM Bart Schaefer
<schaefer@xxxxxxxxxxxxxxxx> wrote:
>
> On Sun, Jan 14, 2024 at 2:34 AM Roman Perepelitsa
> <roman.perepelitsa@xxxxxxxxx> wrote:
> >
> > On Sat, Jan 13, 2024 at 9:02 PM Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
> > >
> > > IFS= read -rd '' file_content <file
> >
> > In addition to being unable to read files with nul bytes, this
> > solution suffers from additional drawbacks:
> >
> > - It's impossible to distinguish EOF from I/O error.
>
> Pretty sure you can do that by examining $ERRNO on nonzero status?
I wouldn't do that other than for debugging. In general, you can
examine errno only for functions that explicitly document how they set
it. If this part is not documented, you have to assume the function
may set errno to anything both on success and on error. Also, most
libc functions may set errno to anything on success.
In this specific case perhaps `read` calls `malloc` after an I/O
error, which may trash errno. Or perhaps at the end of `read <file`
the file descriptor is closed, which again may trash errno. I haven't
verified either of these things. I am merely suggesting why `read`
conceivably could fail to propagate errno from an I/O error in the
absence of explicit guarantees in the docs.
> I'm curious whether
> setopt nomultibyte
> read -u 0 -k 8192 ...
> is actually that much slower in a slurp-like loop.
It is slightly *faster*. For smaller files the difference is about
25%. From 512KB and up there is no discernible difference.
> Another thought: Use -c count option to get number of bytes read and
> -s $size option to specify buffer size. If (( $count == $size )) then
> double $size for the next read.
This does not seem to help, although this might be dependent on the
device and filesystem. Here's a benchmark for various file sizes
(rows) and various fixed buffer sizes (columns):
n fsize 1KB 2KB 4KB 8KB 16KB 32KB 64KB
1 0 41 43 43 43 51 52 53
2 1 47 48 49 48 57 57 59
3 2 48 48 48 48 56 57 58
4 4 49 49 48 49 62 61 59
5 8 74 75 51 49 62 61 63
6 16 47 51 49 49 57 61 63
7 32 47 50 49 50 58 58 59
8 64 54 53 49 50 59 58 71
9 128 50 50 51 51 59 60 61
10 256 49 52 51 51 60 61 63
11 512 53 55 55 54 64 64 65
12 1024 58 61 60 61 57 68 71
13 2048 77 72 71 74 83 83 83
14 4096 112 102 88 89 107 100 108
15 8192 188 153 152 145 161 163 140
16 16384 343 290 270 259 265 240 225
17 32768 658 577 427 471 499 495 489
18 65536 1281 1082 983 771 938 827 937
19 131072 2659 2214 2046 1952 1893 1928 1506
20 262144 4818 4608 4195 4254 3810 3955 3043
21 524288 10174 8967 7502 6382 7632 6142 7148
22 1048576 21591 18205 16424 15691 15243 14327 14889
23 2097152 41156 36087 32731 31840 30104 30090 29913
24 4194304 89814 72949 66447 62716 60998 60252 59485
25 8388608 191579 147195 125987 116327 121544 122384 122631
4KB and 8KB buffers perform best in this benchmark across all file
sizes. Given that 8KB is the default for sysread, there is no apparent
reason to use `-s`.
> > typeset -g REPLY=${(j::)content}
>
> Why the typeset here? Just assign?
Just a habit from using warn_create_global in my scripts. It catches
typos and missing `local` declarations quite well.
> Sadly there's another utility named "slurp":
>
> slurp
> cli utility to select a region in a Wayland compositor
That's too bad: "slurp" is a well-known moniker for reading the full
content of a file (https://www.google.com/search?q=file+slurp).
Perhaps zslurp?
Roman.
Messages sorted by:
Reverse Date,
Date,
Thread,
Author