Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])

X-seq: zsh-workers 51165
From: Stephane Chazelas <stephane@xxxxxxxxxxxx>
To: Oliver Kiddle <opk@xxxxxxx>
Cc: Zsh hackers list <zsh-workers@xxxxxxx>
Subject: Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
Date: Sat, 10 Dec 2022 09:06:26 +0000
Archived-at: <https://zsh.org/workers/51165>
In-reply-to: <99492-1670616302.663548@1brw.o7tP.wgJL>
List-id: <zsh-workers.zsh.org>
Mail-followup-to: Oliver Kiddle <opk@xxxxxxx>, Zsh hackers list <zsh-workers@xxxxxxx>
References: <20221209154225.2z3lbtf422ypnmjx@chazelas.org> <99492-1670616302.663548@1brw.o7tP.wgJL>

2022-12-09 21:05:02 +0100, Oliver Kiddle:
> Stephane Chazelas wrote:
> > Even in a locale with a single-byte charmap, when multibyte is
> > off, I can't make read -d work when the delimiter is a byte >=
> > 0x80.
> 
> In my testing, it does work in a single-byte locale. I tested on
> multiple systems.
> 
> Looking at the multibyte implementation of read, the approach taken
> is to use a wchar_t for the delimiter and then maintain mbstate_t for
> the input. This supports a delimiter that can be any single unicode
> codepoint. In my testing this is working as intended. But note that \351
> alone is incomplete in UTF-8 terms so what wchar_t value should that be
> mapped to.

Note that here I'm talking of the case where multibyte is
*disabled* (zsh +o multibyte), and where UTF-8 (or any other
multibyte charset) is nowhere in the picture. As I said, with
multibyte on, it works for valid characters; in iso8859-15 on
GNU systems, that's 0..0x7f, 0xa0..0xff.

IIRC In other areas of the code, bytes that can't be decoded
into characters are decoded as 0xdc00 + byte.

$ grep -rnwi 0xdc00 .
./ChangeLog:12625:      invalid multibyte characters to 0xDC00 + index which is invalid
./Src/pattern.c:242:    ((wchar_t) (0xDC00 + STOUC(ch)))

See workers/36411 workers/36415

It would be great if something like that was done everywhere so
we can always deal with arbitrary arrays of bytes regardless of
the locale.

> Also interesting to consider is the range \x7f to \x9f in an ISO-8859-x
> locale. Those are duplicates of the control characters. In my testing
> with a single-byte locale \x89 as a delimiter will end input at a tab
> character but the converse (\t as a delimiter) will not terminate at
> \x89 in the input.

I can't reproduce here:

~$ LC_ALL=en_GB.iso885915 zsh +o multibyte -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd
00000000  61 09 62 0a                                       |a.b.|
00000004
~$ LC_ALL=en_GB.iso885915 zsh -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd
00000000  61 09 62 0a                                       |a.b.|
00000004
~$ LC_ALL=en_GB.UTF-8 zsh -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd
00000000  61 09 62 0a                                       |a.b.|
00000004
~$ LC_ALL=en_GB.UTF-8 zsh +o multibyte -c "IFS= read -rd $'\x89' a <<< $'a\tb'; print -rn -- \$a" | hd
00000000  61 09 62 0a                                       |a.b.|
00000004

> My understanding of the proposed POSIX wording is that it requires
> the individual octet, regardless of any character mapping to be the
> delimiter. Does anyone track the austin list? Would be good if they can
> be persuaded to relax what they specify. The part I especially object to
> is requiring that the input does not contain null bytes. The fact that
> zsh can cope with nulls is often really useful. Why can't they leave
> that unspecified? I can understand wanting to standardise a lowest
> common denominator but that is punishing an existing richer
> implementation.

Not sure what you mean. The proposed text has:

 -d delim

    If delim consists of one single-byte character, that byte
    shall be used as the logical line delimiter. If delim is the
    null string, the logical line delimiter shall be the null
    byte. Otherwise, the behavior is unspecified.

That's added alongside xargs -0 and find's -prin0 to be able to
deal with arbitrary file names, so the point is for it to work
on input with NULs.

The:

 If the -d delim option is specified and delim consists of one
 single-byte character other than <newline>, the standard input
 shall contain zero or more characters, shall not contain any
 null bytes, and (if not empty) shall end with delim.

Is a requirement on the *application*, not the implementation.
That is, it only specifies what's meant to happen when the input
doesn't contain NULs.

So I think we're good here.

I'm susbscribed to both austin-group-l and zsh-workers but don't
follow them very closely. I try to mention things relevant to
zsh here when I spot them on austin-group-l and I try to argue
there about things that would conflict with the zsh way for no
good reason.

austin-group-l is not large volume, I would recommend at least
one zsh developer get in there. I see the maintainers of FreeBSD
sh, NetBSD sh, mksh, bash at least occasionally contributing
there. You can also get an account on their bug tracker. I've
got one and I'm not the maintainer of any software relevant to
POSIX.

Changes in the bug tracker are posted to the ML. It's often
preferable to add a comment on a ticket than post on the ML.

> One way forward would be to take the argument to -d as a literal and
> potentially multi-byte delimiter. UTF-8 has the property that a valid
> sequence can't occur within a longer sequence so for UTF-8 you would not
> need to worry about it finding a delimiter within a different
> character. This is not the case with combining characters but the
> current implementation will also stop at the uncombined character.
> There are other multi-byte encodings for which this is not true. I've
> no idea how relevant things like EUC-JP and Shift-JIS still are.

Things like Shift-JIS are unworkable. I don't expect anyone to
still be using them.

GB18030 and BIG5/BIG5-HKSCS may still be relevant. They don't
work on Shift state like Shift-JIS, but many of their characters
have bytes <= 0x7f, and zsh doesn't really work with them for
that reason.

$ echo αε | iconv -t BIG5-HKSCS | hd
00000000  a3 5c a3 60 0a                                    |.\.`.|
00000005

Simply *having* locales with those charsets opens your system up
to security vulnerabilities as you have alpha characters which
contain \ and `, special to the shell and many other things.

in practice not many things work with them, it's not just zsh;
I've noticed newer Debian/Ubuntu doesn't offer locales with them
any longer (though you can still generate some if you like).

[...]
> Should we document the fact that -d '' works like -d $'\0'? Perhaps mark
> this as being for compatibility with other shells? Fortunately, it does
> work as specified but this may only be by accident. When the -d feature
> was added, it was probably only checked that the behaviour with an empty
> delimiter was sane.

Yes, I agree it's worth documenting. AFAIK, read -d is from
ksh93. read -d '' likely works in bash (added there in 2.04) by
accident as well (first byte of a NUL-delimited string), it
didn't in ksh93.

IFS= read -rd '' is a well known coding pattern in bash.
read -d '' now works in ksh93u+m and mksh.

-d is likely used much more often with '' than with anything
else.

> > $ LC_ALL=en_GB.iso885915 ./Src/zsh +o multibyte
> > $ locale charmap
> > ISO-8859-15
> 
> What do you get with the following, I'd sooner trust this:
>   zmodload zsh/langinfo; echo $langinfo[CODESET]

Same ISO-8859-15

$ LC_ALL=en_GB.iso885915  ./Src/zsh +o multibyte -c "IFS= read -rd $'\351' a <<< a$'\351'b; print -rn -- \$a" | hd
00000000  61 e9 62 0a                                       |a.b.|
00000004

gdb under LC_ALL=en_GB.iso885915 luit

6402        if (OPT_ISSET(ops,'d')) {
(gdb)
6403            char *delimstr = OPT_ARG(ops,'d');
(gdb)
6407            if (isset(MULTIBYTE)) {
(gdb)
6412                wi = WEOF;
(gdb)
6413            if (wi != WEOF)
(gdb)
6416                delim = (wchar_t)((delimstr[0] == Meta) ?
(gdb)
6417                                  delimstr[1] ^ 32 : delimstr[0]);
(gdb)
6416                delim = (wchar_t)((delimstr[0] == Meta) ?
(gdb)
6421            if (SHTTY != -1) {
(gdb) p delim
$1 = -23 L'\xffffffe9'
(gdb) p delimstr
$2 = 0x7ffff7fa1790 "é"

(as delimstr is a signed char* instead of unsigned char I guess).

It works better after:

diff --git a/Src/builtin.c b/Src/builtin.c
index a7b7755a7..d650ca750 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -6414,9 +6414,9 @@ bin_read(char *name, char **args, Options ops, UNUSED(int func))
 	    delim = (wchar_t)wi;
 	else
 	    delim = (wchar_t)((delimstr[0] == Meta) ?
-			      delimstr[1] ^ 32 : delimstr[0]);
+			      STOUC(delimstr[1]) ^ 32 : STOUC(delimstr[0]));
 #else
-        delim = (delimstr[0] == Meta) ? delimstr[1] ^ 32 : delimstr[0];
+        delim = (delimstr[0] == Meta) ? STOUC(delimstr[1]) ^ 32 : STOUC(delimstr[0]);
 #endif
 	if (SHTTY != -1) {
 	    struct ttyinfo ti;

(I don't know if it's the proper way to cast, my C is rusty)

Including for bytes that don't map to any character in ISO8859-15:

$ LC_ALL=en_GB.iso885915 zsh -c "IFS= read -rd $'\x80' a <<< $'a\x80b'; print -rn -- \$a" | hd
00000000  61                                                |a|
00000001

So I guess that's the fix for my bug.

-- 
Stephane

Follow-Ups:
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Oliver Kiddle
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Jun T

References:
- read -d $'\200' doesn't work with set +o multibyte
  - From: Stephane Chazelas
- Re: read -d $'\200' doesn't work with set +o multibyte
  - From: Oliver Kiddle

Messages sorted by: Reverse Date, Date, Thread, Author