Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])



> 2022/12/16 17:29, Oliver Kiddle <opk@xxxxxxx> wrote:
> 
>>> +  read -ed $'\xc2'
>>> +0:read delimited by a single byte terminates if the byte is part of a multibyte character
>>> +<one£two
>>> +>one
>> 
>> Is this really what the standard requires (or will require)?
>> Breaking in the middle of a valid multibyte character looks
>> rather odd to me.
> 
> The proposed standard wording appears to only talk about the case of the
> delimiter consisting of "one single-byte character". $'\xc2' is not a
> valid UTF-8 character so my interpretation is that they are leaving this
> undefined.

I thought the "one single-byte character" etc. applies only when C or
POSIX locale is in use.

> Behaviour that treats the input as raw bytes for a raw byte delimiter
> is consistent. This retains compatibility with the way things
> work for a non-multibyte locale. Not all files are valid UTF-8 and it
> can be useful to force things to work at a raw byte level.

I was thinking it would be enough if we can do 'byte-by-byte' analysis by
using C/POSIX locale (or by setting MULTIBYTE option to off).

In the web page Stehane mentioned:
https://austingroupbugs.net/view.php?id=243#c6091

"When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input."

But I'm not familiar with this type of documents.





Messages sorted by: Reverse Date, Date, Thread, Author