Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: num_in_chars incremented after each mbrtowc()

X-seq: zsh-workers 37327
From: Sebastian Gniazdowski <sgniazdowski@xxxxxxxxx>
To: Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx>
Subject: Re: num_in_chars incremented after each mbrtowc()
Date: Sun, 6 Dec 2015 17:40:33 +0100
Cc: Zsh hackers list <zsh-workers@xxxxxxx>
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=NZZsJTViSK/1JQXIWm1SI9V/ROSPxT6FNLPN/MIGXtg=; b=KoUd+LHEzp0FS1kfhceVQJpRCQS6QVUoePsRYMBGcK8vlgWnazzCz83XWF2ThKMCRO RKkyBLeenZVAjEEoPK6VdFKx6CIwOIf4qNMxf6QGoMOCCr4xC5ydBTPDROpbWuMpiGXJ VrUsiHC7JBjfBUgDmlDFlXSBjp3FmB/Kn63YmiDJVyw5DZ40dYUgTmK2eWNP/flFSKbt a5loH5dlgPRg+yZlWrWzBiH+YPVb+mfSa6Xr6ZhPA3dnTraP0OByw81C1G8Bv9n69Ixm wc3SQKJDMLazAIetSvwuMYObdbPj6Utc5oCwlfHDJ20AIEjS1DCfVh0pF7Ngtx6lb8qz MC1w==
In-reply-to: <20151206154956.104b10c6@ntlworld.com>
List-help: <mailto:zsh-workers-help@zsh.org>
List-id: Zsh Workers List <zsh-workers.zsh.org>
List-post: <mailto:zsh-workers@zsh.org>
Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
References: <CAKc7PVCmApbPrfcArqxJcDX0nHN9NmuE7zqcrSZb51bSfGyuuQ@mail.gmail.com> <20151206154956.104b10c6@ntlworld.com>

I start with the second patch (which you haven't look at?) as it is
more justified. To put it simply – ptr is incremented, num_in_char
isn't. That's wrong.

To put it less simply, I've written a function that returns raw
pointer into string and sets prevcharlen and nextcharlen. These two
values come from num_in_char. If I don't add one more incrementation
of num_in_char to the "if (*ptr == Meta) {" block, characters are
displayed as errors (i.e. the question marks symbols).

The first patch is less justified. Only bytes of incomplete characters
are counted in num_in_chars and that is what the code wants. But if
one defines the variable as "holds current byte width of character" it
is clear that *every* call to mbrtowc() should increment that
variable. Currently last byte will not be counted because there will
be no MB_INCOMPLETE returned.

To put it simply, call to mbrtowc() must be followed by num_in_char++,
it cannot be anything different.

Best regards,
Sebastian Gniazdowski

On 6 December 2015 at 16:49, Peter Stephenson
<p.w.stephenson@xxxxxxxxxxxx> wrote:
> On Sun, 6 Dec 2015 10:37:21 +0100
> Sebastian Gniazdowski <sgniazdowski@xxxxxxxxx> wrote:
>> Hello,
>> while working my hands off on implementing display width handling in
>> params.c rather than subst.c I encountered a bug in mb_metastrlenend.
>> It will reveal itself only on improper unicode strings.
>
> I don't understand your patch: the change is to increment num_in_char in
> exactly the cases where it is deliberately set to 0 later to reflect the
> fact we've now got a complete character and the effect is included in
> num instead.
>
> Somebody complained about this function a couple of months ago and I
> explained, then, too; it suggests it needs some more comments, so I've
> added some.
>
> It may be the real difficulty is with the API, in which case you'll need
> to say (in words, not videos, please) what you're expecting.  As long as
> this is consistently dealt with in callers of the function it might be
> possible to change --- I guess you're only worried about the case for
> returning a width, which is uncommon in the code and indeed doesn't
> really have a well defined result for incomplete/invalid characters.
> Maybe you have a particular strategy in mind.
>
> Consequently I haven't looked at your other patch.
>
> pws
>
> diff --git a/Src/utils.c b/Src/utils.c
> index d131383..45f8286 100644
> --- a/Src/utils.c
> +++ b/Src/utils.c
> @@ -5179,6 +5179,17 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
>         ret = mbrtowc(&wc, &inchar, 1, &mb_shiftstate);
>
>         if (ret == MB_INCOMPLETE) {
> +           /*
> +            * "num_in_char" is only used for incomplete characters.  The
> +            * assumption is that we will output this octet as a single
> +            * character (of single width) if we don't get a complete
> +            * character; if we do get a complete character, num_in_char
> +            * becomes irrelevant and is set to zero.
> +            *
> +            * This is in contrast to "num" which counts the characters
> +            * or widths in complete characters.  The two are summed,
> +            * so we don't count characters twice.
> +            */
>             num_in_char++;
>         } else {
>             if (ret == MB_INVALID) {

Follow-Ups:
- Re: num_in_chars incremented after each mbrtowc()
  - From: Peter Stephenson

References:
- num_in_chars incremented after each mbrtowc()
  - From: Sebastian Gniazdowski
- Re: num_in_chars incremented after each mbrtowc()
  - From: Peter Stephenson

Messages sorted by: Reverse Date, Date, Thread, Author