Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: num_in_chars incremented after each mbrtowc()
- X-seq: zsh-workers 37327
- From: Sebastian Gniazdowski <sgniazdowski@xxxxxxxxx>
- To: Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx>
- Subject: Re: num_in_chars incremented after each mbrtowc()
- Date: Sun, 6 Dec 2015 17:40:33 +0100
- Cc: Zsh hackers list <zsh-workers@xxxxxxx>
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=NZZsJTViSK/1JQXIWm1SI9V/ROSPxT6FNLPN/MIGXtg=; b=KoUd+LHEzp0FS1kfhceVQJpRCQS6QVUoePsRYMBGcK8vlgWnazzCz83XWF2ThKMCRO RKkyBLeenZVAjEEoPK6VdFKx6CIwOIf4qNMxf6QGoMOCCr4xC5ydBTPDROpbWuMpiGXJ VrUsiHC7JBjfBUgDmlDFlXSBjp3FmB/Kn63YmiDJVyw5DZ40dYUgTmK2eWNP/flFSKbt a5loH5dlgPRg+yZlWrWzBiH+YPVb+mfSa6Xr6ZhPA3dnTraP0OByw81C1G8Bv9n69Ixm wc3SQKJDMLazAIetSvwuMYObdbPj6Utc5oCwlfHDJ20AIEjS1DCfVh0pF7Ngtx6lb8qz MC1w==
- In-reply-to: <20151206154956.104b10c6@ntlworld.com>
- List-help: <mailto:zsh-workers-help@zsh.org>
- List-id: Zsh Workers List <zsh-workers.zsh.org>
- List-post: <mailto:zsh-workers@zsh.org>
- Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
- References: <CAKc7PVCmApbPrfcArqxJcDX0nHN9NmuE7zqcrSZb51bSfGyuuQ@mail.gmail.com> <20151206154956.104b10c6@ntlworld.com>
I start with the second patch (which you haven't look at?) as it is
more justified. To put it simply – ptr is incremented, num_in_char
isn't. That's wrong.
To put it less simply, I've written a function that returns raw
pointer into string and sets prevcharlen and nextcharlen. These two
values come from num_in_char. If I don't add one more incrementation
of num_in_char to the "if (*ptr == Meta) {" block, characters are
displayed as errors (i.e. the question marks symbols).
The first patch is less justified. Only bytes of incomplete characters
are counted in num_in_chars and that is what the code wants. But if
one defines the variable as "holds current byte width of character" it
is clear that *every* call to mbrtowc() should increment that
variable. Currently last byte will not be counted because there will
be no MB_INCOMPLETE returned.
To put it simply, call to mbrtowc() must be followed by num_in_char++,
it cannot be anything different.
Best regards,
Sebastian Gniazdowski
On 6 December 2015 at 16:49, Peter Stephenson
<p.w.stephenson@xxxxxxxxxxxx> wrote:
> On Sun, 6 Dec 2015 10:37:21 +0100
> Sebastian Gniazdowski <sgniazdowski@xxxxxxxxx> wrote:
>> Hello,
>> while working my hands off on implementing display width handling in
>> params.c rather than subst.c I encountered a bug in mb_metastrlenend.
>> It will reveal itself only on improper unicode strings.
>
> I don't understand your patch: the change is to increment num_in_char in
> exactly the cases where it is deliberately set to 0 later to reflect the
> fact we've now got a complete character and the effect is included in
> num instead.
>
> Somebody complained about this function a couple of months ago and I
> explained, then, too; it suggests it needs some more comments, so I've
> added some.
>
> It may be the real difficulty is with the API, in which case you'll need
> to say (in words, not videos, please) what you're expecting. As long as
> this is consistently dealt with in callers of the function it might be
> possible to change --- I guess you're only worried about the case for
> returning a width, which is uncommon in the code and indeed doesn't
> really have a well defined result for incomplete/invalid characters.
> Maybe you have a particular strategy in mind.
>
> Consequently I haven't looked at your other patch.
>
> pws
>
> diff --git a/Src/utils.c b/Src/utils.c
> index d131383..45f8286 100644
> --- a/Src/utils.c
> +++ b/Src/utils.c
> @@ -5179,6 +5179,17 @@ mb_metastrlenend(char *ptr, int width, char *eptr)
> ret = mbrtowc(&wc, &inchar, 1, &mb_shiftstate);
>
> if (ret == MB_INCOMPLETE) {
> + /*
> + * "num_in_char" is only used for incomplete characters. The
> + * assumption is that we will output this octet as a single
> + * character (of single width) if we don't get a complete
> + * character; if we do get a complete character, num_in_char
> + * becomes irrelevant and is set to zero.
> + *
> + * This is in contrast to "num" which counts the characters
> + * or widths in complete characters. The two are summed,
> + * so we don't count characters twice.
> + */
> num_in_char++;
> } else {
> if (ret == MB_INVALID) {
Messages sorted by:
Reverse Date,
Date,
Thread,
Author