Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: ZWJ paste from clipboard problem (unicode)
- X-seq: zsh-users 30241
- From: Mikael Magnusson <mikachu@xxxxxxxxx>
- To: Oliver Kiddle <opk@xxxxxxx>
- Cc: Daniil Rozanov <personal@xxxxxxxxxxxx>, Zsh users <zsh-users@xxxxxxx>
- Subject: Re: ZWJ paste from clipboard problem (unicode)
- Date: Tue, 1 Apr 2025 14:46:20 +0200
- Archived-at: <https://zsh.org/users/30241>
- In-reply-to: <33942-1743506608.001547@3ros.h6PJ.3Nhw>
- List-id: <zsh-users.zsh.org>
- References: <D8UOVIME9HDS.1I0WLZVJ6FPQ7@rozanov.info> <CAH+w=7atQJ9Hqp=MXpin3CMQ=N_=fvhYs1aWfFh2fpHNLbEvzg@mail.gmail.com> <D8URMP63LOAZ.35Q236JMFWTLZ@rozanov.info> <CAHYJk3Tg7LL2AwjX0CD=CgM2fOVZuiQd85a3TOAyhPB1h9vABw@mail.gmail.com> <D8V5D34J1TWE.3QJ3E59BL2J8X@rozanov.info> <33942-1743506608.001547@3ros.h6PJ.3Nhw>
On Tue, Apr 1, 2025 at 1:23 PM Oliver Kiddle <opk@xxxxxxx> wrote:
>
> "Daniil Rozanov" wrote:
> > As far as I know it is possible on some bunch of terminals. This
> > possibility called "mode 2027", briefly described here:
> >
> > https://mitchellh.com/writing/grapheme-clusters-in-terminals
> >
> > So at least zsh can test terminal to know does it support grapheme
> > clustering or not and work around this.
>
> The problem for zsh remains that wcwidth is the libc interface that
> provides a way to find a correspondence between wide characters (i.e.
> unicode code points) and graphemes. There's a lot of limitations in
> what libc provides. Most terminals also use wcwidth() so share the same
> limitations meaning that behaviour is at least consistent.
>
> So zsh would need to either depend on some additional external library
> or we need to fetch tables of unicode code points and generate our own
> tables to classify them. libICU is the long-standing C library for this
> but it's a huge monster dating from a time when people expected 16-bits
> to be enough. I have used libutf8proc before and my impression was
> positive but I didn't think it covered this particular functionality.
> However, foot is using it (utf8proc) for grapheme-clustering and the
> licence (MIT "expat" variant) is compatible so perhaps it would be an
> option. Or perhaps someone else is more knowledgeable on this than me
> or has strong objections to adding a dependency given zsh currently has
> nearly none.
>
> It still needs someone to do the working of adding this. If someone is
> motivated to do so, the areas of code that need changing might be fairly
> easily identified by looking for where the COMBINING_CHARS option is
> checked.
>
> I'm aware of mode 2027 having come across it when looking at the more
> general possibilities involved with terminal queries that subscribers to
> -workers will have seen.
>
> Is changing COMBINING_CHARS to be enabled by default (at least for zsh
> emulation) something that might sensibly be changed for a future
> (major?) release?
The problem with combined emojis in particular is it can depend both
on which sequences the terminal decides to combine, and also on the
font the user has chosen, and probably many other factors, so there is
no way for a library to magically know which combined emoji the
terminal will decide to render vs rendering the glyphs separately.
https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html has an
insanely long list of some possible combinations, but there is still
no way for zsh to know what the terminal might decide to do with
whatever fonts or other images are installed.
This 2027 thing seems to only be used for enabling/disabling the whole
thing in the terminal and doesn't really do anything for solving the
problem, as I understand it (other than proposing to query the cursor
position _after_ printing the text, when it will in many cases be too
late).
If anything, I would suggest that we don't print the zwj codepoint
when combining_chars is enabled, which will effectively force the
terminal to render the fallback representation, without any "<200d>"
in between them. This way the spec is followed, users don't have to
look at the string "<200d>", and we don't get corrupted output due to
losing track of the cursor.
--
Mikael Magnusson
Messages sorted by:
Reverse Date,
Date,
Thread,
Author