Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: Idea for optimization (use case: iterate string with index parameter)
- X-seq: zsh-workers 42237
- From: Sebastian Gniazdowski <psprint@xxxxxxxxxxx>
- To: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>, "zsh-workers@xxxxxxx" <zsh-workers@xxxxxxx>
- Subject: Re: Idea for optimization (use case: iterate string with index parameter)
- Date: Sat, 6 Jan 2018 06:16:12 +0100
- In-reply-to: <CAH+w=7ZyKsNCqfO=EQPapmrSh+VPb-EFFhTXvBbt85fOR4DjAw@mail.gmail.com>
- List-help: <mailto:zsh-workers-help@zsh.org>
- List-id: Zsh Workers List <zsh-workers.zsh.org>
- List-post: <mailto:zsh-workers@zsh.org>
- List-unsubscribe: <mailto:zsh-workers-unsubscribe@zsh.org>
- Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
- References: <etPan.5a4f7fdd.52e15119.14e5a@zdharma.org> <CAH+w=7ZyKsNCqfO=EQPapmrSh+VPb-EFFhTXvBbt85fOR4DjAw@mail.gmail.com>
On 5 Jan 2018 at 23:23:57, Bart Schaefer (schaefer@xxxxxxxxxxxxxxxx) wrote:
> On Fri, Jan 5, 2018 at 5:38 AM, Sebastian Gniazdowski
> wrote:
> > iterating string with index parameter is quite slow, because unicode characters are
> skipped and counted using mbrtowc().
>
> I can't remember the last time I needed to do that kind of iteration.
Maybe indeed it's not that common. It's one of the basic things one can do with strings but in practice, hmm. I would accumulate that optimization though, as the overall optimization starts to give effects while it's largely composed of disappointing optimizations.
> typeset -a iter=(${(s//)string})
> for ((i=1; i <= $#iter; i++)); do something with $iter[i]; done
> string=${(j//)iter} # if needed
>
> That is more memory-intensive, of course, but it also assists with
> cases of unordered access into the array of characters.
It might give some effects, I was doing "for letter in $iter" path blindly and missed the obvious $iter[i] way, and without index, "for letter ..." couldn't replace existing code.
> > In general, the array would hold #N (5-10 or so) last string-index requests. If new request
> would target the same string, but index greater by 1, getarg() would call mbrtowc() once
> (via MB_METACHARLEN macro) reusing the previous in-string pointer.
>
> Why only when greater by 1? If greater, scan to and record the next
> needed position. Same number of mbrtowc() conversions, overall.
Yes this should be generalized this way, I didn't want to complicate example.
I recalled yesterday that for ASCII there's a short path that returns 1 and doesn't call mbrtowc() to compute size of character. In discussion on irc this yielded a conclusion that the cache should probably be 1-element only, because it would be an overkill for simple $string[2], etc. indexing. This way the code should be very simple. The params.c part in question is:
https://github.com/zsh-users/zsh/blob/c2cc8b0fbefc9868fa83537f5b6d90fc1ec438dd/Src/params.c#L1478-L1489
I'm little afraid that getarg() might be called in some generalized situations, but heck it shouldn't be called for a="$b", so the cache might well survive in many typical loops. And maybe a 2-element cache will not add much code and not slow down simple indexing.
--
Sebastian Gniazdowski
psprint /at/ zdharma.org
Messages sorted by:
Reverse Date,
Date,
Thread,
Author