Re: more splitting

Adding to my previous reply which I inadvertently sent only to Ray.

On Fri, Apr 17, 2026 at 9:49 AM Ray Andrews <rayandrews@xxxxxxxxxxx> wrote:

I understand, but it's too bad that chars that were formerly part of the data are now thrown away:

% print -rn $var | od -vAx -tx1 -tc
000000 61 20 62 20 63 0a 64 20 65 20 66 20 67 20 68 20
a b c \n d e f g h
^ This is data!

2 /aWorking/Zsh/Source/Wk 7 % print -rn $var2 | od -vAx -tx1 -tc
000000 61 20 62 20 63 20 64 20 65 20 66 20 67 20 68 20
a b c d e f g h
^ Now it's a space -- a separator.

Yes, because you told the shell to turn it into one! By using the (f) expansion flag, you said "hey, shell, I know this is an array of individual values, but I want you to ignore that and jam them all together into a single flat string, and then split that string on newlines". If you split a piece of text on newlines you shouldn't be surprised that the newlines disappear from the resulting data.

I know you're not much of a scripting-language person, but maybe this will be clearer if we translate exactly the same operation into another language with more traditional notation. Like _javascript_:

Welcome to Node.js v25.9.0.
Type ".help" for more information.
> var1 = ["a b", "c\nd e f g h", "ij"]
[ 'a b', 'c\nd e f g h', 'ij' ]
> var2 = var1.join(" ").split("\n")
[ 'a b c', 'd e f g h ij' ]

I could do Perl, PowerShell, Python, Raku, Ruby... they all look pretty similar. The point is that there are no newlines in any of the strings in the result array, because the newline is the delimiter you're splitting on.

You do have the option of keeping the delimiters around, but they show up as separate elements in between the others in the result array. That's mostly useful when you have a selection of possible delimiters and need to know which one you got at each position; less so when you know that there had to have been a newline there because it was the only thing you were splitting on.

Zsh's slightly-older sister Bash has a built-in function called readarray that splits input on a delimiter and puts the resulting strings into an array; it is unusual in that it does keep the delimiter around by default at the end of each element. But I very rarely see code using it that way; most of the time it's called with the -t option to disable that behavior and leave the delimiters out.

On Fri, Apr 17, 2026 at 9:49 AM Ray Andrews <rayandrews@xxxxxxxxxxx> wrote:

On 2026-04-17 05:15, Philippe Altherr wrote:

As Bart pointed out, the two hex versions are not interchangeable.

Thanks, I'll digest this at length, for now:

It all depends on what you want to do. For instance, if you want to see exactly what's in a variable,

Yes. Nothing less, nothing more. All the time. Every time. Exactly. Raw.

Here is a first example of failure:

% var=""; hex-var var; hex-doors $var

--- var ---------------------
''

-----------------------------

hex-var correctly shows that var contains the empty string but hex-doors see nothing at the front door. The "problem" is that Zsh entirely drops $var if it expands to an empty string. Thus, in this case, hex-doors is called with no arguments instead of with a single empty string.

Right, that's understandable. That zsh processes the input is what you want 99.9% of the time.

You can fix the "problem" by quoting $var, then hex-doors also sees the empty string:

% var=""; hex-var var; hex-doors "$var"

Since it's necessary, too bad it can't be automatic.

Here is another example of where $var fails to pass the right value(s) to hex-doors:

It really is impossible to know from the pipe version where the input
array was split, the information no longer exists.

Indeed, in a similar way, I can't tell whether you typed your reply with your hands, or with your feet, whether you used some voice-to-text tool, or whether it was your cat that randomly jumped on your keyboard. The only thing I see is the resulting mail.

Still, there is an ambiguity:

% var=("a b" c$'\n''d e f'' ''g h' ij)
% var2=( ${(f)var} )

It seems that the splitting char is always removed, thus one can't look
at $var2 and determine that it was created via split on newlines,

Obviously not. What you have to understand is that var2 doesn't store the literal _expression_ ${(f)var}. It stores the result of its expansion.

I understand, but it's too bad that chars that were formerly part of the data are now thrown away:

% print -rn $var | od -vAx -tx1 -tc
000000 61 20 62 20 63 0a 64 20 65 20 66 20 67 20 68 20
a b c \n d e f g h
^ This is data!

2 /aWorking/Zsh/Source/Wk 7 % print -rn $var2 | od -vAx -tx1 -tc
000000 61 20 62 20 63 20 64 20 65 20 66 20 67 20 68 20
a b c d e f g h
^ Now it's a space -- a separator.

Indeed, there are plenty of ways to generate the same result. Here 6 different ways of initializing var2 to the same values, namely "aa", "bb", and "cc"

Nice list. I'll save that.

This seems to me an
imperfection, tho obviously one would have to go back to the 70s to have
done anything about it. In the interests of perfect information, I
wonder if there's any way that even the 'od' output could inform us that
the output of $var2 is an array split on newlines? I suspect that 'od'
has no way of knowing.

How should that work? How could var2 remember in which of the 6 ways listed above it was initialized? Where should that information be stored? And what for?

There's any number of ways it could be done, of varying degrees of sophistication. And the motive would be that data is preserved irrespective of split. As above, newlines would not be transformed into spaces. I entered the newline as data and it should stay data. Now! I quite understand that keystrokes can be either data or control instructions:

% var=("a b" c$'\n''d e f'' ''g h' ij

... I know that: " " $ ' \ n " " '
... are not data they are controls. The unquoted spaces are controls too. Hard spaces must be quoted. No issue. But what was formerly data can become a separator, and that bugs me. Anyway this is carved in stone, the thing is to just understand and cope.

I suspect that some of your confusion comes from C.

Usually.

In C, when you call a function with a variable like in foo(var), the function foo receives exactly one argument which contains the value of var. In Zsh, the equivalent foo $var works very differently. In Zsh, the _expression_ $var gets first expanded before foo is called and foo is then called with the values resulting from that expansion. As we have seen above, depending on the nature and content of the variable var, the expansion of $var may lead to zero, one, or more values, which will translate into as many arguments for foo. When foo is finally called, foo only sees the result of the expansion of $var. It can't know whether its arguments were provided as explicit values, are the result of the expansion of a variable, come from a command substitution, or were produced in some other fashion.

I understand. But there are situations where one might want to know *how* something was put together. When you're rebuilding an engine, *how* is very important. That loops back to my oldest complaint, namely that a function should have the ability to 'know it's own tail' -- to know the actual keystrokes via which it was invoked.

Meantime, in shell culture pigs can't fly and that's just the way it is. I'm 99% in understanding, now to re-study the above and come up with some function that will tell me *everything* even on Tuesdays. Empty strings are not 'nothing' they are empty strings.

I suppose with more experience my level of paranoia simply isn't valuable, it's just that when one is teaching oneself zsh, and one's data keeps re-spliting from lines into individual chars then into words -- and then you think you've got it sorted out and then it all goes sproinggggg when a space in a filename shows up or a color code, and one has no idea why, one wants some diagnostic tool and a newline that got transformed into a space can ruin your whole day if you don't know if or when or why it happened.

Dunno, but I'd have thought that our efforts at 'hex' above would be standard equipment.

Thanks! Your pedagogy is equal to Bart's or Mark's.

Oh! One more thing: we can forget about the back door. Since it can *never* give full information it's broken ab ovo. Structure does not survive piping. A piped array is a string. I get it.