Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: zargs with -P intermittently failing in zsh 5.9 and macOS



This post is rather long, and divided into (1)-(5).

(1)

> 2022/05/30 5:07, Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx> wrote:
> 
> Irrelevant to the patch, but on the original issue, I keep looking at
> this and wondering how we got an exit code of 19 rather than 19+128
> which would have immediately pointed to SIGCONT.  Is this a difference
> in the way MacOS defines WEXITSTATUS(), or what?


On both macOS and Linux, WEXITSTATUS(status) is equivalent to
(((status) & 0xff00) >> 8). This is so either WIFEXITED() is true or not.
But on both OSes, wait(2) man page says that WEXITSTATUS() can be used only
when WIFEXITED() is true. When WIFCONTINUED() is true we should have used
"0200 | SIGCONT".

(2) Why zrags has the problem only on macOS

The reason seems to be in the way SIGCONT is handled (in the kernel?).
On macOS, if the parent zsh sends SIGCONT to a background child and
calls wait3(), then wait3() returns with WIFCONTINUED even if the child
has not been stopped. This is _quite_ strange, and does not happen on
Linux. You may try the following C program:

#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/resource.h>

int main()
{
	pid_t pid = fork();
	if (pid < 0)
		return -1;
	if (pid > 0) {
		pid_t wpid;
		int state;
		struct rusage ru;
		kill(pid, SIGCONT);
		wpid = wait3(&state, WUNTRACED|WCONTINUED, &ru);
		printf("wait3: %d 0x%04x\n", wpid, state);
		kill(pid, SIGKILL);
	} else {
		sleep(1);
	}
	return 0;
}

On Linux, it outputs, after waiting for 1 second:
wait3: 76819 0x0000
but on macOS it immediately outputs:
wait3: 84313 0x137f
For status=0x137f, WIFCONTINUED(status)=true,
and WEXITSTATUS(x)=0x13=19

(3) zsh on Linux also has a problem

On Linux (and macOS), zsh (without the patch) has the following
problem:

% sleep 20 &
[1] 78570
% kill -STOP 78570
[1]  + 78570 suspended (signal)  sleep 20
% kill -CONT 78570
%
(after 20 seconds)
[1]  + 78570 done       sleep 20
% wait 78570
% echo $?
147        # should be 0

147 = 0200 + SIGSTOP(=19)    (on macOS we get 145)

(4) Another minor problem

In the current zsh (without the patch), when WIFSTOPPED()=true,
addbgstatus() records "0200 | WEXITSTATUS(status)" in bgstatus_list.

But wait(2) manpage says WEXITSTATUS() can be used only if WIFEXITED()
is true, and when WIFSTOPPED() is true WSTOPSIG() should be used.
On both Linux and macOS (and probably on other OSes) WSTOPSIG() is
equivalent to WEXITSTATUS(), so there was no observable problem.

grep WEXITSTATUS in the zsh source tree shows that there are two
more uses of WEXITSTATUS() (in jobs.c) which are better replaced
by WSTOPSIG(), as in the patch shown in (5) below.

And:

>> +    /* See if an entry already exists for the pid */
> 
> Would it be worthwhile to put that entire thing in #ifdef DEBUG ?


Slow down is negligible, as you write, but I think chances are _very_
low that addbgusage() is called more than once. So I enclosed it
by #ifdef DEBUG as you suggested.

Also added a test for the problem (3) above.


(5) Revised patch with test

diff --git a/Src/jobs.c b/Src/jobs.c
index a91ef787f..80893cebf 100644
--- a/Src/jobs.c
+++ b/Src/jobs.c
@@ -414,7 +414,7 @@ storepipestats(Job jn, int inforeground, int fixlastval)
 	jpipestats[i] = (WIFSIGNALED(p->status) ?
 			 0200 | WTERMSIG(p->status) :
 			 (WIFSTOPPED(p->status) ?
-			  0200 | WEXITSTATUS(p->status) :
+			  0200 | WSTOPSIG(p->status) :
 			  WEXITSTATUS(p->status)));
 	if (jpipestats[i])
 	    pipefail = jpipestats[i];
@@ -471,7 +471,7 @@ update_job(Job jn)
 	    val = (WIFSIGNALED(pn->status) ?
 		   0200 | WTERMSIG(pn->status) :
 		   (WIFSTOPPED(pn->status) ?
-		    0200 | WEXITSTATUS(pn->status) :
+		    0200 | WSTOPSIG(pn->status) :
 		    WEXITSTATUS(pn->status)));
 	    signalled = WIFSIGNALED(pn->status);
 	}
@@ -2221,6 +2221,7 @@ addbgstatus(pid_t pid, int status)
 {
     static long child_max;
     Bgstatus bgstatus_entry;
+    LinkNode node;
 
     if (!child_max) {
 #ifdef _SC_CHILD_MAX
@@ -2244,6 +2245,21 @@ addbgstatus(pid_t pid, int status)
 	if (!bgstatus_list)
 	    return;
     }
+#ifdef DEBUG
+    /* See if an entry already exists for the pid */
+    for (node = firstnode(bgstatus_list); node; incnode(node)) {
+	bgstatus_entry = (Bgstatus)getdata(node);
+	if (bgstatus_entry->pid == pid) {
+	    /* In theory this should not happen because addbgstatus() is
+	     * called only once when the process exits or gets killed. */
+	    dputs("addbgstatus called again: status %d -> %d",
+		    bgstatus_entry->status, status);
+	    bgstatus_entry->status = status;
+	    return;
+	}
+    }
+#endif
+    /* Add an entry for the pid */
     if (bgstatus_count == child_max) {
 	/* Overflow.  List is in order, remove first */
 	rembgstatus(firstnode(bgstatus_list));
diff --git a/Src/signals.c b/Src/signals.c
index 5c787e2a8..a61368554 100644
--- a/Src/signals.c
+++ b/Src/signals.c
@@ -576,12 +576,10 @@ wait_for_processes(void)
 	 */
 	if (jn && !(jn->stat & (STAT_CURSH|STAT_BUILTIN)) &&
 	    jn - jobtab != thisjob) {
-	    int val = (WIFSIGNALED(status) ?
-		   0200 | WTERMSIG(status) :
-		   (WIFSTOPPED(status) ?
-		    0200 | WEXITSTATUS(status) :
-		    WEXITSTATUS(status)));
-	    addbgstatus(pid, val);
+	    if (WIFEXITED(status))
+		addbgstatus(pid, WEXITSTATUS(status));
+	    else if (WIFSIGNALED(status))
+		addbgstatus(pid, 0200 | WTERMSIG(status));
 	}
 
 	unqueue_signals();
diff --git a/Test/A05execution.ztst b/Test/A05execution.ztst
index d95ee363c..b257ddf2c 100644
--- a/Test/A05execution.ztst
+++ b/Test/A05execution.ztst
@@ -396,6 +396,13 @@ F:anonymous function, and a descriptor leak when backgrounding a pipeline
 # TBD: the 0 above is believed to be bogus and should also be turned
 # into 127 when the ccorresponding bug is fixed in the main shell.
 
+  sleep 1 & pid=$!
+  kill -STOP $pid
+  sleep 1
+  kill -CONT $pid
+  wait $pid
+0:wait for stopped and continued process
+
 # Without the outer subshell, the test harness reports the pre-46060 behaviour
 # as "skipped" rather than "failed".
  (( exit 130 ) | { sleep 1; echo hello })







Messages sorted by: Reverse Date, Date, Thread, Author