

error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error error: dup2 over STDIN_FILENO: Bad file descriptor In looking at the accounting data it seems that in each of these cases > 20 jobs are trying to start (that is an estimate, not a hard number of job starts). Looking the in the slurmd log, I see that dup2() is failing and slurmstepd is failing to send a message, following by loss of the starting job (and marking down of the node).

Since moving to slurm 16.05.9 (though there may have been some of this in 16.05.8 as well), it seems we are getting a lot of nodes dropping off (maybe 2-3 per day) owing to "batch job completion failure". This means that up to 32 jobs can run independently on these. We have a "shared" partition on cori where users can request just a single core on our haswell nodes.
