Common Job Submission Scripts

checkjob.sh

The “checkjob.sh” script is responsible for checking the status of the job in the queue.

#
# This script attempts to find the state of a specified job. There
# are a number of parameters that could be passed, but this script
# uses:
#    job.id : the id of the job
#    remote.dir : the remote directory where the job was running
#    job.id.filename : the filename that includes the job id
#
# It will return an exit status:
#   0  : script will echo:
#        * The found state from the sacct command
#        * "COMPLETED", under the assmption that the job wasn't found and
#           > 5 minutes since the job submission has elapsed
#        * "" (empty String) : unable to find any information, so
#          implying carry on
#
#  sacct doesn't return any results if pass in a bad job id, such
#    as one that doesn't exist

# outputs the msg from $1 to stdout and stderr without a newline
function err() {
  local msg="$1"
  printf "%s" "${msg}"
  printf "%s" "${msg}" >&2
}

#outputs the msg from $1 to stdout without a newline
function msg() {
  local msg="$1"
  printf "%s" "${msg}"
}

# checks to see if the file is not present or is > 5 minutes old; return
#  1 if either is true, and therefore assume completed,
#  0 otherwise
#
function should_assume_completed() {
  local FL="dart.id"

  if [[ !(-f "${FL}") || -n $(find "${FL}" -mmin +5) ]]; then
    return 1
  else
    return 0
  fi
}

#
# try to get the current status using squeue
#
function set_output_via_squeue {
  local res
  local ec

  res=$(squeue --noheader --jobs=$jobid --format="%.30T" 2>&1)
  ec=$?

  if [[ $ec -eq 0 ]]; then
    OUTPUT=$(echo "$res" | head -1 | sed -e 's/^[[:space:]]*//')
    return 0;
  fi

  return $ec;
}

#
# b/c sacct has so many issues, also make sure even on a non-zero
#  return that it doesn't indicate an error. NOTE: this might
#  not be very locale adjusted; OTOH, sacct might not be either
#
function ensure_sacct_doesnt_say_error {
  local inp="$1"

  if [[ $inp =~ "error:" ]]; then
    SACCT_IS_BEHAVING=1
    return 1
  fi

  return 0
}

#
# if squeue no longer has information about the job, see if
#  sacct knows anything
#
function set_output_via_sacct {
  local res
  local ec

  res=$(sacct --noheader --jobs=$jobid --format="state%30" 2>&1)
  ec=$?

  if [[ $ec -ne 0 ]]; then
    SACCT_IS_BEHAVING=1
    return $ec
  fi

  ensure_sacct_doesnt_say_error "$res"
  if [[ $SACCT_IS_BEHAVING -ne 0 ]]; then
    return 1
  fi

  # we had a good return from sacct, and it didn't say error
  OUTPUT=$(echo "$res" | head -1 | sed -e 's/^[[:space:]]*//')

  return 0
}

######################################################################
#
######################################################################
OUTPUT=""
SACCT_IS_BEHAVING=0

if [[ -z "$jobid" ]]; then
  jobid=$(cat dart.id)
fi

#
# gather the information on the job
#   if squeue returns non-zero, then didn't know about the job
#
set_output_via_squeue
if [[ $? -ne 0 ]]; then
  set_output_via_sacct
fi

#
# if we know sacct has failed us, then we will want to try back
#  in the future
#
if [[ $SACCT_IS_BEHAVING -ne 0 ]]; then
  msg "SACCT_FAILED"
  EXITSTATUS=0
  exit 0
fi

#
# if we didn't get anything back, then the system does not have any
# information about the job. There may be two reasons for this:
#   1. This call has come before the job has had time to be added to the queue
#   2. The job is no longer in the history
#
if [[ -z "${OUTPUT}" ]]; then
  should_assume_completed
  ac=$?
  if [[ $ac -eq 1 ]]; then
    msg "COMPLETED"
  else
    msg "UNKNOWN"
  fi

  EXITSTATUS=0
else
  msg "${OUTPUT}"
fi

status.sh

The “status.sh” script is responsible for checking whether the job, once finished, has completed successfully or not. This is distinct from the role of “checkjob.sh,” which checks the status of the job while it is still in the job queue. The “status.sh” script can be thought of as more of a post-mortem script that inspects one or more output files for clues that everything completed correctly.

#!/bin/bash

if [[ -z "$jobid" ]]; then
  jobid=$(cat dart.id)
fi

checkFilename=slurm-$jobid.out
resultFilename="job.props"

function printResult(){
    if [ $# -eq 0 ] ; then
        return
    fi

    if [ -e $resultFilename ] ; then
        rm $resultFilename
    fi

    line="job.results.status=$1"
    echo "$line" > $resultFilename
    echo $1
}

successString="DAKOTA execution time in seconds:"
failedString="ERROR"

if [ -e $checkFilename ] ; then
    if (grep -q "$successString" $checkFilename) then
        printResult "Successful"
    else
        if (grep -q "$failedString" $checkFilename) then
            printResult "Failed"
        else
            printResult "Undefined"
        fi
    fi
else
    printResult "Undefined"
fi

cancel.sh

The “cancel.sh” script is responsible for stopping the job in the queue if the user stops Next-Gen Workflow.

if [[ -z "$jobid" ]]; then
  jobid=$(cat dart.id)
fi

scancel $jobid