sge_shepherd - Grid Engine single job-controlling agent
provides the parent process functionality for a single Grid
Engine job. The parent functionality is necessary on UNIX systems to retrieve
resource usage information (see after a job has finished. In addition, the
forwards signals to the job, such for suspension,
enabling, termination, and the Grid Engine checkpointing signal (see and for
receives information about the job to be started from
the During the execution of the job it actually starts up to 5 child
processes. First a prolog script is run if this feature is enabled by the
parameter in the cluster configuration. (See Next a parallel
environment startup procedure is run if the job is a parallel job. (See for
more information.) After that, the job itself is run, followed by a parallel
environment shutdown procedure for parallel jobs, and finally an epilog script
if requested by the epilog
parameter in the cluster configuration. The
prolog and epilog scripts, as well as the parallel environment startup and
shutdown procedures, are to be provided by the Grid Engine administrator and
are intended for site-specific actions to be taken before and after execution
of the actual user job.
After the job has finished and the epilog script is processed,
retrieves resource usage statistics about the job, places
them in a job-specific subdirectory of the spool directory for reporting
through and finishes.
also places an exit status file in the spool directory. This
exit status can be viewed with qacct -j JobId (see it is not the exit status
itself but of one of the methods executed by
This exit status can have several meanings, depending on
the method in which an error occurred (if any). The possible methods are:
prolog, parallel start, job, parallel stop, epilog, suspend, restart,
terminate, clean, migrate, and checkpoint.
The following exit values are returned:
- All methods: Operation was executed successfully.
- Job script, prolog and epilog: When
FORBID_RESCHEDULE is not set in the configuration (see the job gets
re-queued. Otherwise see "Other".
- Job script, prolog and epilog: When FORBID_APPERROR
is not set in the configuration (see the job gets re-queued. Otherwise see
- Job script: This is the exit status of the job itself. No
action is taken upon this exit status because the meaning of this exit
status is not known.
Prolog, epilog and parallel start: The queue is set to error state and the
job is re-queued.
Parallel stop: The queue is set to error state, but the job is not
re-queued. It is assumed that the job itself ran successfully and only the
clean up script failed.
Suspend, restart, terminate, clean, and migrate: Always successful.
Checkpoint: Success, except for kernel checkpointing: checkpoint was not
successful, did not happen (but migration will happen).
For the meaning of the return codes of the shepherd itself (which are
interpreted by see
should not be invoked manually, but only by
- Specifies the location of the Grid Engine standard
- If set, specifies the default Grid Engine cell. To address
a Grid Engine cell sge_execd uses (in the order of precedence):
The name of the cell specified in the
environment variable SGE_CELL, if it is set.
The name of the default cell, i.e. default
- If set, enable core dumps on Linux when the admin_user is
not root. Linux normally disables core dumps when the daemon has changed
uid or gid. Setting SGE_ENABLE_COREDUMP in sge_execd's environment defeats
that to enable core dumps for debugging if they are otherwise allowed.
This is typically not a big hazard with SGE, since most information is
exposed in the spool area anyhow. Dumps will appear in the qmaster spool
directory, which need not be world-readable.
On Solaris, may be used to enable such dumps.
- If Linux cgroups handling is enabled, this variable names a
directory under the cgroup mount point in which to create job-specific
directories. The default is sge.SGE_CELL so, for instance,
the cpuset cgroup for a job might be
contains a list of user names and their corresponding encrypted
passwords. If available, the password file will be used by
sge_shepherd. To change the contents of this file please use the
sgepasswd command. It is not advised to change that file
<execd_spool>/job_dir/<job_id> job specific directory
Password information used on Microsoft Windows hosts. See
See for a full statement of rights and permissions.