lamssi_checkpoint_restart - overview of LAM's MPI checkpoint / restart SSI
The "kind" for checkpoint / restart SSI modules is "cr".
Specifically, the string "cr" (without the quotes) is the prefix
that should be used with the mpirun
command line with the -ssi
switch. For example:
- mpirun -ssi cr blcr C my_mpi_program
LAM/MPI can involuntarily checkpoint and restart parallel MPI jobs. Doing so
requires that LAM/MPI was compiled with thread support and that back-end
checkpointing systems are available at run-time. MPI jobs will have to run
with at least MPI_THREAD_SERIALIZED support. If a job elects to run with
checkpoint/restart support and an available cr module is found, the job's
thread level will automatically be promoted to MPI_THREAD_SERIALIZED. See the
User's Guide for more details.
LAM defines three phases for checkpoint / restart support in each MPI process:
- When the checkpoint request arrives, before the actual
- After a checkpoint has successfully completed, in the same
process as the checkpoint was invoked in.
- After a checkpoint has successfully completed, in a new /
The Continue and Restart phases are identical except for the process in which
they are invoked -- the Continue phase is invoked in the same process as the
Checkpoint phase was invoked. The Restart phase is only invoked in newly
LAM currently has two cr modules: blcr
. In order for an
MPI job to be able to be checkpointed and restarted, all of its MPI SSI
modules must support checkpoint/restart. Currently, this means using the
RPI module or the gm
RPI module when compiled with
support (see the User's Guide for more details).
The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is a
software system from Lawrence Berkeley Labs. See the project web page for more
module has one SSI parameter:
- blcr's default priority is 50.
module, when used with checkpoint/restart SSI modules, will
invoke the user-defined functions to save and restore checkpoints. It is
simply a mechanism for user-defined functions to be invoked at LAM's
Checkpoint, Continue, and Restart phases. Hence, the only data that is saved
during the checkpoint is what is written in the user's checkpoint function. No
MPI library state is saved at all.
As such, the model for the self
module is slightly different than, for
example, the blcr
module. Specifically, the Restart function is not
invoked in the same process image of the process that was checkpointed. The
Restart phase is invoked during MPI_INIT
of a new instance of the
application (i.e., it starts over from main()).
Multiple SSI parameters are available:
- Specify a string prefix for the name of the checkpoint,
continue, and restart functions that should be invoked by LAM. That is,
specifying "-ssi cr_self_user_prefix foo" means that LAM expects
to find three functions at run-time: int foo_checkpoint(), int
foo_continue(), and int foo_restart(). This is a convenience parameter
that can be used instead of the three parameters listed below.
- Name of the user function to invoke during the Checkpoint
- Name of the user function to invoke during the Continue
- Name of the user function to invoke during the Restart
If none of these parameters are specified and the self
selected, it will use the default prefix lam_cr_self
Finally, the usual priority SSI parameter is also available:
- self's default priority is 25.
lamssi(7), mpirun(1), LAM User's Guide