The "kind" for checkpoint / restart SSI modules is "cr".
Specifically, the string "cr" (without the quotes) is the prefix that
should be used with the
mpirun
command line with the
-ssi
switch. For example:
mpirun -ssi cr blcr C my_mpi_program
LAM/MPI can involuntarily checkpoint and restart parallel MPI jobs.
Doing so requires that LAM/MPI was compiled with thread support and
that back-end checkpointing systems are available at run-time. MPI
jobs will have to run with at least MPI_THREAD_SERIALIZED support. If
a job elects to run with checkpoint/restart support and an available
cr module is found, the job's thread level will automatically be
promoted to MPI_THREAD_SERIALIZED. See the User's Guide for more
details.
Checkpoint Phases
LAM defines three phases for checkpoint / restart support in each MPI
process:
Checkpoint.
When the checkpoint request arrives, before the actual checkpoint
occurs.
Continue.
After a checkpoint has successfully completed, in the same process as
the checkpoint was invoked in.
Restart
After a checkpoint has successfully completed, in a new / restarted
process.
The Continue and Restart phases are identical except for the process
in which they are invoked -- the Continue phase is invoked in the same
process as the Checkpoint phase was invoked. The Restart phase is
only invoked in newly restarted processes.
AVAILABLE MODULES
LAM currently has two cr modules:
blcr
and
self.
In order for an MPI job to be able to be checkpointed and restarted,
all of its MPI SSI modules must support checkpoint/restart.
Currently, this means using the
crtcp
RPI module or the
gm
RPI module when compiled with
gm_get()
support (see the User's Guide for more details).
blcr CR Module
The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is
a software system from Lawrence Berkeley Labs. See the project web
page for more details:
http://www.nersc.gov/research/ftg/checkpoint/.
The
blcr
module has one SSI parameter:
cr_blcr_priority
blcr's
default priority is 50.
self CR Module
The
self
CR module effectively allows application-level checkpointing by
invoking user-specified functions at the Checkpoint, Continue, and
Restart phases of LAM/MPI C/R support.
Multiple SSI parameters are available:
cr_self_user_prefix
Specify a string prefix for the name of the checkpoint, continue, and
restart functions that should be invoked by LAM. That is,
specifying "-ssi cr_self_user_prefix_foo" means that LAM expects to
find three functions at run-time: foo_checpkoint(), foo_continue(),
and foo_restart(). This is a convenience parameter that can be used
instead of the three parameters listed below.
cr_self_user_checkpoint
Name of the user function to invoke during the Checkpoint phase.
cr_self_user_continue
Name of the user function to invoke during the Continue phase.
cr_self_user_restart
Name of the user function to invoke during the Restart phase.
If none of these parameters are specified and the
self
module is selected, it will abort. Finally, the usual priority SSI
parameter is also available: