lamssi_cr man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

lamssi_cr(7)		      LAM SSI CR OVERVIEW		  lamssi_cr(7)

NAME
       LAM  SSI	 checkpoint  /	restart	 -  overview of LAM's MPI checkpoint /
       restart SSI modules

DESCRIPTION
       The "kind" for checkpoint / restart SSI modules is "cr".	 Specifically,
       the  string "cr" (without the quotes) is the prefix that should be used
       with the mpirun command line with the -ssi switch.  For example:

       mpirun -ssi cr blcr C my_mpi_program

       LAM/MPI can involuntarily checkpoint and	 restart  parallel  MPI	 jobs.
       Doing  so  requires  that  LAM/MPI was compiled with thread support and
       that back-end checkpointing systems are	available  at  run-time.   MPI
       jobs  will have to run with at least MPI_THREAD_SERIALIZED support.  If
       a job elects to run with checkpoint/restart support and an available cr
       module  is found, the job's thread level will automatically be promoted
       to MPI_THREAD_SERIALIZED.  See the User's Guide for more details.

   Checkpoint Phases
       LAM defines three phases for checkpoint / restart support in  each  MPI
       process:

       Checkpoint.
	   When	 the  checkpoint request arrives, before the actual checkpoint
	   occurs.

       Continue.
	   After a checkpoint has successfully completed, in the same  process
	   as the checkpoint was invoked in.

       Restart
	   After a checkpoint has successfully completed, in a new / restarted
	   process.

       The Continue and Restart phases are identical except for the process in
       which  they  are	 invoked  -- the Continue phase is invoked in the same
       process as the Checkpoint phase was invoked.  The Restart phase is only
       invoked in newly restarted processes.

AVAILABLE MODULES
       LAM  currently  has two cr modules: blcr and self.  In order for an MPI
       job to be able to be checkpointed and restarted, all  of	 its  MPI  SSI
       modules	must  support checkpoint/restart.  Currently, this means using
       the crtcp RPI module or the gm RPI module when compiled	with  gm_get()
       support (see the User's Guide for more details).

   blcr CR Module
       The  Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is
       a software system from Lawrence Berkeley Labs.	See  the  project  web
       page for more details: http://www.nersc.gov/research/ftg/checkpoint/.

       The blcr module has one SSI parameter:

       cr_blcr_priority
	   blcr's default priority is 50.

   self CR Module
       The  self  module,  when used with checkpoint/restart SSI modules, will
       invoke the user-defined functions to save and restore  checkpoints.  It
       is simply a mechanism for user-defined functions to be invoked at LAM's
       Checkpoint, Continue, and Restart phases. Hence, the only data that  is
       saved during the checkpoint is what is written in the user's checkpoint
       function. No MPI library state is saved at all.

       As such, the model for the self module is slightly different than,  for
       example,	 the  blcr  module.  Specifically, the Restart function is not
       invoked in the same process image of the process that was checkpointed.
       The  Restart  phase is invoked during MPI_INIT of a new instance of the
       application (i.e., it starts over from main()).

       Multiple SSI parameters are available:

       cr_self_user_prefix
	   Specify a string prefix for the name of the	checkpoint,  continue,
	   and	restart	 functions  that  should  be invoked by LAM.  That is,
	   specifying "-ssi cr_self_user_prefix foo" means that LAM expects to
	   find	  three	 functions  at	run-time:  int	foo_checkpoint(),  int
	   foo_continue(), and	int  foo_restart().   This  is	a  convenience
	   parameter  that  can be used instead of the three parameters listed
	   below.

       cr_self_user_checkpoint
	   Name of the user function to invoke during the Checkpoint phase.

       cr_self_user_continue
	   Name of the user function to invoke during the Continue phase.

       cr_self_user_restart
	   Name of the user function to invoke during the Restart phase.

       If none of these parameters  are	 specified  and	 the  self  module  is
       selected, it will use the default prefix lam_cr_self

       Finally, the usual priority SSI parameter is also available:

       cr_self_priority
	   self's default priority is 25.

SEE ALSO
       lamssi(7), mpirun(1), LAM User's Guide

LAM 7.1.5b2			  June, 2008			  lamssi_cr(7)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net