checkpoint — Checkpoint an application.
checkpoint [OPTIONS] pid
checkpoint checkpoints a running application identified
by one of its processes given by pid.
Checkpoint/Restart basically consists of storing a snapshot of the current application state. Later on, application can be restarted from this snapshot. It can be useful for fault tolerance, scheduling, hardware maintenance and debugging.
Checkpointing an application consists in three steps: freezing the application, saving the application state on disk, unfreezing the application. By default, all three steps are run in sequence.
To allow an application to be checkpointed, one must set the CHECKPOINTABLE capability. See krgcr-run(1), krgcapset(1)) and kerrighed_capabilities(7) for further details.
In general case, an application consists of a tree of processes or
threads. The root process of the application must have the CHECKPOINTABLE
capability effective and inheritable before creating other processes or
threads (See fork(2), clone(2),
and pthread_create(3)).
If the root application process exits later, all its children processes are still considered as processes of the same application.
-h, --helpPrint help and exit.
-v, --versionPrint version informations and exit.
-a, --from-appidUse pid as an application identifier, not
as a standard process identifier.
-f, --freezeFreeze an application without checkpointing it. It is useful if you have to save some objects (such as files) before running the checkpoint without racing with the application.
-u [signal], --unfreeze=[signal]
Unfreeze an application previously frozen without checkpointing
it. Optionally, it sends a signal signal
to all processes of the application just before unfreezing it. The
signal will be handled by each process as soon as the process is
woken up. By default, no signal is sent.
signal must be given as a numerical
value. The list of signal numbers can be retrieved with
kill -L.
-c, --ckpt-onlyCheckpoint an already frozen application.
-k [signal], --kill=[signal]
Send a signal signal to all processes
of the application after checkpointing the running application
and before unfreezing the application. The signal will be handled
by each process as soon as the process is woken up. By default, it
sends the SIGTERM signal. signal must
be given as a numerical value. The list of signal numbers can be
retrieved with kill -L.
-i, --ignore-unsupported-filesAllow to checkpoint an application even if the application is using some files, which type is not supported by checkpoint/restart mechanisms (such as socket). At restart time, file descriptor appear to be closed. Without this option, if there is some unsupported files, checkpoint fails with 'Function not implemented'.
-d description, --description=description
Associate a description description
with the checkpoint. The description is recorded in
description.txt in the
checkpoint folder (see FILES below).
Options --freeze, --ckpt-only,
--unfreeze, --kill are mutually exclusive.
Options --description and
--ignore-unsupported-files make sense only when
really checkpointing the application.
Multi-process applications and multithreaded programs are supported.
To be able to restart the application, all process identifiers (including
process group and session identifier) used by the application must be
unused. Option --pid of restart(1) may
be useful if process group leader or session leader has not been
checkpointed.
Checkpointing applications with zombie processes is not supported.
Checkpointing of applications using socket(s) or named pipe(s) (fifo) is
not supported (unless using --ignore-unsupported-files).
Therefore, graphic applications are not supported since
there are using sockets to communicate with the X server.
Checkpointing of application using anonymous pipe(s) may work depending on where the processes are running. Else, it fails at checkpoint time with 'Function not implemented'.
System V IPC objects are not restored but it is possible to checkpoint an application that is currently waiting on such IPC objects. For instance, you can checkpoint a process waiting to receive a message from a message queue. The process will replay the action after the restart if the IPC objects still exits. State of System V IPC objects can be saved and restored using respectfully ipccheckpoint(1) and ipcrestart(1). For consistency, application should be frozen before saving state of System V IPC objects.
Similarly to System V IPC objects, POSIX shared memory segment (SHM)
are not checkpointed with the application. However, similarly to System V
shared memory segments, POSIX shared memory segments can be saved
independently. Process(es) will be reattached at restart time if the shared
memory segments still exist or if they have been restored. In Linux, state
of POSIX shared memory segment can be saved/restored by copying
(using cp(1)) from/to
/dev/shm/.
For consistency, application should be frozen before saving state of POSIX
shared memory segment (SHM).
<shm_name>
Files are not checkpointed nor restored. The files are reopened at
restart time and file pointers are restored. That means that files must be
in consistent states at restart time. Else, you can expect a strange
behavior from your application. You can take advantage of the
--freeze option before the checkpoint to manually backup
the files.
To restart your application, you must run exactly the same kernel as before the checkpoint. Thus, you can not expect to checkpoint an application before upgrading your kernel and restart once the upgrade is done.
The following example shows how to start an application, checkpoint it and restart it.
$ checkpoint --freeze `pgrep mycomputeprogram`
$ ipccheckpoint -s 2 ~/chkpt/ipcsem.bin
$ cp /dev/shm/shm_computeprogram12 ~/chkpt/shm_computeprogram12.bin
$ cp ./compute12_result.log ~/chkpt/compute12_result.log
$ checkpoint --ckpt-only `pgrep mycomputeprogram`
$ checkpoint --unfreeze `pgrep mycomputeprogram`
Later, you may want to restart your application from the last checkpoint.
$ ipcrestart -s ~/chkpt/ipcsem.bin
$ cp ~/chkpt/shm_computeprogram12.bin /dev/shm/shm_computeprogram12
$ cp ~/chkpt/compute12_result.log ./compute12_result.log
appid
(1632 in this example):
$ restart --foreground 1632 1
/var/chkptThis directory is default location for disk checkpoints.
/var/chkpt/<appid>/v<version>/
This directory contains the nth checkpoint
(with n equals to version)
of application identified by appid.
To remove a checkpoint from disk, remove this folder.
Matthieu Fertré <matthieu.fertre@kerlabs.com>,
Renaud Lottiaux <renaud.lottiaux@kerlabs.com>