Name

checkpoint — Checkpoint an application.

Synopsis

checkpoint [OPTIONS] pid

Description

checkpoint checkpoints a running application identified by one of its processes given by pid.

Checkpoint/Restart basically consists of storing a snapshot of the current application state. Later on, application can be restarted from this snapshot. It can be useful for fault tolerance, scheduling, hardware maintenance and debugging.

Checkpointing an application consists in three steps: freezing the application, saving the application state on disk, unfreezing the application. By default, all three steps are run in sequence.

To allow an application to be checkpointed, one must set the CHECKPOINTABLE capability. See krgcr-run(1), krgcapset(1)) and kerrighed_capabilities(7) for further details.

In general case, an application consists of a tree of processes or threads. The root process of the application must have the CHECKPOINTABLE capability effective and inheritable before creating other processes or threads (See fork(2), clone(2), and pthread_create(3)).

If the root application process exits later, all its children processes are still considered as processes of the same application.

Options

-h, --help

Print help and exit.

-v, --version

Print version informations and exit.

-a, --from-appid

Use pid as an application identifier, not as a standard process identifier.

-f, --freeze

Freeze an application without checkpointing it. It is useful if you have to save some objects (such as files) before running the checkpoint without racing with the application.

-u [signal], --unfreeze=[signal]

Unfreeze an application previously frozen without checkpointing it. Optionally, it sends a signal signal to all processes of the application just before unfreezing it. The signal will be handled by each process as soon as the process is woken up. By default, no signal is sent. signal must be given as a numerical value. The list of signal numbers can be retrieved with kill -L.

-c, --ckpt-only

Checkpoint an already frozen application.

-k [signal], --kill=[signal]

Send a signal signal to all processes of the application after checkpointing the running application and before unfreezing the application. The signal will be handled by each process as soon as the process is woken up. By default, it sends the SIGTERM signal. signal must be given as a numerical value. The list of signal numbers can be retrieved with kill -L.

-i, --ignore-unsupported-files

Allow to checkpoint an application even if the application is using some files, which type is not supported by checkpoint/restart mechanisms (such as socket). At restart time, file descriptor appear to be closed. Without this option, if there is some unsupported files, checkpoint fails with 'Function not implemented'.

-d description, --description=description

Associate a description description with the checkpoint. The description is recorded in description.txt in the checkpoint folder (see FILES below).

Options --freeze, --ckpt-only, --unfreeze, --kill are mutually exclusive.

Options --description and --ignore-unsupported-files make sense only when really checkpointing the application.

Supported applications and limitations

Multi-process applications and multithreaded programs are supported. To be able to restart the application, all process identifiers (including process group and session identifier) used by the application must be unused. Option --pid of restart(1) may be useful if process group leader or session leader has not been checkpointed.

Checkpointing applications with zombie processes is not supported.

Checkpointing of applications using socket(s) or named pipe(s) (fifo) is not supported (unless using --ignore-unsupported-files). Therefore, graphic applications are not supported since there are using sockets to communicate with the X server.

Checkpointing of application using anonymous pipe(s) may work depending on where the processes are running. Else, it fails at checkpoint time with 'Function not implemented'.

System V IPC objects are not restored but it is possible to checkpoint an application that is currently waiting on such IPC objects. For instance, you can checkpoint a process waiting to receive a message from a message queue. The process will replay the action after the restart if the IPC objects still exits. State of System V IPC objects can be saved and restored using respectfully ipccheckpoint(1) and ipcrestart(1). For consistency, application should be frozen before saving state of System V IPC objects.

Similarly to System V IPC objects, POSIX shared memory segment (SHM) are not checkpointed with the application. However, similarly to System V shared memory segments, POSIX shared memory segments can be saved independently. Process(es) will be reattached at restart time if the shared memory segments still exist or if they have been restored. In Linux, state of POSIX shared memory segment can be saved/restored by copying (using cp(1)) from/to /dev/shm/<shm_name>. For consistency, application should be frozen before saving state of POSIX shared memory segment (SHM).

Files are not checkpointed nor restored. The files are reopened at restart time and file pointers are restored. That means that files must be in consistent states at restart time. Else, you can expect a strange behavior from your application. You can take advantage of the --freeze option before the checkpoint to manually backup the files.

To restart your application, you must run exactly the same kernel as before the checkpoint. Thus, you can not expect to checkpoint an application before upgrading your kernel and restart once the upgrade is done.

Example

The following example shows how to start an application, checkpoint it and restart it.

Start the application

$ krgcr-run ./mycomputeprogram 12 1024 58

Checkpoint the application

Freeze the application:

$ checkpoint --freeze `pgrep mycomputeprogram`

Save related System V IPC objects:

$ ipccheckpoint -s 2 ~/chkpt/ipcsem.bin

Save related POSIX shared memory segments:

$ cp /dev/shm/shm_computeprogram12 ~/chkpt/shm_computeprogram12.bin

Save related files:

$ cp ./compute12_result.log ~/chkpt/compute12_result.log

Dump state of application processes:

$ checkpoint --ckpt-only `pgrep mycomputeprogram`

Unfreeze the application:

$ checkpoint --unfreeze `pgrep mycomputeprogram`

Restart the application

Later, you may want to restart your application from the last checkpoint.

Restore related System V IPC objects:

$ ipcrestart -s ~/chkpt/ipcsem.bin

Restore related POSIX shared memory segments:

$ cp ~/chkpt/shm_computeprogram12.bin /dev/shm/shm_computeprogram12

Restore related files:

$ cp ~/chkpt/compute12_result.log ./compute12_result.log

Finally, restart the program using its appid (1632 in this example):

$ restart --foreground 1632 1

Files

/var/chkpt

This directory is default location for disk checkpoints.

/var/chkpt/<appid>/v<version>/

This directory contains the nth checkpoint (with n equals to version) of application identified by appid.

To remove a checkpoint from disk, remove this folder.

Authors

Matthieu Fertré , Renaud Lottiaux