From Kerrighed
[edit] Checkpoint/Restart
Differences between Checkpointing/Restart and Migration regarding kddms
task: repair checkpointing/restart functionality for a single-threaded application
[edit] relation of checkpointing/restart and migration
- share same subtasks:
- extracting all process related data (kernel structure values and process address space)
- rebuilding a process using the extracted data
- =>the same subset of functions can be used (export_*/import_* functions)
- migration is a kind of an “atomic” operation
- extracting process data and rebuilding a process happens in a sequence, without an interruption (should be)
- checkpoint/restart are two “atomic” operations – extracting process data and rebuilding a process are separated tasks
- => checkpoint data should be persistent to be able to restart after a reboot or a process has been killed
[edit] kddms involved in migration and checkpoint/restart
- system kddms: task, child, signal_struct, sighand_struct, app_struct, pid
- kddms: memory
[edit] localize the belonging kddm object in the various system kddms
- the PID is the object id of the belonging task kddm object
- the TGID is the object id of the belonging children kddm object
- children kddm object can also be retrieved by task_struct via member children_obj
- the TGID is the object id of the belonging signal_struct kddm object
- a custom unique id is the object id of sighand_struct kddm object
- sighand_struct kddm object can also be retrieved by the task kddm object via member sighand_struct_id
- sighand_struct kddm object can be retrieved by task_struct via member sighand->krg_objid
- the app_struct kddm object can be retrieved via member app_id of struct app_struct
[edit] reason for checkpoint/restart failure examined
In case of an “atomic” migration, system kddm objects are still within the cluster. If not yet locally present they can be fetched using the kddm get/grab interface.
In case of checkpoint/restart all process related (system) kddm entries vanish after a reboot or a process had been killed. This lead to an unsuccessful restart since data from system kddm entries are needed during the process rebuilding phase.
[edit] 1.Checkpoint - save belonging (system) kddm objects:
When extracting kernel structure values of a process, the system kddm objects have to be saved additionally (in migration they don't vanish).
Save kernel structure values persistently: a file ghost is used within the export_* functions (instead of a network ghost as it is the case in migration)
[edit] system kddm objects
- no need to save task kddm object data
- in case of single process checkpointing no need to save children kddm object data
- function export_signal_struct (epm/g_signal.c)
- save object id
- save struct signal_struct
- function export_sighand_struct (epm/ghost_process_management.c)
- save object id
- save struct sighand_struct
- function save_app_struct_ctnr (epm/application_checkpoint.c)
- save object id
- number of checkpoint
- user id
- group id
- array of nodes which are participating to the application
when extracting the process address space, a memory kddm is set up in case of migration, the process address space is deleted after killing the process or a reboot save process address space persistently using a file ghost
[edit] 2. Restart - rebuild belonging (system) kddm objects
recreate all the system kddm objects using the saved data:
- function import_krg_structs (epm/ghost_process_management.c)
- create an entry (first touch)
- initialize the task kddm object using current values from the task being recreated
- no need to save task kddm object values in the file ghost
- create an entry (first touch)
- merely initialize the list of children, in case of checkpointing a single-process applications – no need to fill list with children
- signal_struct kddm object
- function import_signal_struct (epm/g_signal.c)
- create an entry (first touch) – use saved object id?
- initialize signal_struct object using saved data
- sighand_struct kddm object
- function import_sighand_struct (epm/ghost_process_managemnet.c)
- create an entry (first touch) – use saved object id?
- initialize signal_struct object using saved data
- At the time of restart, the parent process may have disappeared. That's why we currently reparent it to the init process.
- In addition, the process group id and session id are probably free, so there is something to do about it. Currently, pgid and sid are forced to be equals to the pid at the restart time.
- Remote procedure call to allocate/reserve the PID on the originating node
err = sync_remote_service_call(ORIG_NODE(pid), APP_RESERVE_PID,
APPLICATION_CR_CHAN, &pid,
sizeof(pid));
- if process has been migrated before the checkpoint, we need to create a pid kddm object
- Moreover, we need to link the task kddm object and the pid kddm object on the originating node for the process being visible in the command ps. This is done after import_pids and import_krg_struct by calling a RPC (APP_LINK_PID_TASK) from the function restart_process_from_ghost.
- function import_pid_for_restart (proc/pid_management.c)
- force PGID and SESSION_ID to be equal to the PID: this is bad for a multi-threaded or multi-processes application
- if process is on originating node, simply link the task_struct to the struct pid and do not manage a pid kddm object
- else link to the pid kddm object in the same way as in the end of a migration
- function load_app_struct_ctnr in (epm/application_restart.c)
[edit] recreate the process address space:
- set up a memory kddm
- in case of a distributed multithreaded application !
- for System V segment
- if there has been one before the checkpoint?
- Mfertre : Why ? : if pages are in memory kddm only because of a migration, it is not mandatory to restore them in memory kddm and they can be restored on local only pages.
- do not set up a memory kddm – (usual un-shared process address space )
- if there has not been one before the checkpoint?
[edit] Contributions
This page has been initiated by :
Both working for XtreemOS european project.