Home » Mailing lists » Devel » [PATCH 0/11] user-cr: support for pids as shared objects (v2)
[PATCH 10/11] restart: fix support for nested pid namespaces [message #41571 is a reply to message #41565] |
Mon, 07 February 2011 17:21   |
Oren Laadan
Messages: 71 Registered: August 2007
|
Member |
|
|
Adapt restart code to the new pids handling in kernel-cr that handles
pids as a proper shared object.
DISCLAIMER Disclaimer: this patch is bug and intrusive ... Here is a
summary of the changes that it makes:
1) The main change is that we read the 'ckpt_pids' that hold the
actual pids numbers, and then everything else uses tags that refer to
these objects. Since the ctx->pids_arr is an array of variable-length
entries, it is inconvenient ot point to it with an index. So we use
another array, ctx->pids, that maps from a linear index to the offset
in the ctx->pids_arr where the data is found.
2) Now all pids other than those in 'ckpt_pids' are indices into that
array (more precisely, into ctx->pids array), the variables now have
a "_ind" suffix, e.g. "pid_ind" instead of "pid". There are helpers
to translate from index to pids structure.
3) Document the data structures used to track pids and tasks within
the restart code.
4) To support (linearly) nested-pids, the pids hash table was extended
to have depth, so that if we need to allocate a new (dummy) pid, we
can choose unique pids at all pid-ns levels, not just the top.
5) Accordingly, dummy pid allocation is done at all possible depths in
the hash.
6) Throw away ckpt_{read/write/assign}_vpids - it is no longer needed.
Instead, the seuqence of calls is now:
ckpt_read_pids()
ckpt_read_tree()
ckpt_build_pids()
ckpt_build_tree()
7) Disallow restart with --no-pids if there are nested pid-ns, because
because is it quite complicated to find ou the pids of all tasks at
all nested levels from userspace.
8) If the root task's is not a session leader (must be from a subtree
checkpoint), then it should now inherit its sid from the coordinator.
Furthermore, other tasks with sid/pgid inherited from above the root
task should also do the same. For this to work we use a special value
for their {sid,pgid}_ind: we can't use 0, because that already means
a pid from an ancestor pid-ns; instead we mark it with CKPT_PID_ROOT,
and the kernel code knows how to handle it.
NOTE: this is only necessary when the root task is not a session
leader. Otherwise, we can just add a placeholder task to accopmlish
the same effect (recall it's a subtree). But a placeholder cannot be
placed above the root task...
NOTE2: by doing this, we squash all the sids/pgids from above the
root task into a single common value at restart, even though they
may have been distinct at checkpoint. This is considered a feature
until someone really needs this to behave differently ...
9) Fix a subtle bug in the session-propagation logic, whereas we don't
need a placeholder if we reach the root task _and_ we are a child of
the root task (becaus we will inherit the sid from the root task).
10) In ckpt_fork_child() we can use the 'ckpt_pids' structure for the
pids rather than manually build one.
11) In adjust_pids() and --no-pids we only try to update the numbers[0]
of the pid; we don't support nested pid-ns for --no-pids.
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
restart.c | 1192 ++++++++++++++++++++++++++++++++++++++++-------------------- -
1 files changed, 787 insertions(+), 405 deletions(-)
diff --git a/restart.c b/restart.c
index 01566c2..f64e508 100644
--- a/restart.c
+++ b/restart.c
@@ -45,6 +45,12 @@
#include "common.h"
/*
+ * To re-create the tasks tree in user space, 'restart' reads the
+ * header and tree data from the checkpoint image tree. It makes up
+ * for the data that was consumed by using a helper process that
+ * provides the data back to the restart syscall, followed by the rest
+ * of the checkpoint image stream.
+ *
* By default, 'restart' creates a new pid namespace in which the
* restart takes place, using the original pids from the time of the
* checkpoint. This requires that CLONE_NEWPID and eclone() be enabled.
@@ -54,19 +60,15 @@
* by default, 'restart' creates an equivalen tree without restoring
* the original pids, assuming that the application can tolerate this.
* For this, the 'ckpt_pids' array is transformed on-the-fly before it
- * is fed to the kernel.
+ * is fed to the kernel. This mode of operation is permitted only if
+ * all the restarting tasks belong to a single pid-namespace (i.e. no
+ * pid-namespace nesting).
*
- * By default, "--pids" implied "--pidns" and vice-versa. The user can
+ * By default, "--pids" implies "--pidns" and vice-versa. The user can
* use "--pids --no-pidns" for a restart in the currnet namespace -
* 'restart' will attempt to create the new tree with the original pids
* from the time of the checkpoint, if possible. This requires that
* eclone() be enabled.
- *
- * To re-create the tasks tree in user space, 'restart' reads the
- * header and tree data from the checkpoint image tree. It makes up
- * for the data that was consumed by using a helper process that
- * provides the data back to the restart syscall, followed by the rest
- * of the checkpoint image stream.
*/
struct hashent {
@@ -78,6 +80,75 @@ struct hashent {
struct task;
struct ckpt_ctx;
+/*
+ * The following data structres are used to track pids:
+ *
+ * ctx->pids_arr[]:
+ * Array of (variable sized) 'struct ckpt_pids' from the checkpoint
+ * image, each entry indicates the level (depth) relative to the
+ * root task, and the pids at each level. NOTE: the order of pids
+ * matches order of adding them to the objhash during checkpoint
+ * (hence their tags).
+ *
+ * ctx->pids_copy[]:
+ * Array used to hold a copy of pids_arr[] during --no-pids restart
+ * when converting the task's pids from the original values from
+ * the checkpoint image, to the real pids produced by forks.
+ *
+ * ctx->pids_new[]:
+ * Array of (variable sized) 'struct ckpt_pids' to hold new pids
+ * objects allocated by the MakeForst algorithm fo the restart.
+ *
+ * ctx->pids_index[]:
+ * Array of integers that provides mapping from a pid object (tag)
+ * to the byte offset inside ctx->pids_arr where that pid object
+ * is. It is useful since the entries in the latter are of variable
+ * size.
+ *
+ * ctx->tasks_arr[]:
+ * Array of 'struct ckpt_task_pids' from the checkpoint image, each
+ * entry indicates a task's pids (pid,tgid,pgid,sid,ppid) and the
+ * pid-namespace nesting level. NOTE: the pids store the tags of the
+ * corresponding pid objects (and thus their order in ctx->pids_arr)
+ * rather then the pid values themselves.
+ *
+ * ctx->tasks[]:
+ * Array of 'struct task' that holds information about all the tasks
+ * neede to be created in userespace (the input and output of the
+ * DumpForst and CreateForest algorithms). NOTE: the pids here also
+ * store the tags of the corresponding pid objects).
+ *
+ * When restart algorithm needs to create dead tasks or produce dummy
+ * tasks, it stores new 'ckpt_pids' objects in ctx->pids_new[], and
+ * extends ctx->pids[] and ctx->tasks[] to store index to new pids
+ * and new tasks, respectively.
+ *
+ * ctx->pids_nr: (original) size of ctx->pids_arr
+ * ctx->pids_cnt: current size of ctx->pids_index
+ * ctx->pids_max: maximum size of ctx->pids_index
+ * ctx->pids_off: current offset in ctx->pids_new[]
+ * ctx->pids_len: maximum offset in ctx->pids_new[]
+ *
+ * ctx->tasks_nr: size of ctx->pids_arr
+ * ctx->tasks_cnt: current size of ctx->tasks
+ * ctx->tasks_max: maximum size of ctx->tasks
+ *
+ * Given a byte offset in ctx->pids_arr, to get the 'ckpt_pids':
+ * pids = pid_at_index(ctx, @offset)
+ *
+ * Given a pid-index from ctx->tasks/ctx->tasks_arr, to get the byte
+ * offset of the matching 'ckpt_pids' in ctx->pids_arr:
+ * ctx->pids_index[@index]
+ *
+ * And to get the 'ckpt_pids' from an index:
+ * pids = pids_of_index(@index)
+ *
+ *
+ * ctx->tasks_pids[]:
+ * Array of pid values indicating the next hint for pid allocation
+ * at each nesting level of pid-namespace.
+ */
+
struct task {
int flags; /* state and (later) actions */
@@ -91,13 +162,12 @@ struct task {
int vidx; /* index into vpid array, -1 if none */
int piddepth;
- pid_t pid; /* process IDs, our bread-&-butter */
- pid_t ppid;
- pid_t tgid;
- pid_t sid;
+ /* Following are INDEX values into ctx->pids_index */
+ int pid_ind; /* process IDs, our bread-&-butter */
+ int ppid_ind;
+ int tgid_ind;
+ int sid_ind;
- pid_t rpid; /* [restart without vpids] actual (real) pid */
-
struct ckpt_ctx *ctx; /* points back to the c/r context */
pid_t real_parent; /* pid of task's real parent */
@@ -127,32 +197,45 @@ struct ckpt_ctx {
int error;
int success;
- pid_t root_pid;
int pipe_in;
int pipe_out;
- int pids_nr;
- int vpids_nr;
int pipe_child[2]; /* for children to report status */
int pipe_feed[2]; /* for feeder to provide input */
int pipe_coord[2]; /* for coord to report status (if needed) */
+ int root_pid;
+ int pid_offset;
+
struct ckpt_pids *pids_arr;
- struct ckpt_pids *copy_arr;
- __s32 *vpids_arr;
+ struct ckpt_pids *pids_new;
+ struct ckpt_pids *pids_copy;
+ int *pids_index;
+
+ int pids_nr;
+ int vpids_nr;
+ int pids_cnt;
+ int pids_max;
+ int pids_off;
+ int pids_len;
+ struct ckpt_task_pids *tasks_arr;
struct task *tasks;
+
int tasks_nr;
+ int tasks_cnt;
int tasks_max;
- int tasks_pid;
- struct hashent **hash_arr;
+ /* an array of pid hash-tables: one hash-table per pidns level */
+ struct hashent ***hash_arr;
+ int *hash_last_pid;
+ int hash_depth;
char header[BUFSIZE];
char header_arch[BUFSIZE];
char container[BUFSIZE];
char tree[BUFSIZE];
- char vpids[BUFSIZE];
+ char pids[BUFSIZE];
char buf[BUFSIZE];
struct cr_restart_args *args;
@@ -194,9 +277,9 @@ int global_send_sigint = -1;
static int c
...
|
|
|
Goto Forum:
Current Time: Thu Aug 28 13:50:12 GMT 2025
Total time taken to generate the page: 0.21858 seconds
|