OpenVZ Forum


Home » Mailing lists » Devel » Re: [PATCH 4/5] c/r: checkpoint and restart pids objects
Re: [PATCH 4/5] c/r: checkpoint and restart pids objects [message #41534] Sat, 05 February 2011 21:43 Go to previous message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Oren:

I am still reviewing this patchset, but have a few questions/comments
below on this patch.

| From: Oren Laadan <orenl@cs.columbia.edu>
| Subject: [PATCH 4/5] c/r: checkpoint and restart pids objects
|
| Make use of (shared) pids objects instead of simply saving the pid_t
| numbers in both checkpoint and restart.
|
| The motivation for this change is twofold. First, since pid-ns came to
| life pid's in the kenrel _are_ shared objects and should be treated as
| such. This is useful e.g. for tty handling and also file-ownership
| (the latter waiting for this feature). Second, to properly support
| nested namesapces we need to report with each pid the entire list of
| pid numbers, not only a single pid. While current we do that for all
| "live" pids (those that belong to live tasks), we didn't do it for
| "dead" pids (to be assigned to ghost restarting tasks).
|
| Note, that ideally the list of vpids of a pid object should also
| include the pid-ns to which each level belongs; however, in this patch
| we don't yet hanlde that. So only linear pid-nesting works well and
| not arbitrary tree.
|
| DICLAIMER: this patch is big and intrusive! Here is a summary of the
| changes that it makes:
|
| CHECKPOINT:
|
| 1) Modified the data structures used to describe pids and tasks' pids:
| struct ckpt_pids - for the data of a pids object (depth, numbers)
| (pids objects are collected in the order found, and are assigned
| tags sequentially, starting from 1)
| struct ckpt_task_pids - for a task's pids, holding the _tag_ of the
| corresponding pids object rather than their pid numbers themselves.
|
| 2) Accordingly, two arrays are used to hold this information:
| ctx->pids_arr - array of 'struct ckpt_pids' collected from the tasks
| and the pids they reference. Entries are of variable size depending
| on the pid-ns nesting level.
| ctx->tasks_arr - array of 'struct ckpt_task_pids' collected from teh
| tasks. Entries are of fixed size, and hold the objref tags to the
| shared pids objects rather than actual pid numbers.
| (the old vpids_arr is no longer needed, nor written separately).
|
| 3) We now first write the pids information, then tasks' pids.
|
| 4) checkpoint_pids() builds and writes the ctx->pids_arr:
| checkpoint_pids_build() - iterates over the tasks and collects the
| unique pids in a flex_array (also inserts them into the objhash)
| checkpoint_pids_dumps() - dumps the data from the flex_array in
| the format of ctx->tasks_arr
|
| 5) checkpoint_tree() dumps the tasks' pids information, by scanning
| all the tasks and writing out tags of the pids they reference. If
| a pgid/sid is zero, i.e. from an ancestor pid-ns, then the tag will
| be zero.
|
| 6) In container checkpoint, pids out of our namesapce are disallwed.
| We don't do leak detection on pids objects (should we ?).
|
| RESTART:
|
| 1) We first call prepare_descendants() to set the ->checkpoint_ctx
| of the restarting tasks, and _then_ read the pids data followed by
| the tasks' pids data. We validate both against existing tasks.
|
| 2) restore_read_pids() reads the pids data, validates that each pid
| exists (*) and adds the pids to the objhash. Verify that the owner
| task is within our restart context.
|
| (*) We validate them from the root task's point of view, by seeing
| that the task has the correct 'struct pid' pointer. NOTE: user-cr
| does not support restart --no-pids when there are nested pis-ns,
| because is it quite complicated to find ou the pids of all tasks
| at all nested levels from userspace.
|
| 3) restore_read_tasks() reads the tasks' pids data, validates each
| task and adds it to the ctx->tasks_arr. Verify that the task is
| within our restart context.
|
| 4) We track the array of restarting _tasks_ and the active _task_
| instead an array of restarting pids and the active pid. The helpers
| to wake-up, sync, check active task etc were modified accordingly.
| It improves and simplifies the logic, e.g. restore_activate_next().
|
| 5) There are two special values for pgid/sid tags:
| 0 - means that it is from an ancestor namespace, so we verify that
| this is the case. For sid, user-cr should have created the task
| properly; for pgid, use the coordinator's (or coordinator's parent)
| pid if from differnet namespace, or fail.
| CKPT_PID_ROOT - means that we want to reuse the root task's sid,
| useful for when the root task is _not_ a conatiner init (e.g. in
| subtree c/r) and its session (like our pgrp) was inherited from
| somewhere above).
|
| 6) Restoring of a task's pgid was moved to when task is validated,
| as this should be part of the validation.
|
| NOTE: the patch does not yet allow non-linear nesting of pid-ns.
| This would require to make pid-ns a shared object and track it by
| the 'struct ckpt_pids' on the kernel side, and in userspace we'll
| need to update the logic of MakeForest algorithm to be pid-ns aware
| (probably similarly to how sid constraints are handled).
|
| Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| include/linux/checkpoint_hdr.h | 22 ++-
| include/linux/checkpoint_types.h | 10 +-
| kernel/checkpoint/checkpoint.c | 440 +++++++++++++++++++++----------
| kernel/checkpoint/process.c | 108 +-------
| kernel/checkpoint/restart.c | 551 ++++++++++++++++++++++++++++++--------
| kernel/checkpoint/sys.c | 5 -
| 6 files changed, 773 insertions(+), 363 deletions(-)
|
| diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
| index 922eff0..c0a548a 100644
| --- a/include/linux/checkpoint_hdr.h
| +++ b/include/linux/checkpoint_hdr.h
| @@ -107,7 +107,9 @@ enum {
| CKPT_HDR_SECURITY,
| #define CKPT_HDR_SECURITY CKPT_HDR_SECURITY
|
| - CKPT_HDR_TREE = 101,
| + CKPT_HDR_PIDS = 101,
| +#define CKPT_HDR_PIDS CKPT_HDR_PIDS
| + CKPT_HDR_TREE,
| #define CKPT_HDR_TREE CKPT_HDR_TREE
| CKPT_HDR_TASK,
| #define CKPT_HDR_TASK CKPT_HDR_TASK
| @@ -358,20 +360,32 @@ struct ckpt_hdr_container {
| */
| } __attribute__((aligned(8)));;
|
| +/* pids array */
| +struct ckpt_hdr_pids {
| + struct ckpt_hdr h;
| + __u32 nr_pids;
| + __u32 nr_vpids;
| +} __attribute__((aligned(8)));

For consistency can we call this ckpt_hdr_pids_tree ?

| +
| +struct ckpt_pids {
| + __u32 depth;
| + __s32 numbers[1];
| +} __attribute__((aligned(8)));
| +

This actually corresponds to _one_ 'struct pid' right ? Can we rename to
'struct ckpt_pid' or ckpt_struct_pid ?

| /* task tree */
| struct ckpt_hdr_tree {
| struct ckpt_hdr h;
| - __s32 nr_tasks;
| + __u32 nr_tasks;
| } __attribute__((aligned(8)));

And this to, ckpt_hdr_task_tree ?

|
| -struct ckpt_pids {
| +struct ckpt_task_pids {
| /* These pids are in the root_nsproxy's pid ns */
| __s32 vpid;
| __s32 vppid;
| __s32 vtgid;
| __s32 vpgid;
| __s32 vsid;
| - __s32 depth; /* pid namespace depth relative to container init */
| + __u32 depth;
| } __attribute__((aligned(8)));
|
| /* pids */
| diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
| index 87a569a..60c664f 100644
| --- a/include/linux/checkpoint_types.h
| +++ b/include/linux/checkpoint_types.h
| @@ -68,16 +68,14 @@ struct ckpt_ctx {
|
| int nr_vpids; /* total count of vpids */
|
| - /* [checkpoint] */
| - struct task_struct *tsk; /* current target task */
| struct task_struct **tasks_arr; /* array of all tasks */
| int nr_tasks; /* size of tasks array */
|
| + /* [checkpoint] */
| + struct task_struct *tsk; /* current target task */
| +
| /* [restart] */
| - struct pid_namespace *coord_pidns;/* coordinator pid_ns */
| - struct ckpt_pids *pids_arr; /* array of all pids [restart] */
| - int nr_pids; /* size of pids array */
| - int active_pid; /* (next) position in pids array */
| + int active_task; /* (next) position in pids array */
| atomic_t nr_total; /* total tasks count */
| struct completion complete; /* completion for container root */
| wait_queue_head_t waitq; /* waitqueue for restarting tasks */
| diff --git a/kernel/checkpoint/checkpoint.c b/kernel/checkpoint/checkpoint.c
| index 42de30a..f351f49 100644
| --- a/kernel/checkpoint/checkpoint.c
| +++ b/kernel/checkpoint/checkpoint.c
| @@ -313,133 +313,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
| return ret;
| }
|
| -#define CKPT_HDR_PIDS_CHUNK 256
| -
| -/*
| - * Write the pids in ctx->root_nsproxy->pidns. This info is
| - * needed at restart to unambiguously dereference tasks.
| - */
| -static int checkpoint_pids(struct ckpt_ctx *ctx)
| -{
| - struct ckpt_pids *h;
| - struct pid_namespace *root_pidns;
| - struct task_struct *task;
| - struct task_struct **tasks_arr;
| - int nr_tasks, n, pos = 0, ret = 0;
| -
| - root_pidns = ctx->root_nsproxy->pid_ns;
| - tasks_arr = ctx->tasks_arr;
| - nr_tasks = ctx->nr_tasks;
| - BUG_ON(nr_tasks <= 0);
| -
| - ret = ckpt_write_obj_type(ctx, NULL,
| - sizeof(*h) * nr_tasks,
| - CKPT_HDR_BUFFER);
| - if (ret < 0)
| - return ret;
| -
| - h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
| - if (!h)
| - return -ENOMEM;
| -
| - do {
| - rcu_read_lock();
| - for (n = 0; n < min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {
| - struct pid_namespace *task_pidns;
| - task = tasks_arr[pos];
| -
| - h[n].vpid = task_pid_nr_ns(task, root_pidns);
| - h[n].vtgid = task_tgid_nr_ns(task, root_pidns);
| - h[n].vpgid = task_pgrp_nr_ns(task, root_pidns);
| - h[n].vsid = task_session_nr_ns(task, root_pidns);
| - h[n].vppid = task_tgid_nr_ns(task->real_parent,
| - root_pidns);
| - task_pidns = task_active_pid_ns(task);
| - h[n].depth = task_pidns->level - root_pidns->level;
| -
| - ckpt_debug("task[%d]: vpid %d vtgid %d parent %d\n",
| - pos, h[n].vpid, h[n].vtgid, h[n].vppid);
| - ctx->nr_vpids += h[n].depth;
| - pos++;
...

 
Read Message
Read Message
Previous Topic: Re: [PATCH v2] cgroup/freezer: add per freezer duty ratio control
Next Topic: Re: [PATCH, v3 2/2] cgroups: introduce timer slack subsystem
Goto Forum:
  


Current Time: Sun Oct 26 11:41:21 GMT 2025

Total time taken to generate the page: 0.10841 seconds