Home » Mailing lists » Devel » [PATCH 0/6] linux-cr: make pids a proper shared object (v2)
[PATCH 5/6] c/r: checkpoint and restart pids objects [message #41577 is a reply to message #41559] |
Mon, 07 February 2011 17:18  |
Oren Laadan
Messages: 71 Registered: August 2007
|
Member |
|
|
Make use of (shared) pids objects instead of simply saving the pid_t
numbers in both checkpoint and restart.
The motivation for this change is twofold. First, since pid-ns came to
life pid's in the kenrel _are_ shared objects and should be treated as
such. This is useful e.g. for tty handling and also file-ownership
(the latter waiting for this feature). Second, to properly support
nested namesapces we need to report with each pid the entire list of
pid numbers, not only a single pid. While current we do that for all
"live" pids (those that belong to live tasks), we didn't do it for
"dead" pids (to be assigned to ghost restarting tasks).
Note, that ideally the list of vpids of a pid object should also
include the pid-ns to which each level belongs; however, in this patch
we don't yet hanlde that. So only linear pid-nesting works well and
not arbitrary tree.
DICLAIMER: this patch is big and intrusive! Here is a summary of the
changes that it makes:
CHECKPOINT:
1) Modified the data structures used to describe pids and tasks' pids:
struct ckpt_pids - for the data of a pids object (depth, numbers)
(pids objects are collected in the order found, and are assigned
tags sequentially, starting from 1)
struct ckpt_task_pids - for a task's pids, holding the _tag_ of the
corresponding pids object rather than their pid numbers themselves.
2) Accordingly, two arrays are used to hold this information:
ctx->pids_arr - array of 'struct ckpt_pids' collected from the tasks
and the pids they reference. Entries are of variable size depending
on the pid-ns nesting level.
ctx->tasks_arr - array of 'struct ckpt_task_pids' collected from teh
tasks. Entries are of fixed size, and hold the objref tags to the
shared pids objects rather than actual pid numbers.
(the old vpids_arr is no longer needed, nor written separately).
3) We now first write the pids information, then tasks' pids.
4) checkpoint_pids() builds and writes the ctx->pids_arr:
checkpoint_pids_build() - iterates over the tasks and collects the
unique pids in a flex_array (also inserts them into the objhash)
checkpoint_pids_dumps() - dumps the data from the flex_array in
the format of ctx->tasks_arr
5) checkpoint_tree() dumps the tasks' pids information, by scanning
all the tasks and writing out tags of the pids they reference. If
a pgid/sid is zero, i.e. from an ancestor pid-ns, then the tag will
be zero.
6) In container checkpoint, pids out of our namesapce are disallwed.
We don't do leak detection on pids objects (should we ?).
RESTART:
1) We first call prepare_descendants() to set the ->checkpoint_ctx
of the restarting tasks, and _then_ read the pids data followed by
the tasks' pids data. We validate both against existing tasks.
2) restore_read_pids() reads the pids data, validates that each pid
exists (*) and adds the pids to the objhash. Verify that the owner
task is within our restart context.
(*) We validate them from the root task's point of view, by seeing
that the task has the correct 'struct pid' pointer. NOTE: user-cr
does not support restart --no-pids when there are nested pis-ns,
because is it quite complicated to find ou the pids of all tasks
at all nested levels from userspace.
3) restore_read_tasks() reads the tasks' pids data, validates each
task and adds it to the ctx->tasks_arr. Verify that the task is
within our restart context.
4) We track the array of restarting _tasks_ and the active _task_
instead an array of restarting pids and the active pid. The helpers
to wake-up, sync, check active task etc were modified accordingly.
It improves and simplifies the logic, e.g. restore_activate_next().
5) There are two special values for pgid/sid tags:
0 - means that it is from an ancestor namespace, so we verify that
this is the case. For sid, user-cr should have created the task
properly; for pgid, use the coordinator's (or coordinator's parent)
pid if from differnet namespace, or fail.
CKPT_PID_ROOT - means that we want to reuse the root task's sid,
useful for when the root task is _not_ a conatiner init (e.g. in
subtree c/r) and its session (like our pgrp) was inherited from
somewhere above).
6) Restoring of a task's pgid was moved to when task is validated,
as this should be part of the validation.
NOTE: the patch does not yet allow non-linear nesting of pid-ns.
This would require to make pid-ns a shared object and track it by
the 'struct ckpt_pids' on the kernel side, and in userspace we'll
need to update the logic of MakeForest algorithm to be pid-ns aware
(probably similarly to how sid constraints are handled).
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
include/linux/checkpoint_hdr.h | 23 ++-
include/linux/checkpoint_types.h | 10 +-
kernel/checkpoint/checkpoint.c | 450 +++++++++++++++++++++---------
kernel/checkpoint/process.c | 108 +-------
kernel/checkpoint/restart.c | 568 +++++++++++++++++++++++++++++++-------
kernel/checkpoint/sys.c | 5 -
kernel/signal.c | 8 +-
7 files changed, 806 insertions(+), 366 deletions(-)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 922eff0..6f991c6 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -107,7 +107,9 @@ enum {
CKPT_HDR_SECURITY,
#define CKPT_HDR_SECURITY CKPT_HDR_SECURITY
- CKPT_HDR_TREE = 101,
+ CKPT_HDR_PIDS = 101,
+#define CKPT_HDR_PIDS CKPT_HDR_PIDS
+ CKPT_HDR_TREE,
#define CKPT_HDR_TREE CKPT_HDR_TREE
CKPT_HDR_TASK,
#define CKPT_HDR_TASK CKPT_HDR_TASK
@@ -358,20 +360,33 @@ struct ckpt_hdr_container {
*/
} __attribute__((aligned(8)));;
+/* pids array */
+struct ckpt_hdr_pids {
+ struct ckpt_hdr h;
+ __u32 nr_pids;
+ __u32 nr_vpids;
+ __u32 offset;
+} __attribute__((aligned(8)));
+
+struct ckpt_pids {
+ __u32 depth;
+ __s32 numbers[1];
+} __attribute__((aligned(8)));
+
/* task tree */
struct ckpt_hdr_tree {
struct ckpt_hdr h;
- __s32 nr_tasks;
+ __u32 nr_tasks;
} __attribute__((aligned(8)));
-struct ckpt_pids {
+struct ckpt_task_pids {
/* These pids are in the root_nsproxy's pid ns */
__s32 vpid;
__s32 vppid;
__s32 vtgid;
__s32 vpgid;
__s32 vsid;
- __s32 depth; /* pid namespace depth relative to container init */
+ __u32 depth;
} __attribute__((aligned(8)));
/* pids */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 87a569a..60c664f 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -68,16 +68,14 @@ struct ckpt_ctx {
int nr_vpids; /* total count of vpids */
- /* [checkpoint] */
- struct task_struct *tsk; /* current target task */
struct task_struct **tasks_arr; /* array of all tasks */
int nr_tasks; /* size of tasks array */
+ /* [checkpoint] */
+ struct task_struct *tsk; /* current target task */
+
/* [restart] */
- struct pid_namespace *coord_pidns;/* coordinator pid_ns */
- struct ckpt_pids *pids_arr; /* array of all pids [restart] */
- int nr_pids; /* size of pids array */
- int active_pid; /* (next) position in pids array */
+ int active_task; /* (next) position in pids array */
atomic_t nr_total; /* total tasks count */
struct completion complete; /* completion for container root */
wait_queue_head_t waitq; /* waitqueue for restarting tasks */
diff --git a/kernel/checkpoint/checkpoint.c b/kernel/checkpoint/checkpoint.c
index a850423..342590b 100644
--- a/kernel/checkpoint/checkpoint.c
+++ b/kernel/checkpoint/checkpoint.c
@@ -26,6 +26,7 @@
#include <linux/utsname.h>
#include <linux/magic.h>
#include <linux/hrtimer.h>
+#include <linux/flex_array.h>
#include <linux/deferqueue.h>
#include <linux/checkpoint.h>
#include <linux/pid_namespace.h>
@@ -312,133 +313,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
return ret;
}
-#define CKPT_HDR_PIDS_CHUNK 256
-
-/*
- * Write the pids in ctx->root_nsproxy->pidns. This info is
- * needed at restart to unambiguously dereference tasks.
- */
-static int checkpoint_pids(struct ckpt_ctx *ctx)
-{
- struct ckpt_pids *h;
- struct pid_namespace *root_pidns;
- struct task_struct *task;
- struct task_struct **tasks_arr;
- int nr_tasks, n, pos = 0, ret = 0;
-
- root_pidns = ctx->root_nsproxy->pid_ns;
- tasks_arr = ctx->tasks_arr;
- nr_tasks = ctx->nr_tasks;
- BUG_ON(nr_tasks <= 0);
-
- ret = ckpt_write_obj_type(ctx, NULL,
- sizeof(*h) * nr_tasks,
- CKPT_HDR_BUFFER);
- if (ret < 0)
- return ret;
-
- h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
- if (!h)
- return -ENOMEM;
-
- do {
- rcu_read_lock();
- for (n = 0; n < min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {
- struct pid_namespace *task_pidns;
- task = tasks_arr[pos];
-
- h[n].vpid = task_pid_nr_ns(task, root_pidns);
- h[n].vtgid = task_tgid_nr_ns(task, root_pidns);
- h[n].vpgid = task_pgrp_nr_ns(task, root_pidns);
- h[n].vsid = task_session_nr_ns(task, root_pidns);
- h[n].vppid = task_tgid_nr_ns(task->real_parent,
- root_pidns);
- task_pidns = task_active_pid_ns(task);
- h[n].depth = task_pidns->level - root_pidns->level;
-
- ckpt_debug("task[%d]: vpid %d vtgid %d parent %d\n",
- pos, h[n].vpid, h[n].vtgid, h[n].vppid);
- ctx->nr_vpids += h[n].depth;
- pos++;
- }
- rcu_read_unlock();
-
- n = min(nr_tasks, CKPT_HDR_PIDS_CHUNK);
- ret = ckpt_kwrite(ctx, h, n * sizeof(*h));
- if (ret < 0)
- break;
-
- nr_tasks -= n;
- } while (nr_tasks > 0);
-
- _ckpt_hdr_put(ctx, h, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
- return ret;
-}
-
-static int checkpoint_vpids(struct ckpt_ctx *ctx)
-{
- __s32 *h; /* vpid array */
- struct pid_namespace *root_pidns, *task_pidns = NULL, *active_pidns;
- struct task_struct *task;
- int ret, nr_tasks = ctx->nr_tasks;
- int tidx = 0; /* index into task array */
- int hidx = 0; /* pids wr
...
|
|
|
Goto Forum:
Current Time: Fri Aug 15 14:44:51 GMT 2025
Total time taken to generate the page: 0.58714 seconds
|