OpenVZ Forum


Home » Mailing lists » Devel » [RFC][PATCH 0/15] Pid namespaces
[RFC][PATCH 0/15] Pid namespaces [message #15293] Thu, 26 July 2007 14:45 Go to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
(Comment is taken from Sukadev's patchset-v3)

A pid namespace is a "view" of a particular set of tasks on the system.
They work in a similar way to filesystem namespaces. A file (or a process)
can be accessed in multiple namespaces, but it may have a different name
in each. In a filesystem, this name might be /etc/passwd in one namespace,
but /chroot/etc/passwd in another.

For processes, a process may have pid 1234 in one namespace, but be pid 1
in another. This allows new pid namespaces to have basically arbitrary
pids, and not have to worry about what pids exist in other namespaces.
This is essential for checkpoint/restart where a restarted process's pid
might collide with an existing process on the system's pid.

In this particular implementation, pid namespaces have a parent-child
relationship, just like processes. A process in a pid namespace may see
all of the processes in the same namespace, as well as all of the processes
in all of the namespaces which are children of its namespace. Processes may
not, however, see others which are in their parent's namespace, but not in
their own. The same goes for sibling namespaces.

This set is based on my patches, I sent before, but it includes some comments
and patches that I received from Sukadev. Sukadev, please, add your Acked-by,
Signed-off-by or From, to patches you want (everybody is also welcome :) ).

The set is based on 2.6.23-rc1-mm1, which already has some preparation patches
for pid namespaces. After the review and fixing all the comments, this set
will be benchmarked and sent to Andrew for inclusion in -mm tree.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
[PATCH 1/15] Move exit_task_namespaces() [message #15295 is a reply to message #15293] Thu, 26 July 2007 14:46 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Make task release its namespaces after it has reparented all his
children to child_reaper, but before it notifies its parent about
its death.

The reason to release namespaces after reparenting is that when task
exits it may send a signal to its parent (SIGCHLD), but if the parent
has already exited its namespaces there will be no way to decide what
pid to dever to him - parent can be from different namespace.

The reason to release namespace before notifying the parent it that
when task sends a SIGCHLD to parent it can call wait() on this taks
and release it. But releasing the mnt namespace implies dropping
of all the mounts in the mnt namespace and NFS expects the task to
have valid sighand pointer.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

exit.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletion(-)

diff -upr linux-2.6.23-rc1-mm1.orig/kernel/exit.c linux-2.6.23-rc1-mm1-7/kernel/exit.c
--- linux-2.6.23-rc1-mm1.orig/kernel/exit.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/exit.c 2007-07-26 16:36:37.000000000 +0400
@@ -788,6 +804,10 @@ static void exit_notify(struct task_stru
BUG_ON(!list_empty(&tsk->children));
BUG_ON(!list_empty(&tsk->ptrace_children));

+ write_unlock_irq(&tasklist_lock);
+ exit_task_namespaces(tsk);
+ write_lock_irq(&tasklist_lock);
+
/*
* Check to see if any process groups have become orphaned
* as a result of our exiting, and if they have any stopped
@@ -999,7 +1021,6 @@ fastcall NORET_TYPE void do_exit(long co

tsk->exit_code = code;
proc_exit_connector(tsk);
- exit_task_namespaces(tsk);
exit_notify(tsk);
#ifdef CONFIG_NUMA
mpol_free(tsk->mempolicy);
[PATCH 2/15] Introduce MS_KERNMOUNT flag [message #15296 is a reply to message #15293] Thu, 26 July 2007 14:47 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
This flag tells the .get_sb callback that this is a kern_mount() call
so that it can trust *data pointer to be valid in-kernel one.

Running a few steps forward - this will be needed for proc to create the
superblock and store a valid pid namespace on it during the namespace
creation. The reason, why the namespace cannot live without proc mount
is described in the appropriate patch.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

fs/namespace.c | 3 ++-
fs/super.c | 6 +++---
include/linux/fs.h | 4 +++-
3 files changed, 8 insertions(+), 5 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/fs/namespace.c linux-2.6.23-rc1-mm1-7/fs/namespace.c
--- linux-2.6.23-rc1-mm1.orig/fs/namespace.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/namespace.c 2007-07-26 16:36:36.000000000 +0400
@@ -1579,7 +1579,8 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_NOMNT;

flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
- MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT);
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+ MS_NOMNT | MS_KERNMOUNT);

/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
diff -upr linux-2.6.23-rc1-mm1.orig/fs/super.c linux-2.6.23-rc1-mm1-7/fs/super.c
--- linux-2.6.23-rc1-mm1.orig/fs/super.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/super.c 2007-07-26 16:36:36.000000000 +0400
@@ -944,9 +944,9 @@ do_kern_mount(const char *fstype, int fl
return mnt;
}

-struct vfsmount *kern_mount(struct file_system_type *type)
+struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
{
- return vfs_kern_mount(type, 0, type->name, NULL);
+ return vfs_kern_mount(type, MS_KERNMOUNT, type->name, data);
}

-EXPORT_SYMBOL(kern_mount);
+EXPORT_SYMBOL_GPL(kern_mount_data);
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/fs.h linux-2.6.23-rc1-mm1-7/include/linux/fs.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/fs.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/fs.h 2007-07-26 16:36:36.000000000 +0400
@@ -129,6 +129,7 @@ extern int dir_notify_enable;
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
#define MS_SETUSER (1<<23) /* set mnt_uid to current user */
#define MS_NOMNT (1<<24) /* don't allow unprivileged submounts */
+#define MS_KERNMOUNT (1<<25) /* this is a kern_mount call */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

@@ -1459,7 +1460,8 @@ void unnamed_dev_init(void);

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
-extern struct vfsmount *kern_mount(struct file_system_type *);
+extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
+#define kern_mount(type) kern_mount_data(type, NULL)
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
extern void umount_tree(struct vfsmount *, int, struct list_head *);
[PATCH 3/15] kern_siginfo helper [message #15297 is a reply to message #15293] Thu, 26 July 2007 14:48 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
From: Sukadev Bhattiprolu <sukadev@us.ibm.com>

get_signal_to_deliver() checks to ensure that /sbin/init does not
receive any unwanted signals. With implementation of multiple pid
namespaces, we need to extend this check for all "container-init"
processes (processes with pid == 1 in the pid namespace)

IOW, A container-init process, must not receive unwanted signals from
within the container. However, the container-init must receive and
honor any signals it receives from an ancestor pid namespace (i.e it
must appear like a normal process in its parent namespace).

i.e for correct processing of the signal, either:

- the recepient of the signal (in get_signal_to_deliver()) must
have some information (such as pid namespace) of the sender

- or the sender of the signal (in send_signal()) should know
whether the signal is an "unwanted" signal from the recepient's
POV.

This patch provides a simple mechanism, to pass additional information from
the sender to the recepient (from send_signal() to get_signal_to_deliver()).

This patch is just a helper and a follow-on patch will use this infrastructure
to actually pass in information about the sender.

Implementation note:

Most signal interfaces in the kernel operate on 'siginfo_t' data
structure which is user-visible and cannot be easily modified/
extended. So this patch defines a wrapper, 'struct kern_siginfo',
around the 'siginfo_t' and allows extending this wrapper for future
needs.

TODO: This is more an exploratory patch and modifies only interfaces
necessary to implement correct signal semantics in pid namespaces.

If the approach is feasible, we could consistently use 'kern_siginfo'
in other signal interfaces and possibly in 'struct sigqueue'.

We could modify dequeue_signal() to directly work with 'kern_siginfo'
and remove dequeue_signal_kern_info().

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>

---

include/linux/signal.h | 7 +++++++
kernel/signal.c | 38 +++++++++++++++++++++++++++++++-------
2 files changed, 38 insertions(+), 7 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/signal.h linux-2.6.23-rc1-mm1-7/include/linux/signal.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/signal.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/signal.h 2007-07-26 16:36:37.000000000 +0400
@@ -7,6 +7,13 @@
#ifdef __KERNEL__
#include <linux/list.h>

+struct kern_siginfo {
+ siginfo_t *info;
+ int flags;
+};
+
+#define KERN_SIGINFO_CINIT 0x1 /* True iff signalling container-init */
+
/*
* Real Time signals may be queued.
*/
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/signal.c linux-2.6.23-rc1-mm1-7/kernel/signal.c
--- linux-2.6.23-rc1-mm1.orig/kernel/signal.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/signal.c 2007-07-26 16:36:37.000000000 +0400
@@ -299,10 +299,12 @@ unblock_all_signals(void)
spin_unlock_irqrestore(&current->sighand->siglock, flags);
}

-static int collect_signal(int sig, struct sigpending *list, siginfo_t *info)
+static int collect_signal(int sig, struct sigpending *list,
+ struct kern_siginfo *kinfo)
{
struct sigqueue *q, *first = NULL;
int still_pending = 0;
+ siginfo_t *info = kinfo->info;

if (unlikely(!sigismember(&list->signal, sig)))
return 0;
@@ -343,7 +348,7 @@ static int collect_signal(int sig, struc
}

static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
- siginfo_t *info)
+ struct kern_siginfo *kinfo)
{
int sig = next_signal(pending, mask);

@@ -357,7 +364,7 @@ static int __dequeue_signal(struct sigpe
}
}

- if (!collect_signal(sig, pending, info))
+ if (!collect_signal(sig, pending, kinfo))
sig = 0;
}

@@ -370,18 +377,20 @@ static int __dequeue_signal(struct sigpe
*
* All callers have to hold the siglock.
*/
-int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
+int dequeue_signal_kern_info(struct task_struct *tsk, sigset_t *mask,
+ struct kern_siginfo *kinfo)
{
int signr = 0;
+ siginfo_t *info = kinfo->info;

/* We only dequeue private signals from ourselves, we don't let
* signalfd steal them
*/
if (tsk == current)
- signr = __dequeue_signal(&tsk->pending, mask, info);
+ signr = __dequeue_signal(&tsk->pending, mask, kinfo);
if (!signr) {
signr = __dequeue_signal(&tsk->signal->shared_pending,
- mask, info);
+ mask, kinfo);
/*
* itimer signal ?
*
@@ -441,6 +450,22 @@ int dequeue_signal(struct task_struct *t
}

/*
+ * Dequeue a signal and return the element to the caller, which is
+ * expected to free it.
+ *
+ * All callers have to hold the siglock.
+ */
+int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
+{
+ struct kern_siginfo kinfo;
+
+ kinfo.info = info;
+ kinfo.flags = 0;
+
+ return dequeue_signal_kern_info(tsk, mask, &kinfo);
+}
+
+/*
* Tell a process that it has a new active signal..
*
* NOTE! we rely on the previous spin_lock to
@@ -1779,6 +1873,10 @@ int get_signal_to_deliver(siginfo_t *inf
{
sigset_t *mask = &current->blocked;
int signr = 0;
+ struct kern_siginfo kinfo;
+
+ kinfo.info = info;
+ kinfo.flags = 0;

try_to_freeze();

@@ -1791,7 +1889,7 @@ relock:
handle_group_stop())
goto relock;

- signr = dequeue_signal(current, mask, info);
+ signr = dequeue_signal_kern_info(current, mask, &kinfo);

if (!signr)
break; /* will return 0 */
[PATCH 4/15] Make proc_flust_task() flush entries from multiple proc trees [message #15298 is a reply to message #15293] Thu, 26 July 2007 14:48 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Since a task will appear in more than one proc tree we need to shrink many
trees. For this case we pass the struct pid to proc_flush_task() and shrink
the mounts of all the namespaces this pid belongs to.

The NULL passed to it means that only global mount is to be flushed.
This is a preparation for a real flushing patch.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

fs/proc/base.c | 18 +++++++++++++++---
include/linux/proc_fs.h | 6 ++++--
kernel/exit.c | 2 +-
3 files changed, 20 insertions(+), 6 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/fs/proc/base.c linux-2.6.23-rc1-mm1-7/fs/proc/base.c
--- linux-2.6.23-rc1-mm1.orig/fs/proc/base.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/proc/base.c 2007-07-26 16:36:37.000000000 +0400
@@ -74,6 +74,7 @@
#include <linux/nsproxy.h>
#include <linux/oom.h>
#include <linux/elf.h>
+#include <linux/pid_namespace.h>
#include "internal.h"

/* NOTE:
@@ -2115,7 +2116,7 @@ static const struct inode_operations pro
* that no dcache entries will exist at process exit time it
* just makes it very unlikely that any will persist.
*/
-void proc_flush_task(struct task_struct *task)
+static void proc_flush_task_mnt(struct task_struct *task, struct vfsmount *mnt)
{
struct dentry *dentry, *leader, *dir;
char buf[PROC_NUMBUF];
@@ -2123,7 +2124,7 @@ void proc_flush_task(struct task_struct

name.name = buf;
name.len = snprintf(buf, sizeof(buf), "%d", task->pid);
- dentry = d_hash_and_lookup(proc_mnt->mnt_root, &name);
+ dentry = d_hash_and_lookup(mnt->mnt_root, &name);
if (dentry) {
shrink_dcache_parent(dentry);
d_drop(dentry);
@@ -2135,7 +2136,7 @@ void proc_flush_task(struct task_struct

name.name = buf;
name.len = snprintf(buf, sizeof(buf), "%d", task->tgid);
- leader = d_hash_and_lookup(proc_mnt->mnt_root, &name);
+ leader = d_hash_and_lookup(mnt->mnt_root, &name);
if (!leader)
goto out;

@@ -2161,6 +2162,17 @@ out:
return;
}

+/*
+ * when flushing dentries from proc one need to flush them from global
+ * proc (proc_mnt) and from all the namespaces' procs this task was seen
+ * in. this call is supposed to make all this job.
+ */
+
+void proc_flush_task(struct task_struct *task, struct pid *pid)
+{
+ proc_flush_task_mnt(task, proc_mnt);
+}
+
static struct dentry *proc_pid_instantiate(struct inode *dir,
struct dentry * dentry,
struct task_struct *task, const void *ptr)
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/proc_fs.h linux-2.6.23-rc1-mm1-7/include/linux/proc_fs.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/proc_fs.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/proc_fs.h 2007-07-26 16:36:36.000000000 +0400
@@ -111,7 +111,7 @@ extern void proc_misc_init(void);

struct mm_struct;

-void proc_flush_task(struct task_struct *task);
+void proc_flush_task(struct task_struct *task, struct pid *pid);
struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *);
int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir);
unsigned long task_vsize(struct mm_struct *);
@@ -223,7 +227,9 @@ static inline void proc_net_remove(const
#define proc_net_create(name, mode, info) ({ (void)(mode), NULL; })
static inline void proc_net_remove(const char *name) {}

-static inline void proc_flush_task(struct task_struct *task) { }
+static inline void proc_flush_task(struct task_struct *task, struct pid *pid)
+{
+}

static inline struct proc_dir_entry *create_proc_entry(const char *name,
mode_t mode, struct proc_dir_entry *parent) { return NULL; }
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/exit.c linux-2.6.23-rc1-mm1-7/kernel/exit.c
--- linux-2.6.23-rc1-mm1.orig/kernel/exit.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/exit.c 2007-07-26 16:36:37.000000000 +0400
@@ -184,7 +199,7 @@ repeat:
}

write_unlock_irq(&tasklist_lock);
- proc_flush_task(p);
+ proc_flush_task(p, NULL);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);
[PATCH 5/15] Introduce struct upid [message #15299 is a reply to message #15293] Thu, 26 July 2007 14:49 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
From: Sukadev Bhattiprolu <sukadev@us.ibm.com>

Since task will be visible from different pid namespaces each of them
have to be addressed by multiple pids. struct upid is to store the
information about which id refers to which namespace.

The constuciton looks like this. Each struct pid carried the reference
counter and the list of tasks attached to this pid. At its end it has
a variable length array of struct upid-s. Each struct upid has a numerical
id (pid itself), pointer to the namespace, this ID is valid in and is
hashed into a pid_hash for searching the pids.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

init_task.h | 9 ++++++---
pid.h | 12 +++++++++---
2 files changed, 15 insertions(+), 6 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/init_task.h linux-2.6.23-rc1-mm1-7/include/linux/init_task.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/init_task.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/init_task.h 2007-07-26 16:36:36.000000000 +0400
@@ -93,15 +93,18 @@ extern struct group_info init_groups;

#define INIT_STRUCT_PID { \
.count = ATOMIC_INIT(1), \
- .nr = 0, \
- /* Don't put this struct pid in pid_hash */ \
- .pid_chain = { .next = NULL, .pprev = NULL }, \
.tasks = { \
{ .first = &init_task.pids[PIDTYPE_PID].node }, \
{ .first = &init_task.pids[PIDTYPE_PGID].node }, \
{ .first = &init_task.pids[PIDTYPE_SID].node }, \
}, \
.rcu = RCU_HEAD_INIT, \
+ .level = 0, \
+ .numbers = { { \
+ .nr = 0, \
+ .ns = &init_pid_ns, \
+ .pid_chain = { .next = NULL, .pprev = NULL }, \
+ }, } \
}

#define INIT_PID_LINK(type) \
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid.h linux-2.6.23-rc1-mm1-7/include/linux/pid.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
@@ -40,15 +40,21 @@ enum pid_type
* processes.
*/

-struct pid
-{
- atomic_t count;
+struct upid {
/* Try to keep pid_chain in the same cacheline as nr for find_pid */
int nr;
+ struct pid_namespace *ns;
struct hlist_node pid_chain;
+};
+
+struct pid
+{
+ atomic_t count;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct rcu_head rcu;
+ int level;
+ struct upid numbers[1];
};

extern struct pid init_struct_pid;
[PATCH 6/15] Make alloc_pid(), free_pid() and put_pid() work with struct upid [message #15300 is a reply to message #15293] Thu, 26 July 2007 14:50 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Each struct upid element of struct pid has to be initialized properly,
i.e. its nr mst be allocated from appropriate pidmap and it must be
inserted into the hash.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

include/linux/pid.h | 2 +-
kernel/pid.c | 52 +++++++++++++++++++++++++++++++++++++---------------
2 files changed, 38 insertions(+), 16 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid.h linux-2.6.23-rc1-mm1-7/include/linux/pid.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
@@ -83,7 +92,7 @@ extern void FASTCALL(detach_pid(struct t
extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr);

-extern struct pid *alloc_pid(void);
+extern struct pid *alloc_pid(struct pid_namespace *ns);
extern void FASTCALL(free_pid(struct pid *pid));

static inline pid_t pid_nr(struct pid *pid)
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
--- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
@@ -28,7 +28,8 @@
#include <linux/pid_namespace.h>
#include <linux/init_task.h>

-#define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift)
+#define pid_hashfn(nr, ns) \
+ hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
static struct hlist_head *pid_hash;
static int pidhash_shift;
struct pid init_struct_pid = INIT_STRUCT_PID;
@@ -187,11 +202,13 @@ fastcall void put_pid(struct pid *pid)
if (!pid)
return;

- /* FIXME - this must be the namespace this pid lives in */
- ns = &init_pid_ns;
+ ns = pid->numbers[pid->level].ns;
if ((atomic_read(&pid->count) == 1) ||
- atomic_dec_and_test(&pid->count))
+ atomic_dec_and_test(&pid->count)) {
kmem_cache_free(ns->pid_cachep, pid);
+ if (ns != &init_pid_ns)
+ put_pid_ns(ns);
+ }
}
EXPORT_SYMBOL_GPL(put_pid);

@@ -204,45 +221,64 @@ static void delayed_put_pid(struct rcu_h
fastcall void free_pid(struct pid *pid)
{
/* We can be called with write_lock_irq(&tasklist_lock) held */
+ int i;
unsigned long flags;

spin_lock_irqsave(&pidmap_lock, flags);
- hlist_del_rcu(&pid->pid_chain);
+ for (i = 0; i <= pid->level; i++)
+ hlist_del_rcu(&pid->numbers[i].pid_chain);
spin_unlock_irqrestore(&pidmap_lock, flags);

- free_pidmap(&init_pid_ns, pid->nr);
+ for (i = 0; i <= pid->level; i++)
+ free_pidmap(pid->numbers[i].ns, pid->numbers[i].nr);
+
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(void)
+struct pid *alloc_pid(struct pid_namespace *ns)
{
struct pid *pid;
enum pid_type type;
- int nr = -1;
- struct pid_namespace *ns;
+ int i, nr;
+ struct pid_namespace *tmp;

- ns = task_active_pid_ns(current);
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
goto out;

- nr = alloc_pidmap(ns);
- if (nr < 0)
- goto out_free;
+ tmp = ns;
+ for (i = ns->level; i >= 0; i--) {
+ nr = alloc_pidmap(tmp);
+ if (nr < 0)
+ goto out_free;
+
+ pid->numbers[i].nr = nr;
+ pid->numbers[i].ns = tmp;
+ tmp = tmp->parent;
+ }

+ if (ns != &init_pid_ns)
+ get_pid_ns(ns);
+
+ pid->level = ns->level;
atomic_set(&pid->count, 1);
- pid->nr = nr;
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);

spin_lock_irq(&pidmap_lock);
- hlist_add_head_rcu(&pid->pid_chain, &pid_hash[pid_hashfn(pid->nr)]);
+ for (i = pid->level; i >= 0; i--)
+ hlist_add_head_rcu(&pid->numbers[i].pid_chain,
+ &pid_hash[pid_hashfn(pid->numbers[i].nr,
+ pid->numbers[i].ns)]);
spin_unlock_irq(&pidmap_lock);

out:
return pid;

out_free:
+ for (i++; i <= ns->level; i++)
+ free_pidmap(pid->numbers[i].ns, pid->numbers[i].nr);
+
kmem_cache_free(ns->pid_cachep, pid);
pid = NULL;
goto out;
[PATCH 7/15] Helpers to obtain pid numbers [message #15301 is a reply to message #15293] Thu, 26 July 2007 14:51 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
When showing pid to user or getting the pid numerical id for in-kernel
use the value of this id may differ depending on the namespace.

This set of helpers is used to get the global pid nr, the virtual (i.e.
seen by task in its namespace) nr and the nr as it is seen from the
specified namespace.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

include/linux/pid.h | 24 ++++++++++-
include/linux/sched.h | 108 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/pid.c | 8 +++
3 files changed, 129 insertions(+), 11 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid.h linux-2.6.23-rc1-mm1-7/include/linux/pid.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
@@ -83,12 +92,34 @@ extern void FASTCALL(detach_pid(struct t

extern struct pid *alloc_pid(struct pid_namespace *ns);
extern void FASTCALL(free_pid(struct pid *pid));
+
+/*
+ * the helpers to get the pid's id seen from different namespaces
+ *
+ * pid_nr() : global id, i.e. the id seen from the init namespace;
+ * pid_vnr() : virtual id, i.e. the id seen from the namespace this pid
+ * belongs to. this only makes sence when called in the
+ * context of the task that belongs to the same namespace;
+ * pid_nr_ns() : id seen from the ns specified.
+ *
+ * see also task_xid_nr() etc in include/linux/sched.h
+ */

static inline pid_t pid_nr(struct pid *pid)
{
pid_t nr = 0;
if (pid)
- nr = pid->nr;
+ nr = pid->numbers[0].nr;
+ return nr;
+}
+
+pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
+
+static inline pid_t pid_vnr(struct pid *pid)
+{
+ pid_t nr = 0;
+ if (pid)
+ nr = pid->numbers[pid->level].nr;
return nr;
}

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/sched.h linux-2.6.23-rc1-mm1-7/include/linux/sched.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/sched.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/sched.h 2007-07-26 16:36:37.000000000 +0400
@@ -1221,16 +1222,6 @@ static inline int rt_task(struct task_st
return rt_prio(p->prio);
}

-static inline pid_t task_pgrp_nr(struct task_struct *tsk)
-{
- return tsk->signal->pgrp;
-}
-
-static inline pid_t task_session_nr(struct task_struct *tsk)
-{
- return tsk->signal->__session;
-}
-
static inline void set_task_session(struct task_struct *tsk, pid_t session)
{
tsk->signal->__session = session;
@@ -1256,6 +1247,104 @@ static inline struct pid *task_session(s
return task->group_leader->pids[PIDTYPE_SID].pid;
}

+struct pid_namespace;
+
+/*
+ * the helpers to get the task's different pids as they are seen
+ * from various namespaces
+ *
+ * task_xid_nr() : global id, i.e. the id seen from the init namespace;
+ * task_xid_vnr() : virtual id, i.e. the id seen from the namespace the task
+ * belongs to. this only makes sence when called in the
+ * context of the task that belongs to the same namespace;
+ * task_xid_nr_ns() : id seen from the ns specified;
+ *
+ * set_task_vxid() : assigns a virtual id to a task;
+ *
+ * task_ppid_nr_ns() : the parent's id as seen from the namespace specified.
+ * the result depends on the namespace and whether the
+ * task in question is the namespace's init. e.g. for the
+ * namespace's init this will return 0 when called from
+ * the namespace of this init, or appropriate id otherwise.
+ *
+ *
+ * see also pid_nr() etc in include/linux/pid.h
+ */
+
+static inline pid_t task_pid_nr(struct task_struct *tsk)
+{
+ return tsk->pid;
+}
+
+static inline pid_t task_pid_nr_ns(struct task_struct *tsk,
+ struct pid_namespace *ns)
+{
+ return pid_nr_ns(task_pid(tsk), ns);
+}
+
+static inline pid_t task_pid_vnr(struct task_struct *tsk)
+{
+ return pid_vnr(task_pid(tsk));
+}
+
+
+static inline pid_t task_tgid_nr(struct task_struct *tsk)
+{
+ return tsk->tgid;
+}
+
+static inline pid_t task_tgid_nr_ns(struct task_struct *tsk,
+ struct pid_namespace *ns)
+{
+ return pid_nr_ns(task_tgid(tsk), ns);
+}
+
+static inline pid_t task_tgid_vnr(struct task_struct *tsk)
+{
+ return pid_vnr(task_tgid(tsk));
+}
+
+
+static inline pid_t task_pgrp_nr(struct task_struct *tsk)
+{
+ return tsk->signal->pgrp;
+}
+
+static inline pid_t task_pgrp_nr_ns(struct task_struct *tsk,
+ struct pid_namespace *ns)
+{
+ return pid_nr_ns(task_pgrp(tsk), ns);
+}
+
+static inline pid_t task_pgrp_vnr(struct task_struct *tsk)
+{
+ return pid_vnr(task_pgrp(tsk));
+}
+
+
+static inline pid_t task_session_nr(struct task_struct *tsk)
+{
+ return tsk->signal->__session;
+}
+
+static inline pid_t task_session_nr_ns(struct task_struct *tsk,
+ struct pid_namespace *ns)
+{
+ return pid_nr_ns(task_session(tsk), ns);
+}
+
+static inline pid_t task_session_vnr(struct task_struct *tsk)
+{
+ return pid_vnr(task_session(tsk));
+}
+
+
+static inline pid_t task_ppid_nr_ns(struct task_struct *tsk,
+ struct pid_namespace *ns)
+{
+ return pid_nr_ns(task_pid(rcu_dereference(tsk->real_parent)), ns);
+}
+
/**
* pid_alive - check that a task structure is not stale
* @p: Task structure to be checked.
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
--- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
@@ -352,6 +436,14 @@ struct pid *find_get_pid(pid_t nr)
return pid;
}

+pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
+{
+ pid_t nr = 0;
+ if (pid && ns->level <= pid->level)
+ nr = pid->numbers[ns->level].nr;
+ return nr;
+}
+
/*
* Used by proc to find the first pid that is greater then or equal to nr.
*
[PATCH 8/15] Helpers to find the task by its numerical ids [message #15302 is a reply to message #15293] Thu, 26 July 2007 14:51 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
When searching the task by numerical id on may need to find
it using global pid (as it is done now in kernel) or by its
virtual id, e.g. when sending a signal to a task from one
namespace the sender will specify the task's virtual id.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

include/linux/pid.h | 16 ++++++++++++++--
include/linux/sched.h | 31 +++++++++++++++++++++++++++++--
kernel/pid.c | 32 +++++++++++++++++---------------
3 files changed, 60 insertions(+), 19 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid.h linux-2.6.23-rc1-mm1-7/include/linux/pid.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
@@ -83,17 +92,29 @@ extern void FASTCALL(detach_pid(struct t
extern void FASTCALL(transfer_pid(struct task_struct *old,
struct task_struct *new, enum pid_type));

+struct pid_namespace;
+extern struct pid_namespace init_pid_ns;
+
/*
* look up a PID in the hash table. Must be called with the tasklist_lock
* or rcu_read_lock() held.
+ *
+ * find_pid_ns() finds the pid in the namespace specified
+ * find_pid() find the pid by its global id, i.e. in the init namespace
+ * find_vpid() finr the pid by its virtual id, i.e. in the current namespace
+ *
+ * see also find_task_by_pid() set in include/linux/sched.h
*/
-extern struct pid *FASTCALL(find_pid(int nr));
+extern struct pid *FASTCALL(find_pid_ns(int nr, struct pid_namespace *ns));
+
+#define find_vpid(pid) find_pid_ns(pid, current->nsproxy->pid_ns)
+#define find_pid(pid) find_pid_ns(pid, &init_pid_ns)

/*
* Lookup a PID in the hash table, and return with it's count elevated.
*/
extern struct pid *find_get_pid(int nr);
-extern struct pid *find_ge_pid(int nr);
+extern struct pid *find_ge_pid(int nr, struct pid_namespace *);

extern struct pid *alloc_pid(struct pid_namespace *ns);
extern void FASTCALL(free_pid(struct pid *pid));
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/sched.h linux-2.6.23-rc1-mm1-7/include/linux/sched.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/sched.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/sched.h 2007-07-26 16:36:37.000000000 +0400
@@ -1453,8 +1542,35 @@ extern struct task_struct init_task;

extern struct mm_struct init_mm;

-#define find_task_by_pid(nr) find_task_by_pid_type(PIDTYPE_PID, nr)
-extern struct task_struct *find_task_by_pid_type(int type, int pid);
+extern struct pid_namespace init_pid_ns;
+
+/*
+ * find a task by one of its numerical ids
+ *
+ * find_task_by_pid_type_ns():
+ * it is the most generic call - it finds a task by all id,
+ * type and namespace specified
+ * find_task_by_pid_ns():
+ * finds a task by its pid in the specified namespace
+ * find_task_by_pid_type():
+ * finds a task by its global id with the specified type, e.g.
+ * by global session id
+ * find_task_by_pid():
+ * finds a task by its global pid
+ *
+ * see also find_pid() etc in include/linux/pid.h
+ */
+
+extern struct task_struct *find_task_by_pid_type_ns(int type, int pid,
+ struct pid_namespace *ns);
+
+#define find_task_by_pid_ns(nr, ns) \
+ find_task_by_pid_type_ns(PIDTYPE_PID, nr, ns)
+#define find_task_by_pid_type(type, nr) \
+ find_task_by_pid_type_ns(type, nr, &init_pid_ns)
+#define find_task_by_pid(nr) \
+ find_task_by_pid_type(PIDTYPE_PID, nr)
+
extern void __set_special_pids(pid_t session, pid_t pgrp);

/* per-UID process charging. */
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
--- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
@@ -204,19 +221,20 @@ static void delayed_put_pid(struct rcu_h
goto out;
}

-struct pid * fastcall find_pid(int nr)
+struct pid * fastcall find_pid_ns(int nr, struct pid_namespace *ns)
{
struct hlist_node *elem;
- struct pid *pid;
+ struct upid *pnr;
+
+ hlist_for_each_entry_rcu(pnr, elem,
+ &pid_hash[pid_hashfn(nr, ns)], pid_chain)
+ if (pnr->nr == nr && pnr->ns == ns)
+ return container_of(pnr, struct pid,
+ numbers[ns->level]);

- hlist_for_each_entry_rcu(pid, elem,
- &pid_hash[pid_hashfn(nr)], pid_chain) {
- if (pid->nr == nr)
- return pid;
- }
return NULL;
}
-EXPORT_SYMBOL_GPL(find_pid);
+EXPORT_SYMBOL_GPL(find_pid_ns);

/*
* attach_pid() must be called with the tasklist_lock write-held.
@@ -318,12 +355,13 @@ struct task_struct * fastcall pid_task(s
/*
* Must be called under rcu_read_lock() or with tasklist_lock read-held.
*/
-struct task_struct *find_task_by_pid_type(int type, int nr)
+struct task_struct *find_task_by_pid_type_ns(int type, int nr,
+ struct pid_namespace *ns)
{
- return pid_task(find_pid(nr), type);
+ return pid_task(find_pid_ns(nr, ns), type);
}

-EXPORT_SYMBOL(find_task_by_pid_type);
+EXPORT_SYMBOL(find_task_by_pid_type_ns);

struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
{
@@ -342,7 +426,7 @@ struct pid *find_get_pid(pid_t nr)
struct pid *pid;

rcu_read_lock();
- pid = get_pid(find_pid(nr));
+ pid = get_pid(find_vpid(nr));
rcu_read_unlock();

return pid;
@@ -352,15 +436,15 @@ struct pid *find_get_pid(pid_t nr)
*
* If there is a pid at nr this function is exactly the same as find_pid.
*/
-struct pid *find_ge_pid(int nr)
+struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
{
struct pid *pid;

do {
- pid = find_pid(nr);
+ pid = find_pid_ns(nr, ns);
if (pid)
break;
- nr = next_pidmap(task_active_pid_ns(current), nr);
+ nr = next_pidmap(ns, nr);
} while (nr > 0);

return pid;
[PATCH 9/15] Move alloc_pid() after the namespace is cloned [message #15303 is a reply to message #15293] Thu, 26 July 2007 14:52 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
When we create new namespace we will need to allocate the struct pid,
that will have one extra struct upid in array, comparing to the parent.
Thus we need to know the new namespace (if any) in alloc_pid() to init
this struct upid properly, so move the alloc_pid() call lower in
copy_process().

This is a fix for Sukadev's patch that moved the alloc_pid() call from
do_fork() into copy_process().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

fork.c | 59 ++++++++++++++++++++++++++++++++++++++---------------------
1 files changed, 38 insertions(+), 21 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/kernel/fork.c linux-2.6.23-rc1-mm1-7/kernel/fork.c
--- linux-2.6.23-rc1-mm1.orig/kernel/fork.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/fork.c 2007-07-26 16:36:37.000000000 +0400
@@ -1028,16 +1029,9 @@ static struct task_struct *copy_process(
if (p->binfmt && !try_module_get(p->binfmt->module))
goto bad_fork_cleanup_put_domain;

- if (pid != &init_struct_pid) {
- pid = alloc_pid();
- if (!pid)
- goto bad_fork_put_binfmt_module;
- }
-
p->did_exec = 0;
delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
copy_flags(clone_flags, p);
- p->pid = pid_nr(pid);
INIT_LIST_HEAD(&p->children);
INIT_LIST_HEAD(&p->sibling);
p->vfork_done = NULL;
@@ -1112,10 +1106,6 @@ static struct task_struct *copy_process(
p->blocked_on = NULL; /* not blocked yet */
#endif

- p->tgid = p->pid;
- if (clone_flags & CLONE_THREAD)
- p->tgid = current->tgid;
-
if ((retval = security_task_alloc(p)))
goto bad_fork_cleanup_policy;
if ((retval = audit_alloc(p)))
@@ -1141,6 +1131,17 @@ static struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_namespaces;

+ if (pid != &init_struct_pid) {
+ pid = alloc_pid(task_active_pid_ns(p));
+ if (!pid)
+ goto bad_fork_cleanup_namespaces;
+ }
+
+ p->pid = pid_nr(pid);
+ p->tgid = p->pid;
+ if (clone_flags & CLONE_THREAD)
+ p->tgid = current->tgid;
+
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
/*
* Clear TID on mm_release()?
@@ -1237,7 +1242,7 @@ static struct task_struct *copy_process(
spin_unlock(&current->sighand->siglock);
write_unlock_irq(&tasklist_lock);
retval = -ERESTARTNOINTR;
- goto bad_fork_cleanup_namespaces;
+ goto bad_fork_free_pid;
}

if (clone_flags & CLONE_THREAD) {
@@ -1266,11 +1271,22 @@ static struct task_struct *copy_process(
__ptrace_link(p, current->parent);

if (thread_group_leader(p)) {
- p->signal->tty = current->signal->tty;
- p->signal->pgrp = task_pgrp_nr(current);
- set_task_session(p, task_session_nr(current));
- attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
- attach_pid(p, PIDTYPE_SID, task_session(current));
+ if (clone_flags & CLONE_NEWPID) {
+ p->nsproxy->pid_ns->child_reaper = p;
+ p->signal->tty = NULL;
+ p->signal->pgrp = p->pid;
+ set_task_session(p, p->pid);
+ attach_pid(p, PIDTYPE_PGID, pid);
+ attach_pid(p, PIDTYPE_SID, pid);
+ } else {
+ p->signal->tty = current->signal->tty;
+ p->signal->pgrp = task_pgrp_nr(current);
+ set_task_session(p, task_session_nr(current));
+ attach_pid(p, PIDTYPE_PGID,
+ task_pgrp(current));
+ attach_pid(p, PIDTYPE_SID,
+ task_session(current));
+ }

list_add_tail_rcu(&p->tasks, &init_task.tasks);
__get_cpu_var(process_counts)++;
@@ -1288,6 +1304,9 @@ static struct task_struct *copy_process(
container_post_fork(p);
return p;

+bad_fork_free_pid:
+ if (pid != &init_struct_pid)
+ free_pid(pid);
bad_fork_cleanup_namespaces:
exit_task_namespaces(p);
bad_fork_cleanup_keys:
@@ -1322,9 +1341,6 @@ bad_fork_cleanup_container:
#endif
container_exit(p, container_callbacks_done);
delayacct_tsk_free(p);
- if (pid != &init_struct_pid)
- free_pid(pid);
-bad_fork_put_binfmt_module:
if (p->binfmt)
module_put(p->binfmt->module);
bad_fork_cleanup_put_domain:
@@ -1406,7 +1422,13 @@ long do_fork(unsigned long clone_flags,
if (!IS_ERR(p)) {
struct completion vfork;

- nr = pid_nr(task_pid(p));
+ /*
+ * this is enough to call pid_nr_ns here, but this if
+ * improves optimisation of regular fork()
+ */
+ nr = (clone_flags & CLONE_NEWPID) ?
+ task_pid_nr_ns(p, current->nsproxy->pid_ns) :
+ task_pid_vnr(p);

if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
[PATCH 10/15] Make each namespace has its own proc tree [message #15304 is a reply to message #15293] Thu, 26 July 2007 14:54 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
The idea is the following: when a namespace is created the
new proc superblock, pointing to this namespace, is created.
When user accesses the proc tree from some pountmoint it oberves
the tasks that live in the namespace, pointer by the according
spuperblock.

When task dies, its pids must be flushed from several proc trees,
i.e. the ones that are owned by the namespaces that can see the
taks's struct pid.

Having the namespaces live without the proc mount is racy, so
during the namespace creation we kern_mount the new proc instance.

We cannot just get the reference on the struct pid_namespace from
the superblock, otherwise we'll have the ns->proc_mnt->sb->ns
loop. To break this, when the init task dies, it releases the
kern_mount-ed instance.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

fs/proc/base.c | 12 ++++++
fs/proc/inode.c | 3 +
fs/proc/root.c | 83 +++++++++++++++++++++++++++++++++++++++---
include/linux/pid_namespace.h | 3 +
include/linux/proc_fs.h | 15 +++++++
kernel/exit.c | 18 ++++++++-
kernel/fork.c | 5 ++
kernel/pid.c | 1
8 files changed, 133 insertions(+), 7 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/fs/proc/base.c linux-2.6.23-rc1-mm1-7/fs/proc/base.c
--- linux-2.6.23-rc1-mm1.orig/fs/proc/base.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/proc/base.c 2007-07-26 16:36:37.000000000 +0400
@@ -2161,7 +2162,19 @@ out:

void proc_flush_task(struct task_struct *task, struct pid *pid)
{
+ int i;
+ struct upid *upid;
+
proc_flush_task_mnt(task, proc_mnt);
+ if (pid == NULL)
+ return;
+
+ for (i = 1; i <= pid->level; i++)
+ proc_flush_task_mnt(task, pid->numbers[i].ns->proc_mnt);
+
+ upid = &pid->numbers[pid->level];
+ if (upid->nr == 1)
+ pid_ns_release_proc(upid->ns);
}

static struct dentry *proc_pid_instantiate(struct inode *dir,
diff -upr linux-2.6.23-rc1-mm1.orig/fs/proc/inode.c linux-2.6.23-rc1-mm1-7/fs/proc/inode.c
--- linux-2.6.23-rc1-mm1.orig/fs/proc/inode.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/proc/inode.c 2007-07-26 16:36:36.000000000 +0400
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/smp_lock.h>
+#include <linux/pid_namespace.h>

#include <asm/system.h>
#include <asm/uaccess.h>
@@ -427,7 +428,7 @@ out_mod:
return NULL;
}

-int proc_fill_super(struct super_block *s, void *data, int silent)
+int proc_fill_super(struct super_block *s)
{
struct inode * root_inode;

diff -upr linux-2.6.23-rc1-mm1.orig/fs/proc/root.c linux-2.6.23-rc1-mm1-7/fs/proc/root.c
--- linux-2.6.23-rc1-mm1.orig/fs/proc/root.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/proc/root.c 2007-07-26 16:36:37.000000000 +0400
@@ -18,32 +18,90 @@
#include <linux/bitops.h>
#include <linux/smp_lock.h>
#include <linux/mount.h>
+#include <linux/pid_namespace.h>

#include "internal.h"

struct proc_dir_entry *proc_net, *proc_net_stat, *proc_bus, *proc_root_fs, *proc_root_driver;

+static int proc_test_super(struct super_block *sb, void *data)
+{
+ return sb->s_fs_info == data;
+}
+
+static int proc_set_super(struct super_block *sb, void *data)
+{
+ struct pid_namespace *ns;
+
+ ns = (struct pid_namespace *)data;
+ sb->s_fs_info = get_pid_ns(ns);
+ return set_anon_super(sb, NULL);
+}
+
static int proc_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
+ int err;
+ struct super_block *sb;
+ struct pid_namespace *ns;
+ struct proc_inode *ei;
+
if (proc_mnt) {
/* Seed the root directory with a pid so it doesn't need
* to be special in base.c. I would do this earlier but
* the only task alive when /proc is mounted the first time
* is the init_task and it doesn't have any pids.
*/
- struct proc_inode *ei;
ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
if (!ei->pid)
ei->pid = find_get_pid(1);
}
- return get_sb_single(fs_type, flags, data, proc_fill_super, mnt);
+
+ if (flags & MS_KERNMOUNT)
+ ns = (struct pid_namespace *)data;
+ else
+ ns = current->nsproxy->pid_ns;
+
+ sb = sget(fs_type, proc_test_super, proc_set_super, ns);
+ if (IS_ERR(sb))
+ return PTR_ERR(sb);
+
+ if (!sb->s_root) {
+ sb->s_flags = flags;
+ err = proc_fill_super(sb);
+ if (err) {
+ up_write(&sb->s_umount);
+ deactivate_super(sb);
+ return err;
+ }
+
+ ei = PROC_I(sb->s_root->d_inode);
+ if (!ei->pid) {
+ rcu_read_lock();
+ ei->pid = get_pid(find_pid_ns(1, ns));
+ rcu_read_unlock();
+ }
+
+ sb->s_flags |= MS_ACTIVE;
+ ns->proc_mnt = mnt;
+ }
+
+ return simple_set_mnt(mnt, sb);
+}
+
+static void proc_kill_sb(struct super_block *sb)
+{
+ struct pid_namespace *ns;
+
+ ns = (struct pid_namespace *)sb->s_fs_info;
+ kill_anon_super(sb);
+ put_pid_ns(ns);
}

static struct file_system_type proc_fs_type = {
.name = "proc",
.get_sb = proc_get_sb,
- .kill_sb = kill_anon_super,
+ .kill_sb = proc_kill_sb,
};

void __init proc_root_init(void)
@@ -54,12 +112,13 @@ void __init proc_root_init(void)
err = register_filesystem(&proc_fs_type);
if (err)
return;
- proc_mnt = kern_mount(&proc_fs_type);
+ proc_mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
err = PTR_ERR(proc_mnt);
if (IS_ERR(proc_mnt)) {
unregister_filesystem(&proc_fs_type);
return;
}
+
proc_misc_init();
proc_net = proc_mkdir("net", NULL);
proc_net_stat = proc_mkdir("net/stat", NULL);
@@ -153,6 +212,22 @@ struct proc_dir_entry proc_root = {
.parent = &proc_root,
};

+int pid_ns_prepare_proc(struct pid_namespace *ns)
+{
+ struct vfsmount *mnt;
+
+ mnt = kern_mount_data(&proc_fs_type, ns);
+ if (IS_ERR(mnt))
+ return PTR_ERR(mnt);
+
+ return 0;
+}
+
+void pid_ns_release_proc(struct pid_namespace *ns)
+{
+ mntput(ns->proc_mnt);
+}
+
EXPORT_SYMBOL(proc_symlink);
EXPORT_SYMBOL(proc_mkdir);
EXPORT_SYMBOL(create_proc_entry);
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h 2007-07-26 16:36:36.000000000 +0400
@@ -16,6 +15,9 @@ struct pidmap {
int last_pid;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
+#ifdef CONFIG_PROC_FS
+ struct vfsmount *proc_mnt;
+#endif
};

extern struct pid_namespace init_pid_ns;
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/proc_fs.h linux-2.6.23-rc1-mm1-7/include/linux/proc_fs.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/proc_fs.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/proc_fs.h 2007-07-26 16:36:36.000000000 +0400
@@ -126,7 +126,8 @@ extern struct proc_dir_entry *create_pro
extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent);

extern struct vfsmount *proc_mnt;
-extern int proc_fill_super(struct super_block *,void *,int);
+struct pid_namespace;
+extern int proc_fill_super(struct super_block *);
extern struct inode *proc_get_inode(struct super_block *, unsigned int, struct proc_dir_entry *);

/*
@@ -143,6 +144,9 @@ extern const struct file_operations proc
extern const struct file_operations proc_kmsg_operations;
extern const struct file_operations ppc_htab_operations;

+extern int pid_ns_prepare_proc(struct pid_namespace *ns);
+extern void pid_ns_release_proc(struct pid_namespace *ns);
+
/*
* proc_tty.c
*/
@@ -248,6 +254,15 @@ static inline void proc_tty_unregister_d

extern struct proc_dir_entry proc_root;

+static inline int pid_ns_prepare_proc(struct pid_namespace *ns)
+{
+ return 0;
+}
+
+static inline void pid_ns_release_proc(struct pid_namespace *ns)
+{
+}
+
#endif /* CONFIG_PROC_FS */

#if !defined(CONFIG_PROC_KCORE)
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/exit.c linux-2.6.23-rc1-mm1-7/kernel/exit.c
--- linux-2.6.23-rc1-mm1.orig/kernel/exit.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/exit.c 2007-07-26 16:36:37.000000000 +0400
@@ -153,6 +153,7 @@ static void delayed_put_task_struct(stru

void release_task(struct task_struct * p)
{
+ struct pid *pid;
struct task_struct *leader;
int zap_leader;
repeat:
@@ -160,6 +161,20 @@ repeat:
write_lock_irq(&tasklist_lock);
ptrace_unlink(p);
BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
+ /*
+ * we have to keep this pid till proc_flush_task() to make
+ * it possible to flush all dentries holding it. pid will
+ * be put ibidem
+ *
+ * however if the pid belogs to init namespace only, we can
+ * optimize this out
+ */
+ pid = task_pid(p);
+ if (pid->level != 0)
+ get_pid(pid);
+ else
+ pid = NULL;
+
__exit_signal(p);

/*
@@ -184,7 +199,8 @@ repeat:
}

write_unlock_irq(&tasklist_lock);
- proc_flush_task(p, NULL);
+ proc_flush_task(p, pid);
+ put_pid(pid);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);

diff -upr linux-2.6.23-rc1-mm1.orig/kernel/fork.c linux-2.6.23-rc1-mm1-7/kernel/fork.c
--- linux-2.6.23-rc1-mm1.orig/kernel/fork.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/fork.c 2007-07-26 16:36:37.000000000 +0400
@@ -50,6 +50,7 @@
#include <linux/taskstats_kern.h>
#include <linux/random.h>
#include <linux/tty.h>
+#include <linux/proc_fs.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -1141,6 +1131,10 @@ static struct task_struct *copy_process(
pid = alloc_pid(task_active_pid_ns(p));
if (!pid)
goto bad_fork_cleanup_namespaces;
+
+ if (clone_flags & CLONE_NEWPID)
+ if (pid_ns_prepare_proc(task_active_pid_ns(
...

[PATCH 11/15] Signal semantics [message #15305 is a reply to message #15293] Thu, 26 July 2007 14:55 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
From: Sukadev Bhattiprolu <sukadev@us.ibm.com>

With support for multiple pid namespaces, each pid namespace has a
separate child reaper and this process needs some special handling
of signals.

- The child reaper should appear like a normal process to other
processes in its ancestor namespaces and so should be killable
(or not) in the usual way.

- The child reaper should receive, from processes in it's active
and decendent namespaces, only those signals for which it has
installed a signal handler.

- System-wide signals (eg: kill signum -1) from within a child namespace
should only affect processes within that namespace and descendant
namespaces. They should not be posted to processes in ancestor or
sibling namespaces.

- If the sender of a signal does not have a pid_t in the receiver's
namespace (eg: a process in init_pid_ns sends a signal to a process
in a descendant namespace), the sender's pid should appear as 0
in the signal's siginfo structure.

- Existing rules for SIGIO delivery still apply and a process can
choose any other process in its namespace and descendant namespaces
to receive the SIGIO signal.

The following appears to be incorrect in the fcntl() man page for
F_SETOWN.

Sending a signal to the owner process (group) specified by
F_SETOWN is subject to the same permissions checks as are
described for kill(2), where the sending process is the one that
employs F_SETOWN (but see BUGS below).


Current behavior is that the SIGIO signal is delivered on behalf of
the process that caused the event (eg: made data available on the
file) and not the process that called fcntl().

To implement the above requirements, we:

- Add a check in check_kill_permission() for a process within a
namespace sending the fast-pathed, SIGKILL signal.

- We use a flag, SIGQUEUE_CINIT, to tell the container-init if
a signal posted to its queue is from a process within its own
namespace. The flag is set in send_signal() if a process attempts
to send a signal to its container-init.

The SIGQUEUE_CINIT flag is checked in collect_signal() - if
the flag is set, collect_signal() sets the KERN_SIGINFO_CINIT
flag in the kern_siginfo. The KERN_SIGINFO_CINIT flag indicates
that the sender is from within the namespace and the container-init
can choose to ignore the signal.

If the KERN_SIGINFO_CINIT flag is clear in get_signal_to_deliver(),
the signal originated from an ancestor namespace and so the
container-init honors the signal.


Note: We currently use two flags, SIGQUEUE_CINIT, KERN_SIGINFO_CINIT to
avoid modifying 'struct sigqueue'. If 'kern_siginfo' approach is
feasible, we could use 'kern_siginfo' in sigqueue and eliminate
SIGQUEUE_CINIT.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

include/linux/pid.h | 3 ++
include/linux/signal.h | 1
kernel/pid.c | 46 +++++++++++++++++++++++++++++++++++
kernel/signal.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 112 insertions(+), 1 deletion(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid.h linux-2.6.23-rc1-mm1-7/include/linux/pid.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
@@ -71,6 +77,9 @@ extern struct task_struct *FASTCALL(pid_
extern struct task_struct *FASTCALL(get_pid_task(struct pid *pid,
enum pid_type));

+extern int task_visible_in_pid_ns(struct task_struct *tsk,
+ struct pid_namespace *ns);
+extern int pid_ns_equal(struct task_struct *tsk);
extern struct pid *get_task_pid(struct task_struct *task, enum pid_type type);

/*
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/signal.h linux-2.6.23-rc1-mm1-7/include/linux/signal.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/signal.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/signal.h 2007-07-26 16:36:37.000000000 +0400
@@ -20,6 +27,7 @@ struct sigqueue {

/* flags values. */
#define SIGQUEUE_PREALLOC 1
+#define SIGQUEUE_CINIT 2

struct sigpending {
struct list_head list;
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
--- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
@@ -318,6 +355,52 @@ struct task_struct * fastcall pid_task(s
}

/*
+ * Return TRUE if the task @p is visible in the pid namespace @ns
+ *
+ * Note: @p is visible in @ns if the active-pid-ns of @p is either equal to
+ * @ns or is a descendant of @ns.
+ *
+ * @p is not visible in @ns if active-pid-ns of @p is an ancestor of @ns.
+ * Eg: Processes in init-pid-ns are not visible in child pid namespaces.
+ * They should not receive any system-wide signals from a child-pid-
+ * namespace for instance.
+ */
+int task_visible_in_pid_ns(struct task_struct *p, struct pid_namespace *ns)
+{
+ int i;
+ struct pid *pid = task_pid(p);
+
+ if (!pid)
+ return 0;
+
+ for (i = 0; i <= pid->level; i++) {
+ if (pid->numbers[i].ns == ns)
+ return 1;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(task_visible_in_pid_ns);
+
+/*
+ * Return TRUE if the active pid namespace of @tsk is same as active
+ * pid namespace of 'current'
+ */
+
+static inline struct pid_namespace *pid_active_ns(struct pid *pid)
+{
+ if (pid == NULL)
+ return NULL;
+
+ return pid->numbers[pid->level].ns;
+}
+
+int pid_ns_equal(struct task_struct *tsk)
+{
+ return pid_active_ns(task_pid(tsk)) == pid_active_ns(task_pid(current));
+}
+
+/*
* Must be called under rcu_read_lock() or with tasklist_lock read-held.
*/
struct task_struct *find_task_by_pid_type_ns(int type, int nr,
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/signal.c linux-2.6.23-rc1-mm1-7/kernel/signal.c
--- linux-2.6.23-rc1-mm1.orig/kernel/signal.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/signal.c 2007-07-26 16:36:37.000000000 +0400
@@ -323,6 +325,9 @@ static int collect_signal(int sig, struc
if (first) {
list_del_init(&first->list);
copy_siginfo(info, &first->info);
+ if (first->flags & SIGQUEUE_CINIT)
+ kinfo->flags |= KERN_SIGINFO_CINIT;
+
__sigqueue_free(first);
if (!still_pending)
sigdelset(&list->signal, sig);
@@ -343,6 +348,8 @@ static int collect_signal(int sig, struc
{
int sig = next_signal(pending, mask);

+ kinfo->flags &= ~KERN_SIGINFO_CINIT;
+
if (sig) {
if (current->notifier) {
if (sigismember(current->notifier_mask, sig)) {
@@ -522,6 +547,20 @@ static int rm_from_queue(unsigned long m
return 1;
}

+static int deny_signal_to_container_init(struct task_struct *tsk, int sig)
+{
+ /*
+ * If receiver is the container-init of sender and signal is SIGKILL
+ * reject it right-away. If signal is any other one, let the container
+ * init decide (in get_signal_to_deliver()) whether to handle it or
+ * ignore it.
+ */
+ if (is_container_init(tsk) && (sig == SIGKILL) && pid_ns_equal(tsk))
+ return -EPERM;
+
+ return 0;
+}
+
/*
* Bad permissions for sending the signal
*/
@@ -545,6 +584,10 @@ static int check_kill_permission(int sig
&& !capable(CAP_KILL))
return error;

+ error = deny_signal_to_container_init(t, sig);
+ if (error)
+ return error;
+
return security_task_kill(t, info, sig, 0);
}

@@ -659,6 +702,34 @@ static void handle_stop_signal(int sig,
}
}

+static void encode_sender_info(struct task_struct *t, struct sigqueue *q)
+{
+ /*
+ * If sender (i.e 'current') and receiver have the same active
+ * pid namespace and the receiver is the container-init, set the
+ * SIGQUEUE_CINIT flag. This tells the container-init that the
+ * signal originated in its own namespace and so it can choose
+ * to ignore the signal.
+ *
+ * If the receiver is the container-init of a pid namespace,
+ * but the sender is from an ancestor pid namespace, the
+ * container-init cannot ignore the signal. So clear the
+ * SIGQUEUE_CINIT flag in this case.
+ *
+ * Also, if the sender does not have a pid_t in the receiver's
+ * active pid namespace, set si_pid to 0 and pretend it originated
+ * from the kernel.
+ */
+ if (pid_ns_equal(t)) {
+ if (is_container_init(t)) {
+ q->flags |= SIGQUEUE_CINIT;
+ }
+ } else {
+ q->info.si_pid = 0;
+ q->info.si_code = SI_KERNEL;
+ }
+}
+
static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
struct sigpending *signals)
{
@@ -710,6 +781,7 @@ static int send_signal(int sig, struct s
copy_siginfo(&q->info, info);
break;
}
+ encode_sender_info(t, q);
} else if (!is_si_special(info)) {
if (sig >= SIGRTMIN && info->si_code != SI_USER)
/*
@@ -1149,6 +1221,8 @@ EXPORT_SYMBOL_GPL(kill_pid_info_as_uid);
static int kill_something_info(int sig, struct siginfo *info, int pid)
{
int ret;
+ struct pid_namespace *my_ns = task_active_pid_ns(current);
+
rcu_read_lock();
if (!pid) {
ret = kill_pgrp_info(sig, info, task_pgrp(current));
@@ -1158,6 +1232,13 @@ static int kill_something_info(int sig,

read_lock(&tasklist_lock);
for_each_process(p) {
+ /*
+ * System-wide signals apply only to the sender's
+ * pid namespace, unless issued from init_pid_ns.
+ */
+ if (!task_visible_in_pid_ns(p, my_ns))
+ continue;
+
if (p->pid > 1 && p->tgid != current->tgid) {
int err = group_send_sig_info(sig, info, p);
++count;
@@ -1852,7 +1950,7 @@ relock:
* within that pid space. It can of course get signals from
* its parent pid space.
*/
- if (current == task_child_reaper(current))
+ if (kinfo.flags & KERN_SIGINFO_CINIT)
continue;

...

[PATCH 12/15] Miscelaneous stuff for pid namespaces [message #15306 is a reply to message #15293] Thu, 26 July 2007 14:56 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
This includes
* the headers dependency fix
* task_child_reaper() correct declaration
* fixes for is_global_init() and is_container_init()

Maybe this should come before all the other stuff...

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

include/linux/pid_namespace.h | 4 ++--
include/linux/sched.h | 4 ++--
kernel/pid.c | 16 +++++++++++++---
3 files changed, 17 insertions(+), 7 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h 2007-07-26 16:36:36.000000000 +0400
@@ -4,7 +4,6 @@
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/threads.h>
-#include <linux/pid.h>
#include <linux/nsproxy.h>
#include <linux/kref.h>

@@ -46,7 +53,8 @@ static inline struct pid_namespace *task

static inline struct task_struct *task_child_reaper(struct task_struct *tsk)
{
- return init_pid_ns.child_reaper;
+ BUG_ON(tsk != current);
+ return tsk->nsproxy->pid_ns->child_reaper;
}

#endif /* _LINUX_PID_NS_H */
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/sched.h linux-2.6.23-rc1-mm1-7/include/linux/sched.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/sched.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/sched.h 2007-07-26 16:36:37.000000000 +0400
@@ -1277,13 +1366,13 @@ static inline int pid_alive(struct task_
*
* TODO: We should inline this function after some cleanups in pid_namespace.h
*/
-extern int is_global_init(struct task_struct *tsk);
+extern int is_container_init(struct task_struct *tsk);

/*
* is_container_init:
* check whether in the task is init in its own pid namespace.
*/
-static inline int is_container_init(struct task_struct *tsk)
+static inline int is_global_init(struct task_struct *tsk)
{
return tsk->pid == 1;
}
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
--- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
@@ -60,11 +62,21 @@ static inline int mk_pid(struct pid_name
};
EXPORT_SYMBOL(init_pid_ns);

-int is_global_init(struct task_struct *tsk)
+int is_container_init(struct task_struct *tsk)
{
- return tsk == init_pid_ns.child_reaper;
+ int ret;
+ struct pid *pid;
+
+ ret = 0;
+ rcu_read_lock();
+ pid = task_pid(tsk);
+ if (pid != NULL && pid->numbers[pid->level].nr == 1)
+ ret = 1;
+ rcu_read_unlock();
+
+ return ret;
}
-EXPORT_SYMBOL(is_global_init);
+EXPORT_SYMBOL(is_container_init);

/*
* Note: disable interrupts while the pidmap_lock is held as an
[PATCH 13/15] Clone the pid namespace [message #15307 is a reply to message #15293] Thu, 26 July 2007 14:56 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
When clone() is invoked with CLONE_NEWPID, create a new pid namespace
Since the active pid namespace is special and expected to be the first
entry in pid->upid_list, preserve the order of pid namespaces.

Pid namespaces can be nested and the nesting depth is unlimited.
When a process clones its pid namespace, we create additional pid caches
as necessary and use the pid cache to allocate 'struct pids' for that depth.

TODO:
One of the reasons the free_work() was introduced was to cleanup
the proc in non-atomic context, but since proc is now released
from proc_flush_task() this looks to be unneeded, but I have to
recheck this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

include/linux/pid_namespace.h | 7 ++
include/linux/sched.h | 1
kernel/nsproxy.c | 3 -
kernel/pid.c | 102 +++++++++++++++++++++++++++++++++++++-----
4 files changed, 101 insertions(+), 12 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h 2007-07-26 16:36:36.000000000 +0400
@@ -16,11 +15,16 @@ struct pidmap {
#define PIDMAP_ENTRIES ((PID_MAX_LIMIT + 8*PAGE_SIZE - 1)/PAGE_SIZE/8)

struct pid_namespace {
- struct kref kref;
+ union {
+ struct kref kref;
+ struct work_struct free_work;
+ };
struct pidmap pidmap[PIDMAP_ENTRIES];
int last_pid;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
+ int level;
+ struct pid_namespace *parent;
#ifdef CONFIG_PROC_FS
struct vfsmount *proc_mnt;
#endif
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/sched.h linux-2.6.23-rc1-mm1-7/include/linux/sched.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/sched.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/sched.h 2007-07-26 16:36:37.000000000 +0400
@@ -27,6 +27,7 @@
#define CLONE_NEWUTS 0x04000000 /* New utsname group? */
#define CLONE_NEWIPC 0x08000000 /* New ipcs */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
+#define CLONE_NEWPID 0x20000000 /* New pids */

/*
* Scheduling policies
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/nsproxy.c linux-2.6.23-rc1-mm1-7/kernel/nsproxy.c
--- linux-2.6.23-rc1-mm1.orig/kernel/nsproxy.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/nsproxy.c 2007-07-26 16:36:36.000000000 +0400
@@ -132,7 +132,8 @@ int copy_namespaces(unsigned long flags,

get_nsproxy(old_ns);

- if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER)))
+ if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
+ CLONE_NEWUSER | CLONE_NEWPID)))
return 0;

if (!capable(CAP_SYS_ADMIN)) {
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
--- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
@@ -60,14 +62,17 @@ static inline int mk_pid(struct pid_name
* the scheme scales to up to 4 million PIDs, runtime.
*/
struct pid_namespace init_pid_ns = {
- .kref = {
- .refcount = ATOMIC_INIT(2),
+ {
+ .kref = {
+ .refcount = ATOMIC_INIT(2),
+ },
},
.pidmap = {
[ 0 ... PIDMAP_ENTRIES-1] = { ATOMIC_INIT(BITS_PER_PAGE), NULL }
},
.last_pid = 0,
- .child_reaper = &init_task
+ .level = 0,
+ .child_reaper = &init_task,
};
EXPORT_SYMBOL(init_pid_ns);

@@ -409,8 +501,8 @@ static struct kmem_cache *create_pid_cac

snprintf(pcache->name, sizeof(pcache->name), "pid_%d", nr_ids);
cachep = kmem_cache_create(pcache->name,
- /* FIXME add numerical ids here */
- sizeof(struct pid), 0, SLAB_HWCACHE_ALIGN, NULL);
+ sizeof(struct pid) + (nr_ids - 1) * sizeof(struct upid),
+ 0, SLAB_HWCACHE_ALIGN, NULL);
if (cachep == NULL)
goto err_cachep;

@@ -428,11 +520,89 @@ err_alloc:
return NULL;
}

-struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
+static struct pid_namespace *create_pid_namespace(int level)
+{
+ struct pid_namespace *ns;
+ int i;
+
+ ns = kmalloc(sizeof(struct pid_namespace), GFP_KERNEL);
+ if (ns == NULL)
+ goto out;
+
+ ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!ns->pidmap[0].page)
+ goto out_free;
+
+ ns->pid_cachep = create_pid_cachep(level + 1);
+ if (ns->pid_cachep == NULL)
+ goto out_free_map;
+
+ kref_init(&ns->kref);
+ ns->last_pid = 0;
+ ns->child_reaper = NULL;
+ ns->level = level;
+
+ set_bit(0, ns->pidmap[0].page);
+ atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
+ get_pid_ns(ns);
+
+ for (i = 1; i < PIDMAP_ENTRIES; i++) {
+ ns->pidmap[i].page = 0;
+ atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
+ }
+
+ return ns;
+
+out_free_map:
+ kfree(ns->pidmap[0].page);
+out_free:
+ kfree(ns);
+out:
+ return ERR_PTR(-ENOMEM);
+}
+
+static void destroy_pid_namespace(struct pid_namespace *ns)
{
- BUG_ON(!old_ns);
- get_pid_ns(old_ns);
- return old_ns;
+ int i;
+
+ for (i = 0; i < PIDMAP_ENTRIES; i++)
+ kfree(ns->pidmap[i].page);
+ kfree(ns);
+}
+
+ struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
+ {
+ struct pid_namespace *new_ns;
+
+ BUG_ON(!old_ns);
+ new_ns = get_pid_ns(old_ns);
+ if (!(flags & CLONE_NEWPID))
+ goto out;
+
+ new_ns = ERR_PTR(-EINVAL);
+ if (flags & CLONE_THREAD)
+ goto out_put;
+
+ new_ns = create_pid_namespace(old_ns->level + 1);
+ if (new_ns != NULL)
+ new_ns->parent = get_pid_ns(old_ns);
+
+out_put:
+ put_pid_ns(old_ns);
+out:
+ return new_ns;
+}
+
+static void do_free_pid_ns(struct work_struct *w)
+{
+ struct pid_namespace *ns, *parent;
+
+ ns = container_of(w, struct pid_namespace, free_work);
+ parent = ns->parent;
+ destroy_pid_namespace(ns);
+
+ if (parent != NULL)
+ put_pid_ns(parent);
}

void free_pid_ns(struct kref *kref)
@@ -440,7 +648,8 @@ void free_pid_ns(struct kref *kref)
struct pid_namespace *ns;

ns = container_of(kref, struct pid_namespace, kref);
- kfree(ns);
+ INIT_WORK(&ns->free_work, do_free_pid_ns);
+ schedule_work(&ns->free_work);
}

/*
[PATCH 14/15] Destroy pid namespace on init's death [message #15308 is a reply to message #15293] Thu, 26 July 2007 14:57 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
From: Sukadev Bhattiprolu <sukadev@us.ibm.com>

Terminate all processes in a namespace when the reaper of the namespace
is exiting. We do this by walking the pidmap of the namespace and sending
SIGKILL to all processes.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>

---

include/linux/pid.h | 1 +
include/linux/wait.h | 4 ++++
kernel/exit.c | 10 ++++++----
kernel/pid.c | 38 ++++++++++++++++++++++++++++++++++++++
4 files changed, 49 insertions(+), 4 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid.h linux-2.6.23-rc1-mm1-7/include/linux/pid.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
@@ -83,6 +92,7 @@ extern void FASTCALL(detach_pid(struct t

extern struct pid *alloc_pid(struct pid_namespace *ns);
extern void FASTCALL(free_pid(struct pid *pid));
+extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);

/*
* the helpers to get the pid's id seen from different namespaces
diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/wait.h linux-2.6.23-rc1-mm1-7/include/linux/wait.h
--- linux-2.6.23-rc1-mm1.orig/include/linux/wait.h 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/include/linux/wait.h 2007-07-26 16:36:36.000000000 +0400
@@ -22,9 +22,13 @@
#include <linux/list.h>
#include <linux/stddef.h>
#include <linux/spinlock.h>
+#include <linux/resource.h>
+#include <asm/siginfo.h>
#include <asm/system.h>
#include <asm/current.h>

+long do_wait(pid_t pid, int options, struct siginfo __user *infop,
+ int __user *stat_addr, struct rusage __user *ru);
typedef struct __wait_queue wait_queue_t;
typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int sync, void *key);
int default_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/exit.c linux-2.6.23-rc1-mm1-7/kernel/exit.c
--- linux-2.6.23-rc1-mm1.orig/kernel/exit.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/exit.c 2007-07-26 16:36:37.000000000 +0400
@@ -895,6 +915,7 @@ fastcall NORET_TYPE void do_exit(long co
{
struct task_struct *tsk = current;
int group_dead;
+ struct pid_namespace *pid_ns = tsk->nsproxy->pid_ns;

profile_task_exit(tsk);

@@ -905,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");
if (unlikely(tsk == task_child_reaper(tsk))) {
- if (task_active_pid_ns(tsk) != &init_pid_ns)
- task_active_pid_ns(tsk)->child_reaper =
- init_pid_ns.child_reaper;
+ if (pid_ns != &init_pid_ns) {
+ zap_pid_ns_processes(pid_ns);
+ pid_ns->child_reaper = init_pid_ns.child_reaper;
+ }
else
panic("Attempted to kill init!");
}
@@ -1518,7 +1548,7 @@ static inline int my_ptrace_child(struct
return (p->parent != p->real_parent);
}

-static long do_wait(pid_t pid, int options, struct siginfo __user *infop,
+long do_wait(pid_t pid, int options, struct siginfo __user *infop,
int __user *stat_addr, struct rusage __user *ru)
{
DECLARE_WAITQUEUE(wait, current);
diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
--- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
@@ -428,6 +520,44 @@ err_alloc:
return new_ns;
}

+void zap_pid_ns_processes(struct pid_namespace *pid_ns)
+{
+ int i;
+ int nr;
+ int nfree;
+ int options = WNOHANG|WEXITED|__WALL;
+
+repeat:
+ /*
+ * We know pid == 1 is terminating. Find remaining pid_ts
+ * in the namespace, signal them and then wait for them
+ * exit.
+ */
+ nr = next_pidmap(pid_ns, 1);
+ while (nr > 0) {
+ kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
+ nr = next_pidmap(pid_ns, nr);
+ }
+
+ nr = next_pidmap(pid_ns, 1);
+ while (nr > 0) {
+ do_wait(nr, options, NULL, NULL, NULL);
+ nr = next_pidmap(pid_ns, nr);
+ }
+
+ nfree = 0;
+ for (i = 0; i < PIDMAP_ENTRIES; i++)
+ nfree += atomic_read(&pid_ns->pidmap[i].nr_free);
+
+ /*
+ * If pidmap has entries for processes other than 0 and 1, retry.
+ */
+ if (nfree < (BITS_PER_PAGE * PIDMAP_ENTRIES - 2))
+ goto repeat;
+
+ return;
+}
+
static void do_free_pid_ns(struct work_struct *w)
{
struct pid_namespace *ns, *parent;
[PATCH 15/15] Hooks over the code to show correct values to user [message #15309 is a reply to message #15293] Thu, 26 July 2007 14:58 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.

Many thanks to Sukadev for pointing out many places I lost.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

arch/ia64/kernel/signal.c | 4 +--
arch/parisc/kernel/signal.c | 2 -
arch/sparc/kernel/sys_sunos.c | 2 -
arch/sparc64/kernel/sys_sunos32.c | 2 -
drivers/char/tty_io.c | 7 +++---
fs/binfmt_elf.c | 16 +++++++-------
fs/binfmt_elf_fdpic.c | 16 +++++++-------
fs/exec.c | 4 +--
fs/fcntl.c | 5 ++--
fs/ioprio.c | 5 ++--
fs/proc/array.c | 25 +++++++++++++++-------
fs/proc/base.c | 37 ++++++++++++++++++++-------------
include/net/scm.h | 4 ++-
ipc/mqueue.c | 7 ++++--
ipc/msg.c | 6 ++---
ipc/sem.c | 8 +++----
ipc/shm.c | 6 ++---
kernel/capability.c | 14 +++++++-----
kernel/exit.c | 31 ++++++++++++++++++----------
kernel/fork.c | 2 -
kernel/futex.c | 20 +++++++++---------
kernel/ptrace.c | 4 ++-
kernel/sched.c | 3 +-
kernel/signal.c | 39 +++++++++++++++++++++++------------
kernel/sys.c | 42 ++++++++++++++++++++++----------------
kernel/sysctl.c | 2 -
kernel/timer.c | 7 +++---
net/core/scm.c | 4 ++-
net/unix/af_unix.c | 6 ++---
29 files changed, 198 insertions(+), 132 deletions(-)

diff -upr linux-2.6.23-rc1-mm1.orig/arch/ia64/kernel/signal.c linux-2.6.23-rc1-mm1-7/arch/ia64/kernel/signal.c
--- linux-2.6.23-rc1-mm1.orig/arch/ia64/kernel/signal.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/arch/ia64/kernel/signal.c 2007-07-26 16:36:36.000000000 +0400
@@ -227,7 +227,7 @@ ia64_rt_sigreturn (struct sigscratch *sc
si.si_signo = SIGSEGV;
si.si_errno = 0;
si.si_code = SI_KERNEL;
- si.si_pid = current->pid;
+ si.si_pid = task_pid_vnr(current);
si.si_uid = current->uid;
si.si_addr = sc;
force_sig_info(SIGSEGV, &si, current);
@@ -332,7 +332,7 @@ force_sigsegv_info (int sig, void __user
si.si_signo = SIGSEGV;
si.si_errno = 0;
si.si_code = SI_KERNEL;
- si.si_pid = current->pid;
+ si.si_pid = task_pid_vnr(current);
si.si_uid = current->uid;
si.si_addr = addr;
force_sig_info(SIGSEGV, &si, current);
diff -upr linux-2.6.23-rc1-mm1.orig/arch/parisc/kernel/signal.c linux-2.6.23-rc1-mm1-7/arch/parisc/kernel/signal.c
--- linux-2.6.23-rc1-mm1.orig/arch/parisc/kernel/signal.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/arch/parisc/kernel/signal.c 2007-07-26 16:36:36.000000000 +0400
@@ -181,7 +181,7 @@ give_sigsegv:
si.si_signo = SIGSEGV;
si.si_errno = 0;
si.si_code = SI_KERNEL;
- si.si_pid = current->pid;
+ si.si_pid = task_pid_vnr(current);
si.si_uid = current->uid;
si.si_addr = &frame->uc;
force_sig_info(SIGSEGV, &si, current);
diff -upr linux-2.6.23-rc1-mm1.orig/arch/sparc/kernel/sys_sunos.c linux-2.6.23-rc1-mm1-7/arch/sparc/kernel/sys_sunos.c
--- linux-2.6.23-rc1-mm1.orig/arch/sparc/kernel/sys_sunos.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/arch/sparc/kernel/sys_sunos.c 2007-07-26 16:36:36.000000000 +0400
@@ -866,7 +866,7 @@ asmlinkage int sunos_killpg(int pgrp, in
rcu_read_lock();
ret = -EINVAL;
if (pgrp > 0)
- ret = kill_pgrp(find_pid(pgrp), sig, 0);
+ ret = kill_pgrp(find_vpid(pgrp), sig, 0);
rcu_read_unlock();

return ret;
diff -upr linux-2.6.23-rc1-mm1.orig/arch/sparc64/kernel/sys_sunos32.c linux-2.6.23-rc1-mm1-7/arch/sparc64/kernel/sys_sunos32.c
--- linux-2.6.23-rc1-mm1.orig/arch/sparc64/kernel/sys_sunos32.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/arch/sparc64/kernel/sys_sunos32.c 2007-07-26 16:36:36.000000000 +0400
@@ -831,7 +831,7 @@ asmlinkage int sunos_killpg(int pgrp, in
rcu_read_lock();
ret = -EINVAL;
if (pgrp > 0)
- ret = kill_pgrp(find_pid(pgrp), sig, 0);
+ ret = kill_pgrp(find_vpid(pgrp), sig, 0);
rcu_read_unlock();

return ret;
diff -upr linux-2.6.23-rc1-mm1.orig/drivers/char/tty_io.c linux-2.6.23-rc1-mm1-7/drivers/char/tty_io.c
--- linux-2.6.23-rc1-mm1.orig/drivers/char/tty_io.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/drivers/char/tty_io.c 2007-07-26 16:36:36.000000000 +0400
@@ -103,6 +103,7 @@
#include <linux/selection.h>

#include <linux/kmod.h>
+#include <linux/nsproxy.h>

#undef TTY_DEBUG_HANGUP

@@ -3080,7 +3081,7 @@ static int tiocgpgrp(struct tty_struct *
*/
if (tty == real_tty && current->signal->tty != real_tty)
return -ENOTTY;
- return put_user(pid_nr(real_tty->pgrp), p);
+ return put_user(pid_vnr(real_tty->pgrp), p);
}

/**
@@ -3114,7 +3115,7 @@ static int tiocspgrp(struct tty_struct *
if (pgrp_nr < 0)
return -EINVAL;
rcu_read_lock();
- pgrp = find_pid(pgrp_nr);
+ pgrp = find_vpid(pgrp_nr);
retval = -ESRCH;
if (!pgrp)
goto out_unlock;
@@ -3151,7 +3152,7 @@ static int tiocgsid(struct tty_struct *t
return -ENOTTY;
if (!real_tty->session)
return -ENOTTY;
- return put_user(pid_nr(real_tty->session), p);
+ return put_user(pid_vnr(real_tty->session), p);
}

/**
diff -upr linux-2.6.23-rc1-mm1.orig/fs/binfmt_elf.c linux-2.6.23-rc1-mm1-7/fs/binfmt_elf.c
--- linux-2.6.23-rc1-mm1.orig/fs/binfmt_elf.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/binfmt_elf.c 2007-07-26 16:36:36.000000000 +0400
@@ -1339,10 +1339,10 @@ static void fill_prstatus(struct elf_prs
prstatus->pr_info.si_signo = prstatus->pr_cursig = signr;
prstatus->pr_sigpend = p->pending.signal.sig[0];
prstatus->pr_sighold = p->blocked.sig[0];
- prstatus->pr_pid = p->pid;
- prstatus->pr_ppid = p->parent->pid;
- prstatus->pr_pgrp = task_pgrp_nr(p);
- prstatus->pr_sid = task_session_nr(p);
+ prstatus->pr_pid = task_pid_vnr(p);
+ prstatus->pr_ppid = task_pid_vnr(p->parent);
+ prstatus->pr_pgrp = task_pgrp_vnr(p);
+ prstatus->pr_sid = task_session_vnr(p);
if (thread_group_leader(p)) {
/*
* This is the record for the group leader. Add in the
@@ -1385,10 +1385,10 @@ static int fill_psinfo(struct elf_prpsin
psinfo->pr_psargs[i] = ' ';
psinfo->pr_psargs[len] = 0;

- psinfo->pr_pid = p->pid;
- psinfo->pr_ppid = p->parent->pid;
- psinfo->pr_pgrp = task_pgrp_nr(p);
- psinfo->pr_sid = task_session_nr(p);
+ psinfo->pr_pid = task_pid_vnr(p);
+ psinfo->pr_ppid = task_pid_vnr(p->parent);
+ psinfo->pr_pgrp = task_pgrp_vnr(p);
+ psinfo->pr_sid = task_session_vnr(p);

i = p->state ? ffz(~p->state) + 1 : 0;
psinfo->pr_state = i;
diff -upr linux-2.6.23-rc1-mm1.orig/fs/binfmt_elf_fdpic.c linux-2.6.23-rc1-mm1-7/fs/binfmt_elf_fdpic.c
--- linux-2.6.23-rc1-mm1.orig/fs/binfmt_elf_fdpic.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/binfmt_elf_fdpic.c 2007-07-26 16:36:36.000000000 +0400
@@ -1342,10 +1342,10 @@ static void fill_prstatus(struct elf_prs
prstatus->pr_info.si_signo = prstatus->pr_cursig = signr;
prstatus->pr_sigpend = p->pending.signal.sig[0];
prstatus->pr_sighold = p->blocked.sig[0];
- prstatus->pr_pid = p->pid;
- prstatus->pr_ppid = p->parent->pid;
- prstatus->pr_pgrp = task_pgrp_nr(p);
- prstatus->pr_sid = task_session_nr(p);
+ prstatus->pr_pid = task_pid_vnr(p);
+ prstatus->pr_ppid = task_pid_vnr(p->parent);
+ prstatus->pr_pgrp = task_pgrp_vnr(p);
+ prstatus->pr_sid = task_session_vnr(p);
if (thread_group_leader(p)) {
/*
* This is the record for the group leader. Add in the
@@ -1391,10 +1391,10 @@ static int fill_psinfo(struct elf_prpsin
psinfo->pr_psargs[i] = ' ';
psinfo->pr_psargs[len] = 0;

- psinfo->pr_pid = p->pid;
- psinfo->pr_ppid = p->parent->pid;
- psinfo->pr_pgrp = task_pgrp_nr(p);
- psinfo->pr_sid = task_session_nr(p);
+ psinfo->pr_pid = task_pid_vnr(p);
+ psinfo->pr_ppid = task_pid_vnr(p->parent);
+ psinfo->pr_pgrp = task_pgrp_vnr(p);
+ psinfo->pr_sid = task_session_vnr(p);

i = p->state ? ffz(~p->state) + 1 : 0;
psinfo->pr_state = i;
diff -upr linux-2.6.23-rc1-mm1.orig/fs/exec.c linux-2.6.23-rc1-mm1-7/fs/exec.c
--- linux-2.6.23-rc1-mm1.orig/fs/exec.c 2007-07-26 16:34:45.000000000 +0400
+++ linux-2.6.23-rc1-mm1-7/fs/exec.c 2007-07-26 16:36:36.000000000 +0400
@@ -1462,7 +1462,7 @@ static int format_corename(char *corenam
case 'p':
pid_in_pattern = 1;
rc = snprintf(out_ptr, out_end - out_ptr,
- "%d", current->tgid);
+ "%d", task_tgid_vnr(current));
if (rc > out_end - out_ptr)
goto out;
out_ptr += rc;
@@ -1534,7 +1534,7 @@ static int format_corename(char *corenam
if (!ispipe && !pid_in_pattern
&& (core_uses_pid || atomic_read(&current->mm->mm_users) != 1)) {
rc = snprintf(out_ptr, out_end - out_ptr,
- &quo
...

Re: [PATCH 1/15] Move exit_task_namespaces() [message #15320 is a reply to message #15295] Thu, 26 July 2007 16:47 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> Make task release its namespaces after it has reparented all his
> children to child_reaper, but before it notifies its parent about
> its death.
>
> The reason to release namespaces after reparenting is that when task
> exits it may send a signal to its parent (SIGCHLD), but if the parent
> has already exited its namespaces there will be no way to decide what
> pid to dever to him - parent can be from different namespace.
>
> The reason to release namespace before notifying the parent it that
> when task sends a SIGCHLD to parent it can call wait() on this taks
> and release it. But releasing the mnt namespace implies dropping
> of all the mounts in the mnt namespace and NFS expects the task to
> have valid sighand pointer.
>
> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
>
> ---
>
> exit.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletion(-)
>
> diff -upr linux-2.6.23-rc1-mm1.orig/kernel/exit.c
> linux-2.6.23-rc1-mm1-7/kernel/exit.c
> --- linux-2.6.23-rc1-mm1.orig/kernel/exit.c 2007-07-26
> 16:34:45.000000000 +0400
> +++ linux-2.6.23-rc1-mm1-7/kernel/exit.c 2007-07-26
> 16:36:37.000000000 +0400
> @@ -788,6 +804,10 @@ static void exit_notify(struct task_stru
> BUG_ON(!list_empty(&tsk->children));
> BUG_ON(!list_empty(&tsk->ptrace_children));
>
> + write_unlock_irq(&tasklist_lock);
> + exit_task_namespaces(tsk);
> + write_lock_irq(&tasklist_lock);

No.

We "cleared" our ->children/->ptrace_children lists. Now suppose that
another thread dies, and its forget_original_parent() choose us as a
new reaper before we re-take tasklist.

I'll try to read other patches tomorrow, but I can't avoid a stupid
question: can we have a CONFIG_ for that? This series adds a lot of
complications.

OK, I guess the answer is "it is very difficult t achieve", but can't
help myself :)

Oleg.
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15323 is a reply to message #15295] Thu, 26 July 2007 16:10 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Thu, 2007-07-26 at 18:46 +0400, Pavel Emelyanov wrote:
> + write_unlock_irq(&tasklist_lock);
> + exit_task_namespaces(tsk);
> + write_lock_irq(&tasklist_lock);

Are there any other side-effects of doing this? What was the
tasklist_lock protecting here when it was released?

-- Dave
Re: [PATCH 8/15] Helpers to find the task by its numerical ids [message #15324 is a reply to message #15302] Thu, 26 July 2007 19:05 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Thu, 2007-07-26 at 18:51 +0400, Pavel Emelyanov wrote:
> +extern struct task_struct *find_task_by_pid_type_ns(int type, int pid,
> + struct pid_namespace *ns);
> +
> +#define find_task_by_pid_ns(nr, ns) \
> + find_task_by_pid_type_ns(PIDTYPE_PID, nr, ns)
> +#define find_task_by_pid_type(type, nr) \
> + find_task_by_pid_type_ns(type, nr, &init_pid_ns)
> +#define find_task_by_pid(nr) \
> + find_task_by_pid_type(PIDTYPE_PID, nr)
> +
> extern void __set_special_pids(pid_t session, pid_t pgrp);

Do these _have_ to be macros?

> /* per-UID process charging. */
> diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
> --- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
> +++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
> @@ -204,19 +221,20 @@ static void delayed_put_pid(struct rcu_h
> goto out;
> }
>
> -struct pid * fastcall find_pid(int nr)
> +struct pid * fastcall find_pid_ns(int nr, struct pid_namespace *ns)
> {
> struct hlist_node *elem;
> - struct pid *pid;
> + struct upid *pnr;
> +
> + hlist_for_each_entry_rcu(pnr, elem,
> + &pid_hash[pid_hashfn(nr, ns)], pid_chain)
> + if (pnr->nr == nr && pnr->ns == ns)
> + return container_of(pnr, struct pid,
> + numbers[ns->level]);

Do we do this loop anywhere else? Should we have a for_each_pid() that
makes this a little less messy?

> - hlist_for_each_entry_rcu(pid, elem,
> - &pid_hash[pid_hashfn(nr)], pid_chain) {
> - if (pid->nr == nr)
> - return pid;
> - }
> return NULL;
> }
> -EXPORT_SYMBOL_GPL(find_pid);
> +EXPORT_SYMBOL_GPL(find_pid_ns);
>
> /*
> * attach_pid() must be called with the tasklist_lock write-held.
> @@ -318,12 +355,13 @@ struct task_struct * fastcall pid_task(s
> /*
> * Must be called under rcu_read_lock() or with tasklist_lock read-held.
> */
> -struct task_struct *find_task_by_pid_type(int type, int nr)
> +struct task_struct *find_task_by_pid_type_ns(int type, int nr,
> + struct pid_namespace *ns)
> {
> - return pid_task(find_pid(nr), type);
> + return pid_task(find_pid_ns(nr, ns), type);
> }
>
> -EXPORT_SYMBOL(find_task_by_pid_type);
> +EXPORT_SYMBOL(find_task_by_pid_type_ns);
>
> struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
> {
> @@ -342,7 +426,7 @@ struct pid *find_get_pid(pid_t nr)
> struct pid *pid;
>
> rcu_read_lock();
> - pid = get_pid(find_pid(nr));
> + pid = get_pid(find_vpid(nr));
> rcu_read_unlock();

OK, I think this is really confusing. If find_get_pid() finds vpids,
should we not call it find_get_vpid()?

-- Dave
Re: [PATCH 7/15] Helpers to obtain pid numbers [message #15325 is a reply to message #15301] Thu, 26 July 2007 19:03 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Thu, 2007-07-26 at 18:51 +0400, Pavel Emelyanov wrote:
>
> + * pid_nr() : global id, i.e. the id seen from the init namespace;
> + * pid_vnr() : virtual id, i.e. the id seen from the namespace this pid
> + * belongs to. this only makes sence when called in the
> + * context of the task that belongs to the same namespace;

Can we give these some better names? I think "virtual" is pretty bad,
especially if you consider the multiple level of pid namespaces that we
might have some day. Processes can belong to multiple pid namespaces,
and thus have multiple "virtual" ids.

Even though it will make the names longer, I think we need something in
the names to say that "pid_nr()" is the top-level, global, init_pid_ns
number. "pid_vnr()" is the pid for the lowest pid namespace in the
hierarchy. Suka called this an "active pid namespace" because that is
where the task actively interacts with its peers. But, I'm open to
other suggestions, too.

When writing code, people are going to need to know which one to use:
pid_nr() or pid_vnr(). We can document the functions, but the names
will help much more than any documentation.


-- Dave
Re: [RFC][PATCH 0/15] Pid namespaces [message #15329 is a reply to message #15293] Fri, 27 July 2007 04:22 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Pavel,

We seem to have a memory leak. Its new in this patchset (i.e the
following test ran fine on the 2.6.22-rc6-mm1 patchset).

To repro: run "pidns_exec ./mypid" in a tight-loop - where mypid.c
is:

#include <stdio.h>
#include <unistd.h>

main()
{
printf("Pid %d, Ppid %d, Pgid %d, Sid %d\n",
getpid(), getppid(), getpgid(0), getsid(0));
}

I ran into OOM in about 30 mins. I am still investigating.

BTW, can we include a simple test program like the pidns_exec in this
patch-0 for whoever want to play with pidns ?

Suka

Pavel Emelianov [xemul@openvz.org] wrote:
| (Comment is taken from Sukadev's patchset-v3)
|
| A pid namespace is a "view" of a particular set of tasks on the system.
| They work in a similar way to filesystem namespaces. A file (or a process)
| can be accessed in multiple namespaces, but it may have a different name
| in each. In a filesystem, this name might be /etc/passwd in one namespace,
| but /chroot/etc/passwd in another.
|
| For processes, a process may have pid 1234 in one namespace, but be pid 1
| in another. This allows new pid namespaces to have basically arbitrary
| pids, and not have to worry about what pids exist in other namespaces.
| This is essential for checkpoint/restart where a restarted process's pid
| might collide with an existing process on the system's pid.
|
| In this particular implementation, pid namespaces have a parent-child
| relationship, just like processes. A process in a pid namespace may see
| all of the processes in the same namespace, as well as all of the processes
| in all of the namespaces which are children of its namespace. Processes may
| not, however, see others which are in their parent's namespace, but not in
| their own. The same goes for sibling namespaces.
|
| This set is based on my patches, I sent before, but it includes some
| comments
| and patches that I received from Sukadev. Sukadev, please, add your
| Acked-by,
| Signed-off-by or From, to patches you want (everybody is also welcome :) ).
|
| The set is based on 2.6.23-rc1-mm1, which already has some preparation
| patches
| for pid namespaces. After the review and fixing all the comments, this set
| will be benchmarked and sent to Andrew for inclusion in -mm tree.
|
| Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
| Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Re: [PATCH 15/15] Hooks over the code to show correct values to user [message #15330 is a reply to message #15309] Fri, 27 July 2007 05:57 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Pavel Emelianov [xemul@openvz.org] wrote:
| diff -upr linux-2.6.23-rc1-mm1.orig/fs/proc/array.c
| linux-2.6.23-rc1-mm1-7/fs/proc/array.c
| --- linux-2.6.23-rc1-mm1.orig/fs/proc/array.c 2007-07-26
| 16:34:45.000000000 +0400
| +++ linux-2.6.23-rc1-mm1-7/fs/proc/array.c 2007-07-26
| 16:36:36.000000000 +0400
| @@ -77,6 +77,7 @@
| #include <linux/cpuset.h>
| #include <linux/rcupdate.h>
| #include <linux/delayacct.h>
| +#include <linux/pid_namespace.h>
|
| #include <asm/pgtable.h>
| #include <asm/processor.h>
| @@ -161,7 +162,9 @@ static inline char *task_state(struct ta
| struct group_info *group_info;
| int g;
| struct fdtable *fdt = NULL;
| + struct pid_namespace *ns;
|
| + ns = current->nsproxy->pid_ns;
| rcu_read_lock();
| buffer += sprintf(buffer,
| "State:\t%s\n"
| @@ -172,9 +175,12 @@ static inline char *task_state(struct ta
| "Uid:\t%d\t%d\t%d\t%d\n"
| "Gid:\t%d\t%d\t%d\t%d\n",
| get_task_state(p),
| - p->tgid, p->pid,
| - pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
| - pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid
| : 0,
| + task_tgid_nr_ns(p, ns),
| + task_pid_nr_ns(p, ns),
| + pid_alive(p) ?
| + task_ppid_nr_ns(p, ns) : 0,
| + pid_alive(p) && p->ptrace ?
| + task_tgid_nr_ns(rcu_dereference(p->parent), ns) : 0,

Hmm. I think we have this reversed here. We should print the 'tgid' of
task->real_parent and 'pid' of task->parent.
Re: [PATCH 7/15] Helpers to obtain pid numbers [message #15331 is a reply to message #15325] Fri, 27 July 2007 06:40 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Dave Hansen wrote:
> On Thu, 2007-07-26 at 18:51 +0400, Pavel Emelyanov wrote:
>> + * pid_nr() : global id, i.e. the id seen from the init namespace;
>> + * pid_vnr() : virtual id, i.e. the id seen from the namespace this pid
>> + * belongs to. this only makes sence when called in the
>> + * context of the task that belongs to the same namespace;
>
> Can we give these some better names? I think "virtual" is pretty bad,
> especially if you consider the multiple level of pid namespaces that we
> might have some day. Processes can belong to multiple pid namespaces,
> and thus have multiple "virtual" ids.

We do have them now (multiple levels). The "virtual" stands for "the one
that task sees by his own".

> Even though it will make the names longer, I think we need something in
> the names to say that "pid_nr()" is the top-level, global, init_pid_ns
> number. "pid_vnr()" is the pid for the lowest pid namespace in the
> hierarchy.

That's it.

> Suka called this an "active pid namespace" because that is
> where the task actively interacts with its peers. But, I'm open to
> other suggestions, too.
>
> When writing code, people are going to need to know which one to use:
> pid_nr() or pid_vnr(). We can document the functions, but the names
> will help much more than any documentation.

I do not mind, but what other names can we have? pid_nr() for init namespace,
pid_nr_ns for arbitrary namespace and pid_<what> for the lower-level namespace?

>
> -- Dave
>
>

Thanks,
Pavel
Re: [PATCH 8/15] Helpers to find the task by its numerical ids [message #15332 is a reply to message #15324] Fri, 27 July 2007 06:43 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Dave Hansen wrote:
> On Thu, 2007-07-26 at 18:51 +0400, Pavel Emelyanov wrote:
>> +extern struct task_struct *find_task_by_pid_type_ns(int type, int pid,
>> + struct pid_namespace *ns);
>> +
>> +#define find_task_by_pid_ns(nr, ns) \
>> + find_task_by_pid_type_ns(PIDTYPE_PID, nr, ns)
>> +#define find_task_by_pid_type(type, nr) \
>> + find_task_by_pid_type_ns(type, nr, &init_pid_ns)
>> +#define find_task_by_pid(nr) \
>> + find_task_by_pid_type(PIDTYPE_PID, nr)
>> +
>> extern void __set_special_pids(pid_t session, pid_t pgrp);
>
> Do these _have_ to be macros?

No, but why not?
I can make them static inline functions, but why are macros that bad?

>> /* per-UID process charging. */
>> diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c linux-2.6.23-rc1-mm1-7/kernel/pid.c
>> --- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
>> +++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
>> @@ -204,19 +221,20 @@ static void delayed_put_pid(struct rcu_h
>> goto out;
>> }
>>
>> -struct pid * fastcall find_pid(int nr)
>> +struct pid * fastcall find_pid_ns(int nr, struct pid_namespace *ns)
>> {
>> struct hlist_node *elem;
>> - struct pid *pid;
>> + struct upid *pnr;
>> +
>> + hlist_for_each_entry_rcu(pnr, elem,
>> + &pid_hash[pid_hashfn(nr, ns)], pid_chain)
>> + if (pnr->nr == nr && pnr->ns == ns)
>> + return container_of(pnr, struct pid,
>> + numbers[ns->level]);
>
> Do we do this loop anywhere else? Should we have a for_each_pid() that
> makes this a little less messy?

No. Iteration over the hash chain happens here only.

>> - hlist_for_each_entry_rcu(pid, elem,
>> - &pid_hash[pid_hashfn(nr)], pid_chain) {
>> - if (pid->nr == nr)
>> - return pid;
>> - }
>> return NULL;
>> }
>> -EXPORT_SYMBOL_GPL(find_pid);
>> +EXPORT_SYMBOL_GPL(find_pid_ns);
>>
>> /*
>> * attach_pid() must be called with the tasklist_lock write-held.
>> @@ -318,12 +355,13 @@ struct task_struct * fastcall pid_task(s
>> /*
>> * Must be called under rcu_read_lock() or with tasklist_lock read-held.
>> */
>> -struct task_struct *find_task_by_pid_type(int type, int nr)
>> +struct task_struct *find_task_by_pid_type_ns(int type, int nr,
>> + struct pid_namespace *ns)
>> {
>> - return pid_task(find_pid(nr), type);
>> + return pid_task(find_pid_ns(nr, ns), type);
>> }
>>
>> -EXPORT_SYMBOL(find_task_by_pid_type);
>> +EXPORT_SYMBOL(find_task_by_pid_type_ns);
>>
>> struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
>> {
>> @@ -342,7 +426,7 @@ struct pid *find_get_pid(pid_t nr)
>> struct pid *pid;
>>
>> rcu_read_lock();
>> - pid = get_pid(find_pid(nr));
>> + pid = get_pid(find_vpid(nr));
>> rcu_read_unlock();
>
> OK, I think this is really confusing. If find_get_pid() finds vpids,
> should we not call it find_get_vpid()?

I'd better make it find_get_pid_ns() with two arguments and made a couple
of macros (or static inlines) for global search and local search.

> -- Dave
>
>

Thanks,
Pavel
Re: [PATCH 15/15] Hooks over the code to show correct values to user [message #15333 is a reply to message #15330] Fri, 27 July 2007 06:44 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
sukadev@us.ibm.com wrote:
> Pavel Emelianov [xemul@openvz.org] wrote:
> | diff -upr linux-2.6.23-rc1-mm1.orig/fs/proc/array.c
> | linux-2.6.23-rc1-mm1-7/fs/proc/array.c
> | --- linux-2.6.23-rc1-mm1.orig/fs/proc/array.c 2007-07-26
> | 16:34:45.000000000 +0400
> | +++ linux-2.6.23-rc1-mm1-7/fs/proc/array.c 2007-07-26
> | 16:36:36.000000000 +0400
> | @@ -77,6 +77,7 @@
> | #include <linux/cpuset.h>
> | #include <linux/rcupdate.h>
> | #include <linux/delayacct.h>
> | +#include <linux/pid_namespace.h>
> |
> | #include <asm/pgtable.h>
> | #include <asm/processor.h>
> | @@ -161,7 +162,9 @@ static inline char *task_state(struct ta
> | struct group_info *group_info;
> | int g;
> | struct fdtable *fdt = NULL;
> | + struct pid_namespace *ns;
> |
> | + ns = current->nsproxy->pid_ns;
> | rcu_read_lock();
> | buffer += sprintf(buffer,
> | "State:\t%s\n"
> | @@ -172,9 +175,12 @@ static inline char *task_state(struct ta
> | "Uid:\t%d\t%d\t%d\t%d\n"
> | "Gid:\t%d\t%d\t%d\t%d\n",
> | get_task_state(p),
> | - p->tgid, p->pid,
> | - pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
> | - pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid
> | : 0,
> | + task_tgid_nr_ns(p, ns),
> | + task_pid_nr_ns(p, ns),
> | + pid_alive(p) ?
> | + task_ppid_nr_ns(p, ns) : 0,
> | + pid_alive(p) && p->ptrace ?
> | + task_tgid_nr_ns(rcu_dereference(p->parent), ns) : 0,
>
> Hmm. I think we have this reversed here. We should print the 'tgid' of
> task->real_parent and 'pid' of task->parent.
>

You mean smth like task_tgid_nr_ns(p->real_parent) ? Yes, thanks :)
Re: [PATCH 12/15] Miscelaneous stuff for pid namespaces [message #15334 is a reply to message #15306] Fri, 27 July 2007 06:53 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
sukadev@us.ibm.com wrote:
> Pavel Emelianov [xemul@openvz.org] wrote:
> | This includes
> | * the headers dependency fix
> | * task_child_reaper() correct declaration
> | * fixes for is_global_init() and is_container_init()
> |
> | Maybe this should come before all the other stuff...
> |
> | Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> |
> | ---
> |
> | include/linux/pid_namespace.h | 4 ++--
> | include/linux/sched.h | 4 ++--
> | kernel/pid.c | 16 +++++++++++++---
> | 3 files changed, 17 insertions(+), 7 deletions(-)
> |
> | diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h
> | linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h
> | --- linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h 2007-07-26
> | 16:34:45.000000000 +0400
> | +++ linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h 2007-07-26
> | 16:36:36.000000000 +0400
> | @@ -4,7 +4,6 @@
> | #include <linux/sched.h>
> | #include <linux/mm.h>
> | #include <linux/threads.h>
> | -#include <linux/pid.h>
> | #include <linux/nsproxy.h>
> | #include <linux/kref.h>
> |
> | @@ -46,7 +53,8 @@ static inline struct pid_namespace *task
> |
> | static inline struct task_struct *task_child_reaper(struct task_struct *tsk)
> | {
> | - return init_pid_ns.child_reaper;
> | + BUG_ON(tsk != current);
> | + return tsk->nsproxy->pid_ns->child_reaper;
> | }
> |
> | #endif /* _LINUX_PID_NS_H */
> | diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/sched.h
> | linux-2.6.23-rc1-mm1-7/include/linux/sched.h
> | --- linux-2.6.23-rc1-mm1.orig/include/linux/sched.h 2007-07-26
> | 16:34:45.000000000 +0400
> | +++ linux-2.6.23-rc1-mm1-7/include/linux/sched.h 2007-07-26
> | 16:36:37.000000000 +0400
> | @@ -1277,13 +1366,13 @@ static inline int pid_alive(struct task_
> | *
> | * TODO: We should inline this function after some cleanups in
> | pid_namespace.h
> | */
>
> We can now remove this TODO.

Ok.

> | -extern int is_global_init(struct task_struct *tsk);
> | +extern int is_container_init(struct task_struct *tsk);
> |
> | /*
> | * is_container_init:
> | * check whether in the task is init in its own pid namespace.
> | */
>
> The function header describes is_container_init() but the function
> is actually is_global_init(). The opp is true for the function
> heaer of is_container_init() above.

:)

> | -static inline int is_container_init(struct task_struct *tsk)
> | +static inline int is_global_init(struct task_struct *tsk)
> | {
> | return tsk->pid == 1;
> | }
> | diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c
> | linux-2.6.23-rc1-mm1-7/kernel/pid.c
> | --- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26
> | 16:34:45.000000000 +0400
> | +++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26
> | 16:36:37.000000000 +0400
> | @@ -60,11 +62,21 @@ static inline int mk_pid(struct pid_name
> | };
> | EXPORT_SYMBOL(init_pid_ns);
> |
> | -int is_global_init(struct task_struct *tsk)
> | +int is_container_init(struct task_struct *tsk)
> | {
> | - return tsk == init_pid_ns.child_reaper;
> | + int ret;
>
> Initialize ret = 0 here would save a line of code below :-)

Well, I don't actually like initializing variables right when
they are declared. Even with trivial values :)

> | + struct pid *pid;
> | +
> | + ret = 0;
> | + rcu_read_lock();
> | + pid = task_pid(tsk);
> | + if (pid != NULL && pid->numbers[pid->level].nr == 1)
> | + ret = 1;
> | + rcu_read_unlock();
> | +
> | + return ret;
> | }
> | -EXPORT_SYMBOL(is_global_init);
> | +EXPORT_SYMBOL(is_container_init);
> |
> | /*
> | * Note: disable interrupts while the pidmap_lock is held as an
>
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15335 is a reply to message #15323] Fri, 27 July 2007 06:38 Go to previous messageGo to next message
xemul is currently offline  xemul
Messages: 248
Registered: November 2005
Senior Member
Dave Hansen wrote:
> On Thu, 2007-07-26 at 18:46 +0400, Pavel Emelyanov wrote:
>> + write_unlock_irq(&tasklist_lock);
>> + exit_task_namespaces(tsk);
>> + write_lock_irq(&tasklist_lock);
>
> Are there any other side-effects of doing this? What was the

Looks like there are :( Oleg pointed out, that if any thread dies
it can re-parent all its children to this one, who already released
its namespaces... :(

> tasklist_lock protecting here when it was released?
>
> -- Dave
>
>
Re: [PATCH 12/15] Miscelaneous stuff for pid namespaces [message #15337 is a reply to message #15306] Fri, 27 July 2007 06:22 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Pavel Emelianov [xemul@openvz.org] wrote:
| This includes
| * the headers dependency fix
| * task_child_reaper() correct declaration
| * fixes for is_global_init() and is_container_init()
|
| Maybe this should come before all the other stuff...
|
| Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
|
| ---
|
| include/linux/pid_namespace.h | 4 ++--
| include/linux/sched.h | 4 ++--
| kernel/pid.c | 16 +++++++++++++---
| 3 files changed, 17 insertions(+), 7 deletions(-)
|
| diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h
| linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h
| --- linux-2.6.23-rc1-mm1.orig/include/linux/pid_namespace.h 2007-07-26
| 16:34:45.000000000 +0400
| +++ linux-2.6.23-rc1-mm1-7/include/linux/pid_namespace.h 2007-07-26
| 16:36:36.000000000 +0400
| @@ -4,7 +4,6 @@
| #include <linux/sched.h>
| #include <linux/mm.h>
| #include <linux/threads.h>
| -#include <linux/pid.h>
| #include <linux/nsproxy.h>
| #include <linux/kref.h>
|
| @@ -46,7 +53,8 @@ static inline struct pid_namespace *task
|
| static inline struct task_struct *task_child_reaper(struct task_struct *tsk)
| {
| - return init_pid_ns.child_reaper;
| + BUG_ON(tsk != current);
| + return tsk->nsproxy->pid_ns->child_reaper;
| }
|
| #endif /* _LINUX_PID_NS_H */
| diff -upr linux-2.6.23-rc1-mm1.orig/include/linux/sched.h
| linux-2.6.23-rc1-mm1-7/include/linux/sched.h
| --- linux-2.6.23-rc1-mm1.orig/include/linux/sched.h 2007-07-26
| 16:34:45.000000000 +0400
| +++ linux-2.6.23-rc1-mm1-7/include/linux/sched.h 2007-07-26
| 16:36:37.000000000 +0400
| @@ -1277,13 +1366,13 @@ static inline int pid_alive(struct task_
| *
| * TODO: We should inline this function after some cleanups in
| pid_namespace.h
| */

We can now remove this TODO.

| -extern int is_global_init(struct task_struct *tsk);
| +extern int is_container_init(struct task_struct *tsk);
|
| /*
| * is_container_init:
| * check whether in the task is init in its own pid namespace.
| */

The function header describes is_container_init() but the function
is actually is_global_init(). The opp is true for the function
heaer of is_container_init() above.

| -static inline int is_container_init(struct task_struct *tsk)
| +static inline int is_global_init(struct task_struct *tsk)
| {
| return tsk->pid == 1;
| }
| diff -upr linux-2.6.23-rc1-mm1.orig/kernel/pid.c
| linux-2.6.23-rc1-mm1-7/kernel/pid.c
| --- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26
| 16:34:45.000000000 +0400
| +++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26
| 16:36:37.000000000 +0400
| @@ -60,11 +62,21 @@ static inline int mk_pid(struct pid_name
| };
| EXPORT_SYMBOL(init_pid_ns);
|
| -int is_global_init(struct task_struct *tsk)
| +int is_container_init(struct task_struct *tsk)
| {
| - return tsk == init_pid_ns.child_reaper;
| + int ret;

Initialize ret = 0 here would save a line of code below :-)

| + struct pid *pid;
| +
| + ret = 0;
| + rcu_read_lock();
| + pid = task_pid(tsk);
| + if (pid != NULL && pid->numbers[pid->level].nr == 1)
| + ret = 1;
| + rcu_read_unlock();
| +
| + return ret;
| }
| -EXPORT_SYMBOL(is_global_init);
| +EXPORT_SYMBOL(is_container_init);
|
| /*
| * Note: disable interrupts while the pidmap_lock is held as an
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15338 is a reply to message #15320] Fri, 27 July 2007 08:07 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Oleg Nesterov wrote:
>
> On 07/26, Pavel Emelyanov wrote:
> >
> > Make task release its namespaces after it has reparented all his
> > children to child_reaper, but before it notifies its parent about
> > its death.
> >
> > The reason to release namespaces after reparenting is that when task
> > exits it may send a signal to its parent (SIGCHLD), but if the parent
> > has already exited its namespaces there will be no way to decide what
> > pid to dever to him - parent can be from different namespace.
> >
> > The reason to release namespace before notifying the parent it that
> > when task sends a SIGCHLD to parent it can call wait() on this taks
> > and release it. But releasing the mnt namespace implies dropping
> > of all the mounts in the mnt namespace and NFS expects the task to
> > have valid sighand pointer.
> >
> > Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> >
> > ---
> >
> > exit.c | 5 ++++-
> > 1 files changed, 4 insertions(+), 1 deletion(-)
> >
> > diff -upr linux-2.6.23-rc1-mm1.orig/kernel/exit.c
> > linux-2.6.23-rc1-mm1-7/kernel/exit.c
> > --- linux-2.6.23-rc1-mm1.orig/kernel/exit.c 2007-07-26
> > 16:34:45.000000000 +0400
> > +++ linux-2.6.23-rc1-mm1-7/kernel/exit.c 2007-07-26
> > 16:36:37.000000000 +0400
> > @@ -788,6 +804,10 @@ static void exit_notify(struct task_stru
> > BUG_ON(!list_empty(&tsk->children));
> > BUG_ON(!list_empty(&tsk->ptrace_children));
> >
> > + write_unlock_irq(&tasklist_lock);
> > + exit_task_namespaces(tsk);
> > + write_lock_irq(&tasklist_lock);
>
> No.
>
> We "cleared" our ->children/->ptrace_children lists. Now suppose that
> another thread dies, and its forget_original_parent() choose us as a
> new reaper before we re-take tasklist.

Perhaps, we can do something like the patch below. Roland, what do you
think?

We can check PF_EXITING instead of ->exit_state while choosing the new
parent. Note that tasklits_lock acts as a barrier, everyone who takes
tasklist after us (when forget_original_parent() drops it) must see
PF_EXITING.

Oleg.

--- t/kernel/exit.c~ 2007-07-27 11:32:21.000000000 +0400
+++ t/kernel/exit.c 2007-07-27 11:59:09.000000000 +0400
@@ -686,11 +686,14 @@ reparent_thread(struct task_struct *p, s
* the child reaper process (ie "init") in our pid
* space.
*/
-static void
-forget_original_parent(struct task_struct *father, struct list_head *to_release)
+static void forget_original_parent(struct task_struct *father)
{
struct task_struct *p, *reaper = father;
- struct list_head *_p, *_n;
+ struct list_head *ptrace_dead, *_p, *_n;
+
+ INIT_LIST_HEAD(&ptrace_dead);
+
+ write_lock_irq(&tasklist_lock);

do {
reaper = next_thread(reaper);
@@ -698,7 +701,7 @@ forget_original_parent(struct task_struc
reaper = child_reaper(father);
break;
}
- } while (reaper->exit_state);
+ } while (reaper->flags & PF_EXITING);

/*
* There are only two places where our children can be:
@@ -736,13 +739,25 @@ forget_original_parent(struct task_struc
* while it was being traced by us, to be able to see it in wait4.
*/
if (unlikely(ptrace && p->exit_state == EXIT_ZOMBIE && p->exit_signal == -1))
- list_add(&p->ptrace_list, to_release);
+ list_add(&p->ptrace_list, &ptrace_dead);
}
+
list_for_each_safe(_p, _n, &father->ptrace_children) {
p = list_entry(_p, struct task_struct, ptrace_list);
choose_new_parent(p, reaper);
reparent_thread(p, father, 1);
}
+
+ write_unlock_irq(&tasklist_lock);
+ BUG_ON(!list_empty(&tsk->children));
+ BUG_ON(!list_empty(&tsk->ptrace_children));
+
+ list_for_each_safe(_p, _n, &ptrace_dead) {
+ list_del_init(_p);
+ t = list_entry(_p, struct task_struct, ptrace_list);
+ release_task(t);
+ }
+
}

/*
@@ -753,7 +768,6 @@ static void exit_notify(struct task_stru
{
int state;
struct task_struct *t;
- struct list_head ptrace_dead, *_p, *_n;
struct pid *pgrp;

if (signal_pending(tsk) && !(tsk->signal->flags & SIGNAL_GROUP_EXIT)
@@ -776,8 +790,6 @@ static void exit_notify(struct task_stru
read_unlock(&tasklist_lock);
}

- write_lock_irq(&tasklist_lock);
-
/*
* This does two things:
*
@@ -786,12 +798,9 @@ static void exit_notify(struct task_stru
* as a result of our exiting, and if they have any stopped
* jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
*/
+ forget_original_parent(tsk);

- INIT_LIST_HEAD(&ptrace_dead);
- forget_original_parent(tsk, &ptrace_dead);
- BUG_ON(!list_empty(&tsk->children));
- BUG_ON(!list_empty(&tsk->ptrace_children));
-
+ write_lock_irq(&tasklist_lock);
/*
* Check to see if any process groups have become orphaned
* as a result of our exiting, and if they have any stopped
@@ -801,9 +810,8 @@ static void exit_notify(struct task_stru
* and we were the only connection outside, so our pgrp
* is about to become orphaned.
*/
-
t = tsk->real_parent;
-
+
pgrp = task_pgrp(tsk);
if ((task_pgrp(t) != pgrp) &&
(task_session(t) == task_session(tsk)) &&
@@ -826,9 +834,8 @@ static void exit_notify(struct task_stru
* If our self_exec id doesn't match our parent_exec_id then
* we have changed execution domain as these two values started
* the same after a fork.
- *
*/
-
+
if (tsk->exit_signal != SIGCHLD && tsk->exit_signal != -1 &&
( tsk->parent_exec_id != t->self_exec_id ||
tsk->self_exec_id != tsk->parent_exec_id)
@@ -856,12 +863,6 @@ static void exit_notify(struct task_stru

write_unlock_irq(&tasklist_lock);

- list_for_each_safe(_p, _n, &ptrace_dead) {
- list_del_init(_p);
- t = list_entry(_p, struct task_struct, ptrace_list);
- release_task(t);
- }
-
/* If the process is dead, release it - nobody will wait for it */
if (state == EXIT_DEAD)
release_task(tsk);
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15339 is a reply to message #15338] Fri, 27 July 2007 08:24 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Oleg Nesterov wrote:
>> On 07/26, Pavel Emelyanov wrote:
>>> Make task release its namespaces after it has reparented all his
>>> children to child_reaper, but before it notifies its parent about
>>> its death.
>>>
>>> The reason to release namespaces after reparenting is that when task
>>> exits it may send a signal to its parent (SIGCHLD), but if the parent
>>> has already exited its namespaces there will be no way to decide what
>>> pid to dever to him - parent can be from different namespace.
>>>
>>> The reason to release namespace before notifying the parent it that
>>> when task sends a SIGCHLD to parent it can call wait() on this taks
>>> and release it. But releasing the mnt namespace implies dropping
>>> of all the mounts in the mnt namespace and NFS expects the task to
>>> have valid sighand pointer.
>>>
>>> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
>>>
>>> ---
>>>
>>> exit.c | 5 ++++-
>>> 1 files changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff -upr linux-2.6.23-rc1-mm1.orig/kernel/exit.c
>>> linux-2.6.23-rc1-mm1-7/kernel/exit.c
>>> --- linux-2.6.23-rc1-mm1.orig/kernel/exit.c 2007-07-26
>>> 16:34:45.000000000 +0400
>>> +++ linux-2.6.23-rc1-mm1-7/kernel/exit.c 2007-07-26
>>> 16:36:37.000000000 +0400
>>> @@ -788,6 +804,10 @@ static void exit_notify(struct task_stru
>>> BUG_ON(!list_empty(&tsk->children));
>>> BUG_ON(!list_empty(&tsk->ptrace_children));
>>>
>>> + write_unlock_irq(&tasklist_lock);
>>> + exit_task_namespaces(tsk);
>>> + write_lock_irq(&tasklist_lock);
>> No.
>>
>> We "cleared" our ->children/->ptrace_children lists. Now suppose that
>> another thread dies, and its forget_original_parent() choose us as a
>> new reaper before we re-take tasklist.
>
> Perhaps, we can do something like the patch below. Roland, what do you
> think?
>
> We can check PF_EXITING instead of ->exit_state while choosing the new

Heh :) I've came to the same conclusion and now I'm checking for it.
But my patch is much simpler that yours - it just checks for PF_EXITING
in forget_original_parent:

--- ./kernel/exit.c.exitfix 2007-07-27 12:13:25.000000000 +0400
+++ ./kernel/exit.c 2007-07-27 12:15:35.000000000 +0400
@@ -712,7 +712,7 @@ forget_original_parent(struct task_struc
reaper = task_child_reaper(father);
break;
}
- } while (reaper->exit_state);
+ } while (reaper->flags & PF_EXITING);

/*
* There are only two places where our children can be:


> parent. Note that tasklits_lock acts as a barrier, everyone who takes
> tasklist after us (when forget_original_parent() drops it) must see
> PF_EXITING.
>
> Oleg.
>
> --- t/kernel/exit.c~ 2007-07-27 11:32:21.000000000 +0400
> +++ t/kernel/exit.c 2007-07-27 11:59:09.000000000 +0400
> @@ -686,11 +686,14 @@ reparent_thread(struct task_struct *p, s
> * the child reaper process (ie "init") in our pid
> * space.
> */
> -static void
> -forget_original_parent(struct task_struct *father, struct list_head *to_release)
> +static void forget_original_parent(struct task_struct *father)
> {
> struct task_struct *p, *reaper = father;
> - struct list_head *_p, *_n;
> + struct list_head *ptrace_dead, *_p, *_n;
> +
> + INIT_LIST_HEAD(&ptrace_dead);
> +
> + write_lock_irq(&tasklist_lock);
> do {
> reaper = next_thread(reaper);
> @@ -698,7 +701,7 @@ forget_original_parent(struct task_struc
> reaper = child_reaper(father);
> break;
> }
> - } while (reaper->exit_state);
> + } while (reaper->flags & PF_EXITING);
>
> /*
> * There are only two places where our children can be:
> @@ -736,13 +739,25 @@ forget_original_parent(struct task_struc
> * while it was being traced by us, to be able to see it in wait4.
> */
> if (unlikely(ptrace && p->exit_state == EXIT_ZOMBIE && p->exit_signal == -1))
> - list_add(&p->ptrace_list, to_release);
> + list_add(&p->ptrace_list, &ptrace_dead);
> }
> +
> list_for_each_safe(_p, _n, &father->ptrace_children) {
> p = list_entry(_p, struct task_struct, ptrace_list);
> choose_new_parent(p, reaper);
> reparent_thread(p, father, 1);
> }
> +
> + write_unlock_irq(&tasklist_lock);
> + BUG_ON(!list_empty(&tsk->children));
> + BUG_ON(!list_empty(&tsk->ptrace_children));
> +
> + list_for_each_safe(_p, _n, &ptrace_dead) {
> + list_del_init(_p);
> + t = list_entry(_p, struct task_struct, ptrace_list);
> + release_task(t);
> + }
> +
> }
>
> /*
> @@ -753,7 +768,6 @@ static void exit_notify(struct task_stru
> {
> int state;
> struct task_struct *t;
> - struct list_head ptrace_dead, *_p, *_n;
> struct pid *pgrp;
>
> if (signal_pending(tsk) && !(tsk->signal->flags & SIGNAL_GROUP_EXIT)
> @@ -776,8 +790,6 @@ static void exit_notify(struct task_stru
> read_unlock(&tasklist_lock);
> }
>
> - write_lock_irq(&tasklist_lock);
> -
> /*
> * This does two things:
> *
> @@ -786,12 +798,9 @@ static void exit_notify(struct task_stru
> * as a result of our exiting, and if they have any stopped
> * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
> */
> + forget_original_parent(tsk);
>
> - INIT_LIST_HEAD(&ptrace_dead);
> - forget_original_parent(tsk, &ptrace_dead);
> - BUG_ON(!list_empty(&tsk->children));
> - BUG_ON(!list_empty(&tsk->ptrace_children));
> -
> + write_lock_irq(&tasklist_lock);
> /*
> * Check to see if any process groups have become orphaned
> * as a result of our exiting, and if they have any stopped
> @@ -801,9 +810,8 @@ static void exit_notify(struct task_stru
> * and we were the only connection outside, so our pgrp
> * is about to become orphaned.
> */
> -
> t = tsk->real_parent;
> -
> +
> pgrp = task_pgrp(tsk);
> if ((task_pgrp(t) != pgrp) &&
> (task_session(t) == task_session(tsk)) &&
> @@ -826,9 +834,8 @@ static void exit_notify(struct task_stru
> * If our self_exec id doesn't match our parent_exec_id then
> * we have changed execution domain as these two values started
> * the same after a fork.
> - *
> */
> -
> +
> if (tsk->exit_signal != SIGCHLD && tsk->exit_signal != -1 &&
> ( tsk->parent_exec_id != t->self_exec_id ||
> tsk->self_exec_id != tsk->parent_exec_id)
> @@ -856,12 +863,6 @@ static void exit_notify(struct task_stru
>
> write_unlock_irq(&tasklist_lock);
>
> - list_for_each_safe(_p, _n, &ptrace_dead) {
> - list_del_init(_p);
> - t = list_entry(_p, struct task_struct, ptrace_list);
> - release_task(t);
> - }
> -
> /* If the process is dead, release it - nobody will wait for it */
> if (state == EXIT_DEAD)
> release_task(tsk);
>
>
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15340 is a reply to message #15339] Fri, 27 July 2007 08:35 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/27, Pavel Emelyanov wrote:
>
> Oleg Nesterov wrote:
> >
> >Perhaps, we can do something like the patch below. Roland, what do you
> >think?
> >
> >We can check PF_EXITING instead of ->exit_state while choosing the new
>
> Heh :) I've came to the same conclusion and now I'm checking for it.
> But my patch is much simpler that yours - it just checks for PF_EXITING
> in forget_original_parent:
>
> --- ./kernel/exit.c.exitfix 2007-07-27 12:13:25.000000000 +0400
> +++ ./kernel/exit.c 2007-07-27 12:15:35.000000000 +0400
> @@ -712,7 +712,7 @@ forget_original_parent(struct task_struc
> reaper = task_child_reaper(father);
> break;
> }
> - } while (reaper->exit_state);
> + } while (reaper->flags & PF_EXITING);

Yes, other changes are just cleanups. They just move some code from exit_notify
to forget_original_parent(). It is a bit silly to declare ptrace_dead in
exit_notify(), take tasklist, pass ptrace_dead to forget_original_parent(),
unlock-lock-unlock tasklist, and then use ptrace_dead.

Oleg.
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15342 is a reply to message #15340] Fri, 27 July 2007 08:37 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/27, Pavel Emelyanov wrote:
>> Oleg Nesterov wrote:
>>> Perhaps, we can do something like the patch below. Roland, what do you
>>> think?
>>>
>>> We can check PF_EXITING instead of ->exit_state while choosing the new
>> Heh :) I've came to the same conclusion and now I'm checking for it.
>> But my patch is much simpler that yours - it just checks for PF_EXITING
>> in forget_original_parent:
>>
>> --- ./kernel/exit.c.exitfix 2007-07-27 12:13:25.000000000 +0400
>> +++ ./kernel/exit.c 2007-07-27 12:15:35.000000000 +0400
>> @@ -712,7 +712,7 @@ forget_original_parent(struct task_struc
>> reaper = task_child_reaper(father);
>> break;
>> }
>> - } while (reaper->exit_state);
>> + } while (reaper->flags & PF_EXITING);
>
> Yes, other changes are just cleanups. They just move some code from exit_notify
> to forget_original_parent(). It is a bit silly to declare ptrace_dead in
> exit_notify(), take tasklist, pass ptrace_dead to forget_original_parent(),
> unlock-lock-unlock tasklist, and then use ptrace_dead.

:) Ok. Thanks, I then will check with your patch. If you don't mind
I will carry it together with this set.

> Oleg.

Thanks,
Pavel
Re: [PATCH 11/15] Signal semantics [message #15346 is a reply to message #15305] Fri, 27 July 2007 12:30 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
Damn. I don't have time to read these patches today (will try tomorrow),
but when I glanced at this patch yesterday I had some suspicions...

On 07/26, Pavel Emelyanov wrote:
>
> +++ linux-2.6.23-rc1-mm1-7/kernel/signal.c 2007-07-26
> 16:36:37.000000000 +0400
> @@ -323,6 +325,9 @@ static int collect_signal(int sig, struc
> if (first) {
> list_del_init(&first->list);
> copy_siginfo(info, &first->info);
> + if (first->flags & SIGQUEUE_CINIT)
> + kinfo->flags |= KERN_SIGINFO_CINIT;
> +
>
> [...snip...]
>
> @@ -1852,7 +1950,7 @@ relock:
> * within that pid space. It can of course get signals from
> * its parent pid space.
> */
> - if (current == task_child_reaper(current))
> + if (kinfo.flags & KERN_SIGINFO_CINIT)
> continue;

I think the whole idea is broken, it assumes the sender put something into
"struct sigqueue".

Suppose that /sbin/init has no handler for (say) SIGTERM, and we send this
signal from the same namespace. send_signal() sets SIGQUEUE_CINIT, but it
is lost because __group_complete_signal() silently "converts" sig_fatal()
signals to SIGKILL using sigaddset().

> +static void encode_sender_info(struct task_struct *t, struct sigqueue *q)
> +{
> + /*
> + * If sender (i.e 'current') and receiver have the same active
> + * pid namespace and the receiver is the container-init, set the
> + * SIGQUEUE_CINIT flag. This tells the container-init that the
> + * signal originated in its own namespace and so it can choose
> + * to ignore the signal.
> + *
> + * If the receiver is the container-init of a pid namespace,
> + * but the sender is from an ancestor pid namespace, the
> + * container-init cannot ignore the signal. So clear the
> + * SIGQUEUE_CINIT flag in this case.
> + *
> + * Also, if the sender does not have a pid_t in the receiver's
> + * active pid namespace, set si_pid to 0 and pretend it originated
> + * from the kernel.
> + */
> + if (pid_ns_equal(t)) {
> + if (is_container_init(t)) {
> + q->flags |= SIGQUEUE_CINIT;

Ironically, this change carefully preserves the bug we already have :)

This doesn't protect init from "bad" signal if we send it to sub-thread
of init. Actually, this make the behaviour a bit worse compared to what
we currently have. Currently, at least the main init's thread survives
if we send SIGKILL to sub-thread.

> static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
> struct sigpending *signals)
> {
> @@ -710,6 +781,7 @@ static int send_signal(int sig, struct s
> copy_siginfo(&q->info, info);
> break;
> }
> + encode_sender_info(t, q);

We still send the signal if __sigqueue_alloc() fails. In that case, the
dequeued siginfo won't have SIGQUEUE_CINIT/KERN_SIGINFO_CINIT, not good.

> @@ -1158,6 +1232,13 @@ static int kill_something_info(int sig,
>
> read_lock(&tasklist_lock);
> for_each_process(p) {
> + /*
> + * System-wide signals apply only to the sender's
> + * pid namespace, unless issued from init_pid_ns.
> + */
> + if (!task_visible_in_pid_ns(p, my_ns))
> + continue;
> +
> if (p->pid > 1 && p->tgid != current->tgid) {

This "p->pid > 1" check should die.

> +static int deny_signal_to_container_init(struct task_struct *tsk, int sig)
> +{
> + /*
> + * If receiver is the container-init of sender and signal is SIGKILL
> + * reject it right-away. If signal is any other one, let the container
> + * init decide (in get_signal_to_deliver()) whether to handle it or
> + * ignore it.
> + */
> + if (is_container_init(tsk) && (sig == SIGKILL) && pid_ns_equal(tsk))
> + return -EPERM;
> +
> + return 0;
> +}
> +
> /*
> * Bad permissions for sending the signal
> */
> @@ -545,6 +584,10 @@ static int check_kill_permission(int sig
> && !capable(CAP_KILL))
> return error;
>
> + error = deny_signal_to_container_init(t, sig);
> + if (error)
> + return error;

Hm. Could you explain this change? Why do we need a special check for SIGKILL?


(What about ptrace_attach() btw? If it is possible to send a signal to init
from the "parent" namespace, perhaps it makes sense to allow ptracing as
well).

Oleg.
Re: [PATCH 11/15] Signal semantics [message #15351 is a reply to message #15346] Fri, 27 July 2007 13:38 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> Damn. I don't have time to read these patches today (will try tomorrow),

Oh, that's OK. I was about to send the set to Andrew only the next week.

This patch is the most strange one and is to be discussed a lot.

We try to do the following two things:
1. signals going from the namespace, that the target task doesn't
see must be seen as SI_KERNEL if siginfo is allocated;
2. signals to init of any namespace must be allowed to send from
one of the parent namespaces only. From child namespace, init
needs only those, that it's ready to handle (SIGCHLD).

As far as I understand Suka's approach (it's his patch, so I may
be not 100% correct - it's better to wait for his comments) he is
trying to carry the information about the signal up to the
get_signal_to_deliver().

As far as the first issue is concerned, the solution is obvious -
all the "calculations" can be done at the beginning of sending the
signal, but the second issue is a bit more complicated and I have
no good ideas of how to solve this :( yet.

Thanks,
Pavel

> but when I glanced at this patch yesterday I had some suspicions...
>
> On 07/26, Pavel Emelyanov wrote:
>> +++ linux-2.6.23-rc1-mm1-7/kernel/signal.c 2007-07-26
>> 16:36:37.000000000 +0400
>> @@ -323,6 +325,9 @@ static int collect_signal(int sig, struc
>> if (first) {
>> list_del_init(&first->list);
>> copy_siginfo(info, &first->info);
>> + if (first->flags & SIGQUEUE_CINIT)
>> + kinfo->flags |= KERN_SIGINFO_CINIT;
>> +
>>
>> [...snip...]
>>
>> @@ -1852,7 +1950,7 @@ relock:
>> * within that pid space. It can of course get signals from
>> * its parent pid space.
>> */
>> - if (current == task_child_reaper(current))
>> + if (kinfo.flags & KERN_SIGINFO_CINIT)
>> continue;
>
> I think the whole idea is broken, it assumes the sender put something into
> "struct sigqueue".

Yup. That's the problem. It seems to me that the only way to handle init's
signals is to check for permissions in the sending path.

> Suppose that /sbin/init has no handler for (say) SIGTERM, and we send this
> signal from the same namespace. send_signal() sets SIGQUEUE_CINIT, but it
> is lost because __group_complete_signal() silently "converts" sig_fatal()
> signals to SIGKILL using sigaddset().
>
>> +static void encode_sender_info(struct task_struct *t, struct sigqueue *q)
>> +{
>> + /*
>> + * If sender (i.e 'current') and receiver have the same active
>> + * pid namespace and the receiver is the container-init, set the
>> + * SIGQUEUE_CINIT flag. This tells the container-init that the
>> + * signal originated in its own namespace and so it can choose
>> + * to ignore the signal.
>> + *
>> + * If the receiver is the container-init of a pid namespace,
>> + * but the sender is from an ancestor pid namespace, the
>> + * container-init cannot ignore the signal. So clear the
>> + * SIGQUEUE_CINIT flag in this case.
>> + *
>> + * Also, if the sender does not have a pid_t in the receiver's
>> + * active pid namespace, set si_pid to 0 and pretend it originated
>> + * from the kernel.
>> + */
>> + if (pid_ns_equal(t)) {
>> + if (is_container_init(t)) {
>> + q->flags |= SIGQUEUE_CINIT;
>
> Ironically, this change carefully preserves the bug we already have :)
>
> This doesn't protect init from "bad" signal if we send it to sub-thread
> of init. Actually, this make the behaviour a bit worse compared to what
> we currently have. Currently, at least the main init's thread survives
> if we send SIGKILL to sub-thread.
>
>> static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
>> struct sigpending *signals)
>> {
>> @@ -710,6 +781,7 @@ static int send_signal(int sig, struct s
>> copy_siginfo(&q->info, info);
>> break;
>> }
>> + encode_sender_info(t, q);
>
> We still send the signal if __sigqueue_alloc() fails. In that case, the
> dequeued siginfo won't have SIGQUEUE_CINIT/KERN_SIGINFO_CINIT, not good.
>
>> @@ -1158,6 +1232,13 @@ static int kill_something_info(int sig,
>>
>> read_lock(&tasklist_lock);
>> for_each_process(p) {
>> + /*
>> + * System-wide signals apply only to the sender's
>> + * pid namespace, unless issued from init_pid_ns.
>> + */
>> + if (!task_visible_in_pid_ns(p, my_ns))
>> + continue;
>> +
>> if (p->pid > 1 && p->tgid != current->tgid) {
>
> This "p->pid > 1" check should die.
>
>> +static int deny_signal_to_container_init(struct task_struct *tsk, int sig)
>> +{
>> + /*
>> + * If receiver is the container-init of sender and signal is SIGKILL
>> + * reject it right-away. If signal is any other one, let the container
>> + * init decide (in get_signal_to_deliver()) whether to handle it or
>> + * ignore it.
>> + */
>> + if (is_container_init(tsk) && (sig == SIGKILL) && pid_ns_equal(tsk))
>> + return -EPERM;
>> +
>> + return 0;
>> +}
>> +
>> /*
>> * Bad permissions for sending the signal
>> */
>> @@ -545,6 +584,10 @@ static int check_kill_permission(int sig
>> && !capable(CAP_KILL))
>> return error;
>>
>> + error = deny_signal_to_container_init(t, sig);
>> + if (error)
>> + return error;
>
> Hm. Could you explain this change? Why do we need a special check for SIGKILL?
>
>
> (What about ptrace_attach() btw? If it is possible to send a signal to init
> from the "parent" namespace, perhaps it makes sense to allow ptracing as
> well).

ptracing of tasks fro different namespaces is not possible at all, since
strace utility determines the fork()-ed child pid from the parent's eax
register, which would contain the pid value as this parent sees his child.
But if the strace is in different namespace - it won't be able to find
this child with the pid value from parent's eax.

Maybe it's worth disabling cross-namespaces ptracing...

>
> Oleg.
>
>
Re: [PATCH 9/15] Move alloc_pid() after the namespace is cloned [message #15355 is a reply to message #15303] Fri, 27 July 2007 15:10 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> This is a fix for Sukadev's patch that moved the alloc_pid() call from
> do_fork() into copy_process().

... and this patch changes almost every line from Sukadev's patch.
Sorry gents, but isn't it better to ask Andrew to drop that patch
(which is quite useless by itself), and send a new one which incorporates
all necessary changes? Imho, it would be much easier to understand.

> @@ -1406,7 +1422,13 @@ long do_fork(unsigned long clone_flags,
> if (!IS_ERR(p)) {
> struct completion vfork;
>
> - nr = pid_nr(task_pid(p));
> + /*
> + * this is enough to call pid_nr_ns here, but this if
> + * improves optimisation of regular fork()
> + */
> + nr = (clone_flags & CLONE_NEWPID) ?
> + task_pid_nr_ns(p, current->nsproxy->pid_ns) :
> + task_pid_vnr(p);

Shouldn't we do the same for CLONE_PARENT_SETTID in copy_process() ?
Otherwise *parent_tidptr may have a wrong value which doesn't match
to what fork() returns.

Oleg.
Re: [PATCH 11/15] Signal semantics [message #15357 is a reply to message #15305] Fri, 27 July 2007 19:59 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting sukadev@us.ibm.com (sukadev@us.ibm.com):
> Pavel Emelianov [xemul@openvz.org] wrote:
> | Oleg Nesterov wrote:
> | >Damn. I don't have time to read these patches today (will try tomorrow),
> |
> | Oh, that's OK. I was about to send the set to Andrew only the next week.
> |
> | This patch is the most strange one and is to be discussed a lot.
> |
> | We try to do the following two things:
> | 1. signals going from the namespace, that the target task doesn't
> | see must be seen as SI_KERNEL if siginfo is allocated;
> | 2. signals to init of any namespace must be allowed to send from
> | one of the parent namespaces only. From child namespace, init
> | needs only those, that it's ready to handle (SIGCHLD).
>
> Yes.
>
> |
> | As far as I understand Suka's approach (it's his patch, so I may
> | be not 100% correct - it's better to wait for his comments) he is
> | trying to carry the information about the signal up to the
> | get_signal_to_deliver().
> |
> | As far as the first issue is concerned, the solution is obvious -
> | all the "calculations" can be done at the beginning of sending the
> | signal, but the second issue is a bit more complicated and I have
> | no good ideas of how to solve this :( yet.
>
> Even I am looking for a better approach.
>
> |
> | Thanks,
> | Pavel
> |
> | >but when I glanced at this patch yesterday I had some suspicions...
> | >
> | >On 07/26, Pavel Emelyanov wrote:
> | >>+++ linux-2.6.23-rc1-mm1-7/kernel/signal.c 2007-07-26
> | >>16:36:37.000000000 +0400
> | >>@@ -323,6 +325,9 @@ static int collect_signal(int sig, struc
> | >> if (first) {
> | >> list_del_init(&first->list);
> | >> copy_siginfo(info, &first->info);
> | >>+ if (first->flags & SIGQUEUE_CINIT)
> | >>+ kinfo->flags |= KERN_SIGINFO_CINIT;
> | >>+
> | >>
> | >>[...snip...]
> | >>
> | >>@@ -1852,7 +1950,7 @@ relock:
> | >> * within that pid space. It can of course get signals from
> | >> * its parent pid space.
> | >> */
> | >>- if (current == task_child_reaper(current))
> | >>+ if (kinfo.flags & KERN_SIGINFO_CINIT)
> | >> continue;
> | >
> | >I think the whole idea is broken, it assumes the sender put something into
> | >"struct sigqueue".
> |
> | Yup. That's the problem. It seems to me that the only way to handle init's
> | signals is to check for permissions in the sending path.
>
> We can check permissions in the sending path - and in fact we do check for
> SIGKILL case (deny_signal_to_container_init() below).
>
> But the receiver knows/decides whether or not the signal is wanted/not. No ?
>
> Are you saying we should check/special case all fatal signals ?
>
> |
> | >Suppose that /sbin/init has no handler for (say) SIGTERM, and we send this
> | >signal from the same namespace. send_signal() sets SIGQUEUE_CINIT, but it
> | >is lost because __group_complete_signal() silently "converts" sig_fatal()
> | >signals to SIGKILL using sigaddset().
>
> Yes, I should have called it out, but this patch currently assumes /sbin/init
> (or container-init) has a handler for the fatal signals like SIGTERM and has
> a check for SIGKILL (in deny_signal_to_container_init() - as Oleg noted below).
>
> Still looking for better ways to implement.
>
> | >
> | >>+static void encode_sender_info(struct task_struct *t, struct sigqueue *q)
> | >>+{
> | >>+ /*
> | >>+ * If sender (i.e 'current') and receiver have the same active
> | >>+ * pid namespace and the receiver is the container-init, set the
> | >>+ * SIGQUEUE_CINIT flag. This tells the container-init that the
> | >>+ * signal originated in its own namespace and so it can choose
> | >>+ * to ignore the signal.
> | >>+ *
> | >>+ * If the receiver is the container-init of a pid namespace,
> | >>+ * but the sender is from an ancestor pid namespace, the
> | >>+ * container-init cannot ignore the signal. So clear the
> | >>+ * SIGQUEUE_CINIT flag in this case.
> | >>+ *
> | >>+ * Also, if the sender does not have a pid_t in the receiver's
> | >>+ * active pid namespace, set si_pid to 0 and pretend it originated
> | >>+ * from the kernel.
> | >>+ */
> | >>+ if (pid_ns_equal(t)) {
> | >>+ if (is_container_init(t)) {
> | >>+ q->flags |= SIGQUEUE_CINIT;
> | >
> | >Ironically, this change carefully preserves the bug we already have :)
> | >
> | >This doesn't protect init from "bad" signal if we send it to sub-thread
> | >of init. Actually, this make the behaviour a bit worse compared to what
> | >we currently have. Currently, at least the main init's thread survives
> | >if we send SIGKILL to sub-thread.
>
> Do you mean "init's main thread" ? But doesn't SIGKILL to any thread kill
> the entire process ?
>
> | >
> | >>static int send_signal(int sig, struct siginfo *info, struct task_struct
> | >>*t,
> | >> struct sigpending *signals)
> | >>{
> | >>@@ -710,6 +781,7 @@ static int send_signal(int sig, struct s
> | >> copy_siginfo(&q->info, info);
> | >> break;
> | >> }
> | >>+ encode_sender_info(t, q);
> | >
> | >We still send the signal if __sigqueue_alloc() fails. In that case, the
> | >dequeued siginfo won't have SIGQUEUE_CINIT/KERN_SIGINFO_CINIT, not good.
>
> Yes.
>
> | >
> | >>@@ -1158,6 +1232,13 @@ static int kill_something_info(int sig,
> | >>
> | >> read_lock(&tasklist_lock);
> | >> for_each_process(p) {
> | >>+ /*
> | >>+ * System-wide signals apply only to the sender's
> | >>+ * pid namespace, unless issued from init_pid_ns.
> | >>+ */
> | >>+ if (!task_visible_in_pid_ns(p, my_ns))
> | >>+ continue;
> | >>+
> | >> if (p->pid > 1 && p->tgid != current->tgid) {
> | >
> | >This "p->pid > 1" check should die.
> | >
>
> Ok.
>
> | >>+static int deny_signal_to_container_init(struct task_struct *tsk, int
> | >>sig)
> | >>+{
> | >>+ /*
> | >>+ * If receiver is the container-init of sender and signal is SIGKILL
> | >>+ * reject it right-away. If signal is any other one, let the
> | >>container
> | >>+ * init decide (in get_signal_to_deliver()) whether to handle it or
> | >>+ * ignore it.
> | >>+ */
> | >>+ if (is_container_init(tsk) && (sig == SIGKILL) && pid_ns_equal(tsk))
> | >>+ return -EPERM;
> | >>+
> | >>+ return 0;
> | >>+}
> | >>+
> | >>/*
> | >> * Bad permissions for sending the signal
> | >> */
> | >>@@ -545,6 +584,10 @@ static int check_kill_permission(int sig
> | >> && !capable(CAP_KILL))
> | >> return error;
> | >>
> | >>+ error = deny_signal_to_container_init(t, sig);
> | >>+ if (error)
> | >>+ return error;
> | >
> | >Hm. Could you explain this change? Why do we need a special check for
> | >SIGKILL?
>
> As you pointed out above, SIGKILL goes through the __group_complete_signal()/
> sigaddset() path and bypasses/loses the KERN_SIGINFO_CINIT flag. Other
> sig_fatal() signals take this path too, but we assume for now, container-init
> has a handler.
>
>
> | >
> | >
> | >(What about ptrace_attach() btw? If it is possible to send a signal to init
> | > from the "parent" namespace, perhaps it makes sense to allow ptracing as
> | > well).
> |
> | ptracing of tasks fro different namespaces is not possible at all, since
> | strace utility determines the fork()-ed child pid from the parent's eax
> | register, which would contain the pid value as this parent sees his child.
> | But if the strace is in different namespace - it won't be able to find
> | this child with the pid value from parent's eax.
> |
> | Maybe it's worth disabling cross-namespaces ptracing...
>
> I think so too. Its probably not a serious limitation ?

Several people think we will implement 'namespace entering' through a
ptrace hack, where maybe the admin ptraces the init in a child pidns,
makes it fork, and makes the child execute what it wants (i.e. ps -ef).

You're talking about killing that functionality?

-serge
...

Re: [PATCH 11/15] Signal semantics [message #15358 is a reply to message #15357] Fri, 27 July 2007 20:23 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Serge E. Hallyn [serue@us.ibm.com] wrote:
| Quoting sukadev@us.ibm.com (sukadev@us.ibm.com):
| > Pavel Emelianov [xemul@openvz.org] wrote:
| > | Oleg Nesterov wrote:
| > | >(What about ptrace_attach() btw? If it is possible to send a signal to init
| > | > from the "parent" namespace, perhaps it makes sense to allow ptracing as
| > | > well).
| > |
| > | ptracing of tasks fro different namespaces is not possible at all, since
| > | strace utility determines the fork()-ed child pid from the parent's eax
| > | register, which would contain the pid value as this parent sees his child.
| > | But if the strace is in different namespace - it won't be able to find
| > | this child with the pid value from parent's eax.
| > |
| > | Maybe it's worth disabling cross-namespaces ptracing...
| >
| > I think so too. Its probably not a serious limitation ?
|
| Several people think we will implement 'namespace entering' through a
| ptrace hack, where maybe the admin ptraces the init in a child pidns,
| makes it fork, and makes the child execute what it wants (i.e. ps -ef).
|
| You're talking about killing that functionality?

No. I was only thinking in terms of debugging container init and missed
the namespace entering part.

Pavel, I am not sure I understand your comment about being unable to
ptrace() a child ns.

BTW, I am able to gdb a process (incl container-init) from parent ns now.

|
| -serge
Re: [PATCH 5/15] Introduce struct upid [message #15374 is a reply to message #15299] Sun, 29 July 2007 09:50 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> --- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
> +++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
> @@ -40,15 +40,21 @@ enum pid_type
> * processes.
> */
>
> -struct pid
> -{
> - atomic_t count;
> +struct upid {
> /* Try to keep pid_chain in the same cacheline as nr for find_pid */
> int nr;
> + struct pid_namespace *ns;
> struct hlist_node pid_chain;
> +};
> +
> +struct pid
> +{
> + atomic_t count;
> /* lists of tasks that use this pid */
> struct hlist_head tasks[PIDTYPE_MAX];
> struct rcu_head rcu;
> + int level;
> + struct upid numbers[1];
> };

Well. Definitely, the kernel can't be compiled with this patch applied,
this seems to be against the rules...

So. The task has a single (PIDTYPE_MAX) pid no matter how many namespaces
can see it, and "struct pid" has an array of numbers for each namespace.

Still I can't understand why do we need upid->ns, can't we kill it?
Suppose we add "struct pid_namespace *parent_ns" to "struct pid_namespace",
init_pid_ns.parent_ns == NULL.

Now,

struct upid {
int nr;
struct hlist_node pid_chain;
};

struct pid
{
atomic_t count;
struct hlist_head tasks[PIDTYPE_MAX];
struct rcu_head rcu;
struct pid_namespace *active_ns;
struct upid numbers[0];
};

We populate pid->numbers in "reverse" order, so that pid->numbers[0] lives
in pid->active_ns.

Now, for example,

void free_pid(struct pid *pid)
{
struct pid_namespace *ns;
unsigned long flags;
int i;

spin_lock_irqsave(&pidmap_lock, flags);
for (i = 0, ns = pid->active_ns; ns; i++, ns = ns->parent_ns)
hlist_del_rcu(&pid->numbers[i].pid_chain);
spin_unlock_irqrestore(&pidmap_lock, flags);

for (i = 0, ns = pid->active_ns; ns; i++, ns = ns->parent_ns)
free_pidmap(ns, pid->numbers[i].nr);

call_rcu(&pid->rcu, delayed_put_pid);
}

Possible?

Oleg.
Re: [PATCH 6/15] Make alloc_pid(), free_pid() and put_pid() work with struct upid [message #15376 is a reply to message #15300] Sun, 29 July 2007 10:14 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> -struct pid *alloc_pid(void)
> +struct pid *alloc_pid(struct pid_namespace *ns)

Why? We have the only caller, copy_process(), ns == task_active_pid_ns()
always.

> {
> struct pid *pid;
> enum pid_type type;
> - int nr = -1;
> - struct pid_namespace *ns;
> + int i, nr;
> + struct pid_namespace *tmp;
>
> - ns = task_active_pid_ns(current);
> pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
> if (!pid)
> goto out;
>
> - nr = alloc_pidmap(ns);
> - if (nr < 0)
> - goto out_free;
> + tmp = ns;
> + for (i = ns->level; i >= 0; i--) {
> + nr = alloc_pidmap(tmp);
> + if (nr < 0)
> + goto out_free;
> +
> + pid->numbers[i].nr = nr;
> + pid->numbers[i].ns = tmp;
> + tmp = tmp->parent;

Hm... There is no ->parent in "struct pid_namespace", and this
patch doesn't add it.

> + if (ns != &init_pid_ns)
> + get_pid_ns(ns);

Q: put_pid() checks "ns != &init_pid_ns" as well, this is just
an optimization, yes? Perhaps we can move this check into
get_pid_ns/put_pid_ns.

We are doing get_pid_ns() only for the "top namespace"... I guess
this can work if pid_namespace does get_pid_ns() on its ->parent.
This patch looks incomplete.

Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15377 is a reply to message #15308] Sun, 29 July 2007 10:39 Go to previous messageGo to previous message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> @@ -895,6 +915,7 @@ fastcall NORET_TYPE void do_exit(long co
> {
> struct task_struct *tsk = current;
> int group_dead;
> + struct pid_namespace *pid_ns = tsk->nsproxy->pid_ns;
>
> profile_task_exit(tsk);
>
> @@ -905,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
> if (unlikely(!tsk->pid))
> panic("Attempted to kill the idle task!");
> if (unlikely(tsk == task_child_reaper(tsk))) {
> - if (task_active_pid_ns(tsk) != &init_pid_ns)
> - task_active_pid_ns(tsk)->child_reaper =
> - init_pid_ns.child_reaper;
> + if (pid_ns != &init_pid_ns) {
> + zap_pid_ns_processes(pid_ns);
> + pid_ns->child_reaper = init_pid_ns.child_reaper;
> + }
> else
> panic("Attempted to kill init!");

No, no, this is wrong. Yes, the current code is buggy too, I'll send
the fix.

I think this code should be moved below under the "if (group_dead)",
and we should use tsk->group_leader.

> +void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> +{
> + int i;
> + int nr;
> + int nfree;
> + int options = WNOHANG|WEXITED|__WALL;
> +
> +repeat:
> + /*
> + * We know pid == 1 is terminating. Find remaining pid_ts
> + * in the namespace, signal them and then wait for them
> + * exit.
> + */
> + nr = next_pidmap(pid_ns, 1);
> + while (nr > 0) {
> + kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
> + nr = next_pidmap(pid_ns, nr);
> + }
> +
> + nr = next_pidmap(pid_ns, 1);
> + while (nr > 0) {
> + do_wait(nr, options, NULL, NULL, NULL);

When the first child of init exits, it sends SIGCHLD. After that,
do_wait() will never sleep, so we are doing a busy-wait loop.
Not good, especially when we have a niced child, can livelock.

> + nr = next_pidmap(pid_ns, nr);
> + }
> +
> + nfree = 0;
> + for (i = 0; i < PIDMAP_ENTRIES; i++)
> + nfree += atomic_read(&pid_ns->pidmap[i].nr_free);
> +
> + /*
> + * If pidmap has entries for processes other than 0 and 1, retry.
> + */
> + if (nfree < (BITS_PER_PAGE * PIDMAP_ENTRIES - 2))
> + goto repeat;

This doesn't look right.

Suppose that some "struct pid" was pinned from the parent namespace.
In that case zap_pid_ns_processes() will burn CPU until put_pid(), bad.

I think we can rely on forget_original_child() and do something like
this:

zap_active_ns_processes(void)
{
// kill all tasks in our ns and below
kill(-1, SIGKILL);

do {
clear_thread_flag(TIF_SIGPENDING);
} while (wait(NULL) != -ECHLD);
}

Oleg.
Previous Topic: Re: [RFC, PATCH] handle the multi-threaded init's exit() properly
Next Topic: [PATCH 0/14] sysfs cleanups
Goto Forum:
  


Current Time: Wed Oct 16 03:32:11 GMT 2024

Total time taken to generate the page: 0.04997 seconds