OpenVZ Forum


Home » Mailing lists » Devel » [PATCH 0/7] containers (V7): Generic Process Containers
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11464 is a reply to message #10182] Sat, 24 March 2007 14:58 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Mon, Feb 12, 2007 at 12:15:22AM -0800, menage@google.com wrote:
> +static int attach_task(struct container *cont, char *pidbuf, char **ppathbuf)
> +{
> + pid_t pid;
> + struct task_struct *tsk;
> + struct container *oldcont;
> + int retval;
> +
> + if (sscanf(pidbuf, "%d", &pid) != 1)
> + return -EIO;
> +
> + if (pid) {
> + read_lock(&tasklist_lock);
> +
> + tsk = find_task_by_pid(pid);
> + if (!tsk || tsk->flags & PF_EXITING) {

This is probably carrying over code from cpuset.c, but :

/me thinks that there is a ugly race here with 'tsk' exiting.
What happens if the tsk is marked PF_EXITING just after this check?
If that happens, then:

> + read_unlock(&tasklist_lock);
> + return -ESRCH;
> + }
> +
> + get_task_struct(tsk);
> + read_unlock(&tasklist_lock);
> +
> + if ((current->euid) && (current->euid != tsk->uid)
> + && (current->euid != tsk->suid)) {
> + put_task_struct(tsk);
> + return -EACCES;
> + }
> + } else {
> + tsk = current;
> + get_task_struct(tsk);
> + }
> +
> + retval = security_task_setscheduler(tsk, 0, NULL);
> + if (retval) {
> + put_task_struct(tsk);
> + return retval;
> + }
> +
> + mutex_lock(&callback_mutex);
> +
> + task_lock(tsk);
> + oldcont = tsk->container;
> + if (!oldcont) {
> + task_unlock(tsk);
> + mutex_unlock(&callback_mutex);
> + put_task_struct(tsk);
> + return -ESRCH;
> + }
> + atomic_inc(&cont->count);
> + rcu_assign_pointer(tsk->container, cont);

Above assignment A1 can race with below assignment A2 in container_exit() :

tsk->container = &top_container; /* the_top_container_hack - see above */

What happens if A1 follows after A2? I feel very uncomfortable abt it.

IMO, we need to use task_lock() in container_exit() to avoid this race.

(I think this race already exists in mainline cpuset.c?)

P.S : cpuset.c checks for PF_EXITING twice in attach_task(), while this
patch seems to be checking only once. Is that fine?


--
Regards,
vatsa
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11468 is a reply to message #11464] Sat, 24 March 2007 19:25 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
> IMO, we need to use task_lock() in container_exit() to avoid this race.
>
> (I think this race already exists in mainline cpuset.c?)
>
> P.S : cpuset.c checks for PF_EXITING twice in attach_task(), while this
> patch seems to be checking only once. Is that fine?

I think the cpuset code is ok, because, as you note, it locks the task,
picks off the cpuset pointer, and then checks a second time that the
task still does not have PF_EXITING set:

In the kernel/cpuset.c code for attach_task():

task_lock(tsk);
oldcs = tsk->cpuset;
/*
* After getting 'oldcs' cpuset ptr, be sure still not exiting.
* If 'oldcs' might be the top_cpuset due to the_top_cpuset_hack
* then fail this attach_task(), to avoid breaking top_cpuset.count.
*/
if (tsk->flags & PF_EXITING) {
task_unlock(tsk);
mutex_unlock(&callback_mutex);
put_task_struct(tsk);
return -ESRCH;
}

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11471 is a reply to message #11468] Sun, 25 March 2007 00:38 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Sat, Mar 24, 2007 at 12:25:59PM -0700, Paul Jackson wrote:
> > P.S : cpuset.c checks for PF_EXITING twice in attach_task(), while this
> > patch seems to be checking only once. Is that fine?
>
> I think the cpuset code is ok, because, as you note, it locks the task,
> picks off the cpuset pointer, and then checks a second time that the
> task still does not have PF_EXITING set:

Well afaics, PF_EXITING is set for the exiting task w/o taking any lock, which
makes this racy always.

> In the kernel/cpuset.c code for attach_task():
>
> task_lock(tsk);
> oldcs = tsk->cpuset;
> /*
> * After getting 'oldcs' cpuset ptr, be sure still not exiting.
> * If 'oldcs' might be the top_cpuset due to the_top_cpuset_hack
> * then fail this attach_task(), to avoid breaking top_cpuset.count.
> */
> if (tsk->flags & PF_EXITING) {

What if PF_EXITING is set after this check? If that happens then,

> task_unlock(tsk);
> mutex_unlock(&callback_mutex);
> put_task_struct(tsk);
> return -ESRCH;
> }

the following code becomes racy with cpuset_exit() ...

atomic_inc(&cs->count);
rcu_assign_pointer(tsk->cpuset, cs);
task_unlock(tsk);


--
Regards,
vatsa
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11472 is a reply to message #11471] Sun, 25 March 2007 01:41 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
vatsa wrote:
> > if (tsk->flags & PF_EXITING) {
>
> What if PF_EXITING is set after this check? If that happens then,
>
> > task_unlock(tsk);
> > mutex_unlock(&callback_mutex);
> > put_task_struct(tsk);
> > return -ESRCH;
> > }
>
> the following code becomes racy with cpuset_exit() ...
>
> atomic_inc(&cs->count);
> rcu_assign_pointer(tsk->cpuset, cs);
> task_unlock(tsk);

eh ... so ... ?

I don't know of any sequence where that causes any problem.

Do you see one?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11473 is a reply to message #11472] Sun, 25 March 2007 02:21 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Sat, Mar 24, 2007 at 06:41:28PM -0700, Paul Jackson wrote:
> > the following code becomes racy with cpuset_exit() ...
> >
> > atomic_inc(&cs->count);
> > rcu_assign_pointer(tsk->cpuset, cs);
> > task_unlock(tsk);
>
> eh ... so ... ?
>
> I don't know of any sequence where that causes any problem.
>
> Do you see one?

Let's say we had two cpusets CS1 amd CS2 (both different from top_cpuset).
CS1 has just one task T1 in it (CS1->count = 0) while CS2 has no tasks
in it (CS2->count = 0).

Now consider:

------------------------------------------------------------ --------
CPU0 (attach_task T1 to CS2) CPU1 (T1 is exiting)
------------------------------------------------------------ --------

task_lock(T1);

oldcs = tsk->cpuset;
[oldcs = CS1]

T1->flags & PF_EXITING? (No)

T1->flags = PF_EXITING;

atomic_inc(&CS2->count);

cpuset_exit()
cs = tsk->cpuset; (cs = CS1)

T1->cpuset = CS2;

T1->cpuset = &top_cpuset;

task_unlock(T1);


CS2 has one bogus count now (with no tasks in it), which may prevent it from
being removed/freed forever.


Not just this, continuing further we have more trouble:

------------------------------------------------------------ --------
CPU0 (attach_task T1 to CS2) CPU1 (T1 is exiting)
------------------------------------------------------------ --------

synchronize_rcu()
atomic_dec(&CS1->count);
[CS1->count = 0]

if atomic_dec_and_test(&oldcs->count))
[CS1->count = -1]



We now have CS1->count negative. Is that good? I am uncomfortable ..

We need a task_lock() in cpuset_exit to avoid this race.

--
Regards,
vatsa
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11474 is a reply to message #11473] Sun, 25 March 2007 04:16 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Sun, Mar 25, 2007 at 07:58:16AM +0530, Srivatsa Vaddagiri wrote:
> Not just this, continuing further we have more trouble:
>
> ------------------------------------------------------------ --------
> CPU0 (attach_task T1 to CS2) CPU1 (T1 is exiting)
> ------------------------------------------------------------ --------
>
> synchronize_rcu()
> atomic_dec(&CS1->count);
> [CS1->count = 0]
>
> if atomic_dec_and_test(&oldcs->count))
> [CS1->count = -1]
>
>
>
> We now have CS1->count negative. Is that good? I am uncomfortable ..
>
> We need a task_lock() in cpuset_exit to avoid this race.

2nd race is tricky. We probably need to do this to avoid it:

task_lock(tsk);

/* Check if tsk->cpuset is still same. We may have raced with
* cpuset_exit changing tsk->cpuset again under our feet.
*/
if (tsk->cpuset == cs && atomic_dec_and_test(&oldcs->count)) {
task_unlock(tsk);
check_for_release(oldcs, ppathbuf);
goto done;
}

task_unlock(tsk);

done:
return 0;



--
Regards,
vatsa
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11475 is a reply to message #11473] Sun, 25 March 2007 04:45 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
vatsa wrote:
> Now consider:

Nice work - thanks. Yes, both an extra cpuset count and a negative
cpuset count are bad news, opening the door to the usual catastrophes.

Would you like the honor of submitting the patch to add a task_lock
to cpuset_exit()? If you do, be sure to fix, or at least remove,
the cpuset_exit comment lines:

* We don't need to task_lock() this reference to tsk->cpuset,
* because tsk is already marked PF_EXITING, so attach_task() won't
* mess with it, or task is a failed fork, never visible to attach_task.

I guess that taking task_lock() in cpuset_exit() should not be a serious
performance issue. It's taking a spinlock that is in the current
exiting tasks task struct, so it should be a cache hot memory line and
a rarely contested lock.

And I guess I've not see this race in real life, as one side of it has
to execute quite a bit of code in the task exit path, from when it sets
PF_EXITING until it gets into the cpuset_exit() call, while the other side
does the three lines:

if (tsk->flags & PF_EXITING) ...
atomic_inc(&cs->count);
rcu_assign_pointer(tsk->cpuset, cs);

So, in real life, this would be a difficult race to trigger.

Thanks for finding this.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11476 is a reply to message #11475] Sun, 25 March 2007 04:57 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Sat, Mar 24, 2007 at 09:45:50PM -0700, Paul Jackson wrote:
> Nice work - thanks. Yes, both an extra cpuset count and a negative
> cpuset count are bad news, opening the door to the usual catastrophes.
>
> Would you like the honor of submitting the patch to add a task_lock
> to cpuset_exit()? If you do, be sure to fix, or at least remove,
> the cpuset_exit comment lines:

I will try to send out a patch later today to fix this bug in mainline
cpuset code. I happened to notice this race with my rcfs patch and observed
same is true with cpuset/container code also.

> * We don't need to task_lock() this reference to tsk->cpuset,
> * because tsk is already marked PF_EXITING, so attach_task() won't
> * mess with it, or task is a failed fork, never visible to attach_task.

Sure, I had seen that.

> So, in real life, this would be a difficult race to trigger.

Agreed, but good to keep code clean isn't it? :)

> Thanks for finding this.

Wellcome!

--
Regards,
vatsa
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11477 is a reply to message #11476] Sun, 25 March 2007 04:59 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
> I will try to send out a patch later today to fix

Thanks!

> Agreed, but good to keep code clean isn't it? :)

Definitely.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11478 is a reply to message #11474] Sun, 25 March 2007 05:43 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
vatsa wrote:
> Not just this, continuing further we have more trouble:
>
> ------------------------------------------------------------ --------
> CPU0 (attach_task T1 to CS2) CPU1 (T1 is exiting)
> ------------------------------------------------------------ --------
>
> synchronize_rcu()
> atomic_dec(&CS1->count);
> [CS1->count = 0]
>
> if atomic_dec_and_test(&oldcs->count))
> [CS1->count = -1]
> ...
> 2nd race is tricky. We probably need to do this to avoid it:
>
> task_lock(tsk);
>
> /* Check if tsk->cpuset is still same. We may have raced with
> * cpuset_exit changing tsk->cpuset again under our feet.
> */
> if (tsk->cpuset == cs && atomic_dec_and_test(&oldcs->count)) {

I'm unsure here, but this 'tsk->cpuset == cs' test feels fragile to me.

How about a bit earlier in attach_task(), right at the point we overwrite the
victim tasks cpuset pointer, we decrement the count on the old cpuset, and if
it went to zero, remember that we'll need to release it, once we've dropped
some locks:

static int attach_task(struct cpuset *cs, char *pidbuf, char **ppathbuf)
{
...
struct cpuset *oldcs;
struct cpuset *oldcs_tobe_released;

...

task_lock(tsk);
oldcs = tsk->cpuset;
...
if (tsk->flags & PF_EXITING) {
...
}
atomic_inc(&cs->count);
rcu_assign_pointer(tsk->cpuset, cs);
oldcs_tobe_released = NULL;
if (atomic_dec_and_test(&oldcs->count))
oldcs_tobe_released = oldcs;
task_unlock(tsk);

...
put_task_struct(tsk);
synchronize_rcu();
if (oldcs_tobe_released)
check_for_release(oldcs_tobe_released, ppathbuf);
return 0;
}

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [ckrm-tech] [PATCH 1/7] containers (V7): Generic container system abstracted from cpusets code [message #11481 is a reply to message #11478] Sun, 25 March 2007 08:14 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Sat, Mar 24, 2007 at 10:43:48PM -0700, Paul Jackson wrote:
> I'm unsure here, but this 'tsk->cpuset == cs' test feels fragile to me.
>
> How about a bit earlier in attach_task(), right at the point we overwrite the
> victim tasks cpuset pointer, we decrement the count on the old cpuset, and if
> it went to zero, remember that we'll need to release it, once we've dropped
> some locks:

Looks neater. Will adopt this when I send the patch. Thanks.

--
Regards,
vatsa
Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem [message #11552 is a reply to message #11456] Mon, 26 March 2007 21:55 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Srivatsa Vaddagiri (vatsa@in.ibm.com):
> On Mon, Feb 12, 2007 at 12:15:28AM -0800, menage@google.com wrote:
> > +/*
> > + * Rules: you can only create a container if
> > + * 1. you are capable(CAP_SYS_ADMIN)
> > + * 2. the target container is a descendant of your own container
> > + */
> > +static int ns_create(struct container_subsys *ss, struct container *cont)
> > +{
> > + struct nscont *ns;
> > +
> > + if (!capable(CAP_SYS_ADMIN))
> > + return -EPERM;
>
> Does this check break existing namespace semantics in a subtle way?
> It now requires that unshare() of namespaces by any task requires
> CAP_SYS_ADMIN capabilities.
>
> clone(.., CLONE_NEWUTS, ..)->copy_namespaces()->ns_container_clone()->
> ->container_clone()-> .. -> container_create() -> ns_create()
>
> Earlier, one could unshare his uts namespace w/o CAP_SYS_ADMIN
> capabilities.

Not so, CAP_SYS_ADMIN is required. Depending on your tree, check
kernel/utsname.c or kernel/nsproxy.c.

Mind you, whether there is a reason to require CAP_SYS_ADMIN for utsname
I'm not sure. But if there is any way of finding privileged software
which would behave differently based on utsname information, then we
should.

> Now it is required. Is that fine? Don't know.
>
> I feel we can avoid this check totally and let the directory permissions
> take care of these checks.
>
> Serge, what do you think?

There is no way we can think about doing that until we determine that
every kernel will have the ns container subsystem compiled in and
mounted, right?

-serge
Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem [message #11652 is a reply to message #10179] Sat, 31 March 2007 02:47 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Mon, Feb 12, 2007 at 12:15:28AM -0800, menage@google.com wrote:
> +int ns_container_clone(struct task_struct *tsk)
> +{
> + return container_clone(tsk, &ns_subsys);
> +}

This function is a no-op if ns hierarchy is not mounted at this point.
This would mean that we will miss out on some directories in ns
hierarchy if it happened to be mounted later. It would be nice to
recreate such missing directories upon mount. However I suspect it would
not be easy ..Maybe we need to scan the task list and (re-)invoke
ns_container_clone() for every new tsk->nsproxy we find in the list.
Alternately perhaps we could auto mount (kern_mount) ns hierarchy very early at
bootup? On the flip side that would require remount support so that additional
controllers (like cpuset, mem) can be bound to (non-empty) ns hierarchy after
bootup.

--
Regards,
vatsa
Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem [message #11676 is a reply to message #11652] Mon, 02 April 2007 14:09 Go to previous message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Srivatsa Vaddagiri (vatsa@in.ibm.com):
> On Mon, Feb 12, 2007 at 12:15:28AM -0800, menage@google.com wrote:
> > +int ns_container_clone(struct task_struct *tsk)
> > +{
> > + return container_clone(tsk, &ns_subsys);
> > +}
>
> This function is a no-op if ns hierarchy is not mounted at this point.
> This would mean that we will miss out on some directories in ns
> hierarchy if it happened to be mounted later. It would be nice to
> recreate such missing directories upon mount. However I suspect it would
> not be easy ..Maybe we need to scan the task list and (re-)invoke
> ns_container_clone() for every new tsk->nsproxy we find in the list.
> Alternately perhaps we could auto mount (kern_mount) ns hierarchy very early at
> bootup? On the flip side that would require remount support so that additional
> controllers (like cpuset, mem) can be bound to (non-empty) ns hierarchy after
> bootup.

Losing the directory isn't a big deal though. And both unsharing a
namespace (which causes a ns_container_clone) and mounting the hierarchy
are done by userspace, so if for some installation it is a big deal,
the init scripts on initrd can mount the hierarchy before doing any
unsharing.

Can you think of a reason why losing a few directories matters?

-serge
[PATCH 3/7] containers (V7): Add generic multi-subsystem API to containers [message #17449 is a reply to message #10176] Mon, 12 February 2007 08:15 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
This patch removes all cpuset-specific knowlege from the container
system, replacing it with a generic API that can be used by multiple
subsystems. Cpusets is adapted to be a container subsystem.

Signed-off-by: Paul Menage <menage@google.com>

---
 Documentation/containers.txt |  415 +++++++++--
 Documentation/cpusets.txt    |   20 
 include/linux/container.h    |  178 ++++
 include/linux/cpuset.h       |   16 
 include/linux/mempolicy.h    |   12 
 include/linux/sched.h        |    4 
 init/Kconfig                 |   12 
 kernel/container.c           | 1601 ++++++++++++++++++++++++++++++-------------
 kernel/cpuset.c              |  170 ++--
 mm/mempolicy.c               |    2 
 10 files changed, 1808 insertions(+), 622 deletions(-)

Index: container-2.6.20/include/linux/container.h
===================================================================
--- container-2.6.20.orig/include/linux/container.h
+++ container-2.6.20/include/linux/container.h
@@ -9,13 +9,12 @@
  */
 
 #include <linux/sched.h>
+#include <linux/kref.h>
 #include <linux/cpumask.h>
 #include <linux/nodemask.h>
 
 #ifdef CONFIG_CONTAINERS
 
-extern int number_of_containers;	/* How many containers are defined in system? */
-
 extern int container_init_early(void);
 extern int container_init(void);
 extern void container_init_smp(void);
@@ -30,13 +29,105 @@ extern void container_unlock(void);
 extern void container_manage_lock(void);
 extern void container_manage_unlock(void);
 
+struct containerfs_root;
+
+/* Per-subsystem/per-container state maintained by the system. */
+struct container_subsys_state {
+	/* The container that this subsystem is attached to. Useful
+	 * for subsystems that want to know about the container
+	 * hierarchy structure */
+	struct container *container;
+
+	/* State maintained by the container system to allow
+	 * subsystems to be "busy". Should be accessed via css_get()
+	 * and css_put() */
+	spinlock_t refcnt_lock;
+	atomic_t refcnt;
+};
+
+/* A container_group is a structure holding pointers to a set of
+ * containers. This saves space in the task struct object and speeds
+ * up fork()/exit(), since a single inc/dec can bump the reference
+ * count on the entire container set for a task. */
+
+struct container_group {
+
+	/* Reference count */
+	struct kref ref;
+
+	/* List running through all container groups */
+	struct list_head list;
+
+	/* Set of containers, one for each hierarchy. These are
+	 * immutable once the container group has been created */
+	struct container *container[CONFIG_MAX_CONTAINER_HIERARCHIES];
+
+	/* Set of subsystem states, one for each subsystem. NULL for
+	 * subsystems that aren't part of this hierarchy. These
+	 * pointers reduce the number of dereferences required to get
+	 * from a task to its state for a given container, but result
+	 * in increased space usage if tasks are in wildly different
+	 * groupings across different hierarchies. This array is
+	 * mostly immutable after creation - a newly registered
+	 * subsystem can result in a pointer in this array
+	 * transitioning from NULL to non-NULL */
+	struct container_subsys_state *subsys[CONFIG_MAX_CONTAINER_SUBSYS];
+};
+
+/*
+ * Call css_get() to hold a reference on the container; following a
+ * return of 0, this container subsystem state object is guaranteed
+ * not to be destroyed until css_put() is called on it.  A non-zero
+ * return code indicates that a reference could not be taken.
+ *
+ */
+
+static inline int css_get(struct container_subsys_state *css)
+{
+	int retval = 0;
+	unsigned long flags;
+	/* Synchronize with container_rmdir() */
+	spin_lock_irqsave(&css->refcnt_lock, flags);
+	if (atomic_read(&css->refcnt) >= 0) {
+		/* Container is still alive */
+		atomic_inc(&css->refcnt);
+	} else {
+		/* Container removal is in progress */
+		retval = -EINVAL;
+	}
+	spin_unlock_irqrestore(&css->refcnt_lock, flags);
+	return retval;
+}
+
+/*
+ * If you are holding current->alloc_lock then it's impossible for you
+ * to be moved out of your container, and hence it's impossible for
+ * your container to be destroyed. Therefore doing a simple
+ * atomic_inc() on a css is safe.
+ */
+
+static inline void css_get_current(struct container_subsys_state *css)
+{
+	atomic_inc(&css->refcnt);
+}
+
+/*
+ * css_put() should be called to release a reference taken by
+ * css_get() or css_get_current()
+ */
+
+static inline void css_put(struct container_subsys_state *css) {
+	atomic_dec(&css->refcnt);
+}
+
 struct container {
 	unsigned long flags;		/* "unsigned long" so bitops work */
 
 	/*
 	 * Count is atomic so can incr (fork) or decr (exit) without a lock.
 	 */
-	atomic_t count;			/* count tasks using this container */
+	atomic_t count;			/* count of container groups
+					 * using this container*/
 
 	/*
 	 * We link our 'sibling' struct into our parent's 'children'.
@@ -46,11 +137,15 @@ struct container {
 	struct list_head children;	/* my children */
 
 	struct container *parent;	/* my parent */
-	struct dentry *dentry;		/* container fs entry */
+	struct dentry *dentry;	  	/* container fs entry */
 
-#ifdef CONFIG_CPUSETS
-	struct cpuset *cpuset;
-#endif
+	/* Private pointers for each registered subsystem */
+	struct container_subsys_state *subsys[CONFIG_MAX_CONTAINER_SUBSYS];
+
+	int hierarchy;
+
+	struct containerfs_root *root;
+	struct container *top_container;
 };
 
 /* struct cftype:
@@ -67,8 +162,11 @@ struct container {
  */
 
 struct inode;
+#define MAX_CFTYPE_NAME 64
 struct cftype {
-	char *name;
+	/* By convention, the name should begin with the name of the
+	 * subsystem, followed by a period */
+	char name[MAX_CFTYPE_NAME];
 	int private;
 	int (*open) (struct inode *inode, struct file *file);
 	ssize_t (*read) (struct container *cont, struct cftype *cft,
@@ -80,10 +178,72 @@ struct cftype {
 	int (*release) (struct inode *inode, struct file *file);
 };
 
+/* Add a new file to the given container directory. Should only be
+ * called by subsystems from within a populate() method */
 int container_add_file(struct container *cont, const struct cftype *cft);
 
 int container_is_removed(const struct container *cont);
-void container_set_release_agent_path(const char *path);
+
+int container_path(const struct container *cont, char *buf, int buflen);
+
+int container_task_count(const struct container *cont);
+
+/* Return true if the container is a descendant of the current container */
+int container_is_descendant(const struct container *cont);
+
+/* Container subsystem type. See Documentation/containers.txt for details */
+
+struct container_subsys {
+	int (*create)(struct container_subsys *ss,
+		      struct container *cont);
+	void (*destroy)(struct container_subsys *ss, struct container *cont);
+	int (*can_attach)(struct container_subsys *ss,
+			  struct container *cont, struct task_struct *tsk);
+	void (*attach)(struct container_subsys *ss, struct container *cont,
+			struct container *old_cont, struct task_struct *tsk);
+	void (*post_attach)(struct container_subsys *ss,
+			    struct container *cont,
+			    struct container *old_cont,
+			    struct task_struct *tsk);
+	void (*fork)(struct container_subsys *ss, struct task_struct *task);
+	void (*exit)(struct container_subsys *ss, struct task_struct *task);
+	int (*populate)(struct container_subsys *ss,
+			struct container *cont);
+	void (*bind)(struct container_subsys *ss, struct container *root);
+	int subsys_id;
+	int active;
+
+#define MAX_CONTAINER_TYPE_NAMELEN 32
+	const char *name;
+
+	/* Protected by RCU */
+	int hierarchy;
+
+	struct list_head sibling;
+};
+
+int container_register_subsys(struct container_subsys *subsys);
+int container_clone(struct task_struct *tsk, struct container_subsys *ss);
+
+static inline struct container_subsys_state *container_subsys_state(
+	struct container *cont,
+	struct container_subsys *ss)
+{
+	return cont->subsys[ss->subsys_id];
+}
+
+static inline struct container* task_container(struct task_struct *task,
+					       struct container_subsys *ss)
+{
+	return rcu_dereference(task->containers->container[ss->hierarchy]);
+}
+
+static inline struct container_subsys_state *task_subsys_state(
+	struct task_struct *task,
+	struct container_subsys *ss)
+{
+	return rcu_dereference(task->containers->subsys[ss->subsys_id]);
+}
 
 int container_path(const struct container *cont, char *buf, int buflen);
 
Index: container-2.6.20/include/linux/cpuset.h
===================================================================
--- container-2.6.20.orig/include/linux/cpuset.h
+++ container-2.6.20/include/linux/cpuset.h
@@ -70,16 +70,7 @@ static inline int cpuset_do_slab_mem_spr
 
 extern void cpuset_track_online_nodes(void);
 
-extern int cpuset_can_attach_task(struct container *cont,
-				  struct task_struct *tsk);
-extern void cpuset_attach_task(struct container *cont,
-				struct task_struct *tsk);
-extern void cpuset_post_attach_task(struct container *cont,
-				    struct container *oldcont,
-				    struct task_struct *tsk);
-extern int cpuset_populate_dir(struct container *cont);
-extern int cpuset_create(struct container *cont);
-extern void cpuset_destroy(struct container *cont);
+extern int current_cpuset_is_being_rebound(void);
 
 #else /* !CONFIG_CPUSETS */
 
@@ -147,6 +138,11 @@ static inline int cpuset_do_slab_mem_spr
 
 static inline void cpuset_track_online_nodes(void) {}
 
+static inline int current_cpuset_is_being_rebound(void)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
Index: container-2.6.20/kernel/container.c
===================================================================
--- container-2.6.20.orig/kernel/container.c
+++ container-2.6.20/kernel/container.c
@@ -55,7 +55,6 @@
 #include <linux/time.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
-#include <linux/cpuset.h>
 
 #include <asm/uaccess.h>
 #include &l
...

Re: [PATCH 3/7] containers (V7): Add generic multi-subsystem API to containers [message #17450 is a reply to message #17449] Mon, 12 February 2007 18:31 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 2/12/07, Cedric Le Goater <clg@fr.ibm.com> wrote:
> >> +#include <asm/proto.h>
> >
> > I did have a problem with this include.  On s390 it didn't exist so I've
> > just been running without it (with no problems).  A quick 'find'
> > suggests it only exists on x86_64, so I'd expect failures on all other
> > arches.
>
> confirmed on x86 also.
>

Sorry, this turned out to be left over from when I was debugging an
initialization problem that required early_printk(). I should have
removed it.

You can safely remove it from the patch.

Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17451 is a reply to message #10176] Mon, 12 February 2007 23:15 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 2/12/07, Sam Vilain <sam@vilain.net> wrote:
>
> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
> We decided a long time ago that a container was basically just a set of
> namespaces, which includes all of the subsystems you mention.

You may have done that, but the CKRM/ResGroups independently decided a
long time ago that the fundamental unit was the resource class, and
the OpenVZ folks decided that the fundamental unit was the
BeanCounter, and the CPUSet folks decided that the fundamental unit
was the CPUSet, etc ... :-)

But there's a lot of common ground between these different approaches,
and potential for synergy, so the point of this patch set is to
provide a unification point for all of them, and a stepping stone for
other new resource controllers and process control modules.

Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17452 is a reply to message #10176] Mon, 12 February 2007 23:18 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 2/12/07, Serge E. Hallyn <serue@us.ibm.com> wrote:
>
> Well it's an unfortunate conflict, but I don't see where we have any
> standing to make Paul change his terminology  :)

I have no huge problem with changing my terminology in the interest of
wider adoption. "Container" seems like an appropriate name for the
abstraction, and possibly more appropriate than for a virtual server,
but if it was decreed that the only thing stopping the container patch
being merged was that it should be called, say, "Process Sets", I
would happily s/container/pset/g across the entire patchset.

Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17454 is a reply to message #10176] Mon, 12 February 2007 22:47 Go to previous message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Sam Vilain (sam@vilain.net):
> menage@google.com wrote:
> > Generic Process Containers
> > --------------------------
> >
> > There have recently been various proposals floating around for
> > resource management/accounting and other task grouping subsystems in
> > the kernel, including ResGroups, User BeanCounters, NSProxy
> > containers, and others.  These all need the basic abstraction of being
> > able to group together multiple processes in an aggregate, in order to
> > track/limit the resources permitted to those processes, or control
> > other behaviour of the processes, and all implement this grouping in
> > different ways.
> >   
> 
> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container. 

The nsproxy is an implementation detail, actually, and might go away at
any time.  But yes we had been talking for quite awhile about sets of
namespaces as containers, so that a vserver would be a container, as
would a lightweight checkpoint/restart job.

> We decided a long time ago that a container was basically just a set of
> namespaces, which includes all of the subsystems you mention.
> 
> This would suggesting re-write this patchset, part 2 as a "CPUSet

Well it's an unfortunate conflict, but I don't see where we have any
standing to make Paul change his terminology  :)

Lately I've been talking about "Paul's containers", our "namespaces",
and the "namespace container subsystem" in patch 7.

> namespace", part 4 as a "CPU scheduling namespace", parts 5 and 6 as
> "Resource Limits Namespace" (drop this "BeanCounter" brand), and of
> course part 7 falls away.

Part 7 does not fall away, it uses Paul's patchset to start to provide
a management interface for namespaces.

-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 3/7] containers (V7): Add generic multi-subsystem API to containers [message #17455 is a reply to message #17449] Mon, 12 February 2007 15:56 Go to previous message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
>> +#include <asm/proto.h>
> 
> I did have a problem with this include.  On s390 it didn't exist so I've
> just been running without it (with no problems).  A quick 'find'
> suggests it only exists on x86_64, so I'd expect failures on all other
> arches.

confirmed on x86 also.

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17457 is a reply to message #10176] Mon, 12 February 2007 22:38 Go to previous message
Sam Vilain is currently offline  Sam Vilain
Messages: 73
Registered: February 2006
Member
menage@google.com wrote:
> Generic Process Containers
> --------------------------
>
> There have recently been various proposals floating around for
> resource management/accounting and other task grouping subsystems in
> the kernel, including ResGroups, User BeanCounters, NSProxy
> containers, and others.  These all need the basic abstraction of being
> able to group together multiple processes in an aggregate, in order to
> track/limit the resources permitted to those processes, or control
> other behaviour of the processes, and all implement this grouping in
> different ways.
>   

I know I'm a bit out of touch, but AIUI the NSProxy *is* the container. 
We decided a long time ago that a container was basically just a set of
namespaces, which includes all of the subsystems you mention.

This would suggesting re-write this patchset, part 2 as a "CPUSet
namespace", part 4 as a "CPU scheduling namespace", parts 5 and 6 as
"Resource Limits Namespace" (drop this "BeanCounter" brand), and of
course part 7 falls away.

Sam.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17458 is a reply to message #17451] Tue, 13 February 2007 00:30 Go to previous message
Sam Vilain is currently offline  Sam Vilain
Messages: 73
Registered: February 2006
Member
Paul Menage wrote:
>> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
>> We decided a long time ago that a container was basically just a set of
>> namespaces, which includes all of the subsystems you mention.
>>     
> You may have done that, but the CKRM/ResGroups independently decided a
> long time ago that the fundamental unit was the resource class, and
> the OpenVZ folks decided that the fundamental unit was the
> BeanCounter, and the CPUSet folks decided that the fundamental unit
> was the CPUSet, etc ... :-)
>   

There was a consensus that emerged that resulted in the nsproxy being
included in the kernel.  That is why I used "we".  What was included
varies greatly from the patch I put forward, as I mentioned.

You missed the point of what I was trying to say.  Let me try again.

Originally I too thought that in order to begin on the path of
virtualisation merging we would just make a simple
container/vserver/whatever structure and hang everything off that.

I'll now say the same thing that was said before of my patch, that I
don't think that adding the containers structure inside the kernel adds
anything interesting or useful.  In fact, I'd go further to say that
very thing you think is a useful abstraction is locking yourself into
inflexibility.  This is what Eric recognised early on and eventually
brought me around to the idea of too.

So, we have the current implementation - individual subsystems are
virtualised, and it is the subsystems that you can see and work with. 
You can group together your virtualisation decisions and call that a
container, great.

Ask yourself this - what do you need the container structure for so
badly, that virtualising the individual resources does not provide for? 
You don't need it to copy the namespaces of another process ("enter")
and you don't need it for checkpoint/migration.  What does it mean to
make a new container?  You're making a "nothing" namespace for some
as-yet-unspecified grouping of tasks.  That's a great idea for a set of
tightly integrated userland utilities to simplify the presentation to
the admin, but I don't see why you need to enshrine this in the kernel. 
Certainly not for any of the other patches in your set as far as I can see.

> But there's a lot of common ground between these different approaches,
> and potential for synergy, so the point of this patch set is to
> provide a unification point for all of them, and a stepping stone for
> other new resource controllers and process control modules.
>   

That precisely echos my sentiment on my submissions of some 12 months ago.

Sam.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 3/7] containers (V7): Add generic multi-subsystem API to containers [message #17460 is a reply to message #17449] Wed, 14 February 2007 08:49 Go to previous message
Balbir Singh is currently offline  Balbir Singh
Messages: 491
Registered: August 2006
Senior Member
menage@google.com wrote:
> This patch removes all cpuset-specific knowlege from the container
> system, replacing it with a generic API that can be used by multiple
> subsystems. Cpusets is adapted to be a container subsystem.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 

Hi, Paul,

This patch fails to apply for me

balbir@balbir-laptop:~/ocbalbir/images/kernels/containers/linux-2.6.20$ 
pushpatch
patching file include/linux/container.h
patching file include/linux/cpuset.h
patching file kernel/container.c
patch: **** malformed patch at line 640: @@ -202,15 +418,18 @@ static 
DEFINE_MUTEX(callback_mutex);

multiuser_container does not apply

Is anybody else seeing this problem?

-- 
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17472 is a reply to message #10176] Tue, 20 February 2007 17:55 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 2/20/07, Eric W. Biederman <ebiederm@xmission.com> wrote:
> "Paul Menage" <menage@google.com> writes:
>
> > On 2/12/07, Sam Vilain <sam@vilain.net> wrote:
> >>
> >> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
> >> We decided a long time ago that a container was basically just a set of
> >> namespaces, which includes all of the subsystems you mention.
> >
> > You may have done that, but the CKRM/ResGroups independently decided a
> > long time ago that the fundamental unit was the resource class, and
> > the OpenVZ folks decided that the fundamental unit was the
> > BeanCounter, and the CPUSet folks decided that the fundamental unit
> > was the CPUSet, etc ... :-)
>
> Using the container name is bad and it led to this stupid argument.
>
> The fundamental unit of what we have merged into the kernel is the
> namespace.  The aggregate of all namespaces and everything is the
> container.

What are you defining here as "everything"? If you mean "all things
that could be applied to a segregated group of processes such as a
virtual server", then "container" seems like a good name for my
patches, since it allows you to aggregate namespaces, resource
control, other virtualization, etc.

Sam said "the NSProxy *is* the container". You appear to be planning
to have some namespaces, possibly not aggregated within the nsproxy
(pid namespace?) but are you planning to have some higher-level
"container" object that aggregates the nsproxy and the other
namespaces? If so, what is it? Does it track process membership, etc?
What's the userspace API? In what fundamental ways would it be
different from my generic containers patches with Serge Hallyn's
namespace subsystem. (If these questions are answered already by
designs or code, I'd be happy to be pointed at them).

I guess what it comes down to, is why is an aggregation of namespaces
suitable for the name "container", when an aggregation of namespaces
and other resource controllers isn't?

What do you think might be a better name for the generic process
groups that I'm pushing? As I said, I'm happy to do a simple
search/replace on my code to give a different name if that turned out
to be the gating factor to getting it merged. But I'd be inclined to
leave that decision up to Andrew/Linus.

> > But there's a lot of common ground between these different approaches,
> > and potential for synergy, so the point of this patch set is to
> > provide a unification point for all of them, and a stepping stone for
> > other new resource controllers and process control modules.
>
> For the case of namespaces I don't see how your code makes things
> better.  I do not see a real problem that you are solving.

I'm trying to solve the problem that lots of different folks
(including us) are trying to do things that aggregate multiple process
into some kind of constrained group, and are all trying to use
different and incompatible ways of grouping/tracking those processes.

I agree that namespaces fit slightly less well into this model than
some other subsystems like resource management. But by integrating
with it you'd get  automatic access to all the various different
resource controller work that's being done.

Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17475 is a reply to message #17451] Tue, 20 February 2007 17:34 Go to previous message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
"Paul Menage" <menage@google.com> writes:

> On 2/12/07, Sam Vilain <sam@vilain.net> wrote:
>>
>> I know I'm a bit out of touch, but AIUI the NSProxy *is* the container.
>> We decided a long time ago that a container was basically just a set of
>> namespaces, which includes all of the subsystems you mention.
>
> You may have done that, but the CKRM/ResGroups independently decided a
> long time ago that the fundamental unit was the resource class, and
> the OpenVZ folks decided that the fundamental unit was the
> BeanCounter, and the CPUSet folks decided that the fundamental unit
> was the CPUSet, etc ... :-)

Using the container name is bad and it led to this stupid argument.

The fundamental unit of what we have merged into the kernel is the
namespace.  The aggregate of all namespaces and everything is the
container.

Please, please pick a different name so people don't take one look
at your stuff and get confused like Sam did.

> But there's a lot of common ground between these different approaches,
> and potential for synergy, so the point of this patch set is to
> provide a unification point for all of them, and a stepping stone for
> other new resource controllers and process control modules.

For the case of namespaces I don't see how your code makes things
better.  I do not see a real problem that you are solving.

I do agree that a common interface to the code and a common set of
infrastructure may help.

Personally until the pid namespace (which includes within it process
groupings) is complete I'm not certain the problem will be concrete
enough to solve.  Hopefully we can have a working version of that
shortly.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17481 is a reply to message #17472] Tue, 20 February 2007 19:29 Go to previous message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
"Paul Menage" <menage@google.com> writes:

> What are you defining here as "everything"? If you mean "all things
> that could be applied to a segregated group of processes such as a
> virtual server", then "container" seems like a good name for my
> patches, since it allows you to aggregate namespaces, resource
> control, other virtualization, etc.

The aggregation happens at the task struct when you apply different
pieces to a task we don't need a structure to allow aggregation.

> Sam said "the NSProxy *is* the container". You appear to be planning
> to have some namespaces, possibly not aggregated within the nsproxy
> (pid namespace?) but are you planning to have some higher-level
> "container" object that aggregates the nsproxy and the other
> namespaces?

No. A reverse mapping is not needed and is not interesting.
As long as I can walk all processes and ask what namespace are
you in I don't care.

> If so, what is it? Does it track process membership, etc?
> What's the userspace API? In what fundamental ways would it be
> different from my generic containers patches with Serge Hallyn's
> namespace subsystem. (If these questions are answered already by
> designs or code, I'd be happy to be pointed at them).


> I guess what it comes down to, is why is an aggregation of namespaces
> suitable for the name "container", when an aggregation of namespaces
> and other resource controllers isn't?

To me at least the interesting part of your work is not the aggregation
portion.  But the infrastructure for building the different process
groups.

You seem to be calling your groups of processes lumped together for
one purpose or another a container.

> What do you think might be a better name for the generic process
> groups that I'm pushing? As I said, I'm happy to do a simple
> search/replace on my code to give a different name if that turned out
> to be the gating factor to getting it merged. But I'd be inclined to
> leave that decision up to Andrew/Linus.

As we have previously been using the term container is like
application.  The end product that the user sees built out of
individual kernel provided components, but not something that exists
in the kernel as such.

We need a term for the non-aggregated case for the individual process
groups we build this out of in your infrastructure.  Because you
clearly have more than one kind of process group a set of processes
can belong to.

>> > But there's a lot of common ground between these different approaches,
>> > and potential for synergy, so the point of this patch set is to
>> > provide a unification point for all of them, and a stepping stone for
>> > other new resource controllers and process control modules.
>>
>> For the case of namespaces I don't see how your code makes things
>> better.  I do not see a real problem that you are solving.
>
> I'm trying to solve the problem that lots of different folks
> (including us) are trying to do things that aggregate multiple process
> into some kind of constrained group, and are all trying to use
> different and incompatible ways of grouping/tracking those processes.
>
> I agree that namespaces fit slightly less well into this model than
> some other subsystems like resource management. But by integrating
> with it you'd get  automatic access to all the various different
> resource controller work that's being done.

I don't need to integrate with it to get access to the resource
controller work.  The resource controller work applies to a set of
processes.  I have a set of processes.  I just need to apply the
resource controllers to my set of processes.

At least when we don't talk about the names for these things it
is that simple.  Now global names I have issues with because almost
always the global names fall into their own category, but that is
a different story.

Now I have some half backed ideas that might be useful for providing
a better abstraction.  But it requires setting down and looking
at the problem in detail, and understanding what people are trying
to accomplish with these things they are building.  What subset of
process groups do you find interesting.

All that is necessary to have a group of processes do something
in an unnamed fashion is to hang a pointer off of the task_struct.
That's easy.  You are adding a lot more to that, and there is
enough abstraction I haven't been able to easily work my way through
the infrastructure to what is really going on.  Admittedly I haven't
looked closely yet.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17484 is a reply to message #10176] Tue, 20 February 2007 22:19 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 2/20/07, Sam Vilain <sam@vilain.net> wrote:
>
> The term "segregated group of processes" is too vague.  Segregated for
> what?  What is the kernel supposed to do with this information?

The generic part of the kernel just keeps track of the fact that
they're segregated (and their children, etc).

It's the clients of this subsystem (virtual servers, resource
controllers) that can decide to give different per-process behaviour
based on the membership of various groups.

> Did you like the names I came up with in my original reply?
>
>  - CPUset namespace for CPU partitioning
>  - Resource namespaces:
>    - cpusched namespace for CPU
>    - ulimit namespace for memory
>    - quota namespace for disk space
>    - io namespace for disk activity
>    - etc

This is a strange abuse of the term "namespace".
http://en.wikipedia.org/wiki/Namespace_%28computer_science%29

For the virtual server work that you're doing, namespace is a good term:

pids name processes, hence a pid namespace lets you have multiple
distinct mappings from pids to processes
filenames name files, so a filename (or mount) namespace lets you have
multiple distinct mappings from filenames to files.

For resource QoS control, it doesn't really make sense to talk about
namespaces. We're not virtualizing resources to rename them for
different virtual servers, we're limiting the quality of access to the
resources.

But the semantics of the term "namespace" notwithstanding, you're
equating a virtual server namespace (pid, ipc, uts, mounts, etc) with
a resource controller (memory, I/O, etc) in terms of their place in a
hierarchy, which is something I agree with. All of these subsystems
can be considered to be units that can be associated with groups of
processes; the ultimate grouping of processes is something that we're
both ultimately referring to as a container.

>
> Maybe what's missing is a set of helper macros/functions that assist
> with writing new namespaces.

That's pretty much what my containers patch does - provides a generic
way to associate state (what you're referring to as a namespace) with
groups of processes and a consistent user-space API for manipulating
them.

Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17485 is a reply to message #17481] Tue, 20 February 2007 22:47 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 2/20/07, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Sam said "the NSProxy *is* the container". You appear to be planning
> > to have some namespaces, possibly not aggregated within the nsproxy
> > (pid namespace?) but are you planning to have some higher-level
> > "container" object that aggregates the nsproxy and the other
> > namespaces?
>
> No. A reverse mapping is not needed and is not interesting.

... to you.

> As long as I can walk all processes and ask what namespace are
> you in I don't care.

How do you currently do that?

>
> To me at least the interesting part of your work is not the aggregation
> portion.  But the infrastructure for building the different process
> groups.

In that case you can easily use it to just assign one namespace type
to each tree of process groups. The aggregation is something that
other groups find useful, but isn't required for the user to actually
make use of.

>
> You seem to be calling your groups of processes lumped together for
> one purpose or another a container.

Correct.

>
> We need a term for the non-aggregated case for the individual process
> groups we build this out of in your infrastructure.  Because you
> clearly have more than one kind of process group a set of processes
> can belong to.

Yes. That's why my system supports multiple unrelated hierarchies of groups.

> > I agree that namespaces fit slightly less well into this model than
> > some other subsystems like resource management. But by integrating
> > with it you'd get  automatic access to all the various different
> > resource controller work that's being done.
>
> I don't need to integrate with it to get access to the resource
> controller work.

Right, you could certainly do the extra work of tying your virtual
servers together with resource controllers in userspace. But you'd
still need an API to allow those resource controllers to be associated
with groups of processes and configured with limits/guarantees, which
is one of the main aims of my containers patch.

> Now I have some half backed ideas that might be useful for providing
> a better abstraction.  But it requires setting down and looking
> at the problem in detail, and understanding what people are trying
> to accomplish with these things they are building.  What subset of
> process groups do you find interesting.

I'm primarily interested in getting something in the kernel that can
be used as a base for interesting subsystems that apply behavioural or
QoS changes to defined groups of processes. Resource controllers and
namespaces seem to be good examples of this, but I can think of useful
subsystems for monitoring, permission control, etc.

>
> All that is necessary to have a group of processes do something
> in an unnamed fashion is to hang a pointer off of the task_struct.
> That's easy.

Right, adding a pointer to task_struct is easy. Configuring how/when
to not directly inherit it from the parent, or to change it for a
running task, or configuring state associated with the thing that the
pointer is pointing to, naming that group, and determining which group
a given process is assocaited with, is something that's effectively
repeated boiler plate for each different subsystem, and which can be
accomplished more generically via an abstraction like my containers
patch.

> You are adding a lot more to that, and there is

The main thing that I'm adding on top of the operations mentioned in
the previous paragraph, which pretty much essentially have to be in
the kernel, is the ability to group multiple different subsystems
together so that they share the same process->container mappings. Yes,
that is something that could potentially be done from userspace
instead, and just provide an independent tree for each subsystem or
namespace. But there seems to be interest from other parties
(including me) in having kernel support for it.

Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17486 is a reply to message #10176] Tue, 20 February 2007 23:28 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 2/20/07, Sam Vilain <sam@vilain.net> wrote:
>
> I don't necessarily agree with the 'heirarchy' bit.  It doesn't have to
> be so segregated.  But I think we already covered that in this thread.

OK, but it's much easier to use a hierarchical system as a flat system
(just don't create children) than a flat system as a hierarchical
system. And others do seem to want a hierarchy for their process
groupings.

>
> I agree with the comment on the abuse of the term "namespace", though
> consider that it has already been abused with the term IPC namespaces.

That doesn't seem like an abuse to me at all - you're controlling what
IPC object a given name (shm_id, sem_id or msg_id) refers to for any
given group of processes.

> We have for some time been using it to refer to groupable entities
> within the kernel that are associated with tasks, even if they don't
> involve named entities that clash within a particular domain.  But there
> is always an entity and a domain, and that is the key point I'm trying
> to make - the features you are putting forward are no different to the
> examples that we made specifically for the purpose of setting the
> standard for further features to follow.

They're very similar, I agree. An important difference is that things
like pid/mount namespaces are simply ways of controlling the
visibility of existing unix concepts, such as processes or
filesystems. You don't need additional configuration to make them
useful, as unix already has standard ways of manipulating them.

Things like resource controllers typically require additional
configuration to control how much resources each group of processes
can consume, etc. Also, it appears to be much more common to want to
move tasks between different resource controllers than to move them
between different namespaces. (And in order to configure and move, you
need to be able to name)

The configuration, naming and movement are the features that my
container patch provides on top of the features that nsproxy provides
for namespaces.

> anyway, feel free to flog this old dead horse and suggest different
> terms.  We've all had long enough to think about it since so maybe it's
> worth it, but with any new term it should be really darned clear that
> they're essentially the same thing as namespaces, or otherwise be really
> likable.

What you're calling a "namespace", I'm calling a "subsystem state".
Essentially they're the same thing. The important thing that generic
containers provide is a standard way to manipulate "subsystem states"
or "namespaces".

Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17487 is a reply to message #10176] Tue, 20 February 2007 23:36 Go to previous message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 2/20/07, Sam Vilain <sam@vilain.net> wrote:
> Paul Menage wrote:
> >> No. A reverse mapping is not needed and is not interesting.
> >>
> > ... to you.
> >
>
> You're missing the point of Eric's next sentence.  If you can achieve
> everything you need to achieve and get all the information you are after
> without it, then it is uninteresting.

Yes, you can do it with an exhaustive trawl of /proc. That can be very
expensive on busy machines.

>
> >> As long as I can walk all processes and ask what namespace are
> >> you in I don't care.
> >>
> >
> > How do you currently do that?
> >
>
> Take a look at /proc/PID/mounts for example.

That doesn't tell you what mounts namespace a process is in - it tells
you what the process can *view* in the namespace.

>
> So make helpers.  Macros.  Anything, just don't introduce model
> limitations like the container structure, because we've already got the
> structure; the nsproxy.
>

As I mentioned in another email, nsproxy is fine for things that don't
need explicit configuration or reporting, or which already have
configuration methods (such as fork(), mount(), etc) available, and
which don't need to support movement of processes between different
"namespaces". If it was extended to support the
naming/config/movement, then it would be fine to use it as the
equivalent of a container.

I'd actually be interested in trying to combine my container object
and the nsproxy object into a single concept.

Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17488 is a reply to message #17485] Tue, 20 February 2007 23:32 Go to previous message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Paul Menage (menage@google.com):
> On 2/20/07, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >All that is necessary to have a group of processes do something
> >in an unnamed fashion is to hang a pointer off of the task_struct.
> >That's easy.
> 
> Right, adding a pointer to task_struct is easy. Configuring how/when
> to not directly inherit it from the parent, or to change it for a
> running task, or configuring state associated with the thing that the
> pointer is pointing to, naming that group, and determining which group
> a given process is assocaited with, is something that's effectively
> repeated boiler plate for each different subsystem, and which can be
> accomplished more generically via an abstraction like my containers
> patch.

Eric,

what you gain with this patchset is, one very simple container subsystem
can tie a container to a cpu, another can limit it's RSS, and suddenly
you can

	mount -t container -o ns,rss,cpuwhatever ns /container

And each virtual server you create by unsharing can get automatic cpu
and rss controls.

That is worthwhile imo.

-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17489 is a reply to message #10176] Tue, 20 February 2007 23:37 Go to previous message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Sam Vilain (sam@vilain.net):
> Paul Menage wrote:

> We talked about naming a bit before, see
> http://lkml.org/lkml/2006/3/21/403 and possibly other threads.  So,
> anyway, feel free to flog this old dead horse and suggest different
> terms.  We've all had long enough to think about it since so maybe it's

Bah,

Paul, how about s/container/rug/, as in resource usage group?

At least you can be pretty sure that (a) rug won't conflict with any
other computer science therm, and (b) for that exact reason it should be
pretty memorable.

-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17492 is a reply to message #17472] Tue, 20 February 2007 21:58 Go to previous message
Sam Vilain is currently offline  Sam Vilain
Messages: 73
Registered: February 2006
Member
Paul Menage wrote:
>> Using the container name is bad and it led to this stupid argument.
>>
>> The fundamental unit of what we have merged into the kernel is the
>> namespace.  The aggregate of all namespaces and everything is the
>> container.
>>     
> What are you defining here as "everything"? If you mean "all things
> that could be applied to a segregated group of processes such as a
> virtual server",

The term "segregated group of processes" is too vague.  Segregated for
what?  What is the kernel supposed to do with this information?

> I guess what it comes down to, is why is an aggregation of namespaces
> suitable for the name "container", when an aggregation of namespaces
> and other resource controllers isn't?
>   

This argument goes away if you just rename these resource groups to
resource namespaces.

> What do you think might be a better name for the generic process
> groups that I'm pushing? As I said, I'm happy to do a simple
> search/replace on my code to give a different name if that turned out
> to be the gating factor to getting it merged. But I'd be inclined to
> leave that decision up to Andrew/Linus.
>   

Did you like the names I came up with in my original reply?

 - CPUset namespace for CPU partitioning
 - Resource namespaces:
   - cpusched namespace for CPU
   - ulimit namespace for memory
   - quota namespace for disk space
   - io namespace for disk activity
   - etc

>> For the case of namespaces I don't see how your code makes things
>> better.  I do not see a real problem that you are solving.
>>     
> I'm trying to solve the problem that lots of different folks
> (including us) are trying to do things that aggregate multiple process
> into some kind of constrained group, and are all trying to use
> different and incompatible ways of grouping/tracking those processes.
>   

Maybe what's missing is a set of helper macros/functions that assist
with writing new namespaces.  Perhaps you can give some more examples
and we can consider these on a case by case basis.

Sam.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17494 is a reply to message #17484] Tue, 20 February 2007 22:58 Go to previous message
Sam Vilain is currently offline  Sam Vilain
Messages: 73
Registered: February 2006
Member
Paul Menage wrote:
>> The term "segregated group of processes" is too vague.  Segregated for
>> what?  What is the kernel supposed to do with this information?
>>     
> The generic part of the kernel just keeps track of the fact that
> they're segregated (and their children, etc).
>
> It's the clients of this subsystem (virtual servers, resource
> controllers) that can decide to give different per-process behaviour
> based on the membership of various groups.
>   

So those clients can use helper functions to use their own namespaces. 
If they happen to group things in the same way they they'll all end up
using the same nsproxy.

>> Did you like the names I came up with in my original reply?
>>
>>  - CPUset namespace for CPU partitioning
>>  - Resource namespaces:
>>    - cpusched namespace for CPU
>>    - ulimit namespace for memory
>>    - quota namespace for disk space
>>    - io namespace for disk activity
>>    - etc
>>     
>
> This is a strange abuse of the term "namespace".
> http://en.wikipedia.org/wiki/Namespace_%28computer_science%29
>
> For the virtual server work that you're doing, namespace is a good term:
>
> pids name processes, hence a pid namespace lets you have multiple
> distinct mappings from pids to processes
> filenames name files, so a filename (or mount) namespace lets you have
> multiple distinct mappings from filenames to files.
>
> For resource QoS control, it doesn't really make sense to talk about
> namespaces. We're not virtualizing resources to rename them for
> different virtual servers, we're limiting the quality of access to the
> resources.
>
> But the semantics of the term "namespace" notwithstanding, you're
> equating a virtual server namespace (pid, ipc, uts, mounts, etc) with
> a resource controller (memory, I/O, etc) in terms of their place in a
> hierarchy, which is something I agree with. All of these subsystems
> can be considered to be units that can be associated with groups of
> processes; the ultimate grouping of processes is something that we're
> both ultimately referring to as a container.
>   

I don't necessarily agree with the 'heirarchy' bit.  It doesn't have to
be so segregated.  But I think we already covered that in this thread.

I agree with the comment on the abuse of the term "namespace", though
consider that it has already been abused with the term IPC namespaces. 
We have for some time been using it to refer to groupable entities
within the kernel that are associated with tasks, even if they don't
involve named entities that clash within a particular domain.  But there
is always an entity and a domain, and that is the key point I'm trying
to make - the features you are putting forward are no different to the
examples that we made specifically for the purpose of setting the
standard for further features to follow.

We talked about naming a bit before, see
http://lkml.org/lkml/2006/3/21/403 and possibly other threads.  So,
anyway, feel free to flog this old dead horse and suggest different
terms.  We've all had long enough to think about it since so maybe it's
worth it, but with any new term it should be really darned clear that
they're essentially the same thing as namespaces, or otherwise be really
likable.

Sam.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [PATCH 0/7] containers (V7): Generic Process Containers [message #17495 is a reply to message #17485] Tue, 20 February 2007 23:08 Go to previous message
Sam Vilain is currently offline  Sam Vilain
Messages: 73
Registered: February 2006
Member
Paul Menage wrote:
>> No. A reverse mapping is not needed and is not interesting.
>>     
> ... to you.
>   

You're missing the point of Eric's next sentence.  If you can achieve
everything you need to achieve and get all the information you are after
without it, then it is uninteresting.

>> As long as I can walk all processes and ask what namespace are
>> you in I don't care.
>>     
>
> How do you currently do that?
>   

Take a look at /proc/PID/mounts for example.

>> All that is necessary to have a group of processes do something
>> in an unnamed fashion is to hang a pointer off of the task_struct.
>> That's easy.
>>     
> Right, adding a pointer to task_struct is easy. Configuring how/when
> to not directly inherit it from the parent, or to change it for a
> running task, or configuring state associated with the thing that the
> pointer is pointing to, naming that group, and determining which group
> a given process is assocaited with, is something that's effectively
> repeated boiler plate for each different subsystem, and which can be
> accomplished more generically via an abstraction like my containers
> patch.
>   

So make helpers.  Macros.  Anything, just don't introduce model
limitations like the container structure, because we've already got the
structure; the nsproxy.

Sam.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem [message #18013 is a reply to message #11456] Sat, 24 March 2007 16:23 Go to previous message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Sat, Mar 24, 2007 at 10:35:37AM +0530, Srivatsa Vaddagiri wrote:
> > +static int ns_create(struct container_subsys *ss, struct container *cont)
> > +{
> > +	struct nscont *ns;
> > +
> > +	if (!capable(CAP_SYS_ADMIN))
> > +		return -EPERM;
> 
> Does this check break existing namespace semantics in a subtle way?
> It now requires that unshare() of namespaces by any task requires
> CAP_SYS_ADMIN capabilities.

I should clarify that I am referring to unshare thr' clone here (and not
thr' sys_unshare)

> clone(.., CLONE_NEWUTS, ..)->copy_namespaces()->ns_container_clone()->
> 	->container_clone()-> .. -> container_create() -> ns_create()
> 
> Earlier, one could unshare his uts namespace w/o CAP_SYS_ADMIN
> capabilities. Now it is required. Is that fine? Don't know.
> 
> I feel we can avoid this check totally and let the directory permissions
> take care of these checks.
> 
> Serge, what do you think?

-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem [message #18030 is a reply to message #18013] Mon, 26 March 2007 21:57 Go to previous message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Srivatsa Vaddagiri (vatsa@in.ibm.com):
> On Sat, Mar 24, 2007 at 10:35:37AM +0530, Srivatsa Vaddagiri wrote:
> > > +static int ns_create(struct container_subsys *ss, struct container *cont)
> > > +{
> > > +	struct nscont *ns;
> > > +
> > > +	if (!capable(CAP_SYS_ADMIN))
> > > +		return -EPERM;
> > 
> > Does this check break existing namespace semantics in a subtle way?
> > It now requires that unshare() of namespaces by any task requires
> > CAP_SYS_ADMIN capabilities.
> 
> I should clarify that I am referring to unshare thr' clone here (and not
> thr' sys_unshare)

That is still not true, see kernel/utsname:copy_utsname().

Now you might have run a userspace testcase in a kernel with
CONFIG_UTS_NS=n, which at the moment erroneously returns 0 rather than
-EINVAL when you clone(CLONE_NEWUTS).  But you didn't get a new uts
namespace, you were just lied to  :)

-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem [message #18052 is a reply to message #18030] Wed, 28 March 2007 14:55 Go to previous message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Mon, Mar 26, 2007 at 04:57:55PM -0500, Serge E. Hallyn wrote:
> That is still not true, see kernel/utsname:copy_utsname().
> 
> Now you might have run a userspace testcase in a kernel with
> CONFIG_UTS_NS=n, which at the moment erroneously returns 0 rather than
> -EINVAL when you clone(CLONE_NEWUTS).  But you didn't get a new uts
> namespace, you were just lied to  :)

I think you are right here, in that CONFIG_UTS_NS was not turned on,
although I was thinking it was on.

However as a result of this experiment, I found this anomaly:

- On a kernel with CONFIG_UTS_NS=n, a test which does
  clone(CLONE_NEWUTS) works fine. clone() succeeds and the child
  starts running with no error.
- On the same kernel, if ns container hierarchy is mounted, then
  the test fails. clone() returns failure and child is never created.
  As soon as the ns container hierarchy is unmounted, the test works
  again.

I would have expected a consistent behavior here, irrespective of
whether ns hierarchy is mounted or not. Is this difference in behavior
acceptable? Returning -EINVAL in copy_utsname() when CONFIG_UTS_NS=n, as
you say above, would fix this anomaly.


-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Previous Topic: [PATCH 2/3] powernow-k8: switch to *_on_cpu() functions
Next Topic: aufs on 64 bit nodes: warnings on compilation
Goto Forum:
  


Current Time: Sun Oct 26 17:03:23 GMT 2025

Total time taken to generate the page: 0.09230 seconds