Home » Mailing lists » Devel » [PATCH 00/10] Task Containers(V11): Introduction
[PATCH 00/10] Task Containers(V11): Introduction [message #15191] |
Fri, 20 July 2007 18:31 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
This is an update to the task containers patchset.
Changes since V10 (May 30th) include:
- Based on 2.6.22-rc6-mm1 (minus existing container patches, see below)
- Rolled in various fix/tidy patches contributed by akpm and others
- Reorganisation of the mount/unmount code to use sget(); the new
approach is modelled on the NFS superblock code. This fixes some
potential lock inversions pointed out by lockdep.
- Fix various lockdep warnings
- Changed the create() subsystem callback to return a pointer to the
new state object rather than updating the subsystem pointer in the
container directly.
- Changed container_add_file() to automatically prefix the subsystem
name (and a period) on to all container files unless the filesystem
is mounted with the "noprefix" option (intended for use by the
legacy cpuset filesystem emulation).
- Added a release_agent= mount option to allow the release agent path
to be specified at mount time.
- css_put() is now completely non-blocking
- css_get()/css_put() avoid taking/dropping reference counts on the
root state since this can't be freed anyway; this saves some atomic ops
API changes (for subsystem writers):
1) return your new css object from create() callback
2) remove the subsystem name prefix from your cftype structures
3) pass your subsystem pointer as an additional new parameter to
container_add_file() and container_add_files()
Still TODO:
- finalize the naming
- add a hash-table based lookup for css_group objects.
- use seq_file properly in container tasks files to avoid having to
allocate a big array for all the container's task pointers.
- add virtualization support to allow delegation to virtual servers
- fix a lockdep false-positive - container_mutex nests inside
inode->i_mutex, but there's a point in the mount code where we need to
lock a newly-created (and hence guaranteed unlocked) directory from
within container_mutex.
- more subsystems
Generic Process Containers
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
containers, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "containers" and assigning arbitrary state to those
groupings, in order to control the behaviour of the container as an
aggregate.
The intention is that the various resource management and
virtualization/container efforts can also become task container
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
The patch set is structured as follows:
1) Basic container framework - filesystem and tracking structures
2) Support for the "tasks" control file
3) Hooks for fork() and exit()
4) Support for the container_clone() operation
5) Add /proc reporting interface
6) Share container subsystem pointer arrays between tasks with the
same assignments
7) Support for a userspace "release agent", similar to the cpusets
release agent functionality
8) Make cpusets a container subsystem
9) Simple CPU Accounting example subsystem
10) Simple container debugging subsystem
It applies to 2.6.22-rc6-mm1, *minus* the following patches (available
from http://www.kernel.org/pub/linux/kernel/people/akpm/mm/broken -out-2007-06-27-03-28.tar.gz)
containersv10-basic-container-framework.patch
containersv10-basic-container-framework-fix.patch
containersv10-basic-container-framework-fix-2.patch
containersv10-basic-container-framework-fix-3.patch
containersv10-example-cpu-accounting-subsystem.patch
containersv10-example-cpu-accounting-subsystem-fix.patch
containersv10-add-tasks-file-interface.patch
containersv10-add-tasks-file-interface-fix.patch
containersv10-add-tasks-file-interface-fix-2.patch
containersv10-add-fork-exit-hooks.patch
containersv10-add-fork-exit-hooks-fix.patch
containersv10-add-container_clone-interface.patch
containersv10-add-container_clone-interface-fix.patch
containersv10-add-procfs-interface.patch
containersv10-add-procfs-interface-fix.patch
containersv10-make-cpusets-a-client-of-containers.patch
containersv10-make-cpusets-a-client-of-containers-whitespace .patch
containersv10-share-css_group-arrays-between-tasks-with-same -container-memberships.patch
containersv10-share-css_group-arrays-between-tasks-with-same -container-memberships-fix.patch
containersv10-share-css_group-arrays-between-tasks-with-same -container-memberships-cpuset-zero-malloc-fix-for-new-contai ners.patch
containersv10-simple-debug-info-subsystem.patch
containersv10-simple-debug-info-subsystem-fix.patch
containersv10-simple-debug-info-subsystem-fix-2.patch
containersv10-support-for-automatic-userspace-release-agents .patch
containersv10-support-for-automatic-userspace-release-agents -whitespace.patch
add-containerstats-v3.patch
add-containerstats-v3-fix.patch
update-getdelays-to-become-containerstats-aware.patch
containers-implement-subsys-post_clone.patch
containers-implement-namespace-tracking-subsystem-v3.patch
Signed-off-by: Paul Menage <menage@google.com>
--
|
|
|
[PATCH 02/10] Task Containers(V11): Add tasks file interface [message #15192 is a reply to message #15191] |
Fri, 20 July 2007 18:31 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
This patch adds the per-directory "tasks" file for containerfs mounts; this
allows the user to determine which tasks are members of a container by reading
a container's "tasks", and to move a task into a container by writing its pid
to its "tasks".
Signed-off-by: Paul Menage <menage@google.com>
---
include/linux/container.h | 10 +
kernel/container.c | 359 +++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 368 insertions(+), 1 deletion(-)
Index: container-2.6.22-rc6-mm1/include/linux/container.h
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/include/linux/container.h
+++ container-2.6.22-rc6-mm1/include/linux/container.h
@@ -144,6 +144,16 @@ int container_is_removed(const struct co
int container_path(const struct container *cont, char *buf, int buflen);
+int __container_task_count(const struct container *cont);
+static inline int container_task_count(const struct container *cont)
+{
+ int task_count;
+ rcu_read_lock();
+ task_count = __container_task_count(cont);
+ rcu_read_unlock();
+ return task_count;
+}
+
/* Return true if the container is a descendant of the current container */
int container_is_descendant(const struct container *cont);
Index: container-2.6.22-rc6-mm1/kernel/container.c
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/kernel/container.c
+++ container-2.6.22-rc6-mm1/kernel/container.c
@@ -40,7 +40,7 @@
#include <linux/magic.h>
#include <linux/spinlock.h>
#include <linux/string.h>
-
+#include <linux/sort.h>
#include <asm/atomic.h>
/* Generate an array of container subsystem pointers */
@@ -704,6 +704,127 @@ int container_path(const struct containe
return 0;
}
+/*
+ * Return the first subsystem attached to a container's hierarchy, and
+ * its subsystem id.
+ */
+
+static void get_first_subsys(const struct container *cont,
+ struct container_subsys_state **css, int *subsys_id)
+{
+ const struct containerfs_root *root = cont->root;
+ const struct container_subsys *test_ss;
+ BUG_ON(list_empty(&root->subsys_list));
+ test_ss = list_entry(root->subsys_list.next,
+ struct container_subsys, sibling);
+ if (css) {
+ *css = cont->subsys[test_ss->subsys_id];
+ BUG_ON(!*css);
+ }
+ if (subsys_id)
+ *subsys_id = test_ss->subsys_id;
+}
+
+/*
+ * Attach task 'tsk' to container 'cont'
+ *
+ * Call holding container_mutex. May take task_lock of
+ * the task 'pid' during call.
+ */
+static int attach_task(struct container *cont, struct task_struct *tsk)
+{
+ int retval = 0;
+ struct container_subsys *ss;
+ struct container *oldcont;
+ struct css_group *cg = &tsk->containers;
+ struct containerfs_root *root = cont->root;
+ int i;
+ int subsys_id;
+
+ get_first_subsys(cont, NULL, &subsys_id);
+
+ /* Nothing to do if the task is already in that container */
+ oldcont = task_container(tsk, subsys_id);
+ if (cont == oldcont)
+ return 0;
+
+ for_each_subsys(root, ss) {
+ if (ss->can_attach) {
+ retval = ss->can_attach(ss, cont, tsk);
+ if (retval) {
+ return retval;
+ }
+ }
+ }
+
+ task_lock(tsk);
+ if (tsk->flags & PF_EXITING) {
+ task_unlock(tsk);
+ return -ESRCH;
+ }
+ /* Update the css_group pointers for the subsystems in this
+ * hierarchy */
+ for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++) {
+ if (root->subsys_bits & (1ull << i)) {
+ /* Subsystem is in this hierarchy. So we want
+ * the subsystem state from the new
+ * container. Transfer the refcount from the
+ * old to the new */
+ atomic_inc(&cont->count);
+ atomic_dec(&cg->subsys[i]->container->count);
+ rcu_assign_pointer(cg->subsys[i], cont->subsys[i]);
+ }
+ }
+ task_unlock(tsk);
+
+ for_each_subsys(root, ss) {
+ if (ss->attach) {
+ ss->attach(ss, cont, oldcont, tsk);
+ }
+ }
+
+ synchronize_rcu();
+ return 0;
+}
+
+/*
+ * Attach task with pid 'pid' to container 'cont'. Call with
+ * container_mutex, may take task_lock of task
+ */
+static int attach_task_by_pid(struct container *cont, char *pidbuf)
+{
+ pid_t pid;
+ struct task_struct *tsk;
+ int ret;
+
+ if (sscanf(pidbuf, "%d", &pid) != 1)
+ return -EIO;
+
+ if (pid) {
+ rcu_read_lock();
+ tsk = find_task_by_pid(pid);
+ if (!tsk || tsk->flags & PF_EXITING) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ rcu_read_unlock();
+
+ if ((current->euid) && (current->euid != tsk->uid)
+ && (current->euid != tsk->suid)) {
+ put_task_struct(tsk);
+ return -EACCES;
+ }
+ } else {
+ tsk = current;
+ get_task_struct(tsk);
+ }
+
+ ret = attach_task(cont, tsk);
+ put_task_struct(tsk);
+ return ret;
+}
+
/* The various types of files and directories in a container file system */
enum container_filetype {
@@ -712,6 +833,55 @@ enum container_filetype {
FILE_TASKLIST,
};
+static ssize_t container_common_file_write(struct container *cont,
+ struct cftype *cft,
+ struct file *file,
+ const char __user *userbuf,
+ size_t nbytes, loff_t *unused_ppos)
+{
+ enum container_filetype type = cft->private;
+ char *buffer;
+ int retval = 0;
+
+ if (nbytes >= PATH_MAX)
+ return -E2BIG;
+
+ /* +1 for nul-terminator */
+ buffer = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (buffer == NULL)
+ return -ENOMEM;
+
+ if (copy_from_user(buffer, userbuf, nbytes)) {
+ retval = -EFAULT;
+ goto out1;
+ }
+ buffer[nbytes] = 0; /* nul-terminate */
+
+ mutex_lock(&container_mutex);
+
+ if (container_is_removed(cont)) {
+ retval = -ENODEV;
+ goto out2;
+ }
+
+ switch (type) {
+ case FILE_TASKLIST:
+ retval = attach_task_by_pid(cont, buffer);
+ break;
+ default:
+ retval = -EINVAL;
+ goto out2;
+ }
+
+ if (retval == 0)
+ retval = nbytes;
+out2:
+ mutex_unlock(&container_mutex);
+out1:
+ kfree(buffer);
+ return retval;
+}
+
static ssize_t container_file_write(struct file *file, const char __user *buf,
size_t nbytes, loff_t *ppos)
{
@@ -915,6 +1085,189 @@ int container_add_files(struct container
return 0;
}
+/* Count the number of tasks in a container. Could be made more
+ * time-efficient but less space-efficient with more linked lists
+ * running through each container and the css_group structures that
+ * referenced it. Must be called with tasklist_lock held for read or
+ * write or in an rcu critical section.
+ */
+int __container_task_count(const struct container *cont)
+{
+ int count = 0;
+ struct task_struct *g, *p;
+ struct container_subsys_state *css;
+ int subsys_id;
+
+ get_first_subsys(cont, &css, &subsys_id);
+ do_each_thread(g, p) {
+ if (task_subsys_state(p, subsys_id) == css)
+ count ++;
+ } while_each_thread(g, p);
+ return count;
+}
+
+/*
+ * Stuff for reading the 'tasks' file.
+ *
+ * Reading this file can return large amounts of data if a container has
+ * *lots* of attached tasks. So it may need several calls to read(),
+ * but we cannot guarantee that the information we produce is correct
+ * unless we produce it entirely atomically.
+ *
+ * Upon tasks file open(), a struct ctr_struct is allocated, that
+ * will have a pointer to an array (also allocated here). The struct
+ * ctr_struct * is stored in file->private_data. Its resources will
+ * be freed by release() when the file is closed. The array is used
+ * to sprintf the PIDs and then used by read().
+ */
+struct ctr_struct {
+ char *buf;
+ int bufsz;
+};
+
+/*
+ * Load into 'pidarray' up to 'npids' of the tasks using container
+ * 'cont'. Return actual number of pids loaded. No need to
+ * task_lock(p) when reading out p->container, since we're in an RCU
+ * read section, so the css_group can't go away, and is
+ * immutable after creation.
+ */
+static int pid_array_load(pid_t *pidarray, int npids, struct container *cont)
+{
+ int n = 0;
+ struct task_struct *g, *p;
+ struct container_subsys_state *css;
+ int subsys_id;
+
+ get_first_subsys(cont, &css, &subsys_id);
+ rcu_read_lock();
+ do_each_thread(g, p) {
+ if (task_subsys_state(p, subsys_id) == css) {
+ pidarray[n++] = pid_nr(task_pid(p));
+ if (unlikely(n == npids))
+ goto array_full;
+ }
+ } while_each_thread(g, p);
+
+array_full:
+ rcu_read_unlock();
+ return n;
+}
+
+static int cmppid(const void *a, const void *b)
+{
+ return *(pid_t *)a - *(pid_t *)b;
+}
+
+/*
+ * Convert array 'a' of 'npids' pid_t's to a string of newline separated
+ * decimal pids in 'buf'. Don't write more than 'sz' chars, but return
+ * count 'cnt' of how many chars would be written if buf were large enough.
+ */
+static int pid_array_to_buf(char *buf, int sz, pid_t *a, int npids)
+{
+ int cnt = 0;
+ int i;
+
+ for (i = 0; i < npids; i++)
+ cnt += snprintf(buf + cnt, max(sz - cnt, 0), "%d\n", a[i]);
+ return cnt;
+}
+
+/*
+ * Handle an open on 'tasks' file. Prepare a buffer listing the
+ * process id's of tasks currently attached to the container being opened.
+ *
+ * Does not require any specific container mutexes, and does not take any.
+ */
+static int container_tasks_open(struct inode *unused, struct file *file)
+{
+ struct container *cont = __d_cont(file->f_dentry->d_parent);
+ struct ctr_struct *ctr;
+ pid_t *pidarray;
+ int npids;
+ char c;
+
+ if (!(file->f_mode & FMODE_READ))
+ return 0;
+
+ ctr = kmalloc(sizeof(*ctr), GFP_KERNEL);
+ if (!ctr)
+ goto err0;
+
+ /*
+ * If container gets more users after we read count, we won't have
+ * enough space - tough. This race is indistinguishable to the
+ * caller from the case that the additional container users didn't
+ * show up until sometime later on.
+ */
+ npids = container_task_count(cont);
+ if (npids) {
+ pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
+ if (!pidarray)
+ goto err1;
+
+ npids = pid_array_load(pidarray, npids, cont);
+ sort(pidarray, npids, sizeof
...
|
|
|
[PATCH 04/10] Task Containers(V11): Add container_clone() interface. [message #15193 is a reply to message #15191] |
Fri, 20 July 2007 18:31 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
This patch adds support for container_clone(), a way to create new
containers intended to be used for systems such as namespace
unsharing. A new subsystem callback, post_clone(), is added to allow
subsystems to automatically configure cloned containers.
Signed-off-by: Paul Menage <menage@google.com>
---
Documentation/containers.txt | 7 ++
include/linux/container.h | 3
kernel/container.c | 135 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 145 insertions(+)
Index: container-2.6.22-rc6-mm1/include/linux/container.h
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/include/linux/container.h
+++ container-2.6.22-rc6-mm1/include/linux/container.h
@@ -174,6 +174,7 @@ struct container_subsys {
void (*exit)(struct container_subsys *ss, struct task_struct *task);
int (*populate)(struct container_subsys *ss,
struct container *cont);
+ void (*post_clone)(struct container_subsys *ss, struct container *cont);
void (*bind)(struct container_subsys *ss, struct container *root);
int subsys_id;
int active;
@@ -213,6 +214,8 @@ static inline struct container* task_con
int container_path(const struct container *cont, char *buf, int buflen);
+int container_clone(struct task_struct *tsk, struct container_subsys *ss);
+
#else /* !CONFIG_CONTAINERS */
static inline int container_init_early(void) { return 0; }
Index: container-2.6.22-rc6-mm1/kernel/container.c
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/kernel/container.c
+++ container-2.6.22-rc6-mm1/kernel/container.c
@@ -1675,3 +1675,138 @@ void container_exit(struct task_struct *
tsk->containers = init_task.containers;
task_unlock(tsk);
}
+
+/**
+ * container_clone - duplicate the current container in the hierarchy
+ * that the given subsystem is attached to, and move this task into
+ * the new child
+ */
+int container_clone(struct task_struct *tsk, struct container_subsys *subsys)
+{
+ struct dentry *dentry;
+ int ret = 0;
+ char nodename[MAX_CONTAINER_TYPE_NAMELEN];
+ struct container *parent, *child;
+ struct inode *inode;
+ struct css_group *cg;
+ struct containerfs_root *root;
+ struct container_subsys *ss;
+
+ /* We shouldn't be called by an unregistered subsystem */
+ BUG_ON(!subsys->active);
+
+ /* First figure out what hierarchy and container we're dealing
+ * with, and pin them so we can drop container_mutex */
+ mutex_lock(&container_mutex);
+ again:
+ root = subsys->root;
+ if (root == &rootnode) {
+ printk(KERN_INFO
+ "Not cloning container for unused subsystem %s\n",
+ subsys->name);
+ mutex_unlock(&container_mutex);
+ return 0;
+ }
+ cg = &tsk->containers;
+ parent = task_container(tsk, subsys->subsys_id);
+
+ snprintf(nodename, MAX_CONTAINER_TYPE_NAMELEN, "node_%d", tsk->pid);
+
+ /* Pin the hierarchy */
+ atomic_inc(&parent->root->sb->s_active);
+
+ mutex_unlock(&container_mutex);
+
+ /* Now do the VFS work to create a container */
+ inode = parent->dentry->d_inode;
+
+ /* Hold the parent directory mutex across this operation to
+ * stop anyone else deleting the new container */
+ mutex_lock(&inode->i_mutex);
+ dentry = container_get_dentry(parent->dentry, nodename);
+ if (IS_ERR(dentry)) {
+ printk(KERN_INFO
+ "Couldn't allocate dentry for %s: %ld\n", nodename,
+ PTR_ERR(dentry));
+ ret = PTR_ERR(dentry);
+ goto out_release;
+ }
+
+ /* Create the container directory, which also creates the container */
+ ret = vfs_mkdir(inode, dentry, S_IFDIR | 0755);
+ child = __d_cont(dentry);
+ dput(dentry);
+ if (ret) {
+ printk(KERN_INFO
+ "Failed to create container %s: %d\n", nodename,
+ ret);
+ goto out_release;
+ }
+
+ if (!child) {
+ printk(KERN_INFO
+ "Couldn't find new container %s\n", nodename);
+ ret = -ENOMEM;
+ goto out_release;
+ }
+
+ /* The container now exists. Retake container_mutex and check
+ * that we're still in the same state that we thought we
+ * were. */
+ mutex_lock(&container_mutex);
+ if ((root != subsys->root) ||
+ (parent != task_container(tsk, subsys->subsys_id))) {
+ /* Aargh, we raced ... */
+ mutex_unlock(&inode->i_mutex);
+
+ deactivate_super(parent->root->sb);
+ /* The container is still accessible in the VFS, but
+ * we're not going to try to rmdir() it at this
+ * point. */
+ printk(KERN_INFO
+ "Race in container_clone() - leaking container %s\n",
+ nodename);
+ goto again;
+ }
+
+ /* do any required auto-setup */
+ for_each_subsys(root, ss) {
+ if (ss->post_clone)
+ ss->post_clone(ss, child);
+ }
+
+ /* All seems fine. Finish by moving the task into the new container */
+ ret = attach_task(child, tsk);
+ mutex_unlock(&container_mutex);
+
+ out_release:
+ mutex_unlock(&inode->i_mutex);
+ deactivate_super(parent->root->sb);
+ return ret;
+}
+
+/*
+ * See if "cont" is a descendant of the current task's container in
+ * the appropriate hierarchy
+ *
+ * If we are sending in dummytop, then presumably we are creating
+ * the top container in the subsystem.
+ *
+ * Called only by the ns (nsproxy) container.
+ */
+int container_is_descendant(const struct container *cont)
+{
+ int ret;
+ struct container *target;
+ int subsys_id;
+
+ if (cont == dummytop)
+ return 1;
+
+ get_first_subsys(cont, NULL, &subsys_id);
+ target = task_container(current, subsys_id);
+ while (cont != target && cont!= cont->top_container)
+ cont = cont->parent;
+ ret = (cont == target);
+ return ret;
+}
Index: container-2.6.22-rc6-mm1/Documentation/containers.txt
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/Documentation/containers.txt
+++ container-2.6.22-rc6-mm1/Documentation/containers.txt
@@ -504,6 +504,13 @@ include/linux/container.h for details).
method can return an error code, the error code is currently not
always handled well.
+void post_clone(struct container_subsys *ss, struct container *cont)
+
+Called at the end of container_clone() to do any paramater
+initialization which might be required before a task could attach. For
+example in cpusets, no task may attach before 'cpus' and 'mems' are set
+up.
+
void bind(struct container_subsys *ss, struct container *root)
LL=callback_mutex
--
|
|
|
[PATCH 06/10] Task Containers(V11): Shared container subsystem group arrays [message #15194 is a reply to message #15191] |
Fri, 20 July 2007 18:31 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
This patch replaces the struct css_group embedded in task_struct with a
pointer; all tasks that have the same set of memberships across all
hierarchies will share a css_group object, and will be linked via their
css_groups field to the "tasks" list_head in the css_group.
Assuming that many tasks share the same container assignments, this reduces
overall space usage and keeps the size of the task_struct down (three pointers
added to task_struct compared to a non-containers kernel, no matter how many
subsystems are registered).
Signed-off-by: Paul Menage <menage@google.com>
---
Documentation/containers.txt | 14
include/linux/container.h | 89 +++++-
include/linux/sched.h | 33 --
kernel/container.c | 606 +++++++++++++++++++++++++++++++++++++------
kernel/fork.c | 1
5 files changed, 620 insertions(+), 123 deletions(-)
Index: container-2.6.22-rc6-mm1/Documentation/containers.txt
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/Documentation/containers.txt
+++ container-2.6.22-rc6-mm1/Documentation/containers.txt
@@ -176,7 +176,9 @@ Containers extends the kernel as follows
subsystem state is something that's expected to happen frequently
and in performance-critical code, whereas operations that require a
task's actual container assignments (in particular, moving between
- containers) are less common.
+ containers) are less common. A linked list runs through the cg_list
+ field of each task_struct using the css_group, anchored at
+ css_group->tasks.
- A container hierarchy filesystem can be mounted for browsing and
manipulation from user space.
@@ -252,6 +254,16 @@ linear search to locate an appropriate e
very efficient. A future version will use a hash table for better
performance.
+To allow access from a container to the css_groups (and hence tasks)
+that comprise it, a set of cg_container_link objects form a lattice;
+each cg_container_link is linked into a list of cg_container_links for
+a single container on its cont_link_list field, and a list of
+cg_container_links for a single css_group on its cg_link_list.
+
+Thus the set of tasks in a container can be listed by iterating over
+each css_group that references the container, and sub-iterating over
+each css_group's task set.
+
The use of a Linux virtual file system (vfs) to represent the
container hierarchy provides for a familiar permission and name space
for containers, with a minimum of additional kernel code.
Index: container-2.6.22-rc6-mm1/include/linux/container.h
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/include/linux/container.h
+++ container-2.6.22-rc6-mm1/include/linux/container.h
@@ -27,10 +27,19 @@ extern void container_lock(void);
extern void container_unlock(void);
extern void container_fork(struct task_struct *p);
extern void container_fork_callbacks(struct task_struct *p);
+extern void container_post_fork(struct task_struct *p);
extern void container_exit(struct task_struct *p, int run_callbacks);
extern struct file_operations proc_container_operations;
+/* Define the enumeration of all container subsystems */
+#define SUBSYS(_x) _x ## _subsys_id,
+enum container_subsys_id {
+#include <linux/container_subsys.h>
+ CONTAINER_SUBSYS_COUNT
+};
+#undef SUBSYS
+
/* Per-subsystem/per-container state maintained by the system. */
struct container_subsys_state {
/* The container that this subsystem is attached to. Useful
@@ -97,6 +106,52 @@ struct container {
struct containerfs_root *root;
struct container *top_container;
+
+ /*
+ * List of cg_container_links pointing at css_groups with
+ * tasks in this container. Protected by css_group_lock
+ */
+ struct list_head css_groups;
+};
+
+/* A css_group is a structure holding pointers to a set of
+ * container_subsys_state objects. This saves space in the task struct
+ * object and speeds up fork()/exit(), since a single inc/dec and a
+ * list_add()/del() can bump the reference count on the entire
+ * container set for a task.
+ */
+
+struct css_group {
+
+ /* Reference count */
+ struct kref ref;
+
+ /*
+ * List running through all container groups. Protected by
+ * css_group_lock
+ */
+ struct list_head list;
+
+ /*
+ * List running through all tasks using this container
+ * group. Protected by css_group_lock
+ */
+ struct list_head tasks;
+
+ /*
+ * List of cg_container_link objects on link chains from
+ * containers referenced from this css_group. Protected by
+ * css_group_lock
+ */
+ struct list_head cg_links;
+
+ /*
+ * Set of subsystem states, one for each subsystem. This array
+ * is immutable after creation apart from the init_css_group
+ * during subsystem registration (at boot time).
+ */
+ struct container_subsys_state *subsys[CONTAINER_SUBSYS_COUNT];
+
};
/* struct cftype:
@@ -149,15 +204,7 @@ int container_is_removed(const struct co
int container_path(const struct container *cont, char *buf, int buflen);
-int __container_task_count(const struct container *cont);
-static inline int container_task_count(const struct container *cont)
-{
- int task_count;
- rcu_read_lock();
- task_count = __container_task_count(cont);
- rcu_read_unlock();
- return task_count;
-}
+int container_task_count(const struct container *cont);
/* Return true if the container is a descendant of the current container */
int container_is_descendant(const struct container *cont);
@@ -205,7 +252,7 @@ static inline struct container_subsys_st
static inline struct container_subsys_state *task_subsys_state(
struct task_struct *task, int subsys_id)
{
- return rcu_dereference(task->containers.subsys[subsys_id]);
+ return rcu_dereference(task->containers->subsys[subsys_id]);
}
static inline struct container* task_container(struct task_struct *task,
@@ -218,6 +265,27 @@ int container_path(const struct containe
int container_clone(struct task_struct *tsk, struct container_subsys *ss);
+/* A container_iter should be treated as an opaque object */
+struct container_iter {
+ struct list_head *cg_link;
+ struct list_head *task;
+};
+
+/* To iterate across the tasks in a container:
+ *
+ * 1) call container_iter_start to intialize an iterator
+ *
+ * 2) call container_iter_next() to retrieve member tasks until it
+ * returns NULL or until you want to end the iteration
+ *
+ * 3) call container_iter_end() to destroy the iterator.
+ */
+void container_iter_start(struct container *cont, struct container_iter *it);
+struct task_struct *container_iter_next(struct container *cont,
+ struct container_iter *it);
+void container_iter_end(struct container *cont, struct container_iter *it);
+
+
#else /* !CONFIG_CONTAINERS */
static inline int container_init_early(void) { return 0; }
@@ -225,6 +293,7 @@ static inline int container_init(void) {
static inline void container_init_smp(void) {}
static inline void container_fork(struct task_struct *p) {}
static inline void container_fork_callbacks(struct task_struct *p) {}
+static inline void container_post_fork(struct task_struct *p) {}
static inline void container_exit(struct task_struct *p, int callbacks) {}
static inline void container_lock(void) {}
Index: container-2.6.22-rc6-mm1/include/linux/sched.h
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/include/linux/sched.h
+++ container-2.6.22-rc6-mm1/include/linux/sched.h
@@ -909,34 +909,6 @@ struct sched_entity {
unsigned long wait_runtime_overruns, wait_runtime_underruns;
};
-#ifdef CONFIG_CONTAINERS
-
-#define SUBSYS(_x) _x ## _subsys_id,
-enum container_subsys_id {
-#include <linux/container_subsys.h>
- CONTAINER_SUBSYS_COUNT
-};
-#undef SUBSYS
-
-/* A css_group is a structure holding pointers to a set of
- * container_subsys_state objects.
- */
-
-struct css_group {
-
- /* Set of subsystem states, one for each subsystem. NULL for
- * subsystems that aren't part of this hierarchy. These
- * pointers reduce the number of dereferences required to get
- * from a task to its state for a given container, but result
- * in increased space usage if tasks are in wildly different
- * groupings across different hierarchies. This array is
- * immutable after creation */
- struct container_subsys_state *subsys[CONTAINER_SUBSYS_COUNT];
-
-};
-
-#endif /* CONFIG_CONTAINERS */
-
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1174,7 +1146,10 @@ struct task_struct {
int cpuset_mem_spread_rotor;
#endif
#ifdef CONFIG_CONTAINERS
- struct css_group containers;
+ /* Container info protected by css_group_lock */
+ struct css_group *containers;
+ /* cg_list protected by css_group_lock and tsk->alloc_lock */
+ struct list_head cg_list;
#endif
struct robust_list_head __user *robust_list;
#ifdef CONFIG_COMPAT
Index: container-2.6.22-rc6-mm1/kernel/container.c
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/kernel/container.c
+++ container-2.6.22-rc6-mm1/kernel/container.c
@@ -95,6 +95,7 @@ static struct containerfs_root rootnode;
/* The list of hierarchy roots */
static LIST_HEAD(roots);
+static int root_count;
/* dummytop is a shorthand for the dummy hierarchy's top container */
#define dummytop (&rootnode.top_container)
@@ -133,12 +134,49 @@ list_for_each_entry(_ss, &_root->subsys_
#define for_each_root(_root) \
list_for_each_entry(_root, &roots, root_list)
-/* Each task_struct has an embedded css_group, so the get/put
- * operation simply takes a reference count on all the containers
- * referenced by subsystems in this css_group. This can end up
- * multiple-counting some containers, but that's OK - the ref-count is
- * just a busy/not-busy indicator; ensuring that we only count each
- * container once would require taking a g
...
|
|
|
[PATCH 05/10] Task Containers(V11): Add procfs interface [message #15195 is a reply to message #15191] |
Fri, 20 July 2007 18:31 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
This patch adds:
/proc/containers - general system info
/proc/*/container - per-task container membership info
Signed-off-by: Paul Menage <menage@google.com>
---
fs/proc/base.c | 7 ++
include/linux/container.h | 2
kernel/container.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 141 insertions(+)
Index: container-2.6.22-rc6-mm1/fs/proc/base.c
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/fs/proc/base.c
+++ container-2.6.22-rc6-mm1/fs/proc/base.c
@@ -67,6 +67,7 @@
#include <linux/mount.h>
#include <linux/security.h>
#include <linux/ptrace.h>
+#include <linux/container.h>
#include <linux/cpuset.h>
#include <linux/audit.h>
#include <linux/poll.h>
@@ -2050,6 +2051,9 @@ static const struct pid_entry tgid_base_
#ifdef CONFIG_CPUSETS
REG("cpuset", S_IRUGO, cpuset),
#endif
+#ifdef CONFIG_CONTAINERS
+ REG("container", S_IRUGO, container),
+#endif
INF("oom_score", S_IRUGO, oom_score),
REG("oom_adj", S_IRUGO|S_IWUSR, oom_adjust),
#ifdef CONFIG_AUDITSYSCALL
@@ -2341,6 +2345,9 @@ static const struct pid_entry tid_base_s
#ifdef CONFIG_CPUSETS
REG("cpuset", S_IRUGO, cpuset),
#endif
+#ifdef CONFIG_CONTAINERS
+ REG("container", S_IRUGO, container),
+#endif
INF("oom_score", S_IRUGO, oom_score),
REG("oom_adj", S_IRUGO|S_IWUSR, oom_adjust),
#ifdef CONFIG_AUDITSYSCALL
Index: container-2.6.22-rc6-mm1/kernel/container.c
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/kernel/container.c
+++ container-2.6.22-rc6-mm1/kernel/container.c
@@ -33,6 +33,7 @@
#include <linux/mutex.h>
#include <linux/mount.h>
#include <linux/pagemap.h>
+#include <linux/proc_fs.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/seq_file.h>
@@ -247,6 +248,7 @@ static int container_mkdir(struct inode
static int container_rmdir(struct inode *unused_dir, struct dentry *dentry);
static int container_populate_dir(struct container *cont);
static struct inode_operations container_dir_inode_operations;
+static struct file_operations proc_containerstats_operations;
static struct inode *container_new_inode(mode_t mode, struct super_block *sb)
{
@@ -1567,6 +1569,7 @@ int __init container_init(void)
{
int err;
int i;
+ struct proc_dir_entry *entry;
for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++) {
struct container_subsys *ss = subsys[i];
@@ -1578,10 +1581,139 @@ int __init container_init(void)
if (err < 0)
goto out;
+ entry = create_proc_entry("containers", 0, NULL);
+ if (entry)
+ entry->proc_fops = &proc_containerstats_operations;
+
out:
return err;
}
+/*
+ * proc_container_show()
+ * - Print task's container paths into seq_file, one line for each hierarchy
+ * - Used for /proc/<pid>/container.
+ * - No need to task_lock(tsk) on this tsk->container reference, as it
+ * doesn't really matter if tsk->container changes after we read it,
+ * and we take container_mutex, keeping attach_task() from changing it
+ * anyway. No need to check that tsk->container != NULL, thanks to
+ * the_top_container_hack in container_exit(), which sets an exiting tasks
+ * container to top_container.
+ */
+
+/* TODO: Use a proper seq_file iterator */
+static int proc_container_show(struct seq_file *m, void *v)
+{
+ struct pid *pid;
+ struct task_struct *tsk;
+ char *buf;
+ int retval;
+ struct containerfs_root *root;
+
+ retval = -ENOMEM;
+ buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!buf)
+ goto out;
+
+ retval = -ESRCH;
+ pid = m->private;
+ tsk = get_pid_task(pid, PIDTYPE_PID);
+ if (!tsk)
+ goto out_free;
+
+ retval = 0;
+
+ mutex_lock(&container_mutex);
+
+ for_each_root(root) {
+ struct container_subsys *ss;
+ struct container *cont;
+ int subsys_id;
+ int count = 0;
+
+ /* Skip this hierarchy if it has no active subsystems */
+ if (!root->actual_subsys_bits)
+ continue;
+ for_each_subsys(root, ss)
+ seq_printf(m, "%s%s", count++ ? "," : "", ss->name);
+ seq_putc(m, ':');
+ get_first_subsys(&root->top_container, NULL, &subsys_id);
+ cont = task_container(tsk, subsys_id);
+ retval = container_path(cont, buf, PAGE_SIZE);
+ if (retval < 0)
+ goto out_unlock;
+ seq_puts(m, buf);
+ seq_putc(m, '\n');
+ }
+
+out_unlock:
+ mutex_unlock(&container_mutex);
+ put_task_struct(tsk);
+out_free:
+ kfree(buf);
+out:
+ return retval;
+}
+
+static int container_open(struct inode *inode, struct file *file)
+{
+ struct pid *pid = PROC_I(inode)->pid;
+ return single_open(file, proc_container_show, pid);
+}
+
+struct file_operations proc_container_operations = {
+ .open = container_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* Display information about each subsystem and each hierarchy */
+static int proc_containerstats_show(struct seq_file *m, void *v)
+{
+ int i;
+ struct containerfs_root *root;
+
+ mutex_lock(&container_mutex);
+ seq_puts(m, "Hierarchies:\n");
+ for_each_root(root) {
+ struct container_subsys *ss;
+ int first = 1;
+ seq_printf(m, "%p: bits=%lx containers=%d (", root,
+ root->subsys_bits, root->number_of_containers);
+ for_each_subsys(root, ss) {
+ seq_printf(m, "%s%s", first ? "" : ", ", ss->name);
+ first = false;
+ }
+ seq_putc(m, ')');
+ if (root->sb) {
+ seq_printf(m, " s_active=%d",
+ atomic_read(&root->sb->s_active));
+ }
+ seq_putc(m, '\n');
+ }
+ seq_puts(m, "Subsystems:\n");
+ for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++) {
+ struct container_subsys *ss = subsys[i];
+ seq_printf(m, "%d: name=%s hierarchy=%p\n",
+ i, ss->name, ss->root);
+ }
+ mutex_unlock(&container_mutex);
+ return 0;
+}
+
+static int containerstats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, proc_containerstats_show, 0);
+}
+
+static struct file_operations proc_containerstats_operations = {
+ .open = containerstats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
/**
* container_fork - attach newly forked task to its parents container.
* @tsk: pointer to task_struct of forking parent process.
Index: container-2.6.22-rc6-mm1/include/linux/container.h
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/include/linux/container.h
+++ container-2.6.22-rc6-mm1/include/linux/container.h
@@ -29,6 +29,8 @@ extern void container_fork(struct task_s
extern void container_fork_callbacks(struct task_struct *p);
extern void container_exit(struct task_struct *p, int run_callbacks);
+extern struct file_operations proc_container_operations;
+
/* Per-subsystem/per-container state maintained by the system. */
struct container_subsys_state {
/* The container that this subsystem is attached to. Useful
--
|
|
|
[PATCH 10/10] Task Containers(V11): Simple task container debug info subsystem [message #15196 is a reply to message #15191] |
Fri, 20 July 2007 18:32 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
This example subsystem exports debugging information as an aid to diagnosing
refcount leaks, etc, in the container framework.
Signed-off-by: Paul Menage <menage@google.com>
---
include/linux/container_subsys.h | 4 +
init/Kconfig | 10 ++++
kernel/Makefile | 1
kernel/container_debug.c | 97 +++++++++++++++++++++++++++++++++++++++
4 files changed, 112 insertions(+)
Index: container-2.6.22-rc6-mm1/include/linux/container_subsys.h
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/include/linux/container_subsys .h
+++ container-2.6.22-rc6-mm1/include/linux/container_subsys.h
@@ -19,4 +19,8 @@ SUBSYS(cpuacct)
/* */
+#ifdef CONFIG_CONTAINER_DEBUG
+SUBSYS(debug)
+#endif
+
/* */
Index: container-2.6.22-rc6-mm1/init/Kconfig
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/init/Kconfig
+++ container-2.6.22-rc6-mm1/init/Kconfig
@@ -303,6 +303,16 @@ config CONTAINERS
Say N if unsure.
+config CONTAINER_DEBUG
+ bool "Example debug container subsystem"
+ depends on CONTAINERS
+ help
+ This option enables a simple container subsystem that
+ exports useful debugging information about the containers
+ framework
+
+ Say N if unsure
+
config CPUSETS
bool "Cpuset support"
depends on SMP && CONTAINERS
Index: container-2.6.22-rc6-mm1/kernel/Makefile
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/kernel/Makefile
+++ container-2.6.22-rc6-mm1/kernel/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CONTAINERS) += container.o
+obj-$(CONFIG_CONTAINER_DEBUG) += container_debug.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CONTAINER_CPUACCT) += cpu_acct.o
obj-$(CONFIG_IKCONFIG) += configs.o
Index: container-2.6.22-rc6-mm1/kernel/container_debug.c
============================================================ =======
--- /dev/null
+++ container-2.6.22-rc6-mm1/kernel/container_debug.c
@@ -0,0 +1,97 @@
+/*
+ * kernel/ccontainer_debug.c - Example container subsystem that
+ * exposes debug info
+ *
+ * Copyright (C) Google Inc, 2007
+ *
+ * Developed by Paul Menage (menage@google.com)
+ *
+ */
+
+#include <linux/container.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/rcupdate.h>
+
+#include <asm/atomic.h>
+
+static struct container_subsys_state *debug_create(struct container_subsys *ss,
+ struct container *cont)
+{
+ struct container_subsys_state *css = kzalloc(sizeof(*css), GFP_KERNEL);
+
+ if (!css)
+ return ERR_PTR(-ENOMEM);
+
+ return css;
+}
+
+static void debug_destroy(struct container_subsys *ss, struct container *cont)
+{
+ kfree(cont->subsys[debug_subsys_id]);
+}
+
+static u64 container_refcount_read(struct container *cont, struct cftype *cft)
+{
+ return atomic_read(&cont->count);
+}
+
+static u64 taskcount_read(struct container *cont, struct cftype *cft)
+{
+ u64 count;
+
+ container_lock();
+ count = container_task_count(cont);
+ container_unlock();
+ return count;
+}
+
+static u64 current_css_group_read(struct container *cont, struct cftype *cft)
+{
+ return (u64)(long)current->containers;
+}
+
+static u64 current_css_group_refcount_read(struct container *cont,
+ struct cftype *cft)
+{
+ u64 count;
+
+ rcu_read_lock();
+ count = atomic_read(¤t->containers->ref.refcount);
+ rcu_read_unlock();
+ return count;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "container_refcount",
+ .read_uint = container_refcount_read,
+ },
+ {
+ .name = "taskcount",
+ .read_uint = taskcount_read,
+ },
+
+ {
+ .name = "current_css_group",
+ .read_uint = current_css_group_read,
+ },
+
+ {
+ .name = "current_css_group_refcount",
+ .read_uint = current_css_group_refcount_read,
+ },
+};
+
+static int debug_populate(struct container_subsys *ss, struct container *cont)
+{
+ return container_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct container_subsys debug_subsys = {
+ .name = "debug",
+ .create = debug_create,
+ .destroy = debug_destroy,
+ .populate = debug_populate,
+ .subsys_id = debug_subsys_id,
+};
--
|
|
|
[PATCH 01/10] Task Containers(V11): Basic task container framework [message #15197 is a reply to message #15191] |
Fri, 20 July 2007 18:31 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
This patch adds the main task containers framework - the container
filesystem, and the basic structures for tracking membership and
associating subsystem state objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
---
Documentation/containers.txt | 526 +++++++++++++++++
include/linux/container.h | 214 ++++++
include/linux/container_subsys.h | 10
include/linux/magic.h | 1
include/linux/sched.h | 34 +
init/Kconfig | 8
init/main.c | 3
kernel/Makefile | 1
kernel/container.c | 1199 +++++++++++++++++++++++++++++++++++++++
9 files changed, 1995 insertions(+), 1 deletion(-)
Index: container-2.6.22-rc6-mm1/Documentation/containers.txt
============================================================ =======
--- /dev/null
+++ container-2.6.22-rc6-mm1/Documentation/containers.txt
@@ -0,0 +1,526 @@
+ CONTAINERS
+ -------
+
+Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt
+
+Original copyright statements from cpusets.txt:
+Portions Copyright (C) 2004 BULL SA.
+Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
+Modified by Paul Jackson <pj@sgi.com>
+Modified by Christoph Lameter <clameter@sgi.com>
+
+CONTENTS:
+=========
+
+1. Containers
+ 1.1 What are containers ?
+ 1.2 Why are containers needed ?
+ 1.3 How are containers implemented ?
+ 1.4 What does notify_on_release do ?
+ 1.5 How do I use containers ?
+2. Usage Examples and Syntax
+ 2.1 Basic Usage
+ 2.2 Attaching processes
+3. Kernel API
+ 3.1 Overview
+ 3.2 Synchronization
+ 3.3 Subsystem API
+4. Questions
+
+1. Containers
+==========
+
+1.1 What are containers ?
+----------------------
+
+Containers provide a mechanism for aggregating/partitioning sets of
+tasks, and all their future children, into hierarchical groups with
+specialized behaviour.
+
+Definitions:
+
+A *container* associates a set of tasks with a set of parameters for one
+or more subsystems.
+
+A *subsystem* is a module that makes use of the task grouping
+facilities provided by containers to treat groups of tasks in
+particular ways. A subsystem is typically a "resource controller" that
+schedules a resource or applies per-container limits, but it may be
+anything that wants to act on a group of processes, e.g. a
+virtualization subsystem.
+
+A *hierarchy* is a set of containers arranged in a tree, such that
+every task in the system is in exactly one of the containers in the
+hierarchy, and a set of subsystems; each subsystem has system-specific
+state attached to each container in the hierarchy. Each hierarchy has
+an instance of the container virtual filesystem associated with it.
+
+At any one time there may be multiple active hierachies of task
+containers. Each hierarchy is a partition of all tasks in the system.
+
+User level code may create and destroy containers by name in an
+instance of the container virtual file system, specify and query to
+which container a task is assigned, and list the task pids assigned to
+a container. Those creations and assignments only affect the hierarchy
+associated with that instance of the container file system.
+
+On their own, the only use for containers is for simple job
+tracking. The intention is that other subsystems hook into the generic
+container support to provide new attributes for containers, such as
+accounting/limiting the resources which processes in a container can
+access. For example, cpusets (see Documentation/cpusets.txt) allows
+you to associate a set of CPUs and a set of memory nodes with the
+tasks in each container.
+
+1.2 Why are containers needed ?
+----------------------------
+
+There are multiple efforts to provide process aggregations in the
+Linux kernel, mainly for resource tracking purposes. Such efforts
+include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
+namespaces. These all require the basic notion of a
+grouping/partitioning of processes, with newly forked processes ending
+in the same group (container) as their parent process.
+
+The kernel container patch provides the minimum essential kernel
+mechanisms required to efficiently implement such groups. It has
+minimal impact on the system fast paths, and provides hooks for
+specific subsystems such as cpusets to provide additional behaviour as
+desired.
+
+Multiple hierarchy support is provided to allow for situations where
+the division of tasks into containers is distinctly different for
+different subsystems - having parallel hierarchies allows each
+hierarchy to be a natural division of tasks, without having to handle
+complex combinations of tasks that would be present if several
+unrelated subsystems needed to be forced into the same tree of
+containers.
+
+At one extreme, each resource controller or subsystem could be in a
+separate hierarchy; at the other extreme, all subsystems
+would be attached to the same hierarchy.
+
+As an example of a scenario (originally proposed by vatsa@in.ibm.com)
+that can benefit from multiple hierarchies, consider a large
+university server with various users - students, professors, system
+tasks etc. The resource planning for this server could be along the
+following lines:
+
+ CPU : Top cpuset
+ / \
+ CPUSet1 CPUSet2
+ | |
+ (Profs) (Students)
+
+ In addition (system tasks) are attached to topcpuset (so
+ that they can run anywhere) with a limit of 20%
+
+ Memory : Professors (50%), students (30%), system (20%)
+
+ Disk : Prof (50%), students (30%), system (20%)
+
+ Network : WWW browsing (20%), Network File System (60%), others (20%)
+ / \
+ Prof (15%) students (5%)
+
+Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
+into NFS network class.
+
+At the same time firefox/lynx will share an appropriate CPU/Memory class
+depending on who launched it (prof/student).
+
+With the ability to classify tasks differently for different resources
+(by putting those resource subsystems in different hierarchies) then
+the admin can easily set up a script which receives exec notifications
+and depending on who is launching the browser he can
+
+ # echo browser_pid > /mnt/<restype>/<userclass>/tasks
+
+With only a single hierarchy, he now would potentially have to create
+a separate container for every browser launched and associate it with
+approp network and other resource class. This may lead to
+proliferation of such containers.
+
+Also lets say that the administrator would like to give enhanced network
+access temporarily to a student's browser (since it is night and the user
+wants to do online gaming :) OR give one of the students simulation
+apps enhanced CPU power,
+
+With ability to write pids directly to resource classes, its just a
+matter of :
+
+ # echo pid > /mnt/network/<new_class>/tasks
+ (after some time)
+ # echo pid > /mnt/network/<orig_class>/tasks
+
+Without this ability, he would have to split the container into
+multiple separate ones and then associate the new containers with the
+new resource classes.
+
+
+
+1.3 How are containers implemented ?
+---------------------------------
+
+Containers extends the kernel as follows:
+
+ - Each task in the system has a reference-counted pointer to a
+ css_group.
+
+ - A css_group contains a set of reference-counted pointers to
+ container_subsys_state objects, one for each container subsystem
+ registered in the system. There is no direct link from a task to
+ the container of which it's a member in each hierarchy, but this
+ can be determined by following pointers through the
+ container_subsys_state objects. This is because accessing the
+ subsystem state is something that's expected to happen frequently
+ and in performance-critical code, whereas operations that require a
+ task's actual container assignments (in particular, moving between
+ containers) are less common.
+
+ - A container hierarchy filesystem can be mounted for browsing and
+ manipulation from user space.
+
+ - You can list all the tasks (by pid) attached to any container.
+
+The implementation of containers requires a few, simple hooks
+into the rest of the kernel, none in performance critical paths:
+
+ - in init/main.c, to initialize the root containers and initial
+ css_group at system boot.
+
+ - in fork and exit, to attach and detach a task from its css_group.
+
+In addition a new file system, of type "container" may be mounted, to
+enable browsing and modifying the containers presently known to the
+kernel. When mounting a container hierarchy, you may specify a
+comma-separated list of subsystems to mount as the filesystem mount
+options. By default, mounting the container filesystem attempts to
+mount a hierarchy containing all registered subsystems.
+
+If an active hierarchy with exactly the same set of subsystems already
+exists, it will be reused for the new mount. If no existing hierarchy
+matches, and any of the requested subsystems are in use in an existing
+hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
+is activated, associated with the requested subsystems.
+
+It's not currently possible to bind a new subsystem to an active
+container hierarchy, or to unbind a subsystem from an active container
+hierarchy. This may be possible in future, but is fraught with nasty
+error-recovery issues.
+
+When a container filesystem is unmounted, if there are any
+subcontainers created below the top-level container, that hierarchy
+will remain active even though unmounted; if there are no
+subcontainers then the hierarchy will be deactivated.
+
+No new system calls are added for containe
...
|
|
|
[PATCH 07/10] Task Containers(V11): Automatic userspace notification of idle containers [message #15198 is a reply to message #15191] |
Fri, 20 July 2007 18:31 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
This patch adds the following files to the container filesystem:
notify_on_release - configures/reports whether the container subsystem should
attempt to run a release script when this container becomes unused
release_agent - configures/reports the release agent to be used for
this hierarchy (top level in each hierarchy only)
releasable - reports whether this container would have been auto-released if
notify_on_release was true and a release agent was configured (mainly useful
for debugging)
To avoid locking issues, invoking the userspace release agent is done via a
workqueue task; containers that need to have their release agents invoked by
the workqueue task are linked on to a list.
Signed-off-by: Paul Menage <menage@google.com>
---
include/linux/container.h | 11 -
kernel/container.c | 425 +++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 393 insertions(+), 43 deletions(-)
Index: container-2.6.22-rc6-mm1/include/linux/container.h
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/include/linux/container.h
+++ container-2.6.22-rc6-mm1/include/linux/container.h
@@ -77,10 +77,11 @@ static inline void css_get(struct contai
* css_get()
*/
+extern void __css_put(struct container_subsys_state *css);
static inline void css_put(struct container_subsys_state *css)
{
if (!test_bit(CSS_ROOT, &css->flags))
- atomic_dec(&css->refcnt);
+ __css_put(css);
}
struct container {
@@ -112,6 +113,13 @@ struct container {
* tasks in this container. Protected by css_group_lock
*/
struct list_head css_groups;
+
+ /*
+ * Linked list running through all containers that can
+ * potentially be reaped by the release agent. Protected by
+ * release_list_lock
+ */
+ struct list_head release_list;
};
/* A css_group is a structure holding pointers to a set of
@@ -285,7 +293,6 @@ struct task_struct *container_iter_next(
struct container_iter *it);
void container_iter_end(struct container *cont, struct container_iter *it);
-
#else /* !CONFIG_CONTAINERS */
static inline int container_init_early(void) { return 0; }
Index: container-2.6.22-rc6-mm1/kernel/container.c
============================================================ =======
--- container-2.6.22-rc6-mm1.orig/kernel/container.c
+++ container-2.6.22-rc6-mm1/kernel/container.c
@@ -44,6 +44,8 @@
#include <linux/sort.h>
#include <asm/atomic.h>
+static DEFINE_MUTEX(container_mutex);
+
/* Generate an array of container subsystem pointers */
#define SUBSYS(_x) &_x ## _subsys,
@@ -82,6 +84,13 @@ struct containerfs_root {
/* Hierarchy-specific flags */
unsigned long flags;
+
+ /* The path to use for release notifications. No locking
+ * between setting and use - so if userspace updates this
+ * while subcontainers exist, you could miss a
+ * notification. We ensure that it's always a valid
+ * NUL-terminated string */
+ char release_agent_path[PATH_MAX];
};
@@ -109,7 +118,13 @@ static int need_forkexit_callback;
/* bits in struct container flags field */
enum {
+ /* Container is dead */
CONT_REMOVED,
+ /* Container has previously had a child container or a task,
+ * but no longer (only if CONT_NOTIFY_ON_RELEASE is set) */
+ CONT_RELEASABLE,
+ /* Container requires release notifications to userspace */
+ CONT_NOTIFY_ON_RELEASE,
};
/* convenient tests for these bits */
@@ -123,6 +138,19 @@ enum {
ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
};
+inline int container_is_releasable(const struct container *cont)
+{
+ const int bits =
+ (1 << CONT_RELEASABLE) |
+ (1 << CONT_NOTIFY_ON_RELEASE);
+ return (cont->flags & bits) == bits;
+}
+
+inline int notify_on_release(const struct container *cont)
+{
+ return test_bit(CONT_NOTIFY_ON_RELEASE, &cont->flags);
+}
+
/*
* for_each_subsys() allows you to iterate on each subsystem attached to
* an active hierarchy
@@ -134,6 +162,14 @@ list_for_each_entry(_ss, &_root->subsys_
#define for_each_root(_root) \
list_for_each_entry(_root, &roots, root_list)
+/* the list of containers eligible for automatic release. Protected by
+ * release_list_lock */
+static LIST_HEAD(release_list);
+static DEFINE_SPINLOCK(release_list_lock);
+static void container_release_agent(struct work_struct *work);
+static DECLARE_WORK(release_agent_work, container_release_agent);
+static void check_for_release(struct container *cont);
+
/* Link structure for associating css_group objects with containers */
struct cg_container_link {
/*
@@ -188,11 +224,8 @@ static int use_task_css_group_links;
/*
* unlink a css_group from the list and free it
*/
-static void release_css_group(struct kref *k)
+static void unlink_css_group(struct css_group *cg)
{
- struct css_group *cg = container_of(k, struct css_group, ref);
- int i;
-
write_lock(&css_group_lock);
list_del(&cg->list);
css_group_count--;
@@ -205,11 +238,39 @@ static void release_css_group(struct kre
kfree(link);
}
write_unlock(&css_group_lock);
- for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++)
- atomic_dec(&cg->subsys[i]->container->count);
+}
+
+static void __release_css_group(struct kref *k, int taskexit)
+{
+ int i;
+ struct css_group *cg = container_of(k, struct css_group, ref);
+
+ unlink_css_group(cg);
+
+ rcu_read_lock();
+ for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++) {
+ struct container *cont = cg->subsys[i]->container;
+ if (atomic_dec_and_test(&cont->count) &&
+ notify_on_release(cont)) {
+ if (taskexit)
+ set_bit(CONT_RELEASABLE, &cont->flags);
+ check_for_release(cont);
+ }
+ }
+ rcu_read_unlock();
kfree(cg);
}
+static void release_css_group(struct kref *k)
+{
+ __release_css_group(k, 0);
+}
+
+static void release_css_group_taskexit(struct kref *k)
+{
+ __release_css_group(k, 1);
+}
+
/*
* refcounted get/put for css_group objects
*/
@@ -223,6 +284,11 @@ static inline void put_css_group(struct
kref_put(&cg->ref, release_css_group);
}
+static inline void put_css_group_taskexit(struct css_group *cg)
+{
+ kref_put(&cg->ref, release_css_group_taskexit);
+}
+
/*
* find_existing_css_group() is a helper for
* find_css_group(), and checks to see whether an existing
@@ -464,8 +530,6 @@ static struct css_group *find_css_group(
* update of a tasks container pointer by attach_task()
*/
-static DEFINE_MUTEX(container_mutex);
-
/**
* container_lock - lock out any changes to container structures
*
@@ -524,6 +588,13 @@ static void container_diput(struct dentr
if (S_ISDIR(inode->i_mode)) {
struct container *cont = dentry->d_fsdata;
BUG_ON(!(container_is_removed(cont)));
+ /* It's possible for external users to be holding css
+ * reference counts on a container; css_put() needs to
+ * be able to access the container after decrementing
+ * the reference count in order to know if it needs to
+ * queue the container to be handled by the release
+ * agent */
+ synchronize_rcu();
kfree(cont);
}
iput(inode);
@@ -668,6 +739,8 @@ static int container_show_options(struct
seq_printf(seq, ",%s", ss->name);
if (test_bit(ROOT_NOPREFIX, &root->flags))
seq_puts(seq, ",noprefix");
+ if (strlen(root->release_agent_path))
+ seq_printf(seq, ",release_agent=%s", root->release_agent_path);
mutex_unlock(&container_mutex);
return 0;
}
@@ -675,6 +748,7 @@ static int container_show_options(struct
struct container_sb_opts {
unsigned long subsys_bits;
unsigned long flags;
+ char *release_agent;
};
/* Convert a hierarchy specifier into a bitmask of subsystems and
@@ -686,6 +760,7 @@ static int parse_containerfs_options(cha
opts->subsys_bits = 0;
opts->flags = 0;
+ opts->release_agent = NULL;
while ((token = strsep(&o, ",")) != NULL) {
if (!*token)
@@ -694,6 +769,15 @@ static int parse_containerfs_options(cha
opts->subsys_bits = (1 << CONTAINER_SUBSYS_COUNT) - 1;
} else if (!strcmp(token, "noprefix")) {
set_bit(ROOT_NOPREFIX, &opts->flags);
+ } else if (!strncmp(token, "release_agent=", 14)) {
+ /* Specifying two release agents is forbidden */
+ if (opts->release_agent)
+ return -EINVAL;
+ opts->release_agent = kzalloc(PATH_MAX, GFP_KERNEL);
+ if (!opts->release_agent)
+ return -ENOMEM;
+ strncpy(opts->release_agent, token + 14, PATH_MAX - 1);
+ opts->release_agent[PATH_MAX - 1] = 0;
} else {
struct container_subsys *ss;
int i;
@@ -743,7 +827,11 @@ static int container_remount(struct supe
if (!ret)
container_populate_dir(cont);
+ if (opts.release_agent)
+ strcpy(root->release_agent_path, opts.release_agent);
out_unlock:
+ if (opts.release_agent)
+ kfree(opts.release_agent);
mutex_unlock(&container_mutex);
mutex_unlock(&cont->dentry->d_inode->i_mutex);
return ret;
@@ -767,6 +855,7 @@ static void init_container_root(struct c
INIT_LIST_HEAD(&cont->sibling);
INIT_LIST_HEAD(&cont->children);
INIT_LIST_HEAD(&cont->css_groups);
+ INIT_LIST_HEAD(&cont->release_list);
}
static int container_test_super(struct super_block *sb, void *data)
@@ -841,8 +930,11 @@ static int container_get_sb(struct file_
/* First find the desired set of subsystems */
ret = parse_containerfs_options(data, &opts);
- if (ret)
+ if (ret) {
+ if (opts.release_agent)
+ kfree(opts.release_agent);
return ret;
+ }
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root)
@@ -851,6 +943,10 @@ static int container_get_sb(struct file_
init_container_root(root);
root->subsys_bits = opts.subsys_bits;
root->flags = opts.flags;
+ if (opts.release_agent) {
+ strcpy(root->release_agent_path, opts.release_agent);
+ kfree(opts.release_agent);
+ }
sb = sget(fs_type, container_test_super, container_set_super, root);
@@ -1125,7 +1221,7 @@ static int attach_task(struct container
ss->attach(ss, cont, oldcont, tsk);
}
}
-
+ set_bit(CONT_RELEASABLE, &oldcont->flags);
synchronize_rcu();
put_css_group(cg);
return 0;
@@ -1175,6 +1271,9 @@ enum container_filetype {
FILE_ROOT,
FILE_DIR,
FILE_TASKLIST,
+ FILE_NOTIFY_ON_RELEASE,
+ FILE_RELEASABLE,
+ FILE_RELEASE_AGENT,
};
static ssize_t container_common_file_write(struct container *cont,
@@ -1212,6 +1311,32 @@ static ssize_t container_common_file_wri
case FILE_TASKLIST:
retval = attach_task_by_pid(cont, buffer);
break;
+ case FILE_NOTIFY_ON_RELEASE:
+ clear_bit(CONT_RELEASABLE, &cont->flags);
+ if (simple_strtoul(buffer, NULL, 10) != 0)
+ set_bit(CONT_NOTIFY_ON_RELEASE, &cont->flags);
+ else
+ clear_bit(CONT_NOTIFY_ON_RELEASE, &cont->flags);
+ break;
+ case FILE_RELEASE_AGENT:
+ {
+ struct containerfs_root *root = cont->root;
+ /* Strip trailing newline */
+ if (nbytes && (buffer[nbytes-1] == '\n')) {
+ buffer[nbytes-1] = 0;
+ }
+ if (nbytes < sizeof(root->release_agent_path)) {
+ /* We never write anything other than '\0'
+ * into the last char of release_agent_path,
+ * so it always remains a NUL-terminated
+ * string */
+ strncpy(root->release_agent_path, buffer, nbytes);
+ root->release_agent_path[nbytes] = 0;
+ } else {
+ retval = -ENOSPC;
+ }
+ break;
+ }
default:
retval = -EINVAL;
goto out2;
@@ -1252,6 +1377,49 @@ static ssize_t container_read_uint(struc
return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
}
+static ssize_t container_common_file_read(struct container *cont,
+ struct cftype *cft,
+ struct file *file,
+ char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ enum container_filetype type = cft->private;
+ char *page;
+ ssize_t retval = 0;
+ char *s;
+
+ if (!(page = (char *)__get_free_page(GFP_KERNEL)))
+ return -ENOMEM;
+
+ s = page;
+
+ switch (type) {
+ case FILE_RELEASE_AGENT:
+ {
+ struct containerfs_root *root;
+ size_t n;
+ mutex_lock(&container_mutex);
+ root = cont->root;
+ n = strnlen(root->release_agent_path,
+ sizeof(root->release_agent_path));
+ n = min(n, (size_t) PAGE_SIZE);
+ strncpy(s, root->release_agent_path, n);
+ mutex_unlock(&container_mutex);
+ s += n;
+ break;
+ }
+ default:
+ retval = -EINVAL;
+ goto out;
+ }
+ *s++ = '\n';
+
+ retval = simple_read_from_buffer(buf, nbytes, ppos, page, s - page);
+out:
+ free_page((unsigned long)page);
+ return retval;
+}
+
static ssize_t container_file_read(struct file *file, char __user *buf,
size_t nbytes, loff_t *ppos)
{
@@ -1667,16 +1835,49 @@ static int container_tasks_release(struc
return 0;
}
+static u64 container_read_notify_on_release(struct container *cont,
+ struct cftype *cft)
+{
+ return notify_on_release(cont);
+}
+
+static u64 container_read_releasable(struct container *cont, struct cftype *cft)
+{
+ return test_bit(CONT_RELEASABLE, &cont->flags);
+}
+
/*
* for the common functions, 'private' gives the type of file
*/
-static struct cftype cft_tasks = {
- .name = "tasks",
- .open = container_tasks_open,
- .read = container_tasks_read,
+static struct cftype files[] = {
+ {
+ .name = "tasks",
+ .open = container_tasks_open,
+ .read = container_tasks_read,
+ .write = container_common_file_write,
+ .release = container_tasks_release,
+ .private = FILE_TASKLIST,
+ },
+
+ {
+ .name = "notify_on_release",
+ .read_uint = container_read_notify_on_release,
+ .write = container_common_file_write,
+ .private = FILE_NOTIFY_ON_RELEASE,
+ },
+
+ {
+ .name = "releasable",
+ .read_uint = container_read_releasable,
+ .private = FILE_RELEASABLE,
+ }
+};
+
+static struct cftype cft_release_agent = {
+ .name = "release_agent",
+ .read = container_common_file_read,
.write = container_common_file_write,
- .release = container_tasks_release,
- .private = FILE_TASKLIST,
+ .private = FILE_RELEASE_AGENT,
};
static int container_populate_dir(struct container *cont)
@@ -1687,10 +1888,15 @@ static int container_populate_dir(struct
/* First clear out any existing files */
container_clear_directory(cont->dentry);
- err = container_add_file(cont, NULL, &cft_tasks);
+ err = container_add_files(cont, NULL, files, ARRAY_SIZE(files));
if (err < 0)
return err;
+ if (cont == cont->top_container) {
+ if ((err = container_add_file(cont, NULL, &cft_release_agent)) < 0)
+ return err;
+ }
+
for_each_subsys(cont->root, ss) {
if (ss->populate && (err = ss->populate(ss, cont)) < 0)
return err;
@@ -1747,6 +1953,7 @@ static long container_create(struct cont
INIT_LIST_HEAD(&cont->sibling);
INIT_LIST_HEAD(&cont->children);
INIT_LIST_HEAD(&cont->css_groups);
+ INIT_LIST_HEAD(&cont->release_list);
cont->parent = parent;
cont->root = parent->root;
@@ -1808,6 +2015,38 @@ static int container_mkdir(struct inode
return container_create(c_parent, dentry, mode | S_IFDIR);
}
+static inline int container_has_css_refs(struct container *cont)
+{
+ /* Check the reference count on each subsystem. Since we
+ * already established that there are no tasks in the
+ * container, if the css refcount is also 0, then there should
+ * be no outstanding references, so the subsystem is safe to
+ * destroy. We scan across all subsystems rather than using
+ * the per-hierarchy linked list of mounted subsystems since
+ * we can be called via check_for_release() with no
+ * synchronization other than RCU, and the subsystem linked
+ * list isn't RCU-safe */
+ int i;
+ for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++) {
+ struct container_subsys *ss = subsys[i];
+ struct container_subsys_state *css;
+ /* Skip subsystems not in this hierarchy */
+ if (ss->root != cont->root)
+ continue;
+ css = cont->subsys[ss->subsys_id];
+ /* When called from check_for_release() it's possible
+ * that by this point the container has been removed
+ * and the css deleted. But a false-positive doesn't
+ * matter, since it can only happen if the container
+ * has been deleted and hence no longer needs the
+ * release agent to be called anyway. */
+ if (css && atomic_read(&css->refcnt)) {
+ return 1;
+ }
+ }
+ return 0;
+}
+
static int container_rmdir(struct inode *unused_dir, struct dentry *dentry)
{
struct container *cont = dentry->d_fsdata;
@@ -1816,7 +2055,6 @@ static int container_rmdir(struct inode
struct container_subsys *ss;
struct super_block *sb;
struct containerfs_root *root;
- int css_busy = 0;
/* the vfs holds both inode->i_mutex already */
@@ -1834,20 +2072,7 @@ static int container_rmdir(struct inode
root = cont->root;
sb = root->sb;
- /* Check the reference count on each subsystem. Since we
- * already established that there are no tasks in the
- * container, if the css refcount is also 0, then there should
- * be no outstanding references, so the subsystem is safe to
- * destroy */
- for_each_subsys(root, ss) {
- struct container_subsys_state *css;
- css = cont->subsys[ss->subsys_id];
- if (atomic_read(&css->refcnt)) {
- css_busy = 1;
- break;
- }
- }
- if (css_busy) {
+ if (container_has_css_refs(cont)) {
mutex_unlock(&container_mutex);
return -EBUSY;
}
@@ -1857,7 +2082,11 @@ static int container_rmdir(struct inode
ss->destroy(ss, cont);
}
+ spin_lock(&release_list_lock);
set_bit(CONT_REMOVED, &cont->flags);
+ if (!list_empty(&cont->release_list))
+ list_del(&cont->release_list);
+ spin_unlock(&release_list_lock);
/* delete my sibling from parent->children */
list_del(&cont->sibling);
spin_lock(&cont->dentry->d_lock);
@@ -1869,6 +2098,9 @@ static int container_rmdir(struct inode
dput(d);
root->number_of_containers--;
+ set_bit(CONT_RELEASABLE, &parent->flags);
+ check_for_release(parent);
+
mutex_unlock(&container_mutex);
/* Drop the active superblock reference that we took when we
* created the container */
@@ -1906,15 +2138,15 @@ static void container_init_subsys(struct
/* If this subsystem requested that it be notified with fork
* events, we should send it one now for every process in the
* system */
- if (ss->fork) {
- struct task_struct *g, *p;
+ if (ss->fork) {
+ struct task_struct *g, *p;
- read_lock(&tasklist_lock);
- do_each_thread(g, p) {
- ss->fork(ss, p);
- } while_each_thread(g, p);
- read_unlock(&tasklist_lock);
- }
+ read_lock(&tasklist_lock);
+ do_each_thread(g, p) {
+ ss->fork(ss, p);
+ } while_each_thread(g, p);
+ read_unlock(&tasklist_lock);
+ }
need_forkexit_callback |= ss->fork || ss->exit;
@@ -2241,7 +2473,7 @@ void container_exit(struct task_struct *
tsk->containers = &init_css_group;
task_unlock(tsk);
if (cg)
- put_css_group(cg);
+ put_css_group_taskexit(cg);
}
/**
@@ -2352,7 +2584,10 @@ int container_clone(struct task_struct *
out_release:
mutex_unlock(&inode->i_mutex);
+
+ mutex_lock(&container_mutex);
put_css_group(cg);
+ mutex_unlock(&container_mutex);
deactivate_super(parent->root->sb);
return ret;
}
@@ -2382,3 +2617,111 @@ int container_is_descendant(const struct
ret = (cont == target);
return ret;
}
+
+static void check_for_release(struct container *cont)
+{
+ /* All of these checks rely on RCU to keep the container
+ * structure alive */
+ if (container_is_releasable(cont) && !atomic_read(&cont->count)
+ && list_empty(&cont->children) && !container_has_css_refs(cont)) {
+ /* Container is currently removeable. If it's not
+ * already queued for a userspace notification, queue
+ * it now */
+ int need_schedule_work = 0;
+ spin_lock(&release_list_lock);
+ if (!container_is_removed(cont) &&
+ list_empty(&cont->release_list)) {
+ list_add(&cont->release_list, &release_list);
+ need_schedule_work = 1;
+ }
+ spin_unlock(&release_list_lock);
+ if (need_schedule_work)
+ schedule_work(&release_agent_work);
+ }
+}
+
+void __css_put(struct container_subsys_state *css)
+{
+ struct container *cont = css->container;
+ rcu_read_lock();
+ if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cont)) {
+ set_bit(CONT_RELEASABLE, &cont->flags);
+ check_for_release(cont);
+ }
+ rcu_read_unlock();
+}
+
+/*
+ * Notify userspace when a container is released, by running the
+ * configured release agent with the name of the container (path
+ * relative to the root of container file system) as the argument.
+ *
+ * Most likely, this user command will try to rmdir this container.
+ *
+ * This races with the possibility that some other task will be
+ * attached to this container before it is removed, or that some other
+ * user task will 'mkdir' a child container of this container. That's ok.
+ * The presumed 'rmdir' will fail quietly if this container is no longer
+ * unused, and this container will be reprieved from its death sentence,
+ * to continue to serve a useful existence. Next time it's released,
+ * we will get notified again, if it still has 'notify_on_release' set.
+ *
+ * The final arg to call_usermodehelper() is UMH_WAIT_EXEC, which
+ * means only wait until the task is successfully execve()'d. The
+ * separate release agent task is forked by call_usermodehelper(),
+ * then control in this thread returns here, without waiting for the
+ * release agent task. We don't bother to wait because the caller of
+ * this routine has no use for the exit status of the release agent
+ * task, so no sense holding our caller up for that.
+ *
+ */
+
+static void container_release_agent(struct work_struct *work)
+{
+ BUG_ON(work != &release_agent_work);
+ mutex_lock(&container_mutex);
+ spin_lock(&release_list_lock);
+ while (!list_empty(&release_list)) {
+ char *argv[3], *envp[3];
+ int i;
+ char *pathbuf;
+ struct container *cont = list_entry(release_list.next,
+ struct container,
+ release_list);
+ list_del_init(&cont->release_list);
+ spin_unlock(&release_list_lock);
+ pathbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!pathbuf) {
+ spin_lock(&release_list_lock);
+ continue;
+ }
+
+ if (container_path(cont, pathbuf, PAGE_SIZE) < 0) {
+ kfree(pathbuf);
+ spin_lock(&release_list_lock);
+ continue;
+ }
+
+ i = 0;
+ argv[i++] = cont->root->release_agent_path;
+ argv[i++] = (char *)pathbuf;
+ argv[i] = NULL;
+
+ i = 0;
+ /* minimal command environment */
+ envp[i++] = "HOME=/";
+ envp[i++] = "PATH=/sbin:/bin:/usr/sbin:/usr/bin";
+ envp[i] = NULL;
+
+ /* Drop the lock while we invoke the usermode helper,
+ * since the exec could involve hitting disk and hence
+ * be a slow process */
+ mutex_unlock(&container_mutex);
+ call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
+ kfree(pathbuf);
+ mutex_lock(&container_mutex);
+ spin_lock(&release_list_lock);
+ }
+ spin_unlock(&release_list_lock);
+ mutex_unlock(&container_mutex);
+}
--
|
|
|
|
|
Re: [PATCH 07/10] Task Containers(V11): Automatic userspace notification of idle containers [message #15235 is a reply to message #15198] |
Mon, 23 July 2007 17:41 |
serue
Messages: 750 Registered: February 2006
|
Senior Member |
|
|
Quoting menage@google.com (menage@google.com):
> This patch adds the following files to the container filesystem:
>
> notify_on_release - configures/reports whether the container subsystem should
> attempt to run a release script when this container becomes unused
>
> release_agent - configures/reports the release agent to be used for
> this hierarchy (top level in each hierarchy only)
>
> releasable - reports whether this container would have been auto-released if
> notify_on_release was true and a release agent was configured (mainly useful
> for debugging)
>
> To avoid locking issues, invoking the userspace release agent is done via a
> workqueue task; containers that need to have their release agents invoked by
> the workqueue task are linked on to a list.
Hi Paul,
my tree may be a bit crufty, but I had to #include <linux/kmod.h> in
order for this to compile on s390.
thanks,
-serge
> Signed-off-by: Paul Menage <menage@google.com>
> ---
>
> include/linux/container.h | 11 -
> kernel/container.c | 425 +++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 393 insertions(+), 43 deletions(-)
>
> Index: container-2.6.22-rc6-mm1/include/linux/container.h
> ============================================================ =======
> --- container-2.6.22-rc6-mm1.orig/include/linux/container.h
> +++ container-2.6.22-rc6-mm1/include/linux/container.h
> @@ -77,10 +77,11 @@ static inline void css_get(struct contai
> * css_get()
> */
>
> +extern void __css_put(struct container_subsys_state *css);
> static inline void css_put(struct container_subsys_state *css)
> {
> if (!test_bit(CSS_ROOT, &css->flags))
> - atomic_dec(&css->refcnt);
> + __css_put(css);
> }
>
> struct container {
> @@ -112,6 +113,13 @@ struct container {
> * tasks in this container. Protected by css_group_lock
> */
> struct list_head css_groups;
> +
> + /*
> + * Linked list running through all containers that can
> + * potentially be reaped by the release agent. Protected by
> + * release_list_lock
> + */
> + struct list_head release_list;
> };
>
> /* A css_group is a structure holding pointers to a set of
> @@ -285,7 +293,6 @@ struct task_struct *container_iter_next(
> struct container_iter *it);
> void container_iter_end(struct container *cont, struct container_iter *it);
>
> -
> #else /* !CONFIG_CONTAINERS */
>
> static inline int container_init_early(void) { return 0; }
> Index: container-2.6.22-rc6-mm1/kernel/container.c
> ============================================================ =======
> --- container-2.6.22-rc6-mm1.orig/kernel/container.c
> +++ container-2.6.22-rc6-mm1/kernel/container.c
> @@ -44,6 +44,8 @@
> #include <linux/sort.h>
> #include <asm/atomic.h>
>
> +static DEFINE_MUTEX(container_mutex);
> +
> /* Generate an array of container subsystem pointers */
> #define SUBSYS(_x) &_x ## _subsys,
>
> @@ -82,6 +84,13 @@ struct containerfs_root {
>
> /* Hierarchy-specific flags */
> unsigned long flags;
> +
> + /* The path to use for release notifications. No locking
> + * between setting and use - so if userspace updates this
> + * while subcontainers exist, you could miss a
> + * notification. We ensure that it's always a valid
> + * NUL-terminated string */
> + char release_agent_path[PATH_MAX];
> };
>
>
> @@ -109,7 +118,13 @@ static int need_forkexit_callback;
>
> /* bits in struct container flags field */
> enum {
> + /* Container is dead */
> CONT_REMOVED,
> + /* Container has previously had a child container or a task,
> + * but no longer (only if CONT_NOTIFY_ON_RELEASE is set) */
> + CONT_RELEASABLE,
> + /* Container requires release notifications to userspace */
> + CONT_NOTIFY_ON_RELEASE,
> };
>
> /* convenient tests for these bits */
> @@ -123,6 +138,19 @@ enum {
> ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
> };
>
> +inline int container_is_releasable(const struct container *cont)
> +{
> + const int bits =
> + (1 << CONT_RELEASABLE) |
> + (1 << CONT_NOTIFY_ON_RELEASE);
> + return (cont->flags & bits) == bits;
> +}
> +
> +inline int notify_on_release(const struct container *cont)
> +{
> + return test_bit(CONT_NOTIFY_ON_RELEASE, &cont->flags);
> +}
> +
> /*
> * for_each_subsys() allows you to iterate on each subsystem attached to
> * an active hierarchy
> @@ -134,6 +162,14 @@ list_for_each_entry(_ss, &_root->subsys_
> #define for_each_root(_root) \
> list_for_each_entry(_root, &roots, root_list)
>
> +/* the list of containers eligible for automatic release. Protected by
> + * release_list_lock */
> +static LIST_HEAD(release_list);
> +static DEFINE_SPINLOCK(release_list_lock);
> +static void container_release_agent(struct work_struct *work);
> +static DECLARE_WORK(release_agent_work, container_release_agent);
> +static void check_for_release(struct container *cont);
> +
> /* Link structure for associating css_group objects with containers */
> struct cg_container_link {
> /*
> @@ -188,11 +224,8 @@ static int use_task_css_group_links;
> /*
> * unlink a css_group from the list and free it
> */
> -static void release_css_group(struct kref *k)
> +static void unlink_css_group(struct css_group *cg)
> {
> - struct css_group *cg = container_of(k, struct css_group, ref);
> - int i;
> -
> write_lock(&css_group_lock);
> list_del(&cg->list);
> css_group_count--;
> @@ -205,11 +238,39 @@ static void release_css_group(struct kre
> kfree(link);
> }
> write_unlock(&css_group_lock);
> - for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++)
> - atomic_dec(&cg->subsys[i]->container->count);
> +}
> +
> +static void __release_css_group(struct kref *k, int taskexit)
> +{
> + int i;
> + struct css_group *cg = container_of(k, struct css_group, ref);
> +
> + unlink_css_group(cg);
> +
> + rcu_read_lock();
> + for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++) {
> + struct container *cont = cg->subsys[i]->container;
> + if (atomic_dec_and_test(&cont->count) &&
> + notify_on_release(cont)) {
> + if (taskexit)
> + set_bit(CONT_RELEASABLE, &cont->flags);
> + check_for_release(cont);
> + }
> + }
> + rcu_read_unlock();
> kfree(cg);
> }
>
> +static void release_css_group(struct kref *k)
> +{
> + __release_css_group(k, 0);
> +}
> +
> +static void release_css_group_taskexit(struct kref *k)
> +{
> + __release_css_group(k, 1);
> +}
> +
> /*
> * refcounted get/put for css_group objects
> */
> @@ -223,6 +284,11 @@ static inline void put_css_group(struct
> kref_put(&cg->ref, release_css_group);
> }
>
> +static inline void put_css_group_taskexit(struct css_group *cg)
> +{
> + kref_put(&cg->ref, release_css_group_taskexit);
> +}
> +
> /*
> * find_existing_css_group() is a helper for
> * find_css_group(), and checks to see whether an existing
> @@ -464,8 +530,6 @@ static struct css_group *find_css_group(
> * update of a tasks container pointer by attach_task()
> */
>
> -static DEFINE_MUTEX(container_mutex);
> -
> /**
> * container_lock - lock out any changes to container structures
> *
> @@ -524,6 +588,13 @@ static void container_diput(struct dentr
> if (S_ISDIR(inode->i_mode)) {
> struct container *cont = dentry->d_fsdata;
> BUG_ON(!(container_is_removed(cont)));
> + /* It's possible for external users to be holding css
> + * reference counts on a container; css_put() needs to
> + * be able to access the container after decrementing
> + * the reference count in order to know if it needs to
> + * queue the container to be handled by the release
> + * agent */
> + synchronize_rcu();
> kfree(cont);
> }
> iput(inode);
> @@ -668,6 +739,8 @@ static int container_show_options(struct
> seq_printf(seq, ",%s", ss->name);
> if (test_bit(ROOT_NOPREFIX, &root->flags))
> seq_puts(seq, ",noprefix");
> + if (strlen(root->release_agent_path))
> + seq_printf(seq, ",release_agent=%s", root->release_agent_path);
> mutex_unlock(&container_mutex);
> return 0;
> }
> @@ -675,6 +748,7 @@ static int container_show_options(struct
> struct container_sb_opts {
> unsigned long subsys_bits;
> unsigned long flags;
> + char *release_agent;
> };
>
> /* Convert a hierarchy specifier into a bitmask of subsystems and
> @@ -686,6 +760,7 @@ static int parse_containerfs_options(cha
>
> opts->subsys_bits = 0;
> opts->flags = 0;
> + opts->release_agent = NULL;
>
> while ((token = strsep(&o, ",")) != NULL) {
> if (!*token)
> @@ -694,6 +769,15 @@ static int parse_containerfs_options(cha
> opts->subsys_bits = (1 << CONTAINER_SUBSYS_COUNT) - 1;
> } else if (!strcmp(token, "noprefix")) {
> set_bit(ROOT_NOPREFIX, &opts->flags);
> + } else if (!strncmp(token, "release_agent=", 14)) {
> + /* Specifying two release agents is forbidden */
> + if (opts->release_agent)
> + return -EINVAL;
> + opts->release_agent = kzalloc(PATH_MAX, GFP_KERNEL);
> + if (!opts->release_agent)
> + return -ENOMEM;
> + strnc
...
|
|
|
|
Re: [PATCH 01/10] Task Containers(V11): Basic task container framework [message #19473 is a reply to message #19460] |
Sun, 29 July 2007 17:03 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
On 7/26/07, YAMAMOTO Takashi <yamamoto@valinux.co.jp> wrote:
> > +Other fields in the container_subsys object include:
>
> > +- hierarchy: an index indicating which hierarchy, if any, this
> > + subsystem is currently attached to. If this is -1, then the
> > + subsystem is not attached to any hierarchy, and all tasks should be
> > + considered to be members of the subsystem's top_container. It should
> > + be initialized to -1.
>
> stale info?
Yes, I think so. I really need to do a proper pass over
Documentation/containers.txt one more time following a bunch of recent
changes.
>
> > +struct container {
>
> > + struct containerfs_root *root;
> > + struct container *top_container;
> > +};
>
> can cont->top_container be different from than &cont->root.top_container?
No, I guess it can't. And there are only a couple of places using
cont->top_container now, so I should probably remove it.
Thanks,
Paul
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
|
|
|
|
Goto Forum:
Current Time: Fri Sep 13 02:58:59 GMT 2024
Total time taken to generate the page: 0.05081 seconds
|