OpenVZ Forum


Home » Mailing lists » Devel » [PATCH 11/33] task containersv11 make cpusets a client of containers
[PATCH 11/33] task containersv11 make cpusets a client of containers [message #20427] Mon, 17 September 2007 21:03 Go to next message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem

The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.

Signed-off-by: Paul Menage <menage@google.com>
---

 Documentation/cpusets.txt        |   91 +-
 fs/proc/base.c                   |    4 
 include/linux/cgroup_subsys.h |    6 
 include/linux/cpuset.h           |   12 
 include/linux/mempolicy.h        |   12 
 include/linux/sched.h            |    3 
 init/Kconfig                     |    7 
 kernel/cpuset.c                  | 1192 +++++------------------------
 kernel/exit.c                    |    2 
 kernel/fork.c                    |    3 
 mm/mempolicy.c                   |    2 
 11 files changed, 278 insertions(+), 1056 deletions(-)

diff -puN Documentation/cpusets.txt~task-cgroupsv11-make-cpusets-a-client-of-cgroups Documentation/cpusets.txt
--- a/Documentation/cpusets.txt~task-cgroupsv11-make-cpusets-a-client-of-cgroups
+++ a/Documentation/cpusets.txt
@@ -7,6 +7,7 @@ Written by Simon.Derr@bull.net
 Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
 Modified by Paul Jackson <pj@sgi.com>
 Modified by Christoph Lameter <clameter@sgi.com>
+Modified by Paul Menage <menage@google.com>
 
 CONTENTS:
 =========
@@ -16,10 +17,9 @@ CONTENTS:
   1.2 Why are cpusets needed ?
   1.3 How are cpusets implemented ?
   1.4 What are exclusive cpusets ?
-  1.5 What does notify_on_release do ?
-  1.6 What is memory_pressure ?
-  1.7 What is memory spread ?
-  1.8 How do I use cpusets ?
+  1.5 What is memory_pressure ?
+  1.6 What is memory spread ?
+  1.7 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -44,18 +44,19 @@ hierarchy visible in a virtual file syst
 hooks, beyond what is already present, required to manage dynamic
 job placement on large systems.
 
-Each task has a pointer to a cpuset.  Multiple tasks may reference
-the same cpuset.  Requests by a task, using the sched_setaffinity(2)
-system call to include CPUs in its CPU affinity mask, and using the
-mbind(2) and set_mempolicy(2) system calls to include Memory Nodes
-in its memory policy, are both filtered through that tasks cpuset,
-filtering out any CPUs or Memory Nodes not in that cpuset.  The
-scheduler will not schedule a task on a CPU that is not allowed in
-its cpus_allowed vector, and the kernel page allocator will not
-allocate a page on a node that is not allowed in the requesting tasks
-mems_allowed vector.
+Cpusets use the generic cgroup subsystem described in
+Documentation/cgroup.txt.
 
-User level code may create and destroy cpusets by name in the cpuset
+Requests by a task, using the sched_setaffinity(2) system call to
+include CPUs in its CPU affinity mask, and using the mbind(2) and
+set_mempolicy(2) system calls to include Memory Nodes in its memory
+policy, are both filtered through that tasks cpuset, filtering out any
+CPUs or Memory Nodes not in that cpuset.  The scheduler will not
+schedule a task on a CPU that is not allowed in its cpus_allowed
+vector, and the kernel page allocator will not allocate a page on a
+node that is not allowed in the requesting tasks mems_allowed vector.
+
+User level code may create and destroy cpusets by name in the cgroup
 virtual file system, manage the attributes and permissions of these
 cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
 specify and query to which cpuset a task is assigned, and list the
@@ -115,7 +116,7 @@ Cpusets extends these two mechanisms as 
  - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
    kernel.
  - Each task in the system is attached to a cpuset, via a pointer
-   in the task structure to a reference counted cpuset structure.
+   in the task structure to a reference counted cgroup structure.
  - Calls to sched_setaffinity are filtered to just those CPUs
    allowed in that tasks cpuset.
  - Calls to mbind and set_mempolicy are filtered to just
@@ -145,15 +146,10 @@ into the rest of the kernel, none in per
  - in page_alloc.c, to restrict memory to allowed nodes.
  - in vmscan.c, to restrict page recovery to the current cpuset.
 
-In addition a new file system, of type "cpuset" may be mounted,
-typically at /dev/cpuset, to enable browsing and modifying the cpusets
-presently known to the kernel.  No new system calls are added for
-cpusets - all support for querying and modifying cpusets is via
-this cpuset file system.
-
-Each task under /proc has an added file named 'cpuset', displaying
-the cpuset name, as the path relative to the root of the cpuset file
-system.
+You should mount the "cgroup" filesystem type in order to enable
+browsing and modifying the cpusets presently known to the kernel.  No
+new system calls are added for cpusets - all support for querying and
+modifying cpusets is via this cpuset file system.
 
 The /proc/<pid>/status file for each task has two added lines,
 displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
@@ -163,16 +159,15 @@ in the format seen in the following exam
   Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
   Mems_allowed:   ffffffff,ffffffff
 
-Each cpuset is represented by a directory in the cpuset file system
-containing the following files describing that cpuset:
+Each cpuset is represented by a directory in the cgroup file system
+containing (on top of the standard cgroup files) the following
+files describing that cpuset:
 
  - cpus: list of CPUs in that cpuset
  - mems: list of Memory Nodes in that cpuset
  - memory_migrate flag: if set, move pages to cpusets nodes
  - cpu_exclusive flag: is cpu placement exclusive?
  - mem_exclusive flag: is memory placement exclusive?
- - tasks: list of tasks (by pid) attached to that cpuset
- - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
  - memory_pressure: measure of how much paging pressure in cpuset
 
 In addition, the root cpuset only has the following file:
@@ -237,21 +232,7 @@ such as requests from interrupt handlers
 outside even a mem_exclusive cpuset.
 
 
-1.5 What does notify_on_release do ?
-------------------------------------
-
-If the notify_on_release flag is enabled (1) in a cpuset, then whenever
-the last task in the cpuset leaves (exits or attaches to some other
-cpuset) and the last child cpuset of that cpuset is removed, then
-the kernel runs the command /sbin/cpuset_release_agent, supplying the
-pathname (relative to the mount point of the cpuset file system) of the
-abandoned cpuset.  This enables automatic removal of abandoned cpusets.
-The default value of notify_on_release in the root cpuset at system
-boot is disabled (0).  The default value of other cpusets at creation
-is the current value of their parents notify_on_release setting.
-
-
-1.6 What is memory_pressure ?
+1.5 What is memory_pressure ?
 -----------------------------
 The memory_pressure of a cpuset provides a simple per-cpuset metric
 of the rate that the tasks in a cpuset are attempting to free up in
@@ -308,7 +289,7 @@ the tasks in the cpuset, in units of rec
 times 1000.
 
 
-1.7 What is memory spread ?
+1.6 What is memory spread ?
 ---------------------------
 There are two boolean flag files per cpuset that control where the
 kernel allocates pages for the file system buffers and related in
@@ -379,7 +360,7 @@ data set, the memory allocation across t
 can become very uneven.
 
 
-1.8 How do I use cpusets ?
+1.7 How do I use cpusets ?
 --------------------------
 
 In order to minimize the impact of cpusets on critical kernel
@@ -469,7 +450,7 @@ than stress the kernel.
 To start a new job that is to be contained within a cpuset, the steps are:
 
  1) mkdir /dev/cpuset
- 2) mount -t cpuset none /dev/cpuset
+ 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
  3) Create the new cpuset by doing mkdir's and write's (or echo's) in
     the /dev/cpuset virtual file system.
  4) Start a task that will be the "founding father" of the new job.
@@ -481,7 +462,7 @@ For example, the following sequence of c
 named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
 and then start a subshell 'sh' in that cpuset:
 
-  mount -t cpuset none /dev/cpuset
+  mount -t cgroup -ocpuset cpuset /dev/cpuset
   cd /dev/cpuset
   mkdir Charlie
   cd Charlie
@@ -513,7 +494,7 @@ Creating, modifying, using the cpusets c
 virtual filesystem.
 
 To mount it, type:
-# mount -t cpuset none /dev/cpuset
+# mount -t cgroup -o cpuset cpuset /dev/cpuset
 
 Then under /dev/cpuset you can find a tree that corresponds to the
 tree of the cpusets in the system. For instance, /dev/cpuset
@@ -556,6 +537,18 @@ To remove a cpuset, just use rmdir:
 This will fail if the cpuset is in use (has cpusets inside, or has
 processes attached).
 
+Note that for legacy reasons, the "cpuset" filesystem exists as a
+wrapper around the cgroup filesystem.
+
+The command
+
+mount -t cpuset X /dev/cpuset
+
+is equivalent to
+
+mount -t cgroup -ocpuset X /dev/cpuset
+echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
+
 2.2 Adding/removing cpus
 ------------------------
 
diff -puN fs/proc/base.c~task-cgroupsv11-make-cpusets-a-client-of-cgroups fs/proc/base.c
--- a/fs/proc/base.c~task-cgroupsv11-make-cpusets-a-client-of-cgroups
+++ a/fs/proc/base.c
@@ -2049,7 +2049,7 @@ static const struct pid_entry tgid_base_
 #ifdef CONFIG_SCHEDSTATS
 	INF("schedstat",  S_IRUGO, pid_schedstat),
 #endif
-#ifdef CONFIG_CPUSETS
+#ifdef CONFIG_PROC_PID_CPUSET
 	REG("cpuset",     S_IRUGO, cpuset),
 #endif
 #ifdef CONFIG_CGROUPS
@@ -2341,7 +2341,7 @@ static const struct pid_entry tid_base_s
 #ifdef CONFIG_SCHEDSTATS
 	INF("schedstat", S_IRUGO, pid_schedstat),
 #endif
-#ifdef CONFIG_CPUSETS
+#ifdef CONFIG_PROC_PID_CPUSET
 	REG("cpuset",    S_IRUGO,
...

Re: [PATCH 11/33] task containersv11 make cpusets a client of containers [message #21280 is a reply to message #20427] Thu, 04 October 2007 09:53 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
Paul M,

This snippet from the memory allocation hot path worries me a bit.

Once per memory page allocation, we go through here, needing to peak inside
the current tasks cpuset to see if it has changed (it's 'mems_generation'
value doesn't match the last seen value we have stashed in the task struct.)

@@ -653,20 +379,19 @@ void cpuset_update_task_memory_state(voi
 	struct task_struct *tsk = current;
 	struct cpuset *cs;
 
-	if (tsk->cpuset == &top_cpuset) {
+	if (task_cs(tsk) == &top_cpuset) {
 		/* Don't need rcu for top_cpuset.  It's never freed. */
 		my_cpusets_mem_gen = top_cpuset.mems_generation;
 	} else {
 		rcu_read_lock();
-		cs = rcu_dereference(tsk->cpuset);
-		my_cpusets_mem_gen = cs->mems_generation;
+		my_cpusets_mem_gen = task_cs(current)->mems_generation;
 		rcu_read_unlock();
 	}

With this new cgroup code, the task_cs macro was added, -twice-,
which deals with the fact that what used to be a single pointer
in the task struct directly to the tasks cpuset is now roughly
two more dereferences and an indexing away:

    static inline struct cpuset *task_cs(struct task_struct *task)
    {
	    return container_of(task_subsys_state(task, cpuset_subsys_id),
				struct cpuset, css);
    }

    static inline struct cgroup_subsys_state *task_subsys_state(
	    struct task_struct *task, int subsys_id)
    {
	    return rcu_dereference(task->cgroups->subsys[subsys_id]);
    }


At a minimum, could you change that last added line to use 'tsk'
instead of 'current'?   This should save one instruction, as 'tsk'
will likely already be in a register.

+		my_cpusets_mem_gen = task_cs(tsk)->mems_generation;

I guess the two, rather than one, invocations of task_cs() won't matter
much, as they are on the same address, so the second invocation will
hit cache lines just found on the first invocation.

I wonder if we can save any cache line hits on this, or if there is
someway to measure whether or not this has noticeable performance
impact.

... Probably this is all lost in the noise of the other stuff that
gets coded in the memory allocation hot path.  It would be nice to
think that it actually matters however.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 11/33] task containersv11 make cpusets a client of containers [message #21298 is a reply to message #21280] Thu, 04 October 2007 15:16 Go to previous messageGo to next message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 10/4/07, Paul Jackson <pj@sgi.com> wrote:
> Paul M,
>
> This snippet from the memory allocation hot path worries me a bit.
>
> Once per memory page allocation, we go through here, needing to peak inside
> the current tasks cpuset to see if it has changed (it's 'mems_generation'
> value doesn't match the last seen value we have stashed in the task struct.)
>
> @@ -653,20 +379,19 @@ void cpuset_update_task_memory_state(voi
>         struct task_struct *tsk = current;
>         struct cpuset *cs;
>
> -       if (tsk->cpuset == &top_cpuset) {
> +       if (task_cs(tsk) == &top_cpuset) {
>                 /* Don't need rcu for top_cpuset.  It's never freed. */
>                 my_cpusets_mem_gen = top_cpuset.mems_generation;
>         } else {
>                 rcu_read_lock();
> -               cs = rcu_dereference(tsk->cpuset);
> -               my_cpusets_mem_gen = cs->mems_generation;
> +               my_cpusets_mem_gen = task_cs(current)->mems_generation;
>                 rcu_read_unlock();
>         }
>
> With this new cgroup code, the task_cs macro was added, -twice-,
> which deals with the fact that what used to be a single pointer
> in the task struct directly to the tasks cpuset is now roughly
> two more dereferences and an indexing away:

It's two constant-indexed dereferences *in total*, compared to a
single constant-indexed dereference in the pre-cgroup case.

The cpuset pointer is found at
task->cgroups->subsys[cpuset_subsys_id], where cpuset_subsys_id is a
compile-time constant.

>
> At a minimum, could you change that last added line to use 'tsk'
> instead of 'current'?   This should save one instruction, as 'tsk'
> will likely already be in a register.

Sounds reasonable.

>
> I wonder if we can save any cache line hits on this, or if there is
> someway to measure whether or not this has noticeable performance
> impact.

I didn't notice any performance hit on a pure allocate/free memory
benchmark relative to non-cgroup cpusets. (There was a small
performance hit relative to not using cpusets at all, but that was to
be expected).

Paul
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 11/33] task containersv11 make cpusets a client of containers [message #21306 is a reply to message #21298] Thu, 04 October 2007 17:31 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
Paul M wrote:
> It's two constant-indexed dereferences *in total*, compared to a
> single constant-indexed dereference in the pre-cgroup case.

Ok - the C expression is longer and I didn't realize how
little difference it made in the end (the executing code.)

Good - thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 11/33] task containersv11 make cpusets a client of containers [message #21307 is a reply to message #21298] Thu, 04 October 2007 17:32 Go to previous message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
Paul M wrote:
> I didn't notice any performance hit on a pure allocate/free memory
> benchmark relative to non-cgroup cpusets.

Good.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Previous Topic: netns : close all sockets at unshare ?
Next Topic: [PATCH] Simplify memory controller and resource counter I/O
Goto Forum:
  


Current Time: Fri Aug 01 05:59:38 GMT 2025

Total time taken to generate the page: 0.33369 seconds