OpenVZ Forum: Devel » [RFC] Control Groups Roadmap ideas

Home » Mailing lists » Devel » [RFC] Control Groups Roadmap ideas

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Re: [RFC] Control Groups Roadmap ideas [message #29456 is a reply to message #29404]

Sun, 13 April 2008 16:11

serue
Messages: 750
Registered: February 2006

Senior Member

Quoting Balbir Singh (balbir@linux.vnet.ibm.com):
> On Fri, Apr 11, 2008 at 8:18 PM, Serge E. Hallyn <serue@us.ibm.com> wrote:
> >
> > Quoting Paul Menage (menage@google.com):
> >  > This is a list of some of the sub-projects that I'm planning for
> >  > Control Groups, or that I know others are planning on or working on.
> >  > Any comments or suggestions are welcome.
> >  >
> >  >
> >  > 1) Stateless subsystems
> >  > -----
> >  >
> >  > This was motivated by the recent "freezer" subsystem proposal, which
> >  > included a facility for sending signals to all members of a cgroup.
> >  > This wasn't specifically freezer-related, and wasn't even something
> >  > that needed particular per-cgroup state - its only state is that set
> >  > of processes, which is already tracked by crgoups. So it could
> >  > theoretically be mounted on multiple hierarchies at once, and wouldn't
> >  > need an entry in the css_set array.
> >  >
> >  > This would require a few internal plumbing changes in cgroups, in particular:
> >  >
> >  > - hashing css_set objects based on their cgroups rather than their css pointers
> >  > - allowing stateless subsystems to be in multiple hierarchies
> >  > - changing the way hierarchy ids are calculated - simply ORing
> >  > together the subsystem would no longer work since that could result in
> >  > duplicates
> >  >
> >  > 2) More flexible binding/unbinding/rebinding
> >  > -----
> >  >
> >  > Currently you can only add/remove subsystems to a hierarchy when it
> >  > has just a single (root) cgroup. This is a bit inflexible, so I'm
> >  > planning to support:
> >  >
> >  > - adding a subsystem to an existing hierarchy by automatically
> >  > creating a subsys state object for the new subsystem for each existing
> >  > cgroup in the hierarchy and doing the appropriate
> >  > can_attach()/attach_tasks() callbacks for all tasks in the system
> >  >
> >  > - removing a subsystem from an existing hierarchy by moving all tasks
> >  > to that subsystem's root cgroup and destroying the child subsystem
> >  > state objects
> >  >
> >  > - merging two existing hierarchies that have identical cgroup trees
> >  >
> >  > - (maybe) splitting one hierarchy into two separate hierarchies
> >  >
> >  > Whether all these operations should be forced through the mount()
> >  > system call, or whether they should be done via operations on cgroup
> >  > control files, is something I've not figured out yet.
> >
> >  I'm tempted to ask what the use case is for this (I assume you have one,
> >  you don't generally introduce features for no good reason), but it
> >  doesn't sound like this would have any performance effect on the general
> >  case, so it sounds good.
> >
> >  I'd stick with mount semantics.  Just
> >         mount -t cgroup -o remount,devices,cpu none /devwh"
> >  should handle all cases, no?
> >
> >
> >
> >  > 3) Subsystem dependencies
> >  > -----
> >  >
> >  > This would be a fairly simple change, essentially allowing one
> >  > subsystem to require that it only be mounted on a hierarchy when some
> >  > other subsystem was also present. The implementation would probably be
> >  > a callback that allows a subsystem to confirm whether it's prepared to
> >  > be included in a proposed hierarchy containing a specified subsystem
> >  > bitmask; it would be able to prevent the hierarchy from being created
> >  > by giving an error return. An example of a use for this would be a
> >  > swap subsystem that is mostly independent of the memory controller,
> >  > but uses the page-ownership tracking of the memory controller to
> >  > determine which cgroup to charge swap pages to. Hence it would require
> >  > that it only be mounted on a hierarchy that also included a memory
> >  > controller. The memory controller would make no such requirement by
> >  > itself, so could be used on its own without the swap controller.
> >  >
> >  >
> >  > 4) Subsystem Inheritance
> >  > ------
> >  >
> >  > This is an idea that I've been kicking around for a while trying to
> >  > figure out whether its usefulness is worth the in-kernel complexity,
> >  > versus doing it in userspace. It comes from the idea that although
> >  > cgroups supports multiple hierarchies so that different subsystems can
> >  > see different task groupings, one of the more common uses of this is
> >  > (I believe) to support a setup where say we have separate groups A, B
> >  > and C for one resource X, but for resource Y we want a group
> >  > consisting of A+B+C. E.g. we want individual CPU limits for A, B and
> >  > C, but for disk I/O we want them all to share a common limit. This can
> >  > be done from userspace by mounting two hierarchies, one for CPU and
> >  > one for disk I/O, and creating appropriate groupings, but it could
> >  > also be done in the kernel as follows:
> >  >
> >  > - each subsystem "foo" would have a "foo.inherit" file provided by
> >  > (and handled by) cgroups in each group directory
> >  >
> >  > - setting the foo.inherit flag (i.e. writing 1 to it) would cause
> >  > tasks in that cgroup to share the "foo" subsystem state with the
> >  > parent cgroup
> >  >
> >  > - from the subsystem's point of view, it would only need to worry
> >  > about its own foo_cgroup objects  and which task was associated with
> >  > each object; the subsystem wouldn't need to care about which tasks
> >  > were part of each cgroup, and which cgroups were sharing state; that
> >  > would all be taken care of by the cgroup framework
> >  >
> >  > I've mentioned this a couple of times on the containers list as part
> >  > of other random discussions; at one point Serge Hallyn expressed some
> >  > interest but there's not been much noise about it either way. I
> >  > figured I'd include it on this list anyway to see what people think of
> >  > it.
> >
> >  I guess I'm hoping that if libcg goes well then a userspace daemon can
> >  do all we need.  Of course the use case I envision is having a container
> >  which is locked to some amount of ram, wherein the container admin wants
> >  to lock some daemon to a subset of that ram.  If the host admin lets the
> >  container admin edit a config file (or talk to a daemon through some
> >  sock designated for the container) that will only create a child of the
> >  container's cgroup, that's probably great.
> >
> 
> I thought of doing something like this in libcg (having a daemon and a
> client socket interface), but dropped the idea later. When all
> controllers support multi-levels well, the plan is to create a
> sub-directory in the cgroup hierarchy and give subtree ownership to
> the application administrator.
> 
> >  So I'm basically being quiet until I see whether libcg will suffice.
> >
> 
> If you do have any specific requirements, we can cater to them right
> now. Please do let us know. The biggest challenge right now is getting
> a stable API.

It sounds like what you're talking about should suffice - the container
can only write to its own subdirectory, and the control files therein
should not allow the container to escape the bounds set for it, only to
partition it.

The only thing that worries me is how subtle it may turn out to be to
properly set up a container this way.  I.e. you'll need to
	mount --bind /etc/cgroups/mycontainer /vps/container1/etc/cgroups
before the container is off and running and be able to then prevent
the cgroup from mounting the host's /etc any other way.

As in so many other cases it shouldn't be too difficult with selinux,
otherwise I suppose one thing you could do is to put the host's
/etc/cgroup (or really the host's /) on partitionN, mount
/etc/cgroup/container from another partitionM, and use the device
whitelist (eventually, device namespaces) to allow the container to
mount partitionM but not partitionN.

So that's the one place where kernel support might be kind of seductive,
but I suspect it would just lead to either an unsafe, an inflexible, or
just a hokey "solution".  So let's stick with libcg for now.  A daemon
can always be written on top of it if people want, and if at some point
we see a real need for kernel support we can talk about it then.

Thanks, Balbir.

> >  > 5) "procs" control file
> >  > -----
> >  >
> >  > This would be the equivalent of the "tasks" file, but acting/reporting
> >  > on entire thread groups. Not sure exactly what the read semantics
> >  > should be if a sub-thread of a process is in the cgroup, but not its
> >  > thread group leader.
> >  >
> >  >
> >  > 6) Statistics / binary API
> >  > ----
> >  >
> >  > Balaji Rao is working on a generic way to gather per-subsystem
> >  > statistics; it would also be interesting to construct an extensible
> >  &g

...

[ Show the rest of the message ]

Report message to a moderator

[Message index]

		[RFC] Control Groups Roadmap ideas By: Paul Menage on Tue, 08 April 2008 21:14
		Re: [RFC] Control Groups Roadmap ideas By: Li Zefan on Wed, 09 April 2008 02:28
		Re: [RFC] Control Groups Roadmap ideas By: Paul Menage on Thu, 10 April 2008 20:10
		Re: [RFC] Control Groups Roadmap ideas By: serue on Fri, 11 April 2008 14:48
		Re: [RFC] Control Groups Roadmap ideas By: Balbir Singh on Sat, 12 April 2008 05:10
		Re: [RFC] Control Groups Roadmap ideas By: serue on Sun, 13 April 2008 16:11
		Re: [RFC] Control Groups Roadmap ideas By: Balbir Singh on Mon, 14 April 2008 14:31
		Re: [RFC] Control Groups Roadmap ideas By: Paul Menage on Mon, 14 April 2008 05:24
		Re: [RFC] Control Groups Roadmap ideas By: serue on Mon, 14 April 2008 14:11
		Re: [RFC] Control Groups Roadmap ideas By: Paul Menage on Mon, 14 April 2008 15:03

Previous Topic:	same nfs mount dir in VEs
Next Topic:	[RFC][PATCH 0/4] Object creation with a specified id

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Sat Dec 06 21:40:53 GMT 2025

Total time taken to generate the page: 0.08106 seconds