| Home » Mailing lists » Devel » [PATCH 0/2] resource control file system - aka containers on top of nsproxy! Goto Forum:
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17627 is a reply to message #17611] | Fri, 09 March 2007 00:38   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Thu, Mar 08, 2007 at 01:10:24AM -0800, Paul Menage wrote:
> On 3/7/07, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > Please next time this kind of patch is posted add a description of
> > what is happening and why.  I have yet to see people explain why
> > this is a good idea.  Why the current semantics were chosen.
> 
> OK. I thought that the descriptions in my last patch 0/7 and
> Documentation/containers.txt gave a reasonable amount of "why", but I
> can look at adding more details.
> 
> >
> > I have a question?  What does rcfs look like if we start with
> > the code that is in the kernel?  That is start with namespaces
> > and nsproxy and just build a filesystem to display/manipulate them?
> > With the code built so it will support adding resource controllers
> > when they are ready?
> 
> There's at least one resource controller that's already in the kernel - cpusets.
> 
> > We probably want to rename this struct task_proxy....
> > And then we can rename most of the users things like:
> > dup_task_proxy, clone_task_proxy, get_task_proxy, free_task_proxy,
> > put_task_proxy, exit_task_proxy, init_task_proxy....
> 
> That could be a good start.
> 
> >
> > This extra list of nsproxy's is unneeded and a performance problem the
> > way it is used.  In general we want to talk about the individual resource
> > controllers not the nsproxy.
> 
> There's one important reason why it's needed, and highlights one of
> the ways that "resource controllers" are different from the way that
> "namespaces" have currently been used.
> 
> Currently with a namespace, you can only unshare, either by
> sys_unshare() or clone() - you can't "reshare" a namespace with some
> other task. But resource controllers tend to have the concept a lot
> more of being able to move between resource classes. If you're going
> to have an ns_proxy/container_group object that gathers together a
> group of pointers to namespaces/subsystem-states, then either:
> 
> 1) you only allow a task to reshare *all* namespaces/subsystems with
>    another task, i.e. you can update current->task_proxy to point to
>    other->task_proxy. But that restricts flexibility of movement.
>    It would be impossible to have a process that could enter, say,
>    an existing process' network namespace without also entering its
>    pid/ipc/uts namespaces and all of its resource limits.
> 
> 2) you allow a task to selectively reshare namespaces/subsystems with
>    another task, i.e. you can update current->task_proxy to point to
>    a proxy that matches your existing task_proxy in some ways and the
>    task_proxy of your destination in others. In that case a trivial
>    implementation would be to allocate a new task_proxy and copy some
>    pointers from the old task_proxy and some from the new. But then
>    whenever a task moves between different groupings it acquires a
>    new unique task_proxy. So moving a bunch of tasks between two
>    groupings, they'd all end up with unique task_proxy objects with
>    identical contents.
this is exactly what Linux-VServer does right now, and I'm
still not convinced that the nsproxy really buys us anything
compared to a number of different pointers to various spaces
(located in the task struct)
> So it would be much more space efficient to be able to locate an
> existing task_proxy with an identical set of namespace/subsystem
> pointers in that event. The linked list approach that I put in my last
> containers patch was a simple way to do that, and Vatsa's reused it
> for his patches. My intention is to replace it with a more efficient
> lookup (maybe using a hash of the desired pointers?) in a future
> patch.
IMHO that is getting quite complicated and probably very
inefficient, especially if you think hundreds of guests
with a dozent spaces each ... and still we do not know if
the nsproxy is a real benefit either memory or performance
wise ...
> > > +     void *ctlr_data[CONFIG_MAX_RC_SUBSYS];
> >
> > I still don't understand why these pointers are so abstract,
> > and why we need an array lookup into them?
> >
> 
> For the same reason that we have:
> 
> - generic notifier chains rather than having a big pile of #ifdef'd
>   calls to the various notification sites
> 
> - linker sections to define initcalls and per-cpu variables, rather
>   than hard-coding all init calls into init/main.c and having a big
>   per-cpu structure (both of which would again be full of #ifdefs)
> 
> It makes the code much more readable, and makes patches much simpler
> and less likely to stomp on one another.
> 
> OK, so my current approaches have involved an approach like notifier
> chains, i.e. have a generic list/array, and do something to all the
> objects on that array.
I'd prefer to do accounting (and limits) in a very simple
and especially performant way, and the reason for doing
so is quite simple:
 nobody actually cares about a precise accounting and
 calculating shares or partitions of whatever resource,
 all that matters is that you have a way to prevent a
 potential hostile environment from sucking up all your
 resources (or even a single one) resulting in a DoS
so the main purpose of a resource limit (or accounting)
is to get an idea how much a certain guest uses up, not
more and not less ...
> How about a radically different approach based around the
> initcall/percpu way (linker sections)? Something like:
> 
> - each namespace or subsystem defines itself in its own code, via a
> macro such as:
> 
> struct task_subsys {
>   const char *name;
>   ...
> };
> 
> #define DECLARE_TASKGROUP_SUBSYSTEM(ss) \
>     __attribute__((__section__(".data.tasksubsys"))) struct
> task_subsys *ss##_ptr = &ss
> 
> 
> It would be used like:
> 
> struct taskgroup_subsys uts_ns = {
>   .name = "uts",
>   .unshare = uts_unshare,
> };
> 
> DECLARE_TASKGROUP_SUBSYSTEM(uts_ns);
> 
> ...
> 
> struct taskgroup_subsys cpuset_ss {
>   .name = "cpuset",
>   .create = cpuset_create,
>   .attach = cpuset_attach,
> };
> 
> DECLARE_TASKGROUP_SUBSYSTEM(cpuset_ss);
> 
> At boot time, the task_proxy init code would figure out from the size
> of the task_subsys section how many pointers had to be in the
> task_proxy object (maybe add a few spares for dynamically-loaded
> modules?). The offset of the subsystem pointer within the task_subsys
> data section would also be the offset of that subsystem's
> per-task-group state within the task_proxy object, which should allow
> accesses to be pretty efficient (with macros providing user-friendly
> access to the appropriate locations in the task_proxy)
> 
> The loops in container.c in my patch that iterate over the subsys
> array to perform callbacks, and the code in nsproxy.c that performs
> the same action for each namespace type, would be replaced with
> iterations over the task_subsys data section; possibly some
> pre-processing of the various linked-in subsystems could be done to
> remove unnecessary iterations. The generic code would handle things
> like reference counting.
> 
> The existing unshare()/clone() interface would be a way to create a
> child "container" (for want of a better term) that shared some
> subsystem pointers with its parent and had cloned versions of others
> (perhaps only for the namespace-like subsystems?); the filesystem
> interface would allow you to create new "containers" that weren't
> explicitly associated with processes, and to move processes between
> "containers". Also, the filesystem interface would allow you to bind
> multiple subsystems together to allow easier manipulation from
> userspace, in a similar way to my current containers patch.
> 
> So in summary, it takes the concepts that resource controllers and
> namespaces share (that of grouping tasks) and unifies them, while
> not forcing them to behave exactly the same way. I can envisage some
> other per-task pointers that are generally inherited by children
> being possibly moved into this in the same way, e.g. task->user and
> task->mempolicy, if we could come up with a solution that handles
> groupings with sufficiently different lifetimes.
> 
> Thoughts?
sounds quite complicated and fragile to me ...
but I guess I have to go through that one again
before I can give a final statement ...
best,
Herbert
> 
> Paul
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17628 is a reply to message #17616] | Fri, 09 March 2007 00:48   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Thu, Mar 08, 2007 at 03:43:47PM +0530, Srivatsa Vaddagiri wrote:
> On Wed, Mar 07, 2007 at 08:12:00PM -0700, Eric W. Biederman wrote:
> > The review is still largely happening at the why level but no
> > one is addressing that yet.  So please can we have a why.
> 
> Here's a brief summary of what's happening and why. If its not clear,
> pls get back to us with specific questions.
> 
> There have been various projects attempting to provide resource
> management support in Linux, including CKRM/Resource Groups and UBC.
let me note here, once again, that you forgot Linux-VServer
which does quite non-intrusive resource management ...
> Each had its own task-grouping mechanism. 
the basic 'context' (pid space) is the grouping mechanism
we use for resource management too
> Paul Menage observed [1] that cpusets in the kernel already has a
> grouping mechanism which was working well for cpusets. He went ahead
> and generalized the grouping code in cpusets so that it could be used
> for overall resource management purpose. 
> With his patches, it is possible to even create multiple hierarchies
> of groups (see [2] on why multiple hierarchies) as follows:
do we need or even want that? IMHO the hierarchical
concept CKRM was designed with, was also the reason
for it being slow, unuseable and complicated
> mount -t container -o cpuset none /dev/cpuset	<- cpuset hierarchy
> mount -t container -o mem,cpu none /dev/mem	<- memory/cpu hierarchy
> mount -t container -o disk none /dev/disk	<- disk hierarchy
> 
> In each hierarchy, you can create task groups and manipulate the
> resource parameters of each group. You can also move tasks between
> groups at run-time (see [3] on why this is required). 
> Each hierarchy is also manipulated independent of the other.          
> Paul's patches also introduced a 'struct container' in the kernel,
> which serves these key purposes:
> 
> - Task-grouping
>   'struct container' represents a task-group created in each hierarchy.
>   So every directory created under /dev/cpuset or /dev/mem above will
>   have a corresponding 'struct container' inside the kernel. All tasks
>   pointing to the same 'struct container' are considered to be part of
>   a group
> 
>   The 'struct container' in turn has pointers to resource objects which
>   store actual resource parameters for that group. In above example,
>   'struct container' created under /dev/cpuset will have a pointer to
>   'struct cpuset' while 'struct container' created under /dev/disk will
>   have pointer to 'struct disk_quota_or_whatever'.
> 
> - Maintain hierarchical information
>   The 'struct container' also keeps track of hierarchical relationship
>   between groups.
> 
> The filesystem interface in the patches essentially serves these
> purposes:
> 
> 	- Provide an interface to manipulate task-groups. This includes
> 	  creating/deleting groups, listing tasks present in a group and 
> 	  moving tasks across groups
> 
> 	- Provdes an interface to manipulate the resource objects
> 	  (limits etc) pointed to by 'struct container'.
> 
> As you know, the introduction of 'struct container' was objected
> to and was felt redundant as a means to group tasks. Thats where I
> took a shot at converting over Paul Menage's patch to avoid 'struct
> container' abstraction and insead work with 'struct nsproxy'.
which IMHO isn't a step in the right direction, as
you will need to handle different nsproxies within
the same 'resource container' (see previous email)
> In the rcfs patch, each directory (in /dev/cpuset or /dev/disk) is
> associated with a 'struct nsproxy' instead. The most important need
> of the filesystem interface is not to manipulate the nsproxy objects
> directly, but to manipulate the resource objects (nsproxy->ctlr_data[]
> in the patches) which store information like limit etc.
> 
> > I have a question?  What does rcfs look like if we start with
> > the code that is in the kernel?  That is start with namespaces
> > and nsproxy and just build a filesystem to display/manipulate them?
> > With the code built so it will support adding resource controllers
> > when they are ready?
> 
> If I am not mistaken, Serge did attempt something in that direction,
> only that it was based on Paul's container patches. rcfs can no doubt
> support the same feature.
> 
> > >  	struct ipc_namespace *ipc_ns;
> > >  	struct mnt_namespace *mnt_ns;
> > >  	struct pid_namespace *pid_ns;
> > > +#ifdef CONFIG_RCFS
> > > +	struct list_head list;
> > 
> > This extra list of nsproxy's is unneeded and a performance problem the
> > way it is used.  In general we want to talk about the individual resource
> > controllers not the nsproxy.
> 
> I think if you consider the multiple hierarchy picture, the need
> becomes obvious.
> 
> Lets say that you had these hierarchies : /dev/cpuset, /dev/mem, /dev/disk
> and the various resource classes (task-groups) under them as below:
> 
> 	/dev/cpuset/C1, /dev/cpuset/C1/C11, /dev/cpuset/C2
> 	/dev/mem/M1, /dev/mem/M2, /dev/mem/M3
> 	/dev/disk/D1, /dev/disk/D2, /dev/disk/D3
> 
> The nsproxy structure basically has pointers to a resource objects in
> each of these hierarchies. 
> 
> 	nsproxy { ..., C1, M1, D1} could be one nsproxy
> 	nsproxy { ..., C1, M2, D3} could be another nsproxy and so on
> 
> So you see, because of multi-hierachies, we can have different
> combinations of resource classes.
> 
> When we support task movement across resource classes, we need to find a
> nsproxy which has the right combination of resource classes that the
> task's nsproxy can be hooked to.
no, not necessarily, we can simply create a new one
and give it the proper resource or whatever-spaces
> That's where we need the nsproxy list. Hope this makes it clear.
> 
> > > +	void *ctlr_data[CONFIG_MAX_RC_SUBSYS];
> > 
> > I still don't understand why these pointers are so abstract,
> > and why we need an array lookup into them?
> 
> we can avoid these abstract pointers and instead have a set of pointers
> like this:
> 
> 	struct nsproxy {
> 		...
> 		struct cpu_limit *cpu;	/* cpu control namespace */
> 		struct rss_limit *rss;	/* rss control namespace */
> 		struct cpuset *cs;	/* cpuset namespace */
> 
> 	}
> 
> But that will make some code (like searching for a right nsproxy when a
> task moves across classes/groups) very awkward.
> 
> > I'm still inclined to think this should be part of /proc, instead of a purely
> > separate fs.  But I might be missing something.
> 
> A separate filesystem would give us more flexibility like the
> implementing multi-hierarchy support described above.
why is the filesystem approach so favored for this
kind of manipulations?
IMHO it is one of the worst interfaces I can imagine
(to move tasks between spaces and/or assign resources)
but yes, I'm aware that filesystems are 'in' nowadays
best,
Herbert
> -- 
> Regards,
> vatsa
> 
> 
> References:
> 
> 1. http://lkml.org/lkml/2006/09/20/200 
> 2. http://lkml.org/lkml/2006/11/6/95
> 3. http://lkml.org/lkml/2006/09/5/178
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17632 is a reply to message #17619] | Fri, 09 March 2007 01:16   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Thu, Mar 08, 2007 at 05:00:54PM +0530, Srivatsa Vaddagiri wrote:
> On Thu, Mar 08, 2007 at 01:50:01PM +1300, Sam Vilain wrote:
> > 7. resource namespaces
> 
> It should be. Imagine giving 20% bandwidth to a user X. X wants to
> divide this bandwidth further between multi-media (10%), kernel
> compilation (5%) and rest (5%). So,
sounds quite nice, but ...
> > Is the subservient namespace's resource usage counting against ours too?
> 
> Yes, the resource usage of children should be accounted when capping
> parent resource usage.
it will require to do accounting many times
(and limit checks of course), which in itself
might be a way to DoS the kernel by creating
more and more resource groups
> 
> > Can we dynamically alter the subservient namespace's resource
> > allocations?
> 
> Should be possible yes. That lets user X completely manage his
> allocation among whatever sub-groups he creates.
what happens if the parent changes, how is
the resource change (if it was a reduction)
propagated to the children?
e.g. your guest has 1024 file handles, now
you reduce it to 512, but the guest had two
children, both with 256 file handles each ...
> > So let's bring this back to your patches. If they are providing
> > visibility of ns_proxy, then it should be called namesfs or some
> > such.
> 
> The patches should give visibility to both nsproxy objects (by showing
> what tasks share the same nsproxy objects and letting tasks move across
> nsproxy objects if allowed) and the resource control objects pointed to
> by nsproxy (struct cpuset, struct cpu_limit, struct rss_limit etc).
the nsproxy is not really relevant, as it
is some kind of strange indirection, which
does not necessarily depict the real relations,
regardless wether you do the re-sharing of
those nsproies or not .. let me know if you
need examples to verify that ...
best,
Herbert
> > It doesn't really matter if processes disappear from namespace
> > aggregates, because that's what's really happening anyway. The only
> > problem is that if you try to freeze a namespace that has visibility
> > of things at this level, you might not be able to reconstruct the
> > filesystem in the same way. This may or may not be considered a
> > problem, but open filehandles and directory handles etc surviving
> > a freeze/thaw is part of what we're trying to achieve. Then again,
> > perhaps some visibility is better than none for the time being.
> > 
> > If they are restricted entirely to resource control, then don't use
> > the nsproxy directly - use the structure or structures which hang
> > off the nsproxy (or even task_struct) related to resource control.
> 
> -- 
> Regards,
> vatsa
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17641 is a reply to message #17628] | Fri, 09 March 2007 09:23   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| >>There have been various projects attempting to provide resource
>>management support in Linux, including CKRM/Resource Groups and UBC.
> 
> 
> let me note here, once again, that you forgot Linux-VServer
> which does quite non-intrusive resource management ...
Herbert, do you care to send patches except for ask others to do
something that works for you?
Looks like your main argument is non-intrusive...
"working", "secure", "flexible" are not required to people any more? :/
>> Each had its own task-grouping mechanism. 
> 
> 
> the basic 'context' (pid space) is the grouping mechanism
> we use for resource management too
> 
> 
>>Paul Menage observed [1] that cpusets in the kernel already has a
>>grouping mechanism which was working well for cpusets. He went ahead
>>and generalized the grouping code in cpusets so that it could be used
>>for overall resource management purpose. 
> 
> 
>>With his patches, it is possible to even create multiple hierarchies
>>of groups (see [2] on why multiple hierarchies) as follows:
> 
> 
> do we need or even want that? IMHO the hierarchical
> concept CKRM was designed with, was also the reason
> for it being slow, unuseable and complicated
1. cpusets are hierarchical already. So hierarchy is required.
2. As it was discussed on the call controllers which are flat
   can just prohibit creation of hierarchy on the filesystem.
   i.e. allow only 1 depth and continue being fast.
>>mount -t container -o cpuset none /dev/cpuset	<- cpuset hierarchy
>>mount -t container -o mem,cpu none /dev/mem	<- memory/cpu hierarchy
>>mount -t container -o disk none /dev/disk	<- disk hierarchy
>>
>>In each hierarchy, you can create task groups and manipulate the
>>resource parameters of each group. You can also move tasks between
>>groups at run-time (see [3] on why this is required). 
> 
> 
>>Each hierarchy is also manipulated independent of the other.          
> 
> 
>>Paul's patches also introduced a 'struct container' in the kernel,
>>which serves these key purposes:
>>
>>- Task-grouping
>>  'struct container' represents a task-group created in each hierarchy.
>>  So every directory created under /dev/cpuset or /dev/mem above will
>>  have a corresponding 'struct container' inside the kernel. All tasks
>>  pointing to the same 'struct container' are considered to be part of
>>  a group
>>
>>  The 'struct container' in turn has pointers to resource objects which
>>  store actual resource parameters for that group. In above example,
>>  'struct container' created under /dev/cpuset will have a pointer to
>>  'struct cpuset' while 'struct container' created under /dev/disk will
>>  have pointer to 'struct disk_quota_or_whatever'.
>>
>>- Maintain hierarchical information
>>  The 'struct container' also keeps track of hierarchical relationship
>>  between groups.
>>
>>The filesystem interface in the patches essentially serves these
>>purposes:
>>
>>	- Provide an interface to manipulate task-groups. This includes
>>	  creating/deleting groups, listing tasks present in a group and 
>>	  moving tasks across groups
>>
>>	- Provdes an interface to manipulate the resource objects
>>	  (limits etc) pointed to by 'struct container'.
>>
>>As you know, the introduction of 'struct container' was objected
>>to and was felt redundant as a means to group tasks. Thats where I
>>took a shot at converting over Paul Menage's patch to avoid 'struct
>>container' abstraction and insead work with 'struct nsproxy'.
> 
> 
> which IMHO isn't a step in the right direction, as
> you will need to handle different nsproxies within
> the same 'resource container' (see previous email)
tend to agree.
Looks like Paul's original patch was in the right way.
[...]
>>A separate filesystem would give us more flexibility like the
>>implementing multi-hierarchy support described above.
> 
> 
> why is the filesystem approach so favored for this
> kind of manipulations?
> 
> IMHO it is one of the worst interfaces I can imagine
> (to move tasks between spaces and/or assign resources)
> but yes, I'm aware that filesystems are 'in' nowadays
I also hate filesystems approach being used nowdays everywhere.
But, looks like there are reasons still:
1. cpusets already use fs interface.
2. each controller can have a bit of specific information/controls exported easily.
Can you suggest any other extensible/flexible interface for these?
Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17643 is a reply to message #17556] | Fri, 09 March 2007 22:09   |  
			| 
				
				
					|  Paul Menage Messages: 642
 Registered: September 2006
 | Senior Member |  |  |  
	| On 3/9/07, Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
>
> 1. What is the fundamental unit over which resource-management is
> applied? Individual tasks or individual containers?
>
>         /me thinks latter.
Yes
> In which case, it makes sense to stick
>         resource control information in the container somewhere.
Yes, that's what all my patches have been doing.
> 2. Regarding space savings, if 100 tasks are in a container (I dont know
>    what is a typical number) -and- lets say that all tasks are to share
>    the same resource allocation (which seems to be natural), then having
>    a 'struct container_group *' pointer in each task_struct seems to be not
>    very efficient (simply because we dont need that task-level granularity of
>    managing resource allocation).
I think you should re-read my patches.
Previously, each task had N pointers, one for its container in each
potential hierarchy. The container_group concept means that each task
has 1 pointer, to a set of container pointers (one per hierarchy)
shared by all tasks that have exactly the same set of containers (in
the various different hierarchies).
It doesn't give task-level granularity of resource management (unless
you create a separate container for each task), it just gives a space
saving.
>
> 3. This next leads me to think that 'tasks' file in each directory doesnt make
>    sense for containers. In fact it can lend itself to error situations (by
>    administrator/script mistake) when some tasks of a container are in one
>    resource class while others are in a different class.
>
>         Instead, from a containers pov, it may be usefull to write
>         a 'container id' (if such a thing exists) into the tasks file
>         which will move all the tasks of the container into
>         the new resource class. This is the same requirement we
>         discussed long back of moving all threads of a process into new
>         resource class.
I think you need to give a more concrete example and use case of what
you're trying to propose here. I don't really see what advantage
you're getting.
Paul
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17644 is a reply to message #17641] | Fri, 09 March 2007 13:21   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 12:23:55PM +0300, Kirill Korotaev wrote:
>>> There have been various projects attempting to provide
>>> resource management support in Linux, including 
>>> CKRM/Resource Groups and UBC.
>> 
>> let me note here, once again, that you forgot Linux-VServer
>> which does quite non-intrusive resource management ...
> Herbert, do you care to send patches except for ask 
> others to do something that works for you?
sorry, I'm not in the lucky position that I get payed
for sending patches to LKML, so I have to think twice
before I invest time in coding up extra patches ...
i.e. you will have to live with my comments for now
> Looks like your main argument is non-intrusive...
> "working", "secure", "flexible" are not required to 
> people any more? :/
well, Linux-VServer is "working", "secure", "flexible"
_and_ non-intrusive ... it is quite natural that less
won't work for me ... and regarding patches, there
will be a 2.2 release soon, with all the patches ...
>>> Each had its own task-grouping mechanism. 
>> the basic 'context' (pid space) is the grouping mechanism
>> we use for resource management too
>>> Paul Menage observed [1] that cpusets in the kernel already has a
>>> grouping mechanism which was working well for cpusets. He went ahead
>>> and generalized the grouping code in cpusets so that it could be
>>> used for overall resource management purpose.
>>> With his patches, it is possible to even create multiple hierarchies
>>> of groups (see [2] on why multiple hierarchies) as follows:
>> do we need or even want that? IMHO the hierarchical
>> concept CKRM was designed with, was also the reason
>> for it being slow, unuseable and complicated
> 1. cpusets are hierarchical already. So hierarchy is required.
> 2. As it was discussed on the call controllers which are flat
>    can just prohibit creation of hierarchy on the filesystem.
>    i.e. allow only 1 depth and continue being fast.
> 
>>> mount -t container -o cpuset none /dev/cpuset <- cpuset hierarchy
>>> mount -t container -o mem,cpu none /dev/mem	<- memory/cpu hierarchy
>>> mount -t container -o disk none /dev/disk	<- disk hierarchy
>>> 
>>> In each hierarchy, you can create task groups and manipulate the
>>> resource parameters of each group. You can also move tasks between
>>> groups at run-time (see [3] on why this is required). 
>>> Each hierarchy is also manipulated independent of the other.          
>>> Paul's patches also introduced a 'struct container' in the kernel,
>>> which serves these key purposes:
>>> 
>>> - Task-grouping
>>>   'struct container' represents a task-group created in each hierarchy.
>>>   So every directory created under /dev/cpuset or /dev/mem above will
>>>   have a corresponding 'struct container' inside the kernel. All tasks
>>>   pointing to the same 'struct container' are considered to be part of
>>>   a group
>>> 
>>>   The 'struct container' in turn has pointers to resource objects which
>>>   store actual resource parameters for that group. In above example,
>>>   'struct container' created under /dev/cpuset will have a pointer to
>>>   'struct cpuset' while 'struct container' created under /dev/disk will
>>>   have pointer to 'struct disk_quota_or_whatever'.
>>> 
>>> - Maintain hierarchical information
>>>   The 'struct container' also keeps track of hierarchical relationship
>>>   between groups.
>>> 
>>> The filesystem interface in the patches essentially serves these
>>> purposes:
>>> 
>>> 	- Provide an interface to manipulate task-groups. This includes
>>> 	  creating/deleting groups, listing tasks present in a group and 
>>> 	  moving tasks across groups
>>> 
>>> 	- Provdes an interface to manipulate the resource objects
>>> 	  (limits etc) pointed to by 'struct container'.
>>> 
>>> As you know, the introduction of 'struct container' was objected
>>> to and was felt redundant as a means to group tasks. Thats where I
>>> took a shot at converting over Paul Menage's patch to avoid 'struct
>>> container' abstraction and insead work with 'struct nsproxy'.
>> which IMHO isn't a step in the right direction, as
>> you will need to handle different nsproxies within
>> the same 'resource container' (see previous email)
> tend to agree.
> Looks like Paul's original patch was in the right way.
> [...]
>>> A separate filesystem would give us more flexibility like the
>>> implementing multi-hierarchy support described above.
>> why is the filesystem approach so favored for this
>> kind of manipulations?
>> IMHO it is one of the worst interfaces I can imagine
>> (to move tasks between spaces and/or assign resources)
>> but yes, I'm aware that filesystems are 'in' nowadays
> I also hate filesystems approach being used nowdays everywhere.
> But, looks like there are reasons still:
> 1. cpusets already use fs interface.
> 2. each controller can have a bit of specific 
>    information/controls exported easily.
yes, but there are certain drawbacks too, like:
 - performance of filesystem interfaces is quite bad
 - you need to do a lot to make the fs consistant for
   e.g. find and friends (regarding links and filesize)
 - you have a quite hard time to do atomic operations
   (except for the ioctl interface, which nobody likes)
 - vfs/mnt namespaces complicate the access to this
   new filesystem once you start moving around (between
   the spaces)
> Can you suggest any other extensible/flexible interface for these?
well, as you know, all current solutions use a syscall
interface to do most of the work, in the OpenVZ/Virtuozzo
case several, unassigned syscalls are used, while 
FreeVPS and Linux-VServer use a registered and versioned
(multiplexed) system call, which works quite fine for
all known purposes ...
I'm quite happy with the extensibility and flexibility
the versioned syscall interface has, the only thing I'd
change if I would redesign that interface is, that I
would add another pointer argument to eliminate 32/64bit
issues completely (i.e. use 4 args instead of the 3)
best,
Herbert
> Thanks,
> Kirill
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17645 is a reply to message #17640] | Fri, 09 March 2007 13:29   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 12:07:27PM +0300, Kirill Korotaev wrote:
>>  nobody actually cares about a precise accounting and
>>  calculating shares or partitions of whatever resource,
>>  all that matters is that you have a way to prevent a
>>  potential hostile environment from sucking up all your
>>  resources (or even a single one) resulting in a DoS
> This is not true. People care. Reasons:
>   - resource planning
>   - fairness
>   - guarantees
let me make that a little more clear ...
_nobody_ cares wether a shared memory page is
accounted as full page or as fraction of a page
(depending on the number of guests sharing it)
as long as the accounted amount is substracted
correctly when the page is disposed 
so there _is_ a difference between _false_
accounting (which seems what you are referring
to in the next paragraph) and imprecise, but
consistant accounting (which is what I was 
talking about)
best,
Herbert
>   What you talk is about security only. Not the above issues.
>   So good precision is required. If there is no precision at all,
>   security sucks as well and can be exploited, e.g. for CPU
>   schedulers doing an accounting based on jiffies accounting in
>   scheduler_tick() it is easy to build an application consuming
>   90% of CPU, but ~0% from scheduler POV.
> Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17665 is a reply to message #17587] | Fri, 09 March 2007 16:34   |  
			| 
				
				
					|  Srivatsa Vaddagiri Messages: 241
 Registered: August 2006
 | Senior Member |  |  |  
	| On Wed, Mar 07, 2007 at 01:20:18PM -0800, Paul Menage wrote:
> On 3/7/07, Serge E. Hallyn <serue@us.ibm.com> wrote:
> >
> >All that being said, if it were going to save space without overly
> >complicating things I'm actually not opposed to using nsproxy, but it
> 
> If space-saving is the main issue, then the latest version of my
> containers patches uses just a single pointer in the task_struct, and
> all tasks in the same set of containers (across all hierarchies) will
> share a single container_group object, which holds the actual pointers
> to container state.
Paul,
	Some more thoughts, mostly coming from the point of view of
vservers/containers/"whaever is the set of tasks sharing a nsproxy is
called".
1. What is the fundamental unit over which resource-management is
applied? Individual tasks or individual containers?
	/me thinks latter. In which case, it makes sense to stick 
	resource control information in the container somewhere.
	Just like when controlling a user's resource consumption, 
	'struct user_struct' may be a natural place to put these resource 
 	limits.
2. Regarding space savings, if 100 tasks are in a container (I dont know
   what is a typical number) -and- lets say that all tasks are to share
   the same resource allocation (which seems to be natural), then having
   a 'struct container_group *' pointer in each task_struct seems to be not 
   very efficient (simply because we dont need that task-level granularity of
   managing resource allocation).
3. This next leads me to think that 'tasks' file in each directory doesnt make 
   sense for containers. In fact it can lend itself to error situations (by 
   administrator/script mistake) when some tasks of a container are in one 
   resource class while others are in a different class.
	Instead, from a containers pov, it may be usefull to write
	a 'container id' (if such a thing exists) into the tasks file
	which will move all the tasks of the container into 
	the new resource class. This is the same requirement we
	discussed long back of moving all threads of a process into new 
	resource class.
-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17669 is a reply to message #17627] | Fri, 09 March 2007 17:57   |  
			| 
				
				
					|  Srivatsa Vaddagiri Messages: 241
 Registered: August 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 01:38:19AM +0100, Herbert Poetzl wrote:
> > 2) you allow a task to selectively reshare namespaces/subsystems with
> >    another task, i.e. you can update current->task_proxy to point to
> >    a proxy that matches your existing task_proxy in some ways and the
> >    task_proxy of your destination in others. In that case a trivial
> >    implementation would be to allocate a new task_proxy and copy some
> >    pointers from the old task_proxy and some from the new. But then
> >    whenever a task moves between different groupings it acquires a
> >    new unique task_proxy. So moving a bunch of tasks between two
> >    groupings, they'd all end up with unique task_proxy objects with
> >    identical contents.
> 
> this is exactly what Linux-VServer does right now, and I'm
> still not convinced that the nsproxy really buys us anything
> compared to a number of different pointers to various spaces
> (located in the task struct)
Are you saying that the current scheme of storing pointers to different
spaces (uts_ns, ipc_ns etc) in nsproxy doesn't buy anything? 
Or are you referring to storage of pointers to resource (name)spaces 
in nsproxy doesn't buy anything?
In either case, doesn't it buy speed and storage space?
> I'd prefer to do accounting (and limits) in a very simple
> and especially performant way, and the reason for doing
> so is quite simple:
Can you elaborate on the relationship between data structures used to store 
those limits to the task_struct? Does task_struct store pointers to those 
objects directly?
-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17670 is a reply to message #17628] | Fri, 09 March 2007 18:14   |  
			| 
				
				
					|  Srivatsa Vaddagiri Messages: 241
 Registered: August 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 01:48:16AM +0100, Herbert Poetzl wrote:
> > There have been various projects attempting to provide resource
> > management support in Linux, including CKRM/Resource Groups and UBC.
> 
> let me note here, once again, that you forgot Linux-VServer
> which does quite non-intrusive resource management ...
Sorry, not intentionally. Maybe it slipped because I haven't seen much res mgmt
related patches from Linux Vserver on lkml recently. Note that I -did- talk 
about VServer at one point in past (http://lkml.org/lkml/2006/06/15/112)!
> the basic 'context' (pid space) is the grouping mechanism
> we use for resource management too
so tasks sharing the same nsproxy->pid_ns is the fundamental unit of
resource management (as far as vserver/container goes)?
> > As you know, the introduction of 'struct container' was objected
> > to and was felt redundant as a means to group tasks. Thats where I
> > took a shot at converting over Paul Menage's patch to avoid 'struct
> > container' abstraction and insead work with 'struct nsproxy'.
> 
> which IMHO isn't a step in the right direction, as
> you will need to handle different nsproxies within
> the same 'resource container' (see previous email)
Isn't that made simple because of the fact that we have pointers to
namespace objects (and not actual objects themselves) in nsproxy?
I mean, all that is required to manage multiple nsproxy's
is to have the pointer to the same resource object in all of them.
In system call terms, if someone does a unshare of uts namespace, he
will get into a new nsproxy object sure (which has a pointer to the new
uts namespace) but the new nsproxy object will still be pointing to the
old resource controlling objects.
> > When we support task movement across resource classes, we need to find a
> > nsproxy which has the right combination of resource classes that the
> > task's nsproxy can be hooked to.
> 
> no, not necessarily, we can simply create a new one
> and give it the proper resource or whatever-spaces
That would be the simplest, agreeably. But not optimal in terms of
storage?
Pls note that task-movement can be not-so-infrequent (in other words,
frequent) in context of non-container workload management.
> why is the filesystem approach so favored for this
> kind of manipulations?
> 
> IMHO it is one of the worst interfaces I can imagine
> (to move tasks between spaces and/or assign resources)
> but yes, I'm aware that filesystems are 'in' nowadays
Ease of use maybe. Scripts can be more readily used with a fs-based
interface.
-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17672 is a reply to message #17632] | Fri, 09 March 2007 18:41   |  
			| 
				
				
					|  Srivatsa Vaddagiri Messages: 241
 Registered: August 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 02:16:08AM +0100, Herbert Poetzl wrote:
> On Thu, Mar 08, 2007 at 05:00:54PM +0530, Srivatsa Vaddagiri wrote:
> > On Thu, Mar 08, 2007 at 01:50:01PM +1300, Sam Vilain wrote:
> > > 7. resource namespaces
> > 
> > It should be. Imagine giving 20% bandwidth to a user X. X wants to
> > divide this bandwidth further between multi-media (10%), kernel
> > compilation (5%) and rest (5%). So,
> 
> sounds quite nice, but ...
> 
> > > Is the subservient namespace's resource usage counting against ours too?
> > 
> > Yes, the resource usage of children should be accounted when capping
> > parent resource usage.
> 
> it will require to do accounting many times
> (and limit checks of course), which in itself
> might be a way to DoS the kernel by creating
> more and more resource groups
I was only pointing out the usefullness of the feature and not
necessarily saying it -should- be implemented! Ofcourse I understand it
will make the controller complicated and thats why probably none of the
recontrollers we are seeing posted on lkml don't support hierarchical
res mgmt.
> > > Can we dynamically alter the subservient namespace's resource
> > > allocations?
> > 
> > Should be possible yes. That lets user X completely manage his
> > allocation among whatever sub-groups he creates.
> 
> what happens if the parent changes, how is
> the resource change (if it was a reduction)
> propagated to the children?
>
> e.g. your guest has 1024 file handles, now
> you reduce it to 512, but the guest had two
> children, both with 256 file handles each ...
I believe CKRM handled this quite neatly (by defining child shares to be
relative to parent shares). 
In your example, 256+256 add up to 512 which is within the parent's new limit, 
so nothing happens :) You also picked an example of exhaustible/non-reclaimable 
resource, which makes it hard to define what should happen if parent's limit 
goes below 512. Either nothing happens or perhaps a task is killed,
don't know. In case of memory, I would say that some of child's pages may 
get kicked out and in case of cpu, child will start getting fewer
cycles.
> > The patches should give visibility to both nsproxy objects (by showing
> > what tasks share the same nsproxy objects and letting tasks move across
> > nsproxy objects if allowed) and the resource control objects pointed to
> > by nsproxy (struct cpuset, struct cpu_limit, struct rss_limit etc).
> 
> the nsproxy is not really relevant, as it
> is some kind of strange indirection, which
> does not necessarily depict the real relations,
> regardless wether you do the re-sharing of
> those nsproies or not .. 
So what are you recommending we do instead? My thought was whatever is
the fundamental unit to which resource management needs to be applied,
lets store resource parameters (or pointers to them) there (rather than 
duplicating the information in each task_struct).
-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17674 is a reply to message #17670] | Sat, 10 March 2007 00:56   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 11:44:22PM +0530, Srivatsa Vaddagiri wrote:
> On Fri, Mar 09, 2007 at 01:48:16AM +0100, Herbert Poetzl wrote:
> > > There have been various projects attempting to provide resource
> > > management support in Linux, including CKRM/Resource Groups and UBC.
> > let me note here, once again, that you forgot Linux-VServer
> > which does quite non-intrusive resource management ...
> Sorry, not intentionally. Maybe it slipped because I haven't
> seen much res mgmt related patches from Linux Vserver on 
> lkml recently.
mainly because I got the impression that we planned
to work on the various spaces first, and handle things
like resource management later .. but it seems that
resource management is now in focus, while the spaces
got somewhat delayed ...
> Note that I -did- talk about VServer at one point in past
> (http://lkml.org/lkml/2006/06/15/112)!
noted and appreciated (although this was about CPU
resources, which IMHO is a special resource like
the networking, as you are mostly interested in
'bandwidth' limitations there, not in resource
limits per se (and of course, it wasn't even cited
correctly, as it is Linux-VServer not vserver ...)
> > the basic 'context' (pid space) is the grouping mechanism
> > we use for resource management too
> so tasks sharing the same nsproxy->pid_ns is the fundamental
> unit of resource management (as far as vserver/container goes)?
we currently have a 'process' context, which holds
the administrative data (capabilities and flags) and
the resource accounting and limits, which basically
contains the pid namespace, so yes and no
it contains a reference to the 'main' nsproxy, which
is used to copy spaces from when you enter the guest
(or some set of spaces), and it defines the unit we
consider a process container
> > > As you know, the introduction of 'struct container' was objected
> > > to and was felt redundant as a means to group tasks. Thats where
> > > I took a shot at converting over Paul Menage's patch to avoid
> > > 'struct container' abstraction and insead work with 'struct
> > > nsproxy'.
> > 
> > which IMHO isn't a step in the right direction, as
> > you will need to handle different nsproxies within
> > the same 'resource container' (see previous email)
> 
> Isn't that made simple because of the fact that we have pointers to
> namespace objects (and not actual objects themselves) in nsproxy?
> 
> I mean, all that is required to manage multiple nsproxy's
> is to have the pointer to the same resource object in all of them.
> 
> In system call terms, if someone does a unshare of uts namespace, 
> he will get into a new nsproxy object sure (which has a pointer to the
> new uts namespace) but the new nsproxy object will still be pointing
> to the old resource controlling objects.
yes, that is why I agreed, that the container (or
resource limit/accounting/controlling object) can
be seen as space too (and handled like that)
> > > When we support task movement across resource classes, we need to
> > > find a nsproxy which has the right combination of resource classes
> > > that the task's nsproxy can be hooked to.
> > 
> > no, not necessarily, we can simply create a new one
> > and give it the proper resource or whatever-spaces
> 
> That would be the simplest, agreeably. But not optimal in terms of
> storage?
> 
> Pls note that task-movement can be not-so-infrequent 
> (in other words, frequent) in context of non-container workload 
> management.
not only there, also with solutions like Linux-VServer
(it is quite common to enter guests or subsets of the
space mix assigned)
> > why is the filesystem approach so favored for this
> > kind of manipulations?
> > 
> > IMHO it is one of the worst interfaces I can imagine
> > (to move tasks between spaces and/or assign resources)
> > but yes, I'm aware that filesystems are 'in' nowadays
> 
> Ease of use maybe. Scripts can be more readily used with a fs-based
> interface.
correct, but what about security and/or atomicity?
i.e. how to assure that some action really was 
taken and/or how to wait for completion?
sure, all this _can_ be done, no doubt, but it
is much harder to do with a fs based interface than
with e.g. a syscall interface ...
> -- 
> Regards,
> vatsa
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17675 is a reply to message #17638] | Sat, 10 March 2007 01:00   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 11:25:47AM -0800, Paul Jackson wrote:
> > Ease of use maybe. Scripts can be more readily used with a fs-based
> > interface.
> 
> And, as I might have already stated, file system API's are a natural
> fit for hierarchically shaped data, especially if the nodes in the
> hierarchy would benefit from file system like permission attributes.
personally, I'd prefer to avoid hierarchical
structures wherever possible, because they tend
to make processing and checks a lot more complicated
than necessary, and if we really want hierarchical
structures, it might be more than sufficient to
keep the hierarchy in userspace, and use a flat
representation inside the kernel ...
but hey, I'm all for running a hypervisor under
a hypervisor running inside a hypervisor :)
best,
Herbert
> -- 
>                   I won't rest till it's the best ...
>                   Programmer, Linux Scalability
>                   Paul Jackson <pj@sgi.com> 1.925.600.0401
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17676 is a reply to message #17669] | Sat, 10 March 2007 01:19   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 11:27:07PM +0530, Srivatsa Vaddagiri wrote:
> On Fri, Mar 09, 2007 at 01:38:19AM +0100, Herbert Poetzl wrote:
> > > 2) you allow a task to selectively reshare namespaces/subsystems with
> > >    another task, i.e. you can update current->task_proxy to point to
> > >    a proxy that matches your existing task_proxy in some ways and the
> > >    task_proxy of your destination in others. In that case a trivial
> > >    implementation would be to allocate a new task_proxy and copy some
> > >    pointers from the old task_proxy and some from the new. But then
> > >    whenever a task moves between different groupings it acquires a
> > >    new unique task_proxy. So moving a bunch of tasks between two
> > >    groupings, they'd all end up with unique task_proxy objects with
> > >    identical contents.
> > this is exactly what Linux-VServer does right now, and I'm
> > still not convinced that the nsproxy really buys us anything
> > compared to a number of different pointers to various spaces
> > (located in the task struct)
> Are you saying that the current scheme of storing pointers to
> different spaces (uts_ns, ipc_ns etc) in nsproxy doesn't buy
> anything?
> Or are you referring to storage of pointers to resource 
> (name)spaces in nsproxy doesn't buy anything?
> In either case, doesn't it buy speed and storage space?
let's do a few examples here, just to illustrate the
advantages and disadvantages of nsproxy as separate
structure over nsproxy as part of the task_struct
1) typical setup, 100 guests as shell servers, 5
   tasks each when unused, 10 tasks when used 10%
   used in average
   a) separate nsproxy, we need at least 100
      structs to handle that (saves some space)
      we might end up with ~500 nsproxies, if
      the shell clones a new namespace (so might
      not save that much space)
      we do a single inc/dec when the nsproxy
      is reused, but do the full N inc/dec when
      we have to copy an nsproxy (might save
      some refcounting)
      we need to do the indirection step, from
      task to nsproxy to space (and data)
   b) we have ~600 tasks with 600 times the
      nsproxy data (uses up some more space)
      we have to do the full N inc/dev when
      we create a new task (more refcounting)
      we do not need to do the indirection, we
      access spaces directly from the 'hot'
      task struct (makes hot pathes quite fast)
   so basically we trade a little more space and
   overhead on task creation for having no 
   indirection to the data accessed quite often
   throughout the tasks life (hopefully)
2) context migration: for whatever reason, we decide
   to migrate a task into a subset (space mix) of a
   context 1000 times
   a) separate nsproxy, we need to create a new one
      consisting of the 'new' mix, which will
      - allocate the nsproxy struct
      - inc refcounts to all copied spaces
      - inc refcount nsproxy and assign to task
      - dec refcount existing task nsproxy
      after task completion
      - dec nsproxy refcount
      - dec refcounts for all spaces      
      - free up nsproxy struct
   b) nsproxy data in task struct
      - inc/dec refcounts to changed spaces
      after task completion
      - dec refcounts to spaces
   so here we gain nothing with the nsproxy, unless
   the chosen subset is identical to the one already
   used, where we end up with a single refcount 
   instead of N 
> > I'd prefer to do accounting (and limits) in a very simple
> > and especially performant way, and the reason for doing
> > so is quite simple:
> Can you elaborate on the relationship between data structures
> used to store those limits to the task_struct?                                
sure it is one to many, i.e. each task points to
exactly one context struct, while a context can
consist of zero, one or many tasks (no back- 
pointers there)
> Does task_struct store pointers to those objects directly?
it contains a single pointer to the context struct, 
and that contains (as a substruct) the accounting
and limit information
HTC,
Herbert
> -- 
> Regards,
> vatsa
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17677 is a reply to message #17672] | Sat, 10 March 2007 02:03   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Sat, Mar 10, 2007 at 12:11:05AM +0530, Srivatsa Vaddagiri wrote:
> On Fri, Mar 09, 2007 at 02:16:08AM +0100, Herbert Poetzl wrote:
> > On Thu, Mar 08, 2007 at 05:00:54PM +0530, Srivatsa Vaddagiri wrote:
> > > On Thu, Mar 08, 2007 at 01:50:01PM +1300, Sam Vilain wrote:
> > > > 7. resource namespaces
> > > 
> > > It should be. Imagine giving 20% bandwidth to a user X. X wants to
> > > divide this bandwidth further between multi-media (10%), kernel
> > > compilation (5%) and rest (5%). So,
> > 
> > sounds quite nice, but ...
> > 
> > > > Is the subservient namespace's resource usage counting against
> > > > ours too?
> > > 
> > > Yes, the resource usage of children should be accounted when capping
> > > parent resource usage.
> > 
> > it will require to do accounting many times
> > (and limit checks of course), which in itself
> > might be a way to DoS the kernel by creating
> > more and more resource groups
> 
> I was only pointing out the usefullness of the feature and not
> necessarily saying it -should- be implemented! Ofcourse I understand it
> will make the controller complicated and thats why probably none of the
> recontrollers we are seeing posted on lkml don't support hierarchical
> res mgmt.
> 
> > > > Can we dynamically alter the subservient namespace's resource
> > > > allocations?
> > > 
> > > Should be possible yes. That lets user X completely manage his
> > > allocation among whatever sub-groups he creates.
> > 
> > what happens if the parent changes, how is
> > the resource change (if it was a reduction)
> > propagated to the children?
> >
> > e.g. your guest has 1024 file handles, now
> > you reduce it to 512, but the guest had two
> > children, both with 256 file handles each ...
> 
> I believe CKRM handled this quite neatly (by defining child shares to be
> relative to parent shares). 
> 
> In your example, 256+256 add up to 512 which is within the parent's
> new limit, so nothing happens :) 
yes, but that might as well be fatal, because now the
children can easily DoS the parent by using up all the
file handles, where the 'original' setup (2 x 256)
left 512 file handles 'reserved' ...
of course, you could as well have adjusted that to
2 x 128 + 256 for the parent, but that is policy and
IMHO policy does not belong into the kernel, it should
be handled by userspace (maybe invoked by the kernel
in some kind of helper functionality or so)
> You also picked an example of exhaustible/non-reclaimable resource,
> which makes it hard to define what should happen if parent's limit
> goes below 512.
which was quite intentional, and brings us to another
issues when adjusting resource limits (not even in
a hierarchical way)
> Either nothing happens or perhaps a task is killed, don't know.
> In case of memory, I would say that some of child's pages may 
> get kicked out and in case of cpu, child will start getting fewer
> cycles.
btw, kicking out pages when rss limit is reached might
be the obvious choice (if we think Virtual Machine here)
but it might not be the best choice from the overall
performance PoV, which might be much better off by
keeping the page in memory (if there is enough memory
available) but penalizing the guest like the page was
actually kicked out (and needs to be fetched later on)
note: this is something we should think about when we
want to address specific limits like RSS, because IMHO
we should not optimize for the single guest case, but
for the big picture ...
> > > The patches should give visibility to both nsproxy objects (by
> > > showing what tasks share the same nsproxy objects and letting
> > > tasks move across nsproxy objects if allowed) and the resource
> > > control objects pointed to by nsproxy (struct cpuset, struct
> > > cpu_limit, struct rss_limit etc).
> > 
> > the nsproxy is not really relevant, as it
> > is some kind of strange indirection, which
> > does not necessarily depict the real relations,
> > regardless wether you do the re-sharing of
> > those nsproies or not .. 
> 
> So what are you recommending we do instead? 
> My thought was whatever is the fundamental unit to which resource
> management needs to be applied, lets store resource parameters (or
> pointers to them) there (rather than duplicating the information in
> each task_struct).
we do not want to duplicate any information in the task
struct, but we might want to put some (or maybe all)
of the spaces back (as pointer reference) to the task
struct, just to avoid the nsproxy indirection
note that IMHO not all spaces make sense to be separated
e.g. while it is quite useful to have network and pid
space separated, others might be joined to form larger
consistant structures ...
for example, I could as well live with pid and resource
accounting/limits sharing one common struct/space ...
(doesn't mean that separate spaces are not nice :)
best,
Herbert
> -- 
> Regards,
> vatsa
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17678 is a reply to message #17643] | Sat, 10 March 2007 02:02   |  
			| 
				
				
					|  Srivatsa Vaddagiri Messages: 241
 Registered: August 2006
 | Senior Member |  |  |  
	| I think maybe I didnt communicate what I mean by a container here
(although I thought I did). I am referring to a container in a vserver
context (set of tasks which share the same namespace).
On Fri, Mar 09, 2007 at 02:09:35PM -0800, Paul Menage wrote:
> >2. Regarding space savings, if 100 tasks are in a container (I dont know
> >   what is a typical number) -and- lets say that all tasks are to share
> >   the same resource allocation (which seems to be natural), then having
> >   a 'struct container_group *' pointer in each task_struct seems to be not
> >   very efficient (simply because we dont need that task-level granularity 
> >   of
> >   managing resource allocation).
> 
> I think you should re-read my patches.
> 
> Previously, each task had N pointers, one for its container in each
> potential hierarchy. The container_group concept means that each task
> has 1 pointer, to a set of container pointers (one per hierarchy)
> shared by all tasks that have exactly the same set of containers (in
> the various different hierarchies).
Ok, let me see if I can convey what I had in mind better:
	    uts_ns pid_ns ipc_ns
		\    |    /
		---------------
	       | nsproxy  	|
	        ----------------
                 /  |   \    \ <-- 'nsproxy' pointer
		T1  T2  T3 ...T1000
		|   |   |      | <-- 'containers' pointer (4/8 KB for 1000 task)
	       -------------------
	      | container_group	  |
	       ------------------	
		/
	     ----------
	    | container |
	     ----------
		|
	     ----------
	    | cpu_limit |
	     ---------- 
(T1, T2, T3 ..T1000) are part of a vserver lets say sharing the same
uts/pid/ipc_ns. Now where do we store the resource control information
for this unit/set-of-tasks in your patches?
	(tsk->containers->container[cpu_ctlr.hierarchy] + X)->cpu_limit 
(The X is to account for the fact that cotainer structure points to a
'struct container_subsys_state' embedded in some other structure. Its
usually zero if the structure is embedded at the top)
I understand that container_group also points directly to
'struct container_subsys_state', in which case, the above is optimized
to:
	(tsk->containers->subsys[cpu_ctlr.subsys_id] + X)->cpu_limit
Did I get that correct?
Compare that to:
	     			   -----------
				  | cpu_limit |
	    uts_ns pid_ns ipc_ns   ----------
		\    |    /	    |
		------------------------
	       | 	nsproxy  	|
	        ------------------------
                 /  |   \	 |
		T1  T2  T3 .....T1000
We save on 4/8 KB (for 1000 tasks) by avoiding the 'containers' pointer
in each task_struct (just to get to the resource limit information).
So my observation was (again note primarily from a vserver context): given that 
(T1, T2, T3 ..T1000) will all need to be managed as a unit (because they are 
all sharing the same nsproxy pointer), then having the '->containers' pointer 
in -each- one of them to tell the unit's limit is not optimal. Instead store 
the limit in the proper unit structure (in this case nsproxy - but
whatever else is more suitable vserver datastructure (pid_ns?) which
represent the fundamental unit of res mgmt in vservers).
(I will respond to remaining comments later ..too early in the morning now!)
-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17679 is a reply to message #17678] | Sat, 10 March 2007 03:19   |  
			| 
				
				
					|  Srivatsa Vaddagiri Messages: 241
 Registered: August 2006
 | Senior Member |  |  |  
	| On Sat, Mar 10, 2007 at 07:32:20AM +0530, Srivatsa Vaddagiri wrote:
> Ok, let me see if I can convey what I had in mind better:
> 
> 	    uts_ns pid_ns ipc_ns
> 		\    |    /
> 		---------------
> 	       | nsproxy  	|
> 	        ----------------
>                  /  |   \    \ <-- 'nsproxy' pointer
> 		T1  T2  T3 ...T1000
> 		|   |   |      | <-- 'containers' pointer (4/8 KB for 1000 task)
> 	       -------------------
> 	      | container_group	  |
> 	       ------------------	
> 		/
> 	     ----------
> 	    | container |
> 	     ----------
> 		|
> 	     ----------
> 	    | cpu_limit |
> 	     ---------- 
[snip]
> We save on 4/8 KB (for 1000 tasks) by avoiding the 'containers' pointer
> in each task_struct (just to get to the resource limit information).
Having the 'containers' pointer in each task-struct is great from a
non-container res mgmt perspective. It lets you dynamically decide what
is the fundamental unit of res mgmt. 
It could be {T1, T5} tasks/threads of a process, or {T1, T3, T8, T10} tasks of 
a session (for limiting login time per session), or {T1, T2 ..T10, T18, T27} 
tasks of a user etc.
But from a vserver/container pov, this level flexibility (at a -task- level) of 
deciding the unit of res mgmt is IMHO not needed. The
vserver/container/namespace (tsk->nsproxy->some_ns) to which a task 
belongs automatically defines that unit of res mgmt.
-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17682 is a reply to message #17622] | Sat, 10 March 2007 08:52   |  
			| 
				
				
					|  Sam Vilain Messages: 73
 Registered: February 2006
 | Member |  |  |  
	| Paul Jackson wrote:
>> But "namespace" has well-established historical semantics too - a way
>> of changing the mappings of local * to global objects. This
>> accurately describes things liek resource controllers, cpusets, resource
>> monitoring, etc.
>>     
>
> No!
>
> Cpusets don't rename or change the mapping of objects.
>
> I suspect you seriously misunderstand cpusets and are trying to cram them
> into a 'namespace' remapping role into which they don't fit.
>   
Look, you're absolutely right, I'm stretching the terms much too far.
namespaces implies some kind of domain, which is the namespace, and
entities within the domain, which are the names, and there is a (task,
domain) mapping. I was thinking that this implies all similar (task,
domain) mappings could be treated in the same way. But when you apply
this to something like cpusets, it gets a little abstract. Like the
entities are (task,cpu) pairs and the domains the set of cpus that a
process can run on.
Sam.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17697 is a reply to message #17676] | Sun, 11 March 2007 16:36   |  
			| 
				
				
					|  serue Messages: 750
 Registered: February 2006
 | Senior Member |  |  |  
	| Quoting Herbert Poetzl (herbert@13thfloor.at):
> On Fri, Mar 09, 2007 at 11:27:07PM +0530, Srivatsa Vaddagiri wrote:
> > On Fri, Mar 09, 2007 at 01:38:19AM +0100, Herbert Poetzl wrote:
> > > > 2) you allow a task to selectively reshare namespaces/subsystems with
> > > >    another task, i.e. you can update current->task_proxy to point to
> > > >    a proxy that matches your existing task_proxy in some ways and the
> > > >    task_proxy of your destination in others. In that case a trivial
> > > >    implementation would be to allocate a new task_proxy and copy some
> > > >    pointers from the old task_proxy and some from the new. But then
> > > >    whenever a task moves between different groupings it acquires a
> > > >    new unique task_proxy. So moving a bunch of tasks between two
> > > >    groupings, they'd all end up with unique task_proxy objects with
> > > >    identical contents.
> 
> > > this is exactly what Linux-VServer does right now, and I'm
> > > still not convinced that the nsproxy really buys us anything
> > > compared to a number of different pointers to various spaces
> > > (located in the task struct)
> 
> > Are you saying that the current scheme of storing pointers to
> > different spaces (uts_ns, ipc_ns etc) in nsproxy doesn't buy
> > anything?
> 
> > Or are you referring to storage of pointers to resource 
> > (name)spaces in nsproxy doesn't buy anything?
> 
> > In either case, doesn't it buy speed and storage space?
> 
> let's do a few examples here, just to illustrate the
> advantages and disadvantages of nsproxy as separate
> structure over nsproxy as part of the task_struct
But you're forgetting the *common* case, which is hundreds or thousands
of tasks with just one nsproxy.  That's case for which we have to
optimize.
When that case is no longer the common case, we can yank the nsproxy.
As I keep saying, it *is* just an optimization.
-serge
> 1) typical setup, 100 guests as shell servers, 5
>    tasks each when unused, 10 tasks when used 10%
>    used in average
> 
>    a) separate nsproxy, we need at least 100
>       structs to handle that (saves some space)
> 
>       we might end up with ~500 nsproxies, if
>       the shell clones a new namespace (so might
>       not save that much space)
> 
>       we do a single inc/dec when the nsproxy
>       is reused, but do the full N inc/dec when
>       we have to copy an nsproxy (might save
>       some refcounting)
> 
>       we need to do the indirection step, from
>       task to nsproxy to space (and data)
> 
>    b) we have ~600 tasks with 600 times the
>       nsproxy data (uses up some more space)
> 
>       we have to do the full N inc/dev when
>       we create a new task (more refcounting)
> 
>       we do not need to do the indirection, we
>       access spaces directly from the 'hot'
>       task struct (makes hot pathes quite fast)
> 
>    so basically we trade a little more space and
>    overhead on task creation for having no 
>    indirection to the data accessed quite often
>    throughout the tasks life (hopefully)
> 
> 2) context migration: for whatever reason, we decide
>    to migrate a task into a subset (space mix) of a
>    context 1000 times
> 
>    a) separate nsproxy, we need to create a new one
>       consisting of the 'new' mix, which will
> 
>       - allocate the nsproxy struct
>       - inc refcounts to all copied spaces
>       - inc refcount nsproxy and assign to task
>       - dec refcount existing task nsproxy
> 
>       after task completion
>       - dec nsproxy refcount
>       - dec refcounts for all spaces      
>       - free up nsproxy struct
> 
>    b) nsproxy data in task struct
> 
>       - inc/dec refcounts to changed spaces
> 
>       after task completion
>       - dec refcounts to spaces
> 
>    so here we gain nothing with the nsproxy, unless
>    the chosen subset is identical to the one already
>    used, where we end up with a single refcount 
>    instead of N 
> 
> > > I'd prefer to do accounting (and limits) in a very simple
> > > and especially performant way, and the reason for doing
> > > so is quite simple:
> 
> > Can you elaborate on the relationship between data structures
> > used to store those limits to the task_struct?                                
> 
> sure it is one to many, i.e. each task points to
> exactly one context struct, while a context can
> consist of zero, one or many tasks (no back- 
> pointers there)
> 
> > Does task_struct store pointers to those objects directly?
> 
> it contains a single pointer to the context struct, 
> and that contains (as a substruct) the accounting
> and limit information
> 
> HTC,
> Herbert
> 
> > -- 
> > Regards,
> > vatsa
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.osdl.org
> > https://lists.osdl.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17718 is a reply to message #17644] | Sun, 11 March 2007 17:09   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Herbert,
> sorry, I'm not in the lucky position that I get payed
> for sending patches to LKML, so I have to think twice
> before I invest time in coding up extra patches ...
> 
> i.e. you will have to live with my comments for now
looks like you have no better argurments then that...
>>Looks like your main argument is non-intrusive...
>>"working", "secure", "flexible" are not required to 
>>people any more? :/
> 
> 
> well, Linux-VServer is "working", "secure", "flexible"
> _and_ non-intrusive ... it is quite natural that less
> won't work for me ... and regarding patches, there
> will be a 2.2 release soon, with all the patches ...
ok. please check your dcache and slab accounting then
(analyzed according to patch-2.6.20.1-vs2.3.0.11.diff):
Both are full of races and problems. Some of them:
1. Slabs allocated from interrupt context are charged to current context.
   So charged values contain arbitrary mess, since during interrupts
   context can be arbitrary.
2. Due to (1) I guess you do not make any limiting of slabs.
   So there are number of ways how to consume a lot of kernel
   memory from inside container and
   OOM killer will kill arbitrary tasks in case of memory-shortage after that.
   Don't think it is secure... real DoS.
3. Dcache accounting simply doesn't work, since
   charges/uncharges are done on current context (sic!!!), which is arbitrary.
   i.e. lookup can be done in VE context, while dcache shrink can be done
   from another context.
   So the whole problem with dcache DoS is not solved at all, it is just hard to trigger.
4. Dcache accounting is racy, since your checks look like:
   if (atomic_read(de->d_count))
      charge();
   which obviously races with other dput()'s/lookups.
5. Dcache accounting can be hit if someone does `find /` inside container.
   After that it is impossible to open something new,
   since all the dentries for directories in dcache will have d_count > 0
   (due it's children).
   It is a BUG.
6. Counters can be non-zero on container stop due to all of the above.
There are more and more points which arise when such a non-intrusive
accounting is concerned. I'm really suprised, that you don't see them
or try to behave as you don't see them :/
And, please, believe me, I would not suggest so much complicated patches
If everything was so easy and I had no reasons simply to accept vserver code.
> well, as you know, all current solutions use a syscall
> interface to do most of the work, in the OpenVZ/Virtuozzo
> case several, unassigned syscalls are used, while 
> FreeVPS and Linux-VServer use a registered and versioned
> (multiplexed) system call, which works quite fine for
> all known purposes ...
> 
> I'm quite happy with the extensibility and flexibility
> the versioned syscall interface has, the only thing I'd
> change if I would redesign that interface is, that I
> would add another pointer argument to eliminate 32/64bit
> issues completely (i.e. use 4 args instead of the 3)
Well, I would be happy with syscalls also.
But my guess is that cpuset guys who already use fs approach won't be happy :/
Maybe we can use both?
Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy! [message #17746 is a reply to message #17643] | Mon, 12 March 2007 15:07   |  
			| 
				
				
					|  Srivatsa Vaddagiri Messages: 241
 Registered: August 2006
 | Senior Member |  |  |  
	| On Fri, Mar 09, 2007 at 02:09:35PM -0800, Paul Menage wrote:
> > 3. This next leads me to think that 'tasks' file in each directory doesnt make
> >    sense for containers. In fact it can lend itself to error situations (by
> >    administrator/script mistake) when some tasks of a container are in one
> >    resource class while others are in a different class.
> >
> >         Instead, from a containers pov, it may be usefull to write
> >         a 'container id' (if such a thing exists) into the tasks file
> >         which will move all the tasks of the container into
> >         the new resource class. This is the same requirement we
> >         discussed long back of moving all threads of a process into new
> >         resource class.
> 
> I think you need to give a more concrete example and use case of what
> you're trying to propose here. I don't really see what advantage
> you're getting.
Ok, this is what I had in mind:
	mount -t container -o ns /dev/namespace
	mount -t container -o cpu /dev/cpu
Lets we have the namespaces/resource-groups created as under:
	/dev/namespace
		    |-- prof
		    |	 |- tasks <- (T1, T2)
		    |    |- container_id <- 1 (doesnt exist today perhaps)
		    |
		    |-- student
		    |    |- tasks <- (T3, T4)
		    |    |- container_id <- 2 (doesnt exist today perhaps)
	/dev/cpu
	       |-- prof
	       |    |-- tasks
	       |    |-- cpu_limit (40%)
	       |
	       |-- student
	       |    |-- tasks
	       |    |-- cpu_limit (20%)
	       |
	       |
Is it possible to create the above structure in container patches? 
/me thinks so.
If so, then accidentally someone can do this:
	echo T1 > /dev/cpu/prof/tasks
	echo T2 > /dev/cpu/student/tasks
with the result that tasks of the same container are now in different
resource classes.
Thats why in case of containers I felt we shldnt allow individual tasks
to be cat'ed to tasks file. 
Or rather, it may be nice to say :
	echo "cid 2" > /dev/cpu/prof/tasks 
and have all tasks belonging to container id 2 move to the new resource
group.
-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17765 is a reply to message #17718] | Mon, 12 March 2007 23:00   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Sun, Mar 11, 2007 at 08:09:29PM +0300, Kirill Korotaev wrote:
> Herbert,
> 
> > sorry, I'm not in the lucky position that I get payed
> > for sending patches to LKML, so I have to think twice
> > before I invest time in coding up extra patches ...
> > 
> > i.e. you will have to live with my comments for now
> looks like you have no better argurments then that...
pardon?
if you want to make that personal, please do it
offline ... I'm sick of (lkml) folks wasting 
time for (political) hick hack instead of trying
to improve the kernel ...
>>> Looks like your main argument is non-intrusive...
>>> "working", "secure", "flexible" are not required to 
>>> people any more? :/
>> well, Linux-VServer is "working", "secure", "flexible"
>> _and_ non-intrusive ... it is quite natural that less
>> won't work for me ... and regarding patches, there
>> will be a 2.2 release soon, with all the patches ...
> ok. please check your dcache and slab accounting then
> (analyzed according to patch-2.6.20.1-vs2.3.0.11.diff):
development branch, good choice for new features
and code which is currently tested ...
> Both are full of races and problems. Some of them:
> 1. Slabs allocated from interrupt context are charged to 
>    current context.
>    So charged values contain arbitrary mess, since during
>    interrupts context can be arbitrary.
> 2. Due to (1) I guess you do not make any limiting of slabs.
>    So there are number of ways how to consume a lot of kernel
>    memory from inside container and
>    OOM killer will kill arbitrary tasks in case of 
>    memory-shortage after that.
>    Don't think it is secure... real DoS.
> 3. Dcache accounting simply doesn't work, since
>    charges/uncharges are done on current context (sic!!!),
>    which is arbitrary. i.e. lookup can be done in VE context,
>    while dcache shrink can be done from another context.
>    So the whole problem with dcache DoS is not solved at 
>    all, it is just hard to trigger.
> 4. Dcache accounting is racy, since your checks look like:
>    if (atomic_read(de->d_count))
>       charge();
>    which obviously races with other dput()'s/lookups.
> 5. Dcache accounting can be hit if someone does `find /`
>    inside container.
>    After that it is impossible to open something new,
>    since all the dentries for directories in dcache will 
>    have d_count > 0 (due it's children).
>    It is a BUG.
> 6. Counters can be non-zero on container stop due to all
>    of the above.
looks like for the the first time you are actually
looking at the code, or at least providing feedback
and/or suggestions for improvements (well, not many
of them, but hey, nobody is perfect :)
> There are more and more points which arise when such a 
> non-intrusive accounting is concerned. 
never claimed that Linux-VServer code is perfect,
(the Linux accounting isn't perfect either in many
ways) and Linux-VServer is constantly improving
(see my other email) ... but IIRC, we are _not_
discussing Linux-VServer code at all, we are talking
about a superior solution, which combines the best
of both worlds ...
> I'm really suprised, that you don't see them
> or try to behave as you don't see them :/
all I'm saying is that there is no point in achieving
perfect accounting and limits (and everything else)
when all you get is Xen performance and resource usage
> And, please, believe me, I would not suggest so much 
> complicated patches If everything was so easy and I 
> had no reasons simply to accept vserver code.
no, you are suggesting those patches, because that
is what your company came up with after being confronted
with the task (of creating OS-Level virtualization) and
the arising problems ... so it definitely _is_ a
solution to those problems, but not necessarily the
best and definitely not the only one :)
> > well, as you know, all current solutions use a syscall
> > interface to do most of the work, in the OpenVZ/Virtuozzo
> > case several, unassigned syscalls are used, while 
> > FreeVPS and Linux-VServer use a registered and versioned
> > (multiplexed) system call, which works quite fine for
> > all known purposes ...
> > 
> > I'm quite happy with the extensibility and flexibility
> > the versioned syscall interface has, the only thing I'd
> > change if I would redesign that interface is, that I
> > would add another pointer argument to eliminate 32/64bit
> > issues completely (i.e. use 4 args instead of the 3)
> Well, I would be happy with syscalls also.
> But my guess is that cpuset guys who already use fs 
> approach won't be happy :/
> Maybe we can use both?
I'm fine with either here, though my preference is
for syscalls (and we will probably keep the versioned
syscall commands for Linux-VServer anyway)
best,
Herbert
> Thanks,
> Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17766 is a reply to message #17697] | Mon, 12 March 2007 23:16   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Sun, Mar 11, 2007 at 11:36:04AM -0500, Serge E. Hallyn wrote:
> Quoting Herbert Poetzl (herbert@13thfloor.at):
> > On Fri, Mar 09, 2007 at 11:27:07PM +0530, Srivatsa Vaddagiri wrote:
> > > On Fri, Mar 09, 2007 at 01:38:19AM +0100, Herbert Poetzl wrote:
> > > > > 2) you allow a task to selectively reshare namespaces/subsystems with
> > > > >    another task, i.e. you can update current->task_proxy to point to
> > > > >    a proxy that matches your existing task_proxy in some ways and the
> > > > >    task_proxy of your destination in others. In that case a trivial
> > > > >    implementation would be to allocate a new task_proxy and copy some
> > > > >    pointers from the old task_proxy and some from the new. But then
> > > > >    whenever a task moves between different groupings it acquires a
> > > > >    new unique task_proxy. So moving a bunch of tasks between two
> > > > >    groupings, they'd all end up with unique task_proxy objects with
> > > > >    identical contents.
> > 
> > > > this is exactly what Linux-VServer does right now, and I'm
> > > > still not convinced that the nsproxy really buys us anything
> > > > compared to a number of different pointers to various spaces
> > > > (located in the task struct)
> > 
> > > Are you saying that the current scheme of storing pointers to
> > > different spaces (uts_ns, ipc_ns etc) in nsproxy doesn't buy
> > > anything?
> > 
> > > Or are you referring to storage of pointers to resource 
> > > (name)spaces in nsproxy doesn't buy anything?
> > 
> > > In either case, doesn't it buy speed and storage space?
> > 
> > let's do a few examples here, just to illustrate the
> > advantages and disadvantages of nsproxy as separate
> > structure over nsproxy as part of the task_struct
> 
> But you're forgetting the *common* case, which is hundreds or
> thousands of tasks with just one nsproxy. That's case for 
> which we have to optimize.
yes, I agree here, maybe we should do something
I suggested (and submitted a patch for some time
ago) and add some kind of accounting for the various
spaces (and the nsproxy) so that we can get a feeling
how many of them are there and how many create/destroy
cycles really happen ...
those things will definitely be accounted in the
Linux-VServer devel versions, don't know about OVZ
> When that case is no longer the common case, we can yank the 
> nsproxy.  As I keep saying, it *is* just an optimization.
yes, fine with me, just wanted to paint a picture ...
best,
Herbert
> -serge
> 
> > 1) typical setup, 100 guests as shell servers, 5
> >    tasks each when unused, 10 tasks when used 10%
> >    used in average
> > 
> >    a) separate nsproxy, we need at least 100
> >       structs to handle that (saves some space)
> > 
> >       we might end up with ~500 nsproxies, if
> >       the shell clones a new namespace (so might
> >       not save that much space)
> > 
> >       we do a single inc/dec when the nsproxy
> >       is reused, but do the full N inc/dec when
> >       we have to copy an nsproxy (might save
> >       some refcounting)
> > 
> >       we need to do the indirection step, from
> >       task to nsproxy to space (and data)
> > 
> >    b) we have ~600 tasks with 600 times the
> >       nsproxy data (uses up some more space)
> > 
> >       we have to do the full N inc/dev when
> >       we create a new task (more refcounting)
> > 
> >       we do not need to do the indirection, we
> >       access spaces directly from the 'hot'
> >       task struct (makes hot pathes quite fast)
> > 
> >    so basically we trade a little more space and
> >    overhead on task creation for having no 
> >    indirection to the data accessed quite often
> >    throughout the tasks life (hopefully)
> > 
> > 2) context migration: for whatever reason, we decide
> >    to migrate a task into a subset (space mix) of a
> >    context 1000 times
> > 
> >    a) separate nsproxy, we need to create a new one
> >       consisting of the 'new' mix, which will
> > 
> >       - allocate the nsproxy struct
> >       - inc refcounts to all copied spaces
> >       - inc refcount nsproxy and assign to task
> >       - dec refcount existing task nsproxy
> > 
> >       after task completion
> >       - dec nsproxy refcount
> >       - dec refcounts for all spaces      
> >       - free up nsproxy struct
> > 
> >    b) nsproxy data in task struct
> > 
> >       - inc/dec refcounts to changed spaces
> > 
> >       after task completion
> >       - dec refcounts to spaces
> > 
> >    so here we gain nothing with the nsproxy, unless
> >    the chosen subset is identical to the one already
> >    used, where we end up with a single refcount 
> >    instead of N 
> > 
> > > > I'd prefer to do accounting (and limits) in a very simple
> > > > and especially performant way, and the reason for doing
> > > > so is quite simple:
> > 
> > > Can you elaborate on the relationship between data structures
> > > used to store those limits to the task_struct?                                
> > 
> > sure it is one to many, i.e. each task points to
> > exactly one context struct, while a context can
> > consist of zero, one or many tasks (no back- 
> > pointers there)
> > 
> > > Does task_struct store pointers to those objects directly?
> > 
> > it contains a single pointer to the context struct, 
> > and that contains (as a substruct) the accounting
> > and limit information
> > 
> > HTC,
> > Herbert
> > 
> > > -- 
> > > Regards,
> > > vatsa
> > > _______________________________________________
> > > Containers mailing list
> > > Containers@lists.osdl.org
> > > https://lists.osdl.org/mailman/listinfo/containers
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.osdl.org
> > https://lists.osdl.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17789 is a reply to message #17765] | Tue, 13 March 2007 08:28   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| >>>well, Linux-VServer is "working", "secure", "flexible"
>>>_and_ non-intrusive ... it is quite natural that less
>>>won't work for me ... and regarding patches, there
>>>will be a 2.2 release soon, with all the patches ...
> 
> 
>>ok. please check your dcache and slab accounting then
>>(analyzed according to patch-2.6.20.1-vs2.3.0.11.diff):
> 
> 
> development branch, good choice for new features
> and code which is currently tested ...
you know better than I that stable branch doesn't differ much,
especially in securiy (because it lacks these controls at all).
BTW, killing arbitrary task in case of RSS limit hit
doesn't look acceptable resource management approach, does it?
>>Both are full of races and problems. Some of them:
>>1. Slabs allocated from interrupt context are charged to 
>>   current context.
>>   So charged values contain arbitrary mess, since during
>>   interrupts context can be arbitrary.
> 
> 
>>2. Due to (1) I guess you do not make any limiting of slabs.
>>   So there are number of ways how to consume a lot of kernel
>>   memory from inside container and
>>   OOM killer will kill arbitrary tasks in case of 
>>   memory-shortage after that.
>>   Don't think it is secure... real DoS.
> 
> 
>>3. Dcache accounting simply doesn't work, since
>>   charges/uncharges are done on current context (sic!!!),
>>   which is arbitrary. i.e. lookup can be done in VE context,
>>   while dcache shrink can be done from another context.
>>   So the whole problem with dcache DoS is not solved at 
>>   all, it is just hard to trigger.
> 
> 
>>4. Dcache accounting is racy, since your checks look like:
>>   if (atomic_read(de->d_count))
>>      charge();
>>   which obviously races with other dput()'s/lookups.
> 
> 
>>5. Dcache accounting can be hit if someone does `find /`
>>   inside container.
>>   After that it is impossible to open something new,
>>   since all the dentries for directories in dcache will 
>>   have d_count > 0 (due it's children).
>>   It is a BUG.
> 
> 
>>6. Counters can be non-zero on container stop due to all
>>   of the above.
> 
> 
> looks like for the the first time you are actually
> looking at the code, or at least providing feedback
> and/or suggestions for improvements (well, not many
> of them, but hey, nobody is perfect :)
It's a pity, but it took me only 5 minutes of looking into the code,
so "not perfect" is a wrong word here, sorry.
>>There are more and more points which arise when such a 
>>non-intrusive accounting is concerned. 
> 
> 
> never claimed that Linux-VServer code is perfect,
> (the Linux accounting isn't perfect either in many
> ways) and Linux-VServer is constantly improving
> (see my other email) ... but IIRC, we are _not_
> discussing Linux-VServer code at all, we are talking
> about a superior solution, which combines the best
> of both worlds ...
Forget about Vserver and OpenVZ. It is not a war.
We are looking for something working, new and robust.
I'm just trying you to show that non-intrusive and pretty small
accounting/limiting code like in Vserver
simply doesn't work. The problem of resource controls is much more complicated.
So non-intrusiveness is a very weird argument from you (and the only).
>>I'm really suprised, that you don't see them
>>or try to behave as you don't see them :/
> 
> 
> all I'm saying is that there is no point in achieving
> perfect accounting and limits (and everything else)
> when all you get is Xen performance and resource usage
then please elaborate on what you mean by
perfect and non-perfect accounting and limits?
I would be happy to sent a patch with a "non-perfect"
accounting if it really works correct and good and suits all the people needs.
BTW, Xen overhead comes mostly from different things (not resource management) -
inability to share data effectively, emulation overhead etc.
>>And, please, believe me, I would not suggest so much 
>>complicated patches If everything was so easy and I 
>>had no reasons simply to accept vserver code.
> 
> 
> no, you are suggesting those patches, because that
> is what your company came up with after being confronted
> with the task (of creating OS-Level virtualization) and
> the arising problems ... so it definitely _is_ a
> solution to those problems, but not necessarily the
> best and definitely not the only one :)
You judge so because you want to.
Have you had some time to compare UBC patches from OVZ
and those sent to LKML (container + RSS)?
You would notice too litle in common.
Patches in LKML has non-OVZ interfaces, no shared pages accounting,
RSS accounting which is not used in OVZ at all.
So do you see any similarities except for stupid and simple
controls like numtask/numfile?
Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH 1/2] rcfs core patch [message #17793 is a reply to message #17789] | Tue, 13 March 2007 13:55   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Tue, Mar 13, 2007 at 11:28:06AM +0300, Kirill Korotaev wrote:
> >>>well, Linux-VServer is "working", "secure", "flexible"
> >>>_and_ non-intrusive ... it is quite natural that less
> >>>won't work for me ... and regarding patches, there
> >>>will be a 2.2 release soon, with all the patches ...
first, fix your mail client to get the quoting right,
it is quite unreadable the way it is (not the first
time I tell you that)
> >>ok. please check your dcache and slab accounting then
> >>(analyzed according to patch-2.6.20.1-vs2.3.0.11.diff):
> > 
> > 
> > development branch, good choice for new features
> > and code which is currently tested ...
> you know better than I that stable branch doesn't differ much,
> especially in securiy (because it lacks these controls at all).
> 
> BTW, killing arbitrary task in case of RSS limit hit
> doesn't look acceptable resource management approach, does it?
> 
> >>Both are full of races and problems. Some of them:
> >>1. Slabs allocated from interrupt context are charged to 
> >>   current context.
> >>   So charged values contain arbitrary mess, since during
> >>   interrupts context can be arbitrary.
> > 
> > 
> >>2. Due to (1) I guess you do not make any limiting of slabs.
> >>   So there are number of ways how to consume a lot of kernel
> >>   memory from inside container and
> >>   OOM killer will kill arbitrary tasks in case of 
> >>   memory-shortage after that.
> >>   Don't think it is secure... real DoS.
> > 
> > 
> >>3. Dcache accounting simply doesn't work, since
> >>   charges/uncharges are done on current context (sic!!!),
> >>   which is arbitrary. i.e. lookup can be done in VE context,
> >>   while dcache shrink can be done from another context.
> >>   So the whole problem with dcache DoS is not solved at 
> >>   all, it is just hard to trigger.
> > 
> > 
> >>4. Dcache accounting is racy, since your checks look like:
> >>   if (atomic_read(de->d_count))
> >>      charge();
> >>   which obviously races with other dput()'s/lookups.
> > 
> > 
> >>5. Dcache accounting can be hit if someone does `find /`
> >>   inside container.
> >>   After that it is impossible to open something new,
> >>   since all the dentries for directories in dcache will 
> >>   have d_count > 0 (due it's children).
> >>   It is a BUG.
> > 
> > 
> >>6. Counters can be non-zero on container stop due to all
> >>   of the above.
> > 
> > 
> > looks like for the the first time you are actually
> > looking at the code, or at least providing feedback
> > and/or suggestions for improvements (well, not many
> > of them, but hey, nobody is perfect :)
> It's a pity, but it took me only 5 minutes of looking into the code,
> so "not perfect" is a wrong word here, sorry.
see how readable and easily understandable the code is?
it takes me several hours to read OpenVZ code, and that's
not just me :)
> >>There are more and more points which arise when such a 
> >>non-intrusive accounting is concerned. 
> > 
> > 
> > never claimed that Linux-VServer code is perfect,
> > (the Linux accounting isn't perfect either in many
> > ways) and Linux-VServer is constantly improving
> > (see my other email) ... but IIRC, we are _not_
> > discussing Linux-VServer code at all, we are talking
> > about a superior solution, which combines the best
> > of both worlds ...
> Forget about Vserver and OpenVZ. It is not a war.
> We are looking for something working, new and robust.
you forgot efficient and performant here ...
> I'm just trying you to show that non-intrusive and pretty small
> accounting/limiting code like in Vserver simply doesn't work.
simply doesn't work? 
because you didn't try to make it work?
because you didn't succeed in making it work?
> The problem of resource controls is much more complicated.
> So non-intrusiveness is a very weird argument from you 
> (and the only).
no comment, read my previous emails ...
> >>I'm really suprised, that you don't see them
> >>or try to behave as you don't see them :/
> > 
> > 
> > all I'm saying is that there is no point in achieving
> > perfect accounting and limits (and everything else)
> > when all you get is Xen performance and resource usage
> then please elaborate on what you mean by
> perfect and non-perfect accounting and limits?
as we are discussing RSS limits, there are actually
three different (existing) approaches we have talked
about:
 - 'the 'perfect RAM counter'
   each page is accounted exactly once, when used in
   a guest, regardless of how many times it is shared
   between different guest tasks
 - the 'RSS sum' approach
   each page is accounted for every task mapping it
   (will account shared pages inside a guest several
   times and doesn't reflect the actual RAM usage)
 - the 'first user owns' approach
   each page, when mapped the first time, gets accounted
   to the guest who mapped it, regardless of the fact
   that it might be shared with other guests lateron
the first one is 'perfect' IMHO, while all three are
'consistant' if done properly, although they will show
quite different results and require different limit
settings ...
> I would be happy to sent a patch with a "non-perfect"
> accounting if it really works correct and good and suits 
> all the people needs.
good, but what you currently do is providing 'your'
implementation with 'your' design and approach, which
_doesn't_ really suit _my_ needs ...
> BTW, Xen overhead comes mostly from different things 
> (not resource management) - inability to share data 
> effectively, emulation overhead etc.
no comment ...
> >>And, please, believe me, I would not suggest so much 
> >>complicated patches If everything was so easy and I 
> >>had no reasons simply to accept vserver code.
> > 
> > 
> > no, you are suggesting those patches, because that
> > is what your company came up with after being confronted
> > with the task (of creating OS-Level virtualization) and
> > the arising problems ... so it definitely _is_ a
> > solution to those problems, but not necessarily the
> > best and definitely not the only one :)
> You judge so because you want to.
> Have you had some time to compare UBC patches from OVZ
> and those sent to LKML (container + RSS)?
> You would notice too litle in common.
> Patches in LKML has non-OVZ interfaces, no shared pages accounting,
> RSS accounting which is not used in OVZ at all.
> So do you see any similarities except for stupid and simple
> controls like numtask/numfile?
yes, tons of locking, complicated indirections and
a lot of (partially hard to understand) code ...
best,
Herbert
> Thanks,
> Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [PATCH 1/2] rcfs core patch [message #17808 is a reply to message #17793] | Tue, 13 March 2007 14:11  |  
			| 
				
				
					|  Srivatsa Vaddagiri Messages: 241
 Registered: August 2006
 | Senior Member |  |  |  
	| On Tue, Mar 13, 2007 at 02:55:05PM +0100, Herbert Poetzl wrote:
> yes, tons of locking, complicated indirections and
> a lot of (partially hard to understand) code ...
Are you referring to these issues in the general Paul Menage's container code
or in the RSS-control code posted by Pavel?
-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers |  
	|  |  | 
 
 
 Current Time: Sun Oct 26 17:02:28 GMT 2025 
 Total time taken to generate the page: 0.09453 seconds |