| Home » Mailing lists » Devel » [PATCH 1/4] Virtualization/containers: introduction Goto Forum:
	|  |  
	|  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1428 is a reply to message #1412] | Wed, 08 February 2006 14:40   |  
			| 
				
				
					|  Hubertus Franke Messages: 16
 Registered: February 2006
 | Junior Member |  |  |  
	| Eric W. Biederman wrote: > Hubertus Franke <frankeh@watson.ibm.com> writes:
 >
 >
 >>Eric W. Biederman wrote:
 >>
 
 >>>2) What is the syscall interface to create these namespaces?
 >>>   - Do we add clone flags?       (Plan 9 style)
 >>
 >>Like that approach .. flexible .. particular when one has well specified
 >>namespaces.
 >>
 >>
 >>>   - Do we add a syscall (similar to setsid) per namespace?
 >>>     (Traditional unix style)?
 >>
 >>Where does that approach end .. what's wrong with doing it at clone() time ?
 >>Mainly the naming issue. Just providing a flag does not give me name.
 >
 >
 > It really is a fairly even toss up.  The usual argument for doing it
 > this way is that you will get a endless stream of arguments added to
 > fork+exec other wise.  Look of posix_spawn or the windows version if
 > you want an example.  Bits to clone are skirting the edge of a slippery
 > slope.
 >
 
 So it seems the clone( flags ) is a reasonable approach to create new
 namespaces. Question is what is the initial state of each namespace?
 In pidspace we know we should be creating an empty pidmap !
 In network, someone suggested creating a loopback device
 In uts, create "localhost"
 Are there examples where we rather inherit ?  Filesystem ?
 Can we iterate the assumption for each subsystem what people thing is right?
 
 IMHO, there is only a need to refer to a namespace from the global context.
 Since one will be moving into a new container, but getting out of one
 could be prohibitive (e.g. after migration)
 It does not make sense therefore to know the name of a namespace in
 a different container.
 
 The example you used below by using the pid comes natural, because
 that already limits visibility.
 
 I am still struggling with why we need new sys_calls.
 sys_calls already exist for changing certain system parameters (e.g. utsname )
 so to me it boils down to identifying a proper initial state when the
 namespace is created.
 
 >
 >>>3) How do we refer to namespaces and containers when we are not members?
 >>>   - Do we refer to them indirectly by processes or other objects that
 >>>     we can see and are members?
 >>>   - Do we assign some kind of unique id to the containers?
 >>
 >>In containers I simply created an explicite name, which ofcourse colides with
 >>the
 >>clone() approach ..
 >>One possibility is to allow associating a name with a namespace.
 >>For instance
 >>int set_namespace_name( long flags, const char *name ) /* the once we are using
 >>in clone */
 >>{
 >>	if (!flag)
 >>		set name of container associated with current.
 >>	if (flag())
 >>		set the name if only one container is associated with the
 >>namespace(s)
 >>		identified .. or some similar rule
 >>}
 >>
 >
 >
 > What I have done which seems easier than creating new names is to refer
 > to the process which has the namespace I want to manipulate.
 
 Is then the idea to only allow the container->init to manipulate
 or is there need to allow other priviliged processes to perform namespace
 manipulation?
 Also after thinking about it.. why is there a need to have an external name
 for a namespace ?
 
 >
 >
 >>>6) How do we do all of this efficiently without a noticeable impact on
 >>>   performance?
 >>>   - I have already heard concerns that I might be introducing cache
 >>>     line bounces and thus increasing tasklist_lock hold time.
 >>>     Which on big way systems can be a problem.
 >>
 >>Possible to split the lock up now.. one for each pidspace ?
 >
 >
 > At the moment it is worth thinking about.  If the problem isn't
 > so bad that people aren't actively working on it we don't have to
 > solve the problem for a little while, just be aware of it.
 >
 
 Agree, just need to be sure we can split it up. But you already keep
 a task list per pid-namespace, so there should be no problem IMHO.
 If so let's do it now and take it of the table it its as simple as
 
 task_list_lock ::= pspace->task_list_lock
 
 >
 >>>7) How do we allow a process inside a container to create containers
 >>>   for it's children?
 >>>   - In general this is trivial but there are a few ugly issues
 >>>     here.
 >>
 >>Speaking of pids only here ...
 >>Does it matter, you just hang all those containers hang of init.
 >>What ever hierarchy they form is external ...
 >
 >
 > In general it is simple.  For resource accounting, and for naming so
 > you can migrate a container with a nested container it is a question
 > you need to be slightly careful with.
 
 Absolutely, that's why it is useful to have an "external" idea of how
 containers are constructed of basic namespaces==subsystems.
 The it "simply" becomes a policy. E.g. one can not migrate a container
 that has shared subsystems.
 Resource accounting I agree, that might required active aggregation
 at request time.
 
 -- Hubertus
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1436 is a reply to message #1430] | Wed, 08 February 2006 15:57   |  
			| 
				
				
					|  Hubertus Franke Messages: 16
 Registered: February 2006
 | Junior Member |  |  |  
	| Kirill Korotaev wrote: >>> Eric W. Biederman wrote:
 >>> So it seems the clone( flags ) is a reasonable approach to create new
 >>> namespaces. Question is what is the initial state of each namespace?
 >>> In pidspace we know we should be creating an empty pidmap !
 >>> In network, someone suggested creating a loopback device
 >>> In uts, create "localhost"
 >>> Are there examples where we rather inherit ?  Filesystem ?
 >>
 >> Of course filesystem is already implemented, and does inheret a full
 >> copy.
 >
 >
 > why do we want to use clone()? Just because of its name and flags?
 > I think it is really strange to fork() to create network context. What
 > has process creation has to do with it?
 >
 > After all these clone()'s are called, some management actions from host
 > system are still required, to add these IPs/routings/etc.
 > So? Why mess it up? Why not create a separate clean interface for
 > container management?
 >
 > Kirill
 >
 
 We need a "init" per container, which represents the context of the
 system represented by the container.
 If that is the case, then why not create the container such that
 we specify what namespaces need to be new for a container at
 the container creation time and initialize them to a well understood
 state that makes sense (e.g. copy namespace (FS, uts) , new fresh state (pid) ).
 
 Then use the standard syscall to modify state (now "virtualized" through
 the task->xxx_namespace access ).
 
 Do you see a need to change the namespace of a container after it
 has been created. I am not referring to the state of the namespace
 but truely moving to a completely different namespace after the
 container has been created.
 
 Obviously you seem to have some other usage in mind, beyond what my
 limited vision can see. Can you share some of those examples, because
 that would help this discussion along ...
 
 Thanks a 10^6.
 
 -- Hubertus
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1447 is a reply to message #1436] | Wed, 08 February 2006 19:02   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Wed, Feb 08, 2006 at 10:57:24AM -0500, Hubertus Franke wrote: >Kirill Korotaev wrote:
 >>>>Eric W. Biederman wrote:
 >>>>So it seems the clone( flags ) is a reasonable approach to create new
 >>>>namespaces. Question is what is the initial state of each namespace?
 >>>>In pidspace we know we should be creating an empty pidmap !
 >>>>In network, someone suggested creating a loopback device
 >>>>In uts, create "localhost"
 >>>>Are there examples where we rather inherit ?  Filesystem ?
 >>>
 >>>Of course filesystem is already implemented, and does inheret a full
 >>>copy.
 
 I try to comment on both mails here because I thing that
 clone() is basically a good interface, but will require
 some redesign and/or extension ...
 
 >> why do we want to use clone()?
 
 because it is a natural and existing interface for this purpose
 at least in Linux-VServer it would work (to some extend). why?
 
 because we aleady use tools like chbind and chcontext which
 do similar things as chroot, and chroot could, in theory, use
 clone() and rbind to do it's job ...
 
 >> Just because of its name and flags?
 
 extending the flags seems natural to me, but the problem might
 actually be that there are not enough of them left
 
 >> I think it is really strange to fork() to create network context.
 
 if you look at it as network namespace and sharing existing
 spaces and/or creating new ones, then clone() and unshare()
 make pretty much sense there ...
 
 >> What has process creation has to do with it?
 
 it is a natural interface where you can decide whether to
 share a space or acquire a new one ... IMHO it would make
 sense to get trivial userspace tools to create those
 new spaces in one go, so that the user can use those
 'building blocks' to create new spaces whenever she needs
 
 >> After all these clone()'s are called, some management actions
 >> from host system are still required, to add these IPs/routings/etc.
 
 not necessarily, for example Linux-VServer uses some kind
 of 'priviledged' mode, in which the initial guest process
 can modify various things (like in this case the networking)
 and setup whatever is required, then, shortly after, giving
 up those priviledges ...
 
 >> So? Why mess it up?
 >> Why not create a separate clean interface for container management?
 
 I'm not against a clean interface at all, but how would
 such a 'clean' interface look like?
 
 - a complicated sysfs interface where you write strange
 values into even stranger places?
 - 40 different syscalls to do stuff like adding or removing
 various parts from the spaces?
 - a new ioctl for processes?
 
 >> Kirill
 >
 > We need a "init" per container, which represents the context of the
 > system represented by the container.
 
 that's only partially true, for example Linux-VServer also
 allows for light-weight guests/containers which do not
 have a separate init process, just a 'fake' one, so we can
 save the resources consumed by a separate init process ...
 
 it turns out that this works perfectly fine, even without
 that fake init, if you teach a few tools like pstree that
 they should not blindly assume that there is a pid=1 :)
 
 > If that is the case, then why not create the container such that
 > we specify what namespaces need to be new for a container at the
 > container creation time and initialize them to a well understood
 > state that makes sense (e.g. copy namespace (FS, uts) , new fresh
 > state (pid) ).
 
 agreed, but now comes the interesting part, how does such
 a well understood state look like for all contexts?
 
 obviously the name space makes a complete copy, leading to
 various issues when you 'try' to get rid of the 'copied'
 data, like host filesystem mount points and such ...
 
 removing the mounts 'above' a certain chroot() path might
 seem like a good solution here, but actually it will cause
 issues when you want to maintain/access a guest/container
 from the host/parent
 
 leaving all mounts in place will require to either inherit
 changes from the host/parent to all guests/containers, just
 to avoid having e.g. /mnt/cdrom mounted in process A, which
 does not even see it (as it's accessible space starts
 somewhere else) and therefore being unable to eject the
 thing, although it is obviously unused
 
 > Then use the standard syscall to modify state (now "virtualized"
 > through the task->xxx_namespace access ).
 
 works as long as you have a handle for that, and actually
 you do not need one init per guest/container, you need
 one 'uniquely identified' task 'outside' the container
 
 which in the typical case already makes two of them, the
 'handle' task outside and the 'init' task inside ...
 
 > Do you see a need to change the namespace of a container after it
 > has been created. I am not referring to the state of the namespace
 > but truely moving to a completely different namespace after the
 > container has been created.
 
 container itself, probably not, task, definitely yes ...
 i.e. you definitely want to move between all those spaces
 given that you are sufficiently priviledged, which is a
 completely different can of worms ...
 
 > Obviously you seem to have some other usage in mind, beyond what my
 > limited vision can see. Can you share some of those examples, because
 > that would help this discussion along ...
 
 I guess one case where the separate container setup is
 desired is when you want to keep a container alive even
 after the processes have seized to exist. for example to
 visit a 'stopped' guest/context just to do some emergency
 operations or install some files/packages/etc
 
 IMHO this can easily be solved by keeping the 'handle'
 process (or whatever handle may be used for the complete
 context consisting of all spaces) around, even if no
 process is using those spaces ...
 
 the context as 'collection' of certain namespaces is
 definitely something which will be required to allow to
 'move' into a specific container from the host/parent
 side, as the parent process obviously does not hold that
 information.
 
 best,
 Herbert
 
 > Thanks a 10^6.
 >
 > -- Hubertus
 >
 >
 >
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1464 is a reply to message #1449] | Wed, 08 February 2006 21:22   |  
			| 
				
				
					|  serue Messages: 750
 Registered: February 2006
 | Senior Member |  |  |  
	| Quoting Dave Hansen (haveblue@us.ibm.com): > On Wed, 2006-02-08 at 12:03 -0600, Serge E. Hallyn wrote:
 > > Now I believe Eric's code so far would make it so that you can only
 > > refer to a namespace from it's *creating* context.  Still restrictive,
 > > but seems acceptable.
 >
 > The same goes for filesystem namespaces.  You can't see into random
 > namespaces, just the ones underneath your own.  Sounds really reasonable
 > to me.
 
 Hmmm?  I suspect I'm misreading what you're saying, but to be clear:
 
 Let's say I start a screen session.  In one of those shells, I clone,
 specify CLONE_NEWNS, and exec a shell.  now i do a bunch of mounting.
 Other shells in the screen session won't see the results of those
 mounts, and if i ctrl-d, the shell which started the screen session
 can't either.  Each of these is in the "parent filesystem namespace".
 
 OTOH, shared subtrees specified in the parent shell could make it such
 that the parent ns, but not others, see the results.  Is that what
 you're referring to?
 
 thanks,
 -serge
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1471 is a reply to message #1396] | Thu, 09 February 2006 05:41   |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| Kyle Moffett <mrmacman_g4@mac.com> writes: 
 > On Feb 07, 2006, at 17:06, Eric W. Biederman wrote:
 >> I think I can boil the discussion down into some of the fundamental questions
 >> that we are facing.
 >>
 >> Currently everyone seems to agree that we need something like my  namespace
 >> concept that isolates multiple resources.
 >>
 >> We need these for
 >> UIDS
 >> FILESYSTEM
 >
 > I have one suggestion for this (it also covers capabilities to a  certain
 > extent).  Could we use the kernel credentials system to  abstract away the
 > concept of a single UID/GID?  We currently have  uid, euid, gid, egid, groups,
 > fsid.  I'm thinking that there would be  virtualized UID tables to determine
 > ownership of processes/SHM/etc.
 >
 > Each process would have a (uid_container,uid) pair (or similar) as  its "uid"
 > and likewise for gid.  Then the ability to send signals to  any given
 > (uid_container,uid) or (gid_container,gid) pair would be  given by keys in the
 > kernel keyring indexed by the "uid_container"  part and containing the "uid"
 > part (or maybe just a pointer).
 >
 > Likewise the filesystem access could be virtualized by using uid and  gid keys
 > in the kernel keyring indexed by vfsmount (Not superblock,  so that it would be
 > possible to have different UID representations on different mounts/parts of the
 > same filesystem).
 >
 > I'm guessing that the performance implications of the above would not  be quite
 > so nice, as it would put a lot of code in the fastpath, but  I would guess that
 > it might be possible to use the existing fields  for processes without any
 > virtualization needs.
 
 At least for signal sending it looks like it would be easier to just compare
 the pointers to struct user.  At least in that context it looks like it
 would be as cheap as what we are doing now.  I just don't know where
 to find a struct user for the euid, or is it the normal uid.
 
 Eric
 |  
	|  |  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1473 is a reply to message #1396] | Thu, 09 February 2006 04:45   |  
			| 
				
				
					|  Kyle Moffett Messages: 4
 Registered: February 2006
 | Junior Member |  |  |  
	| On Feb 07, 2006, at 17:06, Eric W. Biederman wrote: > I think I can boil the discussion down into some of the fundamental
 > questions that we are facing.
 >
 > Currently everyone seems to agree that we need something like my
 > namespace concept that isolates multiple resources.
 >
 > We need these for
 > UIDS
 > FILESYSTEM
 
 I have one suggestion for this (it also covers capabilities to a
 certain extent).  Could we use the kernel credentials system to
 abstract away the concept of a single UID/GID?  We currently have
 uid, euid, gid, egid, groups, fsid.  I'm thinking that there would be
 virtualized UID tables to determine ownership of processes/SHM/etc.
 
 Each process would have a (uid_container,uid) pair (or similar) as
 its "uid" and likewise for gid.  Then the ability to send signals to
 any given (uid_container,uid) or (gid_container,gid) pair would be
 given by keys in the kernel keyring indexed by the "uid_container"
 part and containing the "uid" part (or maybe just a pointer).
 
 Likewise the filesystem access could be virtualized by using uid and
 gid keys in the kernel keyring indexed by vfsmount (Not superblock,
 so that it would be possible to have different UID representations on
 different mounts/parts of the same filesystem).
 
 I'm guessing that the performance implications of the above would not
 be quite so nice, as it would put a lot of code in the fastpath, but
 I would guess that it might be possible to use the existing fields
 for processes without any virtualization needs.
 
 Cheers,
 Kyle Moffett
 
 --
 There is no way to make Linux robust with unreliable memory
 subsystems, sorry.  It would be like trying to make a human more
 robust with an unreliable O2 supply. Memory just has to work.
 -- Andi Kleen
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 1/4] Virtualization/containers: introduction [message #1483 is a reply to message #1480] | Thu, 09 February 2006 17:47   |  
			| 
				
				
					|  Jeff Dike Messages: 4
 Registered: February 2006
 | Junior Member |  |  |  
	| On Thu, Feb 09, 2006 at 11:38:31AM -0500, Hubertus Franke wrote: > Jeff, interesting, but won't that post some serious scalability issue?
 > Imaging 100s of container/namespace ?
 
 In terms of memory?
 
 Running size on sched.o gives me this on x86_64:
 text    data     bss     dec     hex filename
 35685    6880   28800   71365   116c5 sched.o
 
 and on i386 (actually UML/i386)
 
 text    data     bss     dec     hex filename
 10010      36    2504   12550    3106 obj/kernel/sched.o
 
 I'm not sure why there's such a big difference, but 100 instances adds
 a meg or two (or three) to the kernel.  This overstates things a bit
 because there are things in sched.c which wouldn't be duplicated, like
 the system calls.
 
 How big a deal is that on a system which you plan to have 100s of
 containers on anyway?
 
 It's heavier than your namespaces, but does have the advantage that it
 imposes no cost when it's not being used.
 
 > The namespace is mainly there to identify which data needs to be private
 > when multiple instances of a subsystem are considered and
 > encapsulate that data in an object/datastructure !
 
 Sure, and that's a fine approach.  It's just not the only one.
 
 Jeff
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 1/4] Virtualization/containers: introduction [message #1487 is a reply to message #1483] | Thu, 09 February 2006 22:09   |  
			| 
				
				
					|  Sam Vilain Messages: 73
 Registered: February 2006
 | Member |  |  |  
	| Jeff Dike wrote: > On Thu, Feb 09, 2006 at 11:38:31AM -0500, Hubertus Franke wrote:
 >>Jeff, interesting, but won't that post some serious scalability issue?
 >>Imaging 100s of container/namespace ?
 > In terms of memory?
 > Running size on sched.o gives me this on x86_64:
 >    text    data     bss     dec     hex filename
 >   35685    6880   28800   71365   116c5 sched.o
 >
 > and on i386 (actually UML/i386)
 >
 >    text    data     bss     dec     hex filename
 >   10010      36    2504   12550    3106 obj/kernel/sched.o
 >
 > I'm not sure why there's such a big difference, but 100 instances adds
 > a meg or two (or three) to the kernel.  This overstates things a bit
 > because there are things in sched.c which wouldn't be duplicated, like
 > the system calls.
 >
 > How big a deal is that on a system which you plan to have 100s of
 > containers on anyway?
 
 Quite a big deal.  You might have 2Gigs of main memory, but your CPU is
 unlikely to be more than a Megabyte in close reach.  A meg or two of
 scheduler data and code means that your L1 and L2 cache will be cycling
 every scheduler round; which is OK if you have very short runqueues but
 as you get more and more processes it will really start to hurt.
 
 Remember, systems today are memory bound and anything you can do to
 reduce the amount of time the system sits around waiting for memory to
 fetch, the better.
 
 Compare that to the Token Bucket Scheduler of Linux-VServer; a tiny
 struct for each process umbrella, that will generally fit in one or two
 cachelines, to which the scheduling support adds four ints and a
 spinlock.  With this it achieves fair CPU scheduling between vservers.
 
 Sam.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1700 is a reply to message #1403] | Mon, 20 February 2006 12:08   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| >> The questions seem to break down into: >> 1) Where do we put the references to the different namespaces?
 >>    - Do we put the references in a struct container that we reference
 >> from struct task_struct?
 >>    - Do we put the references directly in struct task_struct?
 >
 >
 > You "cache"   task_struct->container->hotsubsys   under
 > task_struct->hotsubsys.
 > We don't change containers other then at clone time, so no coherency
 > issue here !!!!
 > Which subsystems pointers to "cache", should be agreed by the experts,
 > but first approach should always not to cache and go through the container.
 agreed. I see no much reason to cache it and make tons of the same
 pointers in all the tasks. Only if needed.
 Also, in OpenVZ container has many fields intergrated inside, so there
 is no additional dereference, but task->container->subsys_field
 
 >> 2) What is the syscall interface to create these namespaces?
 >>    - Do we add clone flags?       (Plan 9 style)
 > Like that approach .. flexible .. particular when one has well specified
 > namespaces.
 mmm, how do you plan to pass additional flags to clone()?
 e.g. strong or weak isolation of pids?
 
 another questions:
 how do you plan to meet the dependancies between namespaces?
 e.g. conntracks require netfilters to be initialized.
 network requires sysctls and proc to be initialized and so on.
 do you propose to track all this in clone()? huh...
 
 >>    - Do we add a syscall (similar to setsid) per namespace?
 >>      (Traditional unix style)?
 can be so...
 
 >>    - Do we in addition add syscalls to manipulate containers generically?
 >>
 >>    I don't think having a single system call to create a container and
 >> a new
 >>    instance of each namespace is reasonable as that does not give us a
 >>    path into the future when we create yet another namespace.
 >>
 > Agreed.
 why do you think so?
 this syscalls will start handling this new namespace and that's all.
 this is not different from many syscalls approach.
 
 >> 4) How do we implement each of these namespaces?
 >>    Besides being maintainable are there other constraints?
 >>
 > Good question... at least with PID and FS two are there ..
 >>
 >> 6) How do we do all of this efficiently without a noticeable impact on
 >>    performance?
 >>    - I have already heard concerns that I might be introducing cache
 >>      line bounces and thus increasing tasklist_lock hold time.
 >>      Which on big way systems can be a problem.
 this is nothing compared to hierarchy operations.
 BTW, heirarchy also introduces complicated resource accounting,
 sometimes making it even impossible.
 
 Kirill
 |  
	|  |  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1701 is a reply to message #1700] | Mon, 20 February 2006 12:40   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Mon, Feb 20, 2006 at 03:11:32PM +0300, Kirill Korotaev wrote: > >>The questions seem to break down into:
 > >>1) Where do we put the references to the different namespaces?
 > >>   - Do we put the references in a struct container that we reference
 > >>from struct task_struct?
 > >>   - Do we put the references directly in struct task_struct?
 > >
 > >
 > >You "cache"   task_struct->container->hotsubsys   under
 > >task_struct->hotsubsys.
 > >We don't change containers other then at clone time, so no coherency
 > >issue here !!!!
 > >Which subsystems pointers to "cache", should be agreed by the experts,
 > >but first approach should always not to cache and go through the container.
 > agreed. I see no much reason to cache it and make tons of the same
 > pointers in all the tasks. Only if needed.
 
 > Also, in OpenVZ container has many fields intergrated inside, so there
 > is no additional dereference, but task->container->subsys_field
 
 as does Linux-VServer currently, but do you have
 any proof that putting all the fields together in
 one big structure actually has any (dis)advantage
 over separate structures?
 
 > >>2) What is the syscall interface to create these namespaces?
 > >>   - Do we add clone flags?       (Plan 9 style)
 > >Like that approach .. flexible .. particular when one has well
 > >specified namespaces.
 > mmm, how do you plan to pass additional flags to clone()?
 > e.g. strong or weak isolation of pids?
 
 do you really have to pass them at clone() time?
 would shortly after be more than enough?
 what if you want to change those properties later?
 
 > another questions:
 > how do you plan to meet the dependancies between namespaces?
 > e.g. conntracks require netfilters to be initialized.
 > network requires sysctls and proc to be initialized and so on.
 > do you propose to track all this in clone()? huh...
 
 this is missing isolation/virtualization, and I guess
 it has to be done to make those spaces useful ...
 
 > >>   - Do we add a syscall (similar to setsid) per namespace?
 > >>     (Traditional unix style)?
 > can be so...
 >
 > >>   - Do we in addition add syscalls to manipulate containers
 > >>   generically?
 > >>
 > >>   I don't think having a single system call to create a container
 > >>   and a new instance of each namespace is reasonable as that does
 > >>   not give us a path into the future when we create yet another
 > >>   namespace.
 > >Agreed.
 > why do you think so?
 
 > this syscalls will start handling this new namespace and that's all.
 > this is not different from many syscalls approach.
 
 well, let's defer the 'how amny syscalls' issue to
 a later time, when we know what we want to implement :)
 
 > >>4) How do we implement each of these namespaces?
 > >>   Besides being maintainable are there other constraints?
 > >>
 > >Good question... at least with PID and FS two are there ..
 > >>
 > >>6) How do we do all of this efficiently without a noticeable impact on
 > >>   performance?
 > >>   - I have already heard concerns that I might be introducing cache
 > >>     line bounces and thus increasing tasklist_lock hold time.
 > >>     Which on big way systems can be a problem.
 
 > this is nothing compared to hierarchy operations.
 > BTW, heirarchy also introduces complicated resource accounting,
 > sometimes making it even impossible.
 
 well, depends how you do it ...
 
 best,
 Herbert
 
 > Kirill
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: The issues for agreeing on a virtualization/namespaces implementation. [message #1706 is a reply to message #1704] | Mon, 20 February 2006 15:16   |  
			| 
				
				
					|  Herbert Poetzl Messages: 239
 Registered: February 2006
 | Senior Member |  |  |  
	| On Mon, Feb 20, 2006 at 05:26:13PM +0300, Kirill Korotaev wrote: >> as does Linux-VServer currently, but do you have
 >> any proof that putting all the fields together in
 >> one big structure actually has any (dis)advantage
 >> over separate structures?
 
 > have no proof and don't mind if there are many pointers.
 > Though this doesn't look helpful to me as well.
 
 well, my point is just that we don't know yet
 so we should not favor one over the other, just
 because somebody did it like that and it didn't
 hurt :)
 
 >>> mmm, how do you plan to pass additional flags to clone()?
 >>> e.g. strong or weak isolation of pids?
 
 >> do you really have to pass them at clone() time?
 >> would shortly after be more than enough?
 >> what if you want to change those properties later?
 
 > I don't think it is always suiatable to do configuration later.
 > We had races in OpenVZ on VPS create/stop against exec/enter etc.
 > (even introduced flag is_running).
 > So I have some experience to believe it will be painfull place.
 
 well, Linux-VServer uses a state called 'setup'
 which allows to change all kinds of things before
 the guest can be entered, this state is changed
 as the last operation of the setup, which in turn
 drops all the capabilities and makes the guest
 visible to the outside ...
 
 works quite well and seems to be free of those
 races you mentioned ...
 
 >>> this syscalls will start handling this new namespace and that's all.
 >>> this is not different from many syscalls approach.
 >> well, let's defer the 'how amny syscalls' issue to
 >> a later time, when we know what we want to implement :)
 > agreed.
 
 btw, maybe it's just me, but would it be possible
 to do the email quoting like this:
 
 >>> Text
 
 instead of
 
 > >>Text
 
 TIA,
 Herbert
 
 > Kirill
 |  
	|  |  |  
	|  | 
 
 
 Current Time: Sat Oct 25 06:16:45 GMT 2025 
 Total time taken to generate the page: 0.12810 seconds |