OpenVZ Forum: Devel » Namespaces exhausted CLONE

Home » Mailing lists » Devel » Namespaces exhausted CLONE_XXX bits problem

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Namespaces exhausted CLONE_XXX bits problem [message #26003]

Mon, 14 January 2008 13:45

Pavel Emelianov
Messages: 1149
Registered: September 2006

Senior Member

Hi, guys!

I started looking at PTYs/TTYs/Console to make the appropriate
namespace and suddenly remembered that we have already
exhausted all the CLONE_ bits in 32-bit mask.

So, I recalled the discussions we had and saw the following 
proposals of how to track this problem (with their disadvantages):

1. make the clone2 system call with 64-bit mask
   - this is a new system call
2. re-use CLONE_STOPPED
   - this will give us only one bit
3. merge existing bits into one
   - we lose the ability to create them separately
4. implement a sys_unshare_ns system call with 64bit/arbitrary mask
   - this is anew system call
   - this will bring some dissymmetry between namespaces
5. use sys_indirect
   - this one is not in even -mm tree yet and it's questionable
     whether it will be at all

I have one more suggestion:

6. re-use bits, that don't make sense in sys_unshare (e.g.
   CLONE_STOPPED, CLONE_PARENT_SETTID, CLONE_VFORK etc)
   This will give us ~16 new bits, but this will look not very nice.

What do you think about all of this?

Thanks,
Pavel
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26014 is a reply to message #26003]

Mon, 14 January 2008 14:44

Cedric Le Goater
Messages: 443
Registered: February 2006

Senior Member

Hello Pavel !

Pavel Emelyanov wrote:
> Hi, guys!
> 
> I started looking at PTYs/TTYs/Console to make the appropriate
> namespace and suddenly remembered that we have already
> exhausted all the CLONE_ bits in 32-bit mask.

yes nearly. 1 left with the mq_namespace i'm going to send.

> So, I recalled the discussions we had and saw the following 
> proposals of how to track this problem (with their disadvantages):
> 
> 1. make the clone2 system call with 64-bit mask
>    - this is a new system call

sys_clone2 is used on ia64 ... so we would need another name.

clone_ns() would be nice but it's too specific to namespaces unless 
we agree that we need a new syscall specific to namespaces. 

clone_new or clone_large ? 

> 2. re-use CLONE_STOPPED
>    - this will give us only one bit

not enough.

> 3. merge existing bits into one
>    - we lose the ability to create them separately

it would be useful to have such a flag though, something like CLONE_ALLN,
because it's the one everyone is going to use.

what i've been looking at in December is 1. and 3. : a new general purpose 
clone syscall with extend flags. The all-in-on flag is not an issue but it
would be nice to keep the last clone flag for this purpose. 

Now, if we use 64bits, we have a few issue/cleanups to solve. First, in 
kernel land, the clone_flags are passed down to the security modules

	security_task_create()

so we'll have to change to kernel api. I don't remember anything else 
blocking.

In user land, we need to choose a prototype supporting also 32bits arches. 
so it could be :

	long sys_clone_new(struct clone_new_args)

or

	long sys_clone_new(... unsigned long flags_high, unsigned long flag_low ...)

Second option might be an issue because clone already has 6 arguments.
right ?

> 4. implement a sys_unshare_ns system call with 64bit/arbitrary mask
>    - this is anew system call

I think that a new clone deserves a new unshare.

>    - this will bring some dissymmetry between namespaces

what do you mean ?

> 5. use sys_indirect
>    - this one is not in even -mm tree yet and it's questionable
>      whether it will be at all

I don't know much about that one.

C.

> I have one more suggestion:
> 
> 6. re-use bits, that don't make sense in sys_unshare (e.g.
>    CLONE_STOPPED, CLONE_PARENT_SETTID, CLONE_VFORK etc)
>    This will give us ~16 new bits, but this will look not very nice.
> 
> What do you think about all of this?
> 
> Thanks,
> Pavel
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26016 is a reply to message #26014]

Mon, 14 January 2008 14:50

Pavel Emelianov
Messages: 1149
Registered: September 2006

Senior Member

Cedric Le Goater wrote:
> Hello Pavel !
> 
> Pavel Emelyanov wrote:
>> Hi, guys!
>>
>> I started looking at PTYs/TTYs/Console to make the appropriate
>> namespace and suddenly remembered that we have already
>> exhausted all the CLONE_ bits in 32-bit mask.
> 
> yes nearly. 1 left with the mq_namespace i'm going to send.

Yup. That's why I think that we should first solve this
issue and then send more namespaces.

>> So, I recalled the discussions we had and saw the following 
>> proposals of how to track this problem (with their disadvantages):
>>
>> 1. make the clone2 system call with 64-bit mask
>>    - this is a new system call
> 
> sys_clone2 is used on ia64 ... so we would need another name.
>  
> clone_ns() would be nice but it's too specific to namespaces unless 
> we agree that we need a new syscall specific to namespaces. 
> 
> clone_new or clone_large ? 

clone3 :) Just kidding. _If_ implement new system calls then I'd
better like cloe_ns and unshare_nr pair.

>> 2. re-use CLONE_STOPPED
>>    - this will give us only one bit
> 
> not enough.

Yup :)

>> 3. merge existing bits into one
>>    - we lose the ability to create them separately
> 
> it would be useful to have such a flag though, something like CLONE_ALLN,
> because it's the one everyone is going to use.
> 
> what i've been looking at in December is 1. and 3. : a new general purpose 
> clone syscall with extend flags. The all-in-on flag is not an issue but it
> would be nice to keep the last clone flag for this purpose. 
> 
> Now, if we use 64bits, we have a few issue/cleanups to solve. First, in 
> kernel land, the clone_flags are passed down to the security modules
>  
> 	security_task_create()
> 
> so we'll have to change to kernel api. I don't remember anything else 
> blocking.
> 
> In user land, we need to choose a prototype supporting also 32bits arches. 
> so it could be :
> 
> 	long sys_clone_new(struct clone_new_args)
> 
> or
> 
> 	long sys_clone_new(... unsigned long flags_high, unsigned long flag_low ...)
> 
> Second option might be an issue because clone already has 6 arguments.
> right ?

Yes.

> 
>> 4. implement a sys_unshare_ns system call with 64bit/arbitrary mask
>>    - this is anew system call
> 
> I think that a new clone deserves a new unshare.
>  
>>    - this will bring some dissymmetry between namespaces
> 
> what do you mean ?

I mean, that soe namespaces will be unshare-only, but some 
clone-and-unshare. 

>> 5. use sys_indirect
>>    - this one is not in even -mm tree yet and it's questionable
>>      whether it will be at all
> 
> I don't know much about that one.
> 
> C.

So you seem to prefer a "new system call" approach, right?

>> I have one more suggestion:
>>
>> 6. re-use bits, that don't make sense in sys_unshare (e.g.
>>    CLONE_STOPPED, CLONE_PARENT_SETTID, CLONE_VFORK etc)
>>    This will give us ~16 new bits, but this will look not very nice.
>>
>> What do you think about all of this?
>>
>> Thanks,
>> Pavel
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers
>>
> 
> 

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26026 is a reply to message #26016]

Mon, 14 January 2008 15:20

Cedric Le Goater
Messages: 443
Registered: February 2006

Senior Member

>>> I started looking at PTYs/TTYs/Console to make the appropriate
>>> namespace and suddenly remembered that we have already
>>> exhausted all the CLONE_ bits in 32-bit mask.
>> yes nearly. 1 left with the mq_namespace i'm going to send.
> 
> Yup. That's why I think that we should first solve this
> issue and then send more namespaces.

OK. 

>>> So, I recalled the discussions we had and saw the following 
>>> proposals of how to track this problem (with their disadvantages):
>>>
>>> 1. make the clone2 system call with 64-bit mask
>>>    - this is a new system call
>> sys_clone2 is used on ia64 ... so we would need another name.
>>  
>> clone_ns() would be nice but it's too specific to namespaces unless 
>> we agree that we need a new syscall specific to namespaces. 
>>
>> clone_new or clone_large ? 
> 
> clone3 :) Just kidding. _If_ implement new system calls then I'd
> better like cloe_ns and unshare_nr pair.

We will find a name.

>>> 2. re-use CLONE_STOPPED
>>>    - this will give us only one bit
>> not enough.
> 
> Yup :)
> 
>>> 3. merge existing bits into one
>>>    - we lose the ability to create them separately
>> it would be useful to have such a flag though, something like CLONE_ALLN,
>> because it's the one everyone is going to use.
>>
>> what i've been looking at in December is 1. and 3. : a new general purpose 
>> clone syscall with extend flags. The all-in-on flag is not an issue but it
>> would be nice to keep the last clone flag for this purpose. 
>>
>> Now, if we use 64bits, we have a few issue/cleanups to solve. First, in 
>> kernel land, the clone_flags are passed down to the security modules
>>  
>> 	security_task_create()
>>
>> so we'll have to change to kernel api. I don't remember anything else 
>> blocking.
>>
>> In user land, we need to choose a prototype supporting also 32bits arches. 
>> so it could be :
>>
>> 	long sys_clone_new(struct clone_new_args)
>>
>> or
>>
>> 	long sys_clone_new(... unsigned long flags_high, unsigned long flag_low ...)
>>
>> Second option might be an issue because clone already has 6 arguments.
>> right ?
> 
> Yes.
> 
>>> 4. implement a sys_unshare_ns system call with 64bit/arbitrary mask
>>>    - this is anew system call
>> I think that a new clone deserves a new unshare.
>>  
>>>    - this will bring some dissymmetry between namespaces
>> what do you mean ?
> 
> I mean, that soe namespaces will be unshare-only, but some 
> clone-and-unshare. 

OK. we still have that already. pid namespace for instance.

>>> 5. use sys_indirect
>>>    - this one is not in even -mm tree yet and it's questionable
>>>      whether it will be at all
>> I don't know much about that one.
>>
>> C.
> 
> So you seem to prefer a "new system call" approach, right?

yes. 

to be more precise :

	long sys_clone_something(struct clone_something_args args) 

and 

	long sys_unshare_something(struct unshare_something_args args) 

The arg passing will be slower bc of the copy_from_user() but we will 
still have the sys_clone syscall for the fast path.

C.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26030 is a reply to message #26026]

Mon, 14 January 2008 16:32

serue
Messages: 750
Registered: February 2006

Senior Member

Quoting Cedric Le Goater (clg@fr.ibm.com):
> to be more precise :
> 
> 	long sys_clone_something(struct clone_something_args args) 
> 
> and 
> 
> 	long sys_unshare_something(struct unshare_something_args args) 
> 
> The arg passing will be slower bc of the copy_from_user() but we will 
> still have the sys_clone syscall for the fast path.
> 
> C.

I'm fine with the direction you're going, but just as one more option,
we could follow more of the selinux/lsm approach of first requesting
clone/unshare options, then doing the actual clone/unshare.  So
something like

	sys_clone_request(extended_64bit_clone_flags)
	sys_clone(usual args)

or

	echo pid,mqueue,user,ipc,uts,net > /proc/self/clone_unshare
	clone()

-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26032 is a reply to message #26030]

Mon, 14 January 2008 16:52

Pavel Emelianov
Messages: 1149
Registered: September 2006

Senior Member

Serge E. Hallyn wrote:
> Quoting Cedric Le Goater (clg@fr.ibm.com):
>> to be more precise :
>>
>> 	long sys_clone_something(struct clone_something_args args) 
>>
>> and 
>>
>> 	long sys_unshare_something(struct unshare_something_args args) 
>>
>> The arg passing will be slower bc of the copy_from_user() but we will 
>> still have the sys_clone syscall for the fast path.
>>
>> C.
> 
> I'm fine with the direction you're going, but just as one more option,
> we could follow more of the selinux/lsm approach of first requesting
> clone/unshare options, then doing the actual clone/unshare.  So
> something like
> 
> 	sys_clone_request(extended_64bit_clone_flags)

What if we someday hit the 64-bit limit? :)

> 	sys_clone(usual args)
> 
> or
> 
> 	echo pid,mqueue,user,ipc,uts,net > /proc/self/clone_unshare
> 	clone()

Well, this is how sys_indirect() was intended to work. Nobody
liked it, so I'm afraid this will also not be accepted.

> -serge
> 

Thanks,
Pavel
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26036 is a reply to message #26032]

Mon, 14 January 2008 18:07

serue
Messages: 750
Registered: February 2006

Senior Member

Quoting Pavel Emelyanov (xemul@openvz.org):
> Serge E. Hallyn wrote:
> > Quoting Cedric Le Goater (clg@fr.ibm.com):
> >> to be more precise :
> >>
> >> 	long sys_clone_something(struct clone_something_args args) 
> >>
> >> and 
> >>
> >> 	long sys_unshare_something(struct unshare_something_args args) 
> >>
> >> The arg passing will be slower bc of the copy_from_user() but we will 
> >> still have the sys_clone syscall for the fast path.
> >>
> >> C.
> > 
> > I'm fine with the direction you're going, but just as one more option,
> > we could follow more of the selinux/lsm approach of first requesting
> > clone/unshare options, then doing the actual clone/unshare.  So
> > something like
> > 
> > 	sys_clone_request(extended_64bit_clone_flags)
> 
> What if we someday hit the 64-bit limit? :)
> 
> > 	sys_clone(usual args)
> > 
> > or
> > 
> > 	echo pid,mqueue,user,ipc,uts,net > /proc/self/clone_unshare
> > 	clone()
> 
> Well, this is how sys_indirect() was intended to work. Nobody
> liked it, so I'm afraid this will also not be accepted.

I would have thought sys_indirect would be disliked because
it looks like an ioctl type multiplexor.  Whereas sys_clone_request()
or /proc/self/clone_unshare simply sets arguments in advance, the
way /proc/self/attr/current does.

-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26039 is a reply to message #26036]

Mon, 14 January 2008 21:36

Oren Laadan
Messages: 71
Registered: August 2007

Member

Serge E. Hallyn wrote:
> Quoting Pavel Emelyanov (xemul@openvz.org):
>> Serge E. Hallyn wrote:
>>> Quoting Cedric Le Goater (clg@fr.ibm.com):
>>>> to be more precise :
>>>>
>>>> 	long sys_clone_something(struct clone_something_args args) 
>>>>
>>>> and 
>>>>
>>>> 	long sys_unshare_something(struct unshare_something_args args) 
>>>>
>>>> The arg passing will be slower bc of the copy_from_user() but we will 
>>>> still have the sys_clone syscall for the fast path.
>>>>
>>>> C.
>>> I'm fine with the direction you're going, but just as one more option,
>>> we could follow more of the selinux/lsm approach of first requesting
>>> clone/unshare options, then doing the actual clone/unshare.  So
>>> something like
>>>
>>> 	sys_clone_request(extended_64bit_clone_flags)
>> What if we someday hit the 64-bit limit? :)
>>
>>> 	sys_clone(usual args)

One (security ?) problem with a two stage approach is that the operation
may not be completed in an atomic manner; e.g. if there are two threads
doing the first call before any of them gets to the second call. Or at
least ensure that such races cannot occur by design. (In contrast, with
sys_indirect() everything is atomic).

Also, in a two-step approach, using /proc as opposed to a specialized
system call incurs higher overhead should ultra-fast clone()s are a
goal by itself.

I second the concern of running out of 64 bits of flags. In fact, the
problem with the flags is likely to be valid outside our context, and
general to the linux kernel soon. Should we not discuss it there too ?

>>>
>>> or
>>>
>>> 	echo pid,mqueue,user,ipc,uts,net > /proc/self/clone_unshare
>>> 	clone()
>> Well, this is how sys_indirect() was intended to work. Nobody
>> liked it, so I'm afraid this will also not be accepted.
> 
> I would have thought sys_indirect would be disliked because
> it looks like an ioctl type multiplexor.  Whereas sys_clone_request()
> or /proc/self/clone_unshare simply sets arguments in advance, the
> way /proc/self/attr/current does.

I find the sys_indirect() approach very appealing, in particular
because it is designed and motivated by a non-ioctl multiplexing
and backward compatibility in mind. Like any API it can be abused
and misused, but since it applies to actual system calls and not
obscured ioctl, it is far less likely to become a victim (so to
speak ...).

While I prefer the sys_indirect() (personally I find it elegant,
but it isn't clear that it will be merged), from a technical point
of view any of new system call, sys_indirect() or a 2-step approach
approach - all three seem plausible. The final solution, however
needs to be coordinated with the rest of the kernel developers.

Oren.

> 
> -serge
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26044 is a reply to message #26039]

Mon, 14 January 2008 21:54

Dave Hansen
Messages: 240
Registered: October 2005

Senior Member

On Mon, 2008-01-14 at 16:36 -0500, Oren Laadan wrote:
> I second the concern of running out of 64 bits of flags. In fact, the
> problem with the flags is likely to be valid outside our context, and
> general to the linux kernel soon. Should we not discuss it there
> too ? 

It would be pretty easy to make a new one expandable:

	sys_newclone(int len, unsigned long *flags_array)

Then you could give it a virtually unlimited number of "unsigned long"s
pointed to by "flags_array".

Plus, the old clone just becomes:

        sys_oldclone(unsigned long flags)
        {
        	do_newclone(1, &flags);
        }

We could validate the flags array address in sys_newclone(), then call
do_newclone().

-- Dave

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26065 is a reply to message #26030]

Tue, 15 January 2008 07:53

Cedric Le Goater
Messages: 443
Registered: February 2006

Senior Member

Serge E. Hallyn wrote:
> Quoting Cedric Le Goater (clg@fr.ibm.com):
>> to be more precise :
>>
>> 	long sys_clone_something(struct clone_something_args args) 
>>
>> and 
>>
>> 	long sys_unshare_something(struct unshare_something_args args) 
>>
>> The arg passing will be slower bc of the copy_from_user() but we will 
>> still have the sys_clone syscall for the fast path.
>>
>> C.
> 
> I'm fine with the direction you're going, but just as one more option,
> we could follow more of the selinux/lsm approach of first requesting
> clone/unshare options, then doing the actual clone/unshare.  So
> something like
> 
> 	sys_clone_request(extended_64bit_clone_flags)
> 	sys_clone(usual args)
> 
> or
> 
> 	echo pid,mqueue,user,ipc,uts,net > /proc/self/clone_unshare
> 	clone()

For my information, why selinux/lsm chose that 2 steps approach ?
What kind of issues are they trying to solve ?

Thanks,

C. 
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26071 is a reply to message #26044]

Tue, 15 January 2008 08:25

Pavel Emelianov
Messages: 1149
Registered: September 2006

Senior Member

Dave Hansen wrote:
> On Mon, 2008-01-14 at 16:36 -0500, Oren Laadan wrote:
>> I second the concern of running out of 64 bits of flags. In fact, the
>> problem with the flags is likely to be valid outside our context, and
>> general to the linux kernel soon. Should we not discuss it there
>> too ? 
> 
> It would be pretty easy to make a new one expandable:
> 
> 	sys_newclone(int len, unsigned long *flags_array)
> 
> Then you could give it a virtually unlimited number of "unsigned long"s
> pointed to by "flags_array".
> 
> Plus, the old clone just becomes:
> 
>         sys_oldclone(unsigned long flags)
>         {
>         	do_newclone(1, &flags);
>         }
> 
> We could validate the flags array address in sys_newclone(), then call
> do_newclone().

Hmm. I have an idea how to make this w/o a new system call. This might
look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
consider the parent_tidptr/child_tidptr in this case as the pointer to 
an array of extra arguments/flargs?

> -- Dave
> 
> 

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26072 is a reply to message #26071]

Tue, 15 January 2008 08:39

Cedric Le Goater
Messages: 443
Registered: February 2006

Senior Member

Pavel Emelyanov wrote:
> Dave Hansen wrote:
>> On Mon, 2008-01-14 at 16:36 -0500, Oren Laadan wrote:
>>> I second the concern of running out of 64 bits of flags. In fact, the
>>> problem with the flags is likely to be valid outside our context, and
>>> general to the linux kernel soon. Should we not discuss it there
>>> too ? 
>> It would be pretty easy to make a new one expandable:
>>
>> 	sys_newclone(int len, unsigned long *flags_array)
>>
>> Then you could give it a virtually unlimited number of "unsigned long"s
>> pointed to by "flags_array".
>>
>> Plus, the old clone just becomes:
>>
>>         sys_oldclone(unsigned long flags)
>>         {
>>         	do_newclone(1, &flags);
>>         }
>>
>> We could validate the flags array address in sys_newclone(), then call
>> do_newclone().
> 
> Hmm. I have an idea how to make this w/o a new system call. This might
> look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
> consider the parent_tidptr/child_tidptr in this case as the pointer to 
> an array of extra arguments/flargs?

It's a bit hacky but it looks like a good idea to me !

Shall we use parent_tidptr or child_tidptr to pass a extended array of 
flags only ? if we could pass the pid of the task to be cloned, it would 
be useful for c/r.

C.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26073 is a reply to message #26072]

Tue, 15 January 2008 08:53

Pavel Emelianov
Messages: 1149
Registered: September 2006

Senior Member

Cedric Le Goater wrote:
> Pavel Emelyanov wrote:
>> Dave Hansen wrote:
>>> On Mon, 2008-01-14 at 16:36 -0500, Oren Laadan wrote:
>>>> I second the concern of running out of 64 bits of flags. In fact, the
>>>> problem with the flags is likely to be valid outside our context, and
>>>> general to the linux kernel soon. Should we not discuss it there
>>>> too ? 
>>> It would be pretty easy to make a new one expandable:
>>>
>>> 	sys_newclone(int len, unsigned long *flags_array)
>>>
>>> Then you could give it a virtually unlimited number of "unsigned long"s
>>> pointed to by "flags_array".
>>>
>>> Plus, the old clone just becomes:
>>>
>>>         sys_oldclone(unsigned long flags)
>>>         {
>>>         	do_newclone(1, &flags);
>>>         }
>>>
>>> We could validate the flags array address in sys_newclone(), then call
>>> do_newclone().
>> Hmm. I have an idea how to make this w/o a new system call. This might
>> look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
>> consider the parent_tidptr/child_tidptr in this case as the pointer to 
>> an array of extra arguments/flargs?
> 
> It's a bit hacky but it looks like a good idea to me !
> 
> Shall we use parent_tidptr or child_tidptr to pass a extended array of 
> flags only ? if we could pass the pid of the task to be cloned, it would 
> be useful for c/r.

Yup. I think we can declare a

struct new_clone_arg {
	unsigned int size;
};

and consider the xx_tidptr to be a pointer on it. After this we
may sen patches that add fields to this structure.

E.g. first

 struct new_clone_arg {
 	unsigned int size;
+	unsigned long new_flags;
 };

to add flags for cloning new namespaces. Later

 struct new_clone_arg {
 	unsigned int size;
 	unsigned long new_flags;
+	int desired_pid;
 };

and each code that needs to access the extra argument would need
to check for new_clone_arg->size to be not less than the offset
of the field he need an access to. E.g. like this:

#define clone_arg_has(arg, member)	({			\
	struct new_clone_arg *__carg = arg;			\
	(__carg->size >= offsetof(struct new_clone_arg, member) + \
		sizeof(__carg->member)) })

...

if (!clone_arg_has(arg, desired_pid))
	return -EINVAL;

This would keep the API always compatible.

> C.
> 

Thanks,
Pavel
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26075 is a reply to message #26071]

Tue, 15 January 2008 09:22

Dave Hansen
Messages: 240
Registered: October 2005

Senior Member

On Tue, 2008-01-15 at 11:25 +0300, Pavel Emelyanov wrote:
> Hmm. I have an idea how to make this w/o a new system call. This might
> look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
> consider the parent_tidptr/child_tidptr in this case as the pointer to 
> an array of extra arguments/flargs?

I guess that does keep us from having to add an _actual_ system call.
Do we make the array something like

	array[] = { orig_tidptr, nr_flags, actual flags... };

?

-- Dave

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26076 is a reply to message #26075]

Tue, 15 January 2008 09:24

Pavel Emelianov
Messages: 1149
Registered: September 2006

Senior Member

Dave Hansen wrote:
> On Tue, 2008-01-15 at 11:25 +0300, Pavel Emelyanov wrote:
>> Hmm. I have an idea how to make this w/o a new system call. This might
>> look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
>> consider the parent_tidptr/child_tidptr in this case as the pointer to 
>> an array of extra arguments/flargs?
> 
> I guess that does keep us from having to add an _actual_ system call.

Exactly!

> Do we make the array something like
> 
> 	array[] = { orig_tidptr, nr_flags, actual flags... };

Not exactly. I have already sent my view of this in 
another letter.

> ?
> 
> -- Dave
> 
> 

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26077 is a reply to message #26073]

Tue, 15 January 2008 09:40

Cedric Le Goater
Messages: 443
Registered: February 2006

Senior Member

Pavel Emelyanov wrote:
> Cedric Le Goater wrote:
>> Pavel Emelyanov wrote:
>>> Dave Hansen wrote:
>>>> On Mon, 2008-01-14 at 16:36 -0500, Oren Laadan wrote:
>>>>> I second the concern of running out of 64 bits of flags. In fact, the
>>>>> problem with the flags is likely to be valid outside our context, and
>>>>> general to the linux kernel soon. Should we not discuss it there
>>>>> too ? 
>>>> It would be pretty easy to make a new one expandable:
>>>>
>>>> 	sys_newclone(int len, unsigned long *flags_array)
>>>>
>>>> Then you could give it a virtually unlimited number of "unsigned long"s
>>>> pointed to by "flags_array".
>>>>
>>>> Plus, the old clone just becomes:
>>>>
>>>>         sys_oldclone(unsigned long flags)
>>>>         {
>>>>         	do_newclone(1, &flags);
>>>>         }
>>>>
>>>> We could validate the flags array address in sys_newclone(), then call
>>>> do_newclone().
>>> Hmm. I have an idea how to make this w/o a new system call. This might
>>> look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
>>> consider the parent_tidptr/child_tidptr in this case as the pointer to 
>>> an array of extra arguments/flargs?
>> It's a bit hacky but it looks like a good idea to me !
>>
>> Shall we use parent_tidptr or child_tidptr to pass a extended array of 
>> flags only ? if we could pass the pid of the task to be cloned, it would 
>> be useful for c/r.
> 
> Yup. I think we can declare a
> 
> struct new_clone_arg {
> 	unsigned int size;
> };
> 
> and consider the xx_tidptr to be a pointer on it. After this we
> may sen patches that add fields to this structure.
> 
> E.g. first
> 
>  struct new_clone_arg {
>  	unsigned int size;
> +	unsigned long new_flags;
>  };
> 
> to add flags for cloning new namespaces. Later
> 
>  struct new_clone_arg {
>  	unsigned int size;
>  	unsigned long new_flags;
> +	int desired_pid;
>  };
> 
> and each code that needs to access the extra argument would need
> to check for new_clone_arg->size to be not less than the offset
> of the field he need an access to. E.g. like this:
> 
> #define clone_arg_has(arg, member)	({			\
> 	struct new_clone_arg *__carg = arg;			\
> 	(__carg->size >= offsetof(struct new_clone_arg, member) + \
> 		sizeof(__carg->member)) })
> 
> ...
> 
> if (!clone_arg_has(arg, desired_pid))
> 	return -EINVAL;
> 
> This would keep the API always compatible.

Pavel, this is pretty neat. 

I think we need to work on a patch now and send it to andrew and lkml@
to have a larger audience.

I doesn't seem to be a really big patch and I wondering how I could help. 
We still have to prepare something for security_task_create()

Thanks !

C.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26078 is a reply to message #26077]

Tue, 15 January 2008 09:57

Pavel Emelianov
Messages: 1149
Registered: September 2006

Senior Member

Cedric Le Goater wrote:
> Pavel Emelyanov wrote:
>> Cedric Le Goater wrote:
>>> Pavel Emelyanov wrote:
>>>> Dave Hansen wrote:
>>>>> On Mon, 2008-01-14 at 16:36 -0500, Oren Laadan wrote:
>>>>>> I second the concern of running out of 64 bits of flags. In fact, the
>>>>>> problem with the flags is likely to be valid outside our context, and
>>>>>> general to the linux kernel soon. Should we not discuss it there
>>>>>> too ? 
>>>>> It would be pretty easy to make a new one expandable:
>>>>>
>>>>> 	sys_newclone(int len, unsigned long *flags_array)
>>>>>
>>>>> Then you could give it a virtually unlimited number of "unsigned long"s
>>>>> pointed to by "flags_array".
>>>>>
>>>>> Plus, the old clone just becomes:
>>>>>
>>>>>         sys_oldclone(unsigned long flags)
>>>>>         {
>>>>>         	do_newclone(1, &flags);
>>>>>         }
>>>>>
>>>>> We could validate the flags array address in sys_newclone(), then call
>>>>> do_newclone().
>>>> Hmm. I have an idea how to make this w/o a new system call. This might
>>>> look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
>>>> consider the parent_tidptr/child_tidptr in this case as the pointer to 
>>>> an array of extra arguments/flargs?
>>> It's a bit hacky but it looks like a good idea to me !
>>>
>>> Shall we use parent_tidptr or child_tidptr to pass a extended array of 
>>> flags only ? if we could pass the pid of the task to be cloned, it would 
>>> be useful for c/r.
>> Yup. I think we can declare a
>>
>> struct new_clone_arg {
>> 	unsigned int size;
>> };
>>
>> and consider the xx_tidptr to be a pointer on it. After this we
>> may sen patches that add fields to this structure.
>>
>> E.g. first
>>
>>  struct new_clone_arg {
>>  	unsigned int size;
>> +	unsigned long new_flags;
>>  };
>>
>> to add flags for cloning new namespaces. Later
>>
>>  struct new_clone_arg {
>>  	unsigned int size;
>>  	unsigned long new_flags;
>> +	int desired_pid;
>>  };
>>
>> and each code that needs to access the extra argument would need
>> to check for new_clone_arg->size to be not less than the offset
>> of the field he need an access to. E.g. like this:
>>
>> #define clone_arg_has(arg, member)	({			\
>> 	struct new_clone_arg *__carg = arg;			\
>> 	(__carg->size >= offsetof(struct new_clone_arg, member) + \
>> 		sizeof(__carg->member)) })
>>
>> ...
>>
>> if (!clone_arg_has(arg, desired_pid))
>> 	return -EINVAL;
>>
>> This would keep the API always compatible.
> 
> Pavel, this is pretty neat. 

Thanks, but what to do with unshare()? Stop unsharing namespaces
is not an option, so we'll have to add a new sys_unshare2 system
call with similar technique for argument passing.

> I think we need to work on a patch now and send it to andrew and lkml@
> to have a larger audience.

OK, I'll try to prepare the one for clone() today. Hope it will
be ready to be sent tomorrow.

> I doesn't seem to be a really big patch and I wondering how I could help. 

I'll send it for pre-review before showing to people ;)

> We still have to prepare something for security_task_create()
> 
> Thanks !
> 
> C.

Thanks,
Pavel
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26094 is a reply to message #26065]

Tue, 15 January 2008 14:35

serue
Messages: 750
Registered: February 2006

Senior Member

Quoting Cedric Le Goater (clg@fr.ibm.com):
> Serge E. Hallyn wrote:
> > Quoting Cedric Le Goater (clg@fr.ibm.com):
> >> to be more precise :
> >>
> >> 	long sys_clone_something(struct clone_something_args args) 
> >>
> >> and 
> >>
> >> 	long sys_unshare_something(struct unshare_something_args args) 
> >>
> >> The arg passing will be slower bc of the copy_from_user() but we will 
> >> still have the sys_clone syscall for the fast path.
> >>
> >> C.
> > 
> > I'm fine with the direction you're going, but just as one more option,
> > we could follow more of the selinux/lsm approach of first requesting
> > clone/unshare options, then doing the actual clone/unshare.  So
> > something like
> > 
> > 	sys_clone_request(extended_64bit_clone_flags)
> > 	sys_clone(usual args)
> > 
> > or
> > 
> > 	echo pid,mqueue,user,ipc,uts,net > /proc/self/clone_unshare
> > 	clone()
> 
> For my information, why selinux/lsm chose that 2 steps approach ?
> What kind of issues are they trying to solve ?

Well an interface was needed to allow multiple LSMs to query and set
task information.  Using a syscall (which was attempted) required
ioctl-like subcommands which was not accepted.

-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26101 is a reply to message #26076]

Tue, 15 January 2008 15:08

serue
Messages: 750
Registered: February 2006

Senior Member

Quoting Pavel Emelyanov (xemul@openvz.org):
> Dave Hansen wrote:
> > On Tue, 2008-01-15 at 11:25 +0300, Pavel Emelyanov wrote:
> >> Hmm. I have an idea how to make this w/o a new system call. This might
> >> look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
> >> consider the parent_tidptr/child_tidptr in this case as the pointer to 
> >> an array of extra arguments/flargs?
> > 
> > I guess that does keep us from having to add an _actual_ system call.
> 
> Exactly!

I'll be honest, while it's a really neat idea, in terms of code actually
going into tree I far far prefer a real new syscall.

But it sounds like I'm the only one so I'll just mention it once and
then bite my tongue  :)

thanks,
-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Namespaces exhausted CLONE_XXX bits problem [message #26105 is a reply to message #26101]

Tue, 15 January 2008 15:51

Cedric Le Goater
Messages: 443
Registered: February 2006

Senior Member

Serge E. Hallyn wrote:
> Quoting Pavel Emelyanov (xemul@openvz.org):
>> Dave Hansen wrote:
>>> On Tue, 2008-01-15 at 11:25 +0300, Pavel Emelyanov wrote:
>>>> Hmm. I have an idea how to make this w/o a new system call. This might
>>>> look wierd, but. Why not stopple the last bit with a CLONE_NEWCLONE and
>>>> consider the parent_tidptr/child_tidptr in this case as the pointer to 
>>>> an array of extra arguments/flargs?
>>> I guess that does keep us from having to add an _actual_ system call.
>> Exactly!
> 
> I'll be honest, while it's a really neat idea, in terms of code actually
> going into tree I far far prefer a real new syscall.

well, hijacking child_tidptr and adding a new syscall will probably look 
the same internally. so if it ends up that hijacking child_tidptr is not 
acceptable, we won't have much work to plug it in a new syscall.

> But it sounds like I'm the only one so I'll just mention it once and
> then bite my tongue  :)

hold on. this patch has not been sent on lkml@ but it's worth a try :)

C.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Previous Topic:	Re: [RFC PATCH 0/4] [RESEND] Change default MSGMNI tunable to scale with lowmem
Next Topic:	[patch 0/9] mount ownership and unprivileged mount syscall (v6)

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Tue Jul 16 17:38:17 GMT 2024

Total time taken to generate the page: 0.03013 seconds