OpenVZ Forum


Home » Mailing lists » Devel » [patch -mm 00/17] new namespaces and related syscalls
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16928 is a reply to message #16810] Mon, 11 December 2006 20:34 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>
> Yeah, that occurred to me, but it doesn't seem like we can possibly make
> sufficient guarantees to the client to make this worthwhile.
>
> I'd love to be wrong about that, but if nothing else we can't prove to
> the client that they're running on an unhacked host.  So the host admin
> will always have to be trusted.

To some extent that is true.  Although all security models we have
currently fall down if you hack the kernel, or run your kernel
in a hacked virtual environment.  It would be nice if under normal
conditions you could mount an encrypted filesystem only in a container
and not have concerns of those files escaping.

Which would probably be a matter of having a separate uid_ns and not
allowing process outside of your container to have any permissions in
that filesystem.

>> 2) When we only partially enter a namespace it is very easy for additional
>>    properties to enter that namespace.  For example we enter the pid
>>    namespace and the mount namespace, but keep our current working directory
>>    in the previous namespace.  Then a process in the restricted namespace
>>    can get out by cd into /proc/<?>/cwd.
>
> Yup, entering existing namespaces should be all-or-nothing.

A truly all-or-nothing has the problem that there is no external
input into the container, and a very controlled external input
to the existing container is what this is about.

>> If someones permissions to various objects does not depend on the namespace
>> they are in quite possibly this is a non-issue.  If we actually depend on
>> the isolation to keep things secure enter is a setup for a first rate escape.
>
> I don't believe the isolation can be effective between two namespaces
> where one is an ancestor of another.  It can be so long as one isn't
> the ancestor of another, but then we're not allowing either to enter
> the other namespace.  So it's not a problem.

Reasonable.  

> The bind_ns() proposed by Cedric is stricter, only allowing nsid 0 to
> switch namespaces.  So it may be overly restrictive, and does introduce
> a new global namespace, but it is safe.

I will look a little more.  There are a lot patches out there that need
review.   What disturbs a little is that with ptrace we have an existing
mechanism that can do everything we want enter or bind_ns to be able to do.

I actually have code that will let me fork a process in a new namespace today
with out needing bind_ns.  What is more I don't even have to be root
to use it.

I would very much prefer to see us optimizing our debugging and
control interfaces so they are efficient then see us implement
something completely new that is problem domain specific. 

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16930 is a reply to message #16923] Mon, 11 December 2006 20:03 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Serge E. Hallyn (serue@us.ibm.com):
> >> Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> > Herbert Poetzl <herbert@13thfloor.at> writes:
> >> > >> Beyond that yes it seems to make sense to let user space
> >> > >> maintain any mapping of containers to ids.
> >> > >
> >> > > I agree with that, but we need something to move
> >> > > around between the various spaces ...
> >> >
> >> > If you have CAP_SYS_PTRACE or you have a child process
> >> > in a container you can create another with ptrace.
> >> >
> >> > Now I don't mind optimizing that case, with something like
> >> > the proposed bind_ns syscall.  But we need to be darn certain
> >> > why it is safe, and does not change the security model that
> >> > we currently have.
> >>
> >> Sigh, and that's going to have to be a discussion per namespace.
> >
> > Well, assuming that we're using pids as identifiers, that means
> > we can only enter decendent namespaces, which means 'we' must
> > have created them.  So anything we could do by entering the ns,
> > we could have done by creating it as well, right?
> 
> It isn't strict descendents who we can see.  i.e.  init can create
> the thing, and we could have just logged into the network but init
> and us still share the same pid namespace.
> 
> But yes it would be we can only enter descendent namespaces, for
> some definition of enter.
> 
> There are two issues.
> 1) We may have a namespace we want to create and then remove the ability
>    for the sysadmin to fiddle with, so it can play with encrypted data or
>    something like that safely.  Not quite unix but it is certainly worth
>    considering.

Yeah, that occurred to me, but it doesn't seem like we can possibly make
sufficient guarantees to the client to make this worthwhile.

I'd love to be wrong about that, but if nothing else we can't prove to
the client that they're running on an unhacked host.  So the host admin
will always have to be trusted.

> 2) When we only partially enter a namespace it is very easy for additional
>    properties to enter that namespace.  For example we enter the pid
>    namespace and the mount namespace, but keep our current working directory
>    in the previous namespace.  Then a process in the restricted namespace
>    can get out by cd into /proc/<?>/cwd.

Yup, entering existing namespaces should be all-or-nothing.

> If someones permissions to various objects does not depend on the namespace
> they are in quite possibly this is a non-issue.  If we actually depend on
> the isolation to keep things secure enter is a setup for a first rate escape.

I don't believe the isolation can be effective between two namespaces
where one is an ancestor of another.  It can be so long as one isn't
the ancestor of another, but then we're not allowing either to enter
the other namespace.  So it's not a problem.

The bind_ns() proposed by Cedric is stricter, only allowing nsid 0 to
switch namespaces.  So it may be overly restrictive, and does introduce
a new global namespace, but it is safe.

-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16954 is a reply to message #16810] Mon, 11 December 2006 22:53 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
> > Even letting the concept of nsproxy escape to user space sounds wrong.
> > nsproxy is an internal space optimization.  It's not struct container
> > and I don't think we want it to become that.
> 
> i don't agree here. we need that, so does openvz, vserver, people working
> on resource management. 

I think what those projects need is _some_ way to group tasks.  I'm not
sure they actually need nsproxies.

Two tasks in the same container could very well have different
nsproxies.  The nsproxy defines how the pid namespace, and pid<->task
mappings happen for a given task.  The init process for a container is
special and might actually appear in more than one pid namespace, while
its children might only appear in one.  That means that this init
process's nsproxy can and should actually be different from its
children's.  This is despite the fact that they are in the same
container.

If we really need this 'container' grouping, it can easily be something
pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.

-- Dave

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16956 is a reply to message #16928] Mon, 11 December 2006 22:01 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >
> > Yeah, that occurred to me, but it doesn't seem like we can possibly make
> > sufficient guarantees to the client to make this worthwhile.
> >
> > I'd love to be wrong about that, but if nothing else we can't prove to
> > the client that they're running on an unhacked host.  So the host admin
> > will always have to be trusted.
> 
> To some extent that is true.  Although all security models we have
> currently fall down if you hack the kernel, or run your kernel
> in a hacked virtual environment.  It would be nice if under normal
> conditions you could mount an encrypted filesystem only in a container
> and not have concerns of those files escaping.

Hmm, well perhaps I'm being overly pessimistic - IBM research did have a
demo based on TPM of remote attestation, which may be usable for
ensuring that you're connecting to a service on your virtual machine on
a certain (unhacked) kernel on particular hardware, in which case what
you're talking about may be possible - given a stringent initial
environment (i.e. not the 'gimme $20/month for a hosted partition in
arizona' environment).

Given that, perhaps having a virtual machine with access to encrypted
storage - safe from the host machine admins - may not be unattainable
after all.  And given that, it would be worth designing the ns_enter()
system call so that a parent cannot enter some child namespace.

> Which would probably be a matter of having a separate uid_ns and not
> allowing process outside of your container to have any permissions in
> that filesystem.

Yup.  Or even just a separate uid_ns and an ecryptfs partition, so
that the host can back up the encrypted data incrementally (per file,
i.e. not just the whole dmcrypted loop file).

-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16958 is a reply to message #16928] Mon, 11 December 2006 22:18 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Eric W. Biederman (ebiederm@xmission.com):
> I actually have code that will let me fork a process in a new namespace today
> with out needing bind_ns.  What is more I don't even have to be root
> to use it.

Can you elaborate?  The user namespace patches don't enforce ptrace
yet, so you could unshare as root, become uid 500, then as uid 500
in the original namespace ptrace the process in the new namespace.
Is that what you're doing?  If (when) ptrace enforces the uid namespace,
will that stop what you're doing?

-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 10/17] nsproxy: add unshare_ns and bind_ns syscalls [message #16959 is a reply to message #16863] Mon, 11 December 2006 15:21 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Eric W. Biederman wrote:
> clg@fr.ibm.com writes:
> 
>> From: Cedric Le Goater <clg@fr.ibm.com>
>>
>> The following patch defines 2 new syscalls specific to nsproxy and
>> namespaces :
>>
>> * unshare_ns :
>>
>> 	enables a process to unshare one or more namespaces. this
>>         duplicates the unshare syscall for the moment but we
>> 	expect to diverge when the number of namespaces increases
> 
> Are we out of clone flags yet?  If not this is premature.
> 
>> * bind_ns :
>> 	
>> 	allows a process to bind
>> 	1 - its nsproxy to some identifier
>> 	2 - to another nsproxy using an identifier or -pid
> 
> NAK
>
> Don't use global identifiers.  Use pids.  i.e. struct pid * for your
> identifiers.  Is there is a reason pids are unsuitable?

(1) gives a little more freedom to the sysadmin managing its  
(2) uses pids. do you also nak it ? 

do you always have access to pid ? 

> I'm also worried about the security implications of switching namespaces
> on a process.   That is something that needs to be looked at very closely.

this is required by at least 3 products I know of.

> These two changes certainly don't belong in a single patch, and they
> certainly use a bit more explanation.  syscalls are not something to
> add lightly. Because they must be supported forever.

agree.

c.

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16960 is a reply to message #16864] Mon, 11 December 2006 15:23 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Eric W. Biederman wrote:
> clg@fr.ibm.com writes:
> 
>> From: Cedric Le Goater <clg@fr.ibm.com>
>>
>> This patch adds a hashtable of nsproxy using the nsproxy as a key.
>> init_nsproxy is hashed at init with key 0. This is considered to be
>> the 'host' nsproxy.
> 
> NAK.  Which namespace do these ids live in?
> 
> It sounds like you are setting up to make the 'host' nsproxy special
> and have special rules.  

exactly. 

> That also sounds wrong.

sounds very nice to me and a few others.

> Even letting the concept of nsproxy escape to user space sounds wrong.
> nsproxy is an internal space optimization.  It's not struct container
> and I don't think we want it to become that.

i don't agree here. we need that, so does openvz, vserver, people working
on resource management.

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 04/17] nsproxy: externalizes exit_task_namespaces [message #16961 is a reply to message #16869] Mon, 11 December 2006 15:26 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Eric W. Biederman wrote:
> clg@fr.ibm.com writes:
> 
>> From: Cedric Le Goater <clg@fr.ibm.com>
>>
>> this is required to remove a header dependency in sched.h which breaks
>> next patches.
> 
> This just doesn't feel right.
> 
> Why with unshare working now are you needing to rework everything?

This is not everything. this is just 3 lines compared to millions.

The issue is not here anyway.

c.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 09/17] nsproxy: add namespace flags [message #16962 is a reply to message #16865] Mon, 11 December 2006 15:27 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Eric W. Biederman wrote:
> Cedric Le Goater <clg@fr.ibm.com> writes:
> 
>>>>  /*
>>>> + * namespaces flags
>>>> + */
>>>> +#define NS_MNT		0x00000001
>>>> +#define NS_UTS		0x00000002
>>>> +#define NS_IPC		0x00000004
>>>> +#define NS_PID		0x00000008
>>>> +#define NS_NET		0x00000010
>>>> +#define NS_USER		0x00000020
>>>> +#define NS_ALL		(NS_MNT|NS_UTS|NS_IPC|NS_PID|NS_NET|NS_USER)
>>> hmm, why _another_ set of flags to refer to the
>>> namespaces?
>> well, because namespaces are a new kind in the kernel
> 
> Gratuitous incompatibility.

?

>>> is the clone()/unshare() set of flags not sufficient
>>> for that?
>> because we are reaching the limits of the CLONE_ flags.
> 
> Not really.   There are at least 8 bits that clone cannot use
> but that unshare can.

please, could you list them ? 

>>> if so, shouldn't we switch (or even better change?
>>> the unshare() too) to a new set of syscalls?
>> unshare_ns() is a new syscall and we don't really need a
>> clone anyway. nop ?
> 
> Huh?  Clone should be the primary.   There are certain namespaces
> that it are very hard to unshare, without creating a new process.

You just said above that clone had less available flags than
unshare ...

anyway, could you elaborate a bit more ? I have the opposite 
feeling and you gave me that impression also a few month ago. 

No problem for me, i just want a way to use this stuff without


>>> we should think twice before we create just another
>>> set of flags, and if we do so, please let us change
>>> them all, including certain clone flags (and add a
>>> single compatibility wrapper for the 'old' syscalls)
>> so you would keep the unshare as is but change the set
>> of flags its using, making sure the old ones are still
>> compatible with the new ones.
>>
>> something like this :
>>
>> int sys_unshare(int unshare_flags)
>> {
>> 	int unshare_ns_flags;
>>
>> 	unshare_ns_flags = convert_flags(unshare_flags);
>>
>> 	return sys_unshare_ns(unshare_ns_flags);
>> }
>>
>> ?
> 
> If necessary.

ok good. will check it out.

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16966 is a reply to message #16882] Mon, 11 December 2006 16:09 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Herbert Poetzl wrote:
> On Fri, Dec 08, 2006 at 01:57:38PM -0700, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>>
>>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>>> clg@fr.ibm.com writes:
>>>>
>>>>> From: Cedric Le Goater <clg@fr.ibm.com>
>>>>>
>>>>> This patch adds a hashtable of nsproxy using the nsproxy as a key. 
>>>>> init_nsproxy is hashed at init with key 0. This is considered to be 
>>>>> the 'host' nsproxy.
>>>> NAK.  Which namespace do these ids live in?
> 
> well, I gave a similar answer in another email,
> so I fully agree with the NAK here ...

hmm, I wasn't that clear to me. OK, let's dig :)

>>>> It sounds like you are setting up to make the 'host' nsproxy
>>>> special and have special rules. That also sounds wrong.
>>>>
>>>> Even letting the concept of nsproxy escape to user space sounds
>>>> wrong. nsproxy is an internal space optimization. It's not struct
>>>> container and I don't think we want it to become that.
>>>>
>>>> Eric
>>> So would you advocate referring to containers just by the pid of
>>> a process containing the nsproxy, and letting userspace maintain
>>> a mapping of id's to containers through container create/enter
>>> commands? Or is there some other way you were thinking of doing
>>> this?
> 
>> There are two possible ways.
>> 1) Just use a process using the namespace.
>>    This is easiest to implement.
> 
>> 2) Have a struct pid reference in the namespace itself, 
>>    and probably an extra pointer in struct pid to find it.
>>    This is the most stable, because fork/exit won't affect 
>>    which pid you need to use.
> 
> while I agree that nsproxy is definitely the wrong
> point to tie a 'context' too, as it can contain a
> mixture of spaces from inside and outside a context,
> and it would require to forbid doing things like
> clone() with the space flags, both inside and outside
> a 'container' to allow to use them for actual vps
> applications, I think that we have to have some kind
> of handle to tie specific sets of namespaces too

this is nsproxy ... 
 
> that 'can' be an nsproxy or something different, but
> I'm absolutely unhappy with tying it to a process,

hmm, what do you mean ? nsproxy survives the death of any 
process. It's not tied to any process in particular. One
process creates it with an unshare but that's all.

the ->nsproxy in task_struct is a way to find it.

> as I already mentioned several times, that lightweight
> 'containers' do not use/have an init process, and no
> single process might survive the entire life span of
> that 'container' ...

I think there is a misunderstanding here. a 'container' 
or 'nsproxy' or what ever is a set of namespaces which
are not tied to a process. 

you can do that today on 2.6.19 with utsname. 

>> Beyond that yes it seems to make sense to let user space 
>> maintain any mapping of containers to ids.
> 
> I agree with that, but we need something to move
> around between the various spaces ...

the bind_ns syscall lets the user specify the mapping. this 
is not done by the kernel. 

I had to introduce some rules, like giving more capabilities to 
some processes, but that can be changed. For the 
moment, they have to live in "init_proxy".

> for example, Linux-VServer ties the namespaces to
> the context structure (atm) which allows userspace
> to set and enter specific spaces of a guest context
> (I assume OpenVZ does similar)

What's the big difference with nsproxy ? 

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 10/17] nsproxy: add unshare_ns and bind_ns syscalls [message #16967 is a reply to message #16883] Mon, 11 December 2006 17:05 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Herbert Poetzl wrote:
> On Fri, Dec 08, 2006 at 12:26:49PM -0700, Eric W. Biederman wrote:
>> clg@fr.ibm.com writes:
>>
>>> From: Cedric Le Goater <clg@fr.ibm.com>
>>>
>>> The following patch defines 2 new syscalls specific to nsproxy and
>>> namespaces :
>>>
>>> * unshare_ns :
>>>
>>> 	enables a process to unshare one or more namespaces. this
>>>         duplicates the unshare syscall for the moment but we
>>> 	expect to diverge when the number of namespaces increases
>> Are we out of clone flags yet?  If not this is premature.
> 
> no, but a different nevertheless related question:
> does anybody, except for 'us' use the unshare() syscall?
> 
> because if not, then why not simply extend that one
> to 64bit and be done, we probably won't need a clone64()
> but if we find we do (at some point) adding that with
> the new flags would be trivial ...
> 
> OTOH, we could also just add an unshare64() too
>
> anyway, we _will_ run out of flags in the near future

yes. that's probably the way to go. I'll rework unshare_ns() in a 
unshare64(). it will give some air to the 32bits clone() and unshare() and 
will let us use the >32bits flags for namespaces.

thanks,

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16969 is a reply to message #16958] Tue, 12 December 2006 03:28 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> I actually have code that will let me fork a process in a new namespace today
>> with out needing bind_ns.  What is more I don't even have to be root
>> to use it.
>
> Can you elaborate?  The user namespace patches don't enforce ptrace
> yet, so you could unshare as root, become uid 500, then as uid 500
> in the original namespace ptrace the process in the new namespace.
> Is that what you're doing?  If (when) ptrace enforces the uid namespace,
> will that stop what you're doing?

sys_ptrace is allowed in 2 situations.
- The user and group identities are the same.
- The calling process has CAP_SYS_PTRACE capability.

So currently if the uid namespace enforces the user and group checks
that will prevent the first case, and is very desirable.  But it won't
stop someone with CAP_SYS_PTRACE.  Which given the normal case seems
reasonable.

Getting to the point where you can't trace what a process is doing
would probably require some additional interprocess firewalling
from something like selinux.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16975 is a reply to message #16810] Tue, 12 December 2006 07:52 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Cedric Le Goater <clg@fr.ibm.com> writes:

> Serge E. Hallyn wrote:
>> Well, assuming that we're using pids as identifiers, that means
>
> we can't because a process could die while the namespace is still
> referenced by an other subsystem. We need some kind of id.

Think of a session think of a process group heck think of threads
a pid is not tied to one task struct.  It is absolutely not a problem
for a namespace to do get_pid(...) when it is initialized and put_pid(...)
just before it is freed.

All of the mechanisms for using pids for something like this are already
in place.

What we don't have is a fast pid to namespace transfer. But that is just
an extra pointer in struct pid.  Really that is a trivial patch.
Giving every namespace a pid pointer in struct pid takes a little more
space then I would like but it is not a big deal.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16977 is a reply to message #16810] Tue, 12 December 2006 08:37 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Cedric Le Goater <clg@fr.ibm.com> writes:

> Dave Hansen wrote:
>> On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
>>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>>> nsproxy is an internal space optimization.  It's not struct container
>>>> and I don't think we want it to become that.
>>> i don't agree here. we need that, so does openvz, vserver, people working
>>> on resource management.
>> 
>> I think what those projects need is _some_ way to group tasks.  I'm not
>> sure they actually need nsproxies.
>
> not only tasks. ipc, fs, etc.

What is the important aspect that you need to group.  What concept
are you trying to convey?

How do you describe a container in which someone is using the
pam_namespace module?  So different tasks in the container have
a different mount namespace?

>> Two tasks in the same container could very well have different
>> nsproxies.  The nsproxy defines how the pid namespace, and pid<->task
>> mappings happen for a given task. 
>
> not only. there are other namespaces in nsproxy.

The point is that there is not a one to one mapping between containers
and nsproxies.  There are likely to be more nsproxies than containers.

>> The init process for a container is
>> special and might actually appear in more than one pid namespace, while
>> its children might only appear in one.  That means that this init
>> process's nsproxy can and should actually be different from its
>> children's.  This is despite the fact that they are in the same
>> container.
>> 
>> If we really need this 'container' grouping, it can easily be something
>> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
>
> ok so let's add a container object, containing a nsproxy and add 
> another indirection ...

Well that isn't what Dave suggested, and I don't think it will give
you what you want.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16978 is a reply to message #16810] Tue, 12 December 2006 08:57 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Kirill Korotaev <dev@sw.ru> writes:

>> 
>> I think what those projects need is _some_ way to group tasks.  I'm not
>> sure they actually need nsproxies.
>> 
>> Two tasks in the same container could very well have different
>> nsproxies.
> what is container then from your POV?

A nested instance of user space.  User space may unshare things
such as the mount namespace so it can give users the ability to
control their own mounts and the like.

>> The nsproxy defines how the pid namespace, and pid<->task
>> mappings happen for a given task.  The init process for a container is
>> special and might actually appear in more than one pid namespace, while
>> its children might only appear in one.  That means that this init
>> process's nsproxy can and should actually be different from its
>> children's.  This is despite the fact that they are in the same
>> container.
> nsproxy has references to all namespaces, not just pid namespace.
> Thus it is a container "view" effectively.
> If container is something different, then please define it.

nsproxy has exactly one instance of all namespaces.  A container
in the general case can hold other containers, and near containers
(like processes with separate mount namespaces).  As well as
processes.

So nsproxy currently captures the common case for containers but not
the general case.

>> If we really need this 'container' grouping, it can easily be something
>> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
> You can add another indirection if really want it so much...
> But is it required?
> We created nsproxy which adds another level of indirection, but from performance
> POV
> it is questinable. I can say that we had a nice experience, when adding
> a single dereference in TCP code resulted in ~0.5% performance degradation.

I totally agree with that, nsproxy is something we need to watch from
a performance point of view.  nsproxy is primarily a space
optimization to keep from bloating task struct, and possibly a fork
time optimization. At least at the point we added it no one could
measure overhead from using it.

That is one of the reasons I don't want nsproxy to become explicit and
be exported to user space.  So if it is a performance problem we can
change the implementation without affecting users.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16981 is a reply to message #16954] Tue, 12 December 2006 07:09 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Dave Hansen wrote:
> On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>> nsproxy is an internal space optimization.  It's not struct container
>>> and I don't think we want it to become that.
>> i don't agree here. we need that, so does openvz, vserver, people working
>> on resource management.
> 
> I think what those projects need is _some_ way to group tasks.  I'm not
> sure they actually need nsproxies.

not only tasks. ipc, fs, etc.

> Two tasks in the same container could very well have different
> nsproxies.  The nsproxy defines how the pid namespace, and pid<->task
> mappings happen for a given task. 

not only. there are other namespaces in nsproxy.

> The init process for a container is
> special and might actually appear in more than one pid namespace, while
> its children might only appear in one.  That means that this init
> process's nsproxy can and should actually be different from its
> children's.  This is despite the fact that they are in the same
> container.
> 
> If we really need this 'container' grouping, it can easily be something
> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.

ok so let's add a container object, containing a nsproxy and add 
another indirection ...

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16982 is a reply to message #16921] Tue, 12 December 2006 07:11 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Serge E. Hallyn wrote:
> Quoting Serge E. Hallyn (serue@us.ibm.com):
>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>> Herbert Poetzl <herbert@13thfloor.at> writes:
>>>>> Beyond that yes it seems to make sense to let user space
>>>>> maintain any mapping of containers to ids.
>>>> I agree with that, but we need something to move
>>>> around between the various spaces ...
>>> If you have CAP_SYS_PTRACE or you have a child process
>>> in a container you can create another with ptrace.
>>>
>>> Now I don't mind optimizing that case, with something like
>>> the proposed bind_ns syscall.  But we need to be darn certain
>>> why it is safe, and does not change the security model that
>>> we currently have.
>> Sigh, and that's going to have to be a discussion per namespace.
> 
> Well, assuming that we're using pids as identifiers, that means

we can't because a process could die while the namespace is still
referenced by an other subsystem. We need some kind of id.

> we can only enter decendent namespaces, which means 'we' must
> have created them.  So anything we could do by entering the ns,
> we could have done by creating it as well, right?

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16983 is a reply to message #16969] Tue, 12 December 2006 15:29 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> I actually have code that will let me fork a process in a new namespace today
> >> with out needing bind_ns.  What is more I don't even have to be root
> >> to use it.
> >
> > Can you elaborate?  The user namespace patches don't enforce ptrace
> > yet, so you could unshare as root, become uid 500, then as uid 500
> > in the original namespace ptrace the process in the new namespace.
> > Is that what you're doing?  If (when) ptrace enforces the uid namespace,
> > will that stop what you're doing?
> 
> sys_ptrace is allowed in 2 situations.
> - The user and group identities are the same.
> - The calling process has CAP_SYS_PTRACE capability.
> 
> So currently if the uid namespace enforces the user and group checks
> that will prevent the first case, and is very desirable.  But it won't
> stop someone with CAP_SYS_PTRACE.  Which given the normal case seems
> reasonable.

Yes, I was forgetting that intra-container ptrace is generally
inhibited by lack of a handle to processes in the other container.
So:

	. in checkpoint/restart usage, the normal CAP_SYS_PTRACE
	  semantics is fine
	. inside a vserver, the normal CAP_SYS_PTRACE is fine
	. in general, a process inside one vserver cannot reference
	  a process in another vserver, so we don't need to worry
	  about ptrace permissions at all
	. however, if we want to (as per emails yesterday) provide
	  some bit of enforcement of limits from parent namespaces
	  to child namespaces - where a pid is in fact available for
	  at least the init process (and, depending on our final
	  implementation, perhaps all processes) - then we need
	  something more.

As you say, selinux permissions would be one way to obtain this.

> Getting to the point where you can't trace what a process is doing
> would probably require some additional interprocess firewalling
> from something like selinux.

Yup.

thanks,
-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16984 is a reply to message #16981] Tue, 12 December 2006 15:45 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Cedric Le Goater (clg@fr.ibm.com):
> Dave Hansen wrote:
> > On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
> >>> Even letting the concept of nsproxy escape to user space sounds wrong.
> >>> nsproxy is an internal space optimization.  It's not struct container
> >>> and I don't think we want it to become that.
> >> i don't agree here. we need that, so does openvz, vserver, people working
> >> on resource management.
> > 
> > I think what those projects need is _some_ way to group tasks.  I'm not
> > sure they actually need nsproxies.
> 
> not only tasks. ipc, fs, etc.
> 
> > Two tasks in the same container could very well have different
> > nsproxies.  The nsproxy defines how the pid namespace, and pid<->task
> > mappings happen for a given task. 
> 
> not only. there are other namespaces in nsproxy.

Right, and as Eric has pointed out, you may well want to use one id to
refer to several nsproxies - for instance if you are using unshare
to provide per-user private mount namespaces using pam_namespace.so
(that's mostly for LSPP systems right now, but I do this on my laptop
too).  All my accounts are in the same 'container', but have different
mount namespaces, hence different nsproxies.

> > The init process for a container is
> > special and might actually appear in more than one pid namespace, while
> > its children might only appear in one.  That means that this init
> > process's nsproxy can and should actually be different from its
> > children's.  This is despite the fact that they are in the same
> > container.
> > 
> > If we really need this 'container' grouping, it can easily be something
> > pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
> 
> ok so let's add a container object, containing a nsproxy and add 
> another indirection ...

No thanks.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16989 is a reply to message #16954] Tue, 12 December 2006 08:43 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

>>>Even letting the concept of nsproxy escape to user space sounds wrong.
>>>nsproxy is an internal space optimization.  It's not struct container
>>>and I don't think we want it to become that.
>>
>>i don't agree here. we need that, so does openvz, vserver, people working
>>on resource management. 
> 
> 
> I think what those projects need is _some_ way to group tasks.  I'm not
> sure they actually need nsproxies.
> 
> Two tasks in the same container could very well have different
> nsproxies.
what is container then from your POV?

> The nsproxy defines how the pid namespace, and pid<->task
> mappings happen for a given task.  The init process for a container is
> special and might actually appear in more than one pid namespace, while
> its children might only appear in one.  That means that this init
> process's nsproxy can and should actually be different from its
> children's.  This is despite the fact that they are in the same
> container.
nsproxy has references to all namespaces, not just pid namespace.
Thus it is a container "view" effectively.
If container is something different, then please define it.

> If we really need this 'container' grouping, it can easily be something
> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
You can add another indirection if really want it so much...
But is it required?
We created nsproxy which adds another level of indirection, but from performance POV
it is questinable. I can say that we had a nice experience, when adding
a single dereference in TCP code resulted in ~0.5% performance degradation.

Thanks,
Kirill

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16998 is a reply to message #16928] Tue, 12 December 2006 18:29 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
>>> If someones permissions to various objects does not depend on the namespace
>>> they are in quite possibly this is a non-issue.  If we actually depend on
>>> the isolation to keep things secure enter is a setup for a first rate escape.
>> I don't believe the isolation can be effective between two namespaces
>> where one is an ancestor of another.  It can be so long as one isn't
>> the ancestor of another, but then we're not allowing either to enter
>> the other namespace.  So it's not a problem.
> 
> Reasonable.  
> 
>> The bind_ns() proposed by Cedric is stricter, only allowing nsid 0 to
>> switch namespaces.  So it may be overly restrictive, and does introduce
>> a new global namespace, but it is safe.
> 
> I will look a little more.  There are a lot patches out there that need
> review.   What disturbs a little is that with ptrace we have an existing
> mechanism that can do everything we want enter or bind_ns to be able to do.

Eric, you have this habit of flooding us with email whenever a patchset is
sent on this topic. It is a bad habit. Please take some time to look at it
before. There is work behind it and it tries to address some issues.  

This patchset has been sent on container@ as a proposal for -mm. I'll try
to make a summary of how we can improve next one to move forward. 

I still need to read all your emails :)

thanks,

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17000 is a reply to message #16956] Tue, 12 December 2006 23:22 Go to previous messageGo to next message
Herbert Poetzl is currently offline  Herbert Poetzl
Messages: 239
Registered: February 2006
Senior Member
On Mon, Dec 11, 2006 at 04:01:15PM -0600, Serge E. Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
> > "Serge E. Hallyn" <serue@us.ibm.com> writes:
> > 
> > > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > >
> > > Yeah, that occurred to me, but it doesn't seem like we can possibly make
> > > sufficient guarantees to the client to make this worthwhile.
> > >
> > > I'd love to be wrong about that, but if nothing else we can't prove to
> > > the client that they're running on an unhacked host.  So the host admin
> > > will always have to be trusted.
> > 
> > To some extent that is true.  Although all security models we have
> > currently fall down if you hack the kernel, or run your kernel
> > in a hacked virtual environment.  It would be nice if under normal
> > conditions you could mount an encrypted filesystem only in a container
> > and not have concerns of those files escaping.
> 
> Hmm, well perhaps I'm being overly pessimistic - IBM research did have a
> demo based on TPM of remote attestation, which may be usable for
> ensuring that you're connecting to a service on your virtual machine on
> a certain (unhacked) kernel on particular hardware, in which case what
> you're talking about may be possible - given a stringent initial
> environment (i.e. not the 'gimme $20/month for a hosted partition in
> arizona' environment).

interesting, how would you _ensure_ from inside
such an environment, that nobody tampered with
the kernel you are running on?

> Given that, perhaps having a virtual machine with access to encrypted
> storage - safe from the host machine admins - may not be unattainable
> after all.  And given that, it would be worth designing the ns_enter()
> system call so that a parent cannot enter some child namespace.

we currently call this Context Privacy, and it
is partially implemented, but of course, it
does only work if the kernel is known good

> > Which would probably be a matter of having a separate uid_ns and not
> > allowing process outside of your container to have any permissions in
> > that filesystem.
> 
> Yup.  Or even just a separate uid_ns and an ecryptfs partition, so
> that the host can back up the encrypted data incrementally (per file,
> i.e. not just the whole dmcrypted loop file).

it's simple to avoid access to certain 'tagged'
devices and/or filesystems, it's hard to handle
kernel modifications or even simple things like
reading the kernel memory ...

best,
Herbert

> -serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17014 is a reply to message #16989] Wed, 13 December 2006 04:55 Go to previous messageGo to next message
Herbert Poetzl is currently offline  Herbert Poetzl
Messages: 239
Registered: February 2006
Senior Member
On Tue, Dec 12, 2006 at 11:43:38AM +0300, Kirill Korotaev wrote:
> >>>Even letting the concept of nsproxy escape to user space sounds wrong.
> >>>nsproxy is an internal space optimization.  It's not struct container
> >>>and I don't think we want it to become that.

> >>i don't agree here. we need that, so does openvz, vserver, people
> >>working on resource management.
> > 
> > 
> > I think what those projects need is _some_ way to group tasks. I'm
> > not sure they actually need nsproxies.
> > 
> > Two tasks in the same container could very well have different
> > nsproxies.

and typically, they will ...

> what is container then from your POV?

from my PoV, a container is something keeping
processes _inside_ which basically requires
the following elements:

 - isolation from other containers
 - virtualization of unique elements
 - limitation on resources
 - policy on all interfaces

the current spaces mostly address the isolation
and to some degree, the virtualization, which
is a good thing, but the container also requires
the resource limitation and the policy, to handle
interfaces to the outside (should not be new to
you, actually :)

so the container (may it be represented by a 
structure or not), may reference an nsproxy
(as we do in the 2.6.19 versions of Linux-VServer)
but an nsproxy is not the proper element to
define a container ..

we also want to be able to have sub spaces inside
a container, as long as they do not interfere or
overcome the limitations and policy

> > The nsproxy defines how the pid namespace, and pid<->task
> > mappings happen for a given task.  The init process for a container is
> > special and might actually appear in more than one pid namespace, while
> > its children might only appear in one.  That means that this init
> > process's nsproxy can and should actually be different from its
> > children's.  This is despite the fact that they are in the same
> > container.

> nsproxy has references to all namespaces, not just pid namespace.
> Thus it is a container "view" effectively.

it is a view into the world of one or more processes,
but not necessarily the view of all processes inside
a container :)

> If container is something different, then please define it.

see above ...

> > If we really need this 'container' grouping, it can easily be something
> > pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.

> You can add another indirection if really want it so much...
> But is it required?
> We created nsproxy which adds another level of indirection, but from
> performance POV it is questinable. 

I'm not very happy with the nsproxy abstraction,
as I think it would be better handled per task,
and I still have no real world test results what
overhead the nsproxy indirection causes

> I can say that we had a nice experience, when adding a single
> dereference in TCP code resulted in ~0.5% performance degradation.

yes, that is what I fear is happening right now
with the nsproxy ... but I think we need to test
that, and if it makes sense, switch to task direct
spaces (as we had before), just more of them ...

best,
Herbert

> Thanks,
> Kirill
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17022 is a reply to message #16810] Wed, 13 December 2006 18:53 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Cedric Le Goater <clg@fr.ibm.com> writes:

> Eric W. Biederman wrote:
>> 
>> What we don't have is a fast pid to namespace transfer. But that is just
>> an extra pointer in struct pid.  Really that is a trivial patch.
>> Giving every namespace a pid pointer in struct pid takes a little more
>> space then I would like but it is not a big deal.
>
> I'm not sure I understand how you want to do this. 
>
> Let me try : you would add a 'struct pid pid' field to all namespaces and
> assign that 'pid' field  with the struct pid of the task creating the
> namespace ? 

Yes a struct pid *pid field, that we did the proper reference counting
on.

As for which pid to assign, that is a little trickier.  The struct pid
of the task creating the namespace is the obvious choice and that will
always work for clone.  For unshare that would only work if we added
the restriction you can't unshare if someone already has used that pid
for that kind of namespace.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17025 is a reply to message #16975] Wed, 13 December 2006 14:44 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Eric W. Biederman wrote:
> Cedric Le Goater <clg@fr.ibm.com> writes:
> 
>> Serge E. Hallyn wrote:
>>> Well, assuming that we're using pids as identifiers, that means
>> we can't because a process could die while the namespace is still
>> referenced by an other subsystem. We need some kind of id.
> 
> Think of a session think of a process group heck think of threads
> a pid is not tied to one task struct.  It is absolutely not a problem
> for a namespace to do get_pid(...) when it is initialized and put_pid(...)
> just before it is freed.
> 
> All of the mechanisms for using pids for something like this are already
> in place.
> 
> What we don't have is a fast pid to namespace transfer. But that is just
> an extra pointer in struct pid.  Really that is a trivial patch.
> Giving every namespace a pid pointer in struct pid takes a little more
> space then I would like but it is not a big deal.

I'm not sure I understand how you want to do this. 

Let me try : you would add a 'struct pid pid' field to all namespaces and
assign that 'pid' field  with the struct pid of the task creating the
namespace ? 

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17026 is a reply to message #16984] Wed, 13 December 2006 15:00 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Serge E. Hallyn wrote:
> Quoting Cedric Le Goater (clg@fr.ibm.com):
>> Dave Hansen wrote:
>>> On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
>>>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>>>> nsproxy is an internal space optimization.  It's not struct container
>>>>> and I don't think we want it to become that.
>>>> i don't agree here. we need that, so does openvz, vserver, people working
>>>> on resource management.
>>> I think what those projects need is _some_ way to group tasks.  I'm not
>>> sure they actually need nsproxies.
>> not only tasks. ipc, fs, etc.
>>
>>> Two tasks in the same container could very well have different
>>> nsproxies.  The nsproxy defines how the pid namespace, and pid<->task
>>> mappings happen for a given task.
>> not only. there are other namespaces in nsproxy.
> 
> Right, and as Eric has pointed out, you may well want to use one id to
> refer to several nsproxies - for instance if you are using unshare
> to provide per-user private mount namespaces using pam_namespace.so
> (that's mostly for LSPP systems right now, but I do this on my laptop
> too).  All my accounts are in the same 'container', but have different
> mount namespaces, hence different nsproxies.

I think we have definition issue here : what is a 'container' ? 


I don't see any issue with the above scenario. unsharing mount namespace
results in the creation of a new nsproxy which will require a new identifier
in order to find this new mount namespace. 

so yes, different mount namespaces, hence different nsproxies, hence 
different ids if you want to find that new  mount namespace.

>>> The init process for a container is
>>> special and might actually appear in more than one pid namespace, while
>>> its children might only appear in one.  That means that this init
>>> process's nsproxy can and should actually be different from its
>>> children's.  This is despite the fact that they are in the same
>>> container.
>>>
>>> If we really need this 'container' grouping, it can easily be something
>>> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
>> ok so let's add a container object, containing a nsproxy and add
>> another indirection ...
> 
> No thanks.

exactly.

C.

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17027 is a reply to message #16977] Wed, 13 December 2006 15:02 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Eric W. Biederman wrote:
> Cedric Le Goater <clg@fr.ibm.com> writes:
> 
>> Dave Hansen wrote:
>>> On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
>>>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>>>> nsproxy is an internal space optimization.  It's not struct container
>>>>> and I don't think we want it to become that.
>>>> i don't agree here. we need that, so does openvz, vserver, people working
>>>> on resource management.
>>> I think what those projects need is _some_ way to group tasks.  I'm not
>>> sure they actually need nsproxies.
>> not only tasks. ipc, fs, etc.
> 
> What is the important aspect that you need to group.  What concept
> are you trying to convey?
> 
> How do you describe a container in which someone is using the
> pam_namespace module?  So different tasks in the container have
> a different mount namespace?

let's define a container first. I'm not sure for what you are using that
term.
 
>>> Two tasks in the same container could very well have different
>>> nsproxies.  The nsproxy defines how the pid namespace, and pid<->task
>>> mappings happen for a given task.
>> not only. there are other namespaces in nsproxy.
> 
> The point is that there is not a one to one mapping between containers
> and nsproxies.  There are likely to be more nsproxies than containers.

again. please explain the difference that you see between a container 
and a nsproxy. I don't get it and i might be missing something important
doing this short cut.

C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17028 is a reply to message #17014] Wed, 13 December 2006 15:17 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Herbert Poetzl wrote:
> On Tue, Dec 12, 2006 at 11:43:38AM +0300, Kirill Korotaev wrote:
>>>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>>>> nsproxy is an internal space optimization.  It's not struct container
>>>>> and I don't think we want it to become that.
> 
>>>> i don't agree here. we need that, so does openvz, vserver, people
>>>> working on resource management.
>>>
>>> I think what those projects need is _some_ way to group tasks. I'm
>>> not sure they actually need nsproxies.
>>>
>>> Two tasks in the same container could very well have different
>>> nsproxies.
> 
> and typically, they will ...

that means we are missing a container object then, a vps, a vcontext, a 
vsomething. nop ?

>> what is container then from your POV?
> 
> from my PoV, a container is something keeping
> processes _inside_ which basically requires
> the following elements:
> 
>  - isolation from other containers
>  - virtualization of unique elements
>  - limitation on resources
>  - policy on all interfaces
> 
> the current spaces mostly address the isolation
> and to some degree, the virtualization, which
> is a good thing, but the container also requires
> the resource limitation and the policy, to handle
> interfaces to the outside (should not be new to
> you, actually :)
>
> so the container (may it be represented by a 
> structure or not), may reference an nsproxy
> (as we do in the 2.6.19 versions of Linux-VServer)
> but an nsproxy is not the proper element to
> define a container ..

agree. it's not complete.

should we address that by introducing a new object ? 
could that be done on per-product basis ? I mean like
in a driver model. 

> we also want to be able to have sub spaces inside
> a container, as long as they do not interfere or
> overcome the limitations and policy
> 
>>> The nsproxy defines how the pid namespace, and pid<->task
>>> mappings happen for a given task.  The init process for a container is
>>> special and might actually appear in more than one pid namespace, while
>>> its children might only appear in one.  That means that this init
>>> process's nsproxy can and should actually be different from its
>>> children's.  This is despite the fact that they are in the same
>>> container.
> 
>> nsproxy has references to all namespaces, not just pid namespace.
>> Thus it is a container "view" effectively.
> 
> it is a view into the world of one or more processes,
> but not necessarily the view of all processes inside
> a container :)
> 
>> If container is something different, then please define it.
> 
> see above ...
> 
>>> If we really need this 'container' grouping, it can easily be something
>>> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
> 
>> You can add another indirection if really want it so much...
>> But is it required?
>> We created nsproxy which adds another level of indirection, but from
>> performance POV it is questinable. 
> 
> I'm not very happy with the nsproxy abstraction,
> as I think it would be better handled per task,
> and I still have no real world test results what
> overhead the nsproxy indirection causes
> 
>> I can say that we had a nice experience, when adding a single
>> dereference in TCP code resulted in ~0.5% performance degradation.
> 
> yes, that is what I fear is happening right now
> with the nsproxy ... but I think we need to test
> that, and if it makes sense, switch to task direct
> spaces (as we had before), just more of them ...

getting some figures would be nice and we might also be able
to improve the current nsproxy model.


C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17059 is a reply to message #16810] Thu, 14 December 2006 21:08 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Cedric Le Goater <clg@fr.ibm.com> writes:

>>> Let me try : you would add a 'struct pid pid' field to all namespaces and
>>> assign that 'pid' field  with the struct pid of the task creating the
>>> namespace ?
>> 
>> Yes a struct pid *pid field, that we did the proper reference counting
>> on.
>
> sure.
>  
>> As for which pid to assign, that is a little trickier.  The struct pid
>> of the task creating the namespace is the obvious choice and that will
>> always work for clone.  For unshare that would only work if we added
>> the restriction you can't unshare if someone already has used that pid
>> for that kind of namespace.
>
> clone will also be an issue if more than one namespace is unshared. Do 
> you use the same 'struct pid*' for each namespace ? hmm, it feels also 
> wrong.
>
> having an id field in the namespace and using a bind_ns like syscall
> to let the user assign whatever id he wants to, doesn't seem to be
> such a bad idea.

I think I would probably have suggested simply taking the next
available id in that case in practice.  Partly it depends on exactly
what we are trying to do with these.

But I do agree getting that last little details right so some corner
case doesn't feel wrong is hard.  That is why I try and put off
this kind of things until as much is known about how we are going
to use it as possible.  So we can make a good decision and solve
practical problems, and not theoretical ones.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17062 is a reply to message #17022] Thu, 14 December 2006 13:17 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
>> Let me try : you would add a 'struct pid pid' field to all namespaces and
>> assign that 'pid' field  with the struct pid of the task creating the
>> namespace ?
> 
> Yes a struct pid *pid field, that we did the proper reference counting
> on.

sure.
 
> As for which pid to assign, that is a little trickier.  The struct pid
> of the task creating the namespace is the obvious choice and that will
> always work for clone.  For unshare that would only work if we added
> the restriction you can't unshare if someone already has used that pid
> for that kind of namespace.

clone will also be an issue if more than one namespace is unshared. Do 
you use the same 'struct pid*' for each namespace ? hmm, it feels also 
wrong.

having an id field in the namespace and using a bind_ns like syscall
to let the user assign whatever id he wants to, doesn't seem to be
such a bad idea.

C. 
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17100 is a reply to message #17000] Wed, 20 December 2006 06:12 Go to previous message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Herbert Poetzl (herbert@13thfloor.at):
> On Mon, Dec 11, 2006 at 04:01:15PM -0600, Serge E. Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > > "Serge E. Hallyn" <serue@us.ibm.com> writes:
> > >
> > > > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > > >
> > > > Yeah, that occurred to me, but it doesn't seem like we can possibly make
> > > > sufficient guarantees to the client to make this worthwhile.
> > > >
> > > > I'd love to be wrong about that, but if nothing else we can't prove to
> > > > the client that they're running on an unhacked host.  So the host admin
> > > > will always have to be trusted.
> > >
> > > To some extent that is true.  Although all security models we have
> > > currently fall down if you hack the kernel, or run your kernel
> > > in a hacked virtual environment.  It would be nice if under normal
> > > conditions you could mount an encrypted filesystem only in a container
> > > and not have concerns of those files escaping.
> >
> > Hmm, well perhaps I'm being overly pessimistic - IBM research did have a
> > demo based on TPM of remote attestation, which may be usable for
> > ensuring that you're connecting to a service on your virtual machine on
> > a certain (unhacked) kernel on particular hardware, in which case what
> > you're talking about may be possible - given a stringent initial
> > environment (i.e. not the 'gimme $20/month for a hosted partition in
> > arizona' environment).
> 
> interesting, how would you _ensure_ from inside
> such an environment, that nobody tampered with
> the kernel you are running on?

Sorry, took awhile to find the best reference, but I guess this would be
it:

http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/459e9a2b1f668aee85256f330067589f

Another description is

http://domino.research.ibm.com/comm/research_people.nsf/pages/sailer.ima.html

I guess it was in 2004 that they did a demo of remote attestion at the
RSA conference, as described in the third to last paragraph in
http://cio.co.nz/cio.nsf/0/03943645293DB008CC256E47005D8EA2?OpenDocument
and in
http://domino.research.ibm.com/comm/pr.nsf/pages/news.20040218_linux.html

-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
Previous Topic: seems to be a flaw in cfq
Next Topic: [PATCH] compat offsets size change
Goto Forum:
  


Current Time: Thu May 09 09:46:36 GMT 2024

Total time taken to generate the page: 0.01736 seconds