Home » Mailing lists » Devel » [patch -mm 00/17] new namespaces and related syscalls
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16928 is a reply to message #16810] |
Mon, 11 December 2006 20:34 |
ebiederm
Messages: 1354 Registered: February 2006
|
Senior Member |
|
|
"Serge E. Hallyn" <serue@us.ibm.com> writes:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
>
> Yeah, that occurred to me, but it doesn't seem like we can possibly make
> sufficient guarantees to the client to make this worthwhile.
>
> I'd love to be wrong about that, but if nothing else we can't prove to
> the client that they're running on an unhacked host. So the host admin
> will always have to be trusted.
To some extent that is true. Although all security models we have
currently fall down if you hack the kernel, or run your kernel
in a hacked virtual environment. It would be nice if under normal
conditions you could mount an encrypted filesystem only in a container
and not have concerns of those files escaping.
Which would probably be a matter of having a separate uid_ns and not
allowing process outside of your container to have any permissions in
that filesystem.
>> 2) When we only partially enter a namespace it is very easy for additional
>> properties to enter that namespace. For example we enter the pid
>> namespace and the mount namespace, but keep our current working directory
>> in the previous namespace. Then a process in the restricted namespace
>> can get out by cd into /proc/<?>/cwd.
>
> Yup, entering existing namespaces should be all-or-nothing.
A truly all-or-nothing has the problem that there is no external
input into the container, and a very controlled external input
to the existing container is what this is about.
>> If someones permissions to various objects does not depend on the namespace
>> they are in quite possibly this is a non-issue. If we actually depend on
>> the isolation to keep things secure enter is a setup for a first rate escape.
>
> I don't believe the isolation can be effective between two namespaces
> where one is an ancestor of another. It can be so long as one isn't
> the ancestor of another, but then we're not allowing either to enter
> the other namespace. So it's not a problem.
Reasonable.
> The bind_ns() proposed by Cedric is stricter, only allowing nsid 0 to
> switch namespaces. So it may be overly restrictive, and does introduce
> a new global namespace, but it is safe.
I will look a little more. There are a lot patches out there that need
review. What disturbs a little is that with ptrace we have an existing
mechanism that can do everything we want enter or bind_ns to be able to do.
I actually have code that will let me fork a process in a new namespace today
with out needing bind_ns. What is more I don't even have to be root
to use it.
I would very much prefer to see us optimizing our debugging and
control interfaces so they are efficient then see us implement
something completely new that is problem domain specific.
Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16930 is a reply to message #16923] |
Mon, 11 December 2006 20:03 |
serue
Messages: 750 Registered: February 2006
|
Senior Member |
|
|
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>
> > Quoting Serge E. Hallyn (serue@us.ibm.com):
> >> Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> > Herbert Poetzl <herbert@13thfloor.at> writes:
> >> > >> Beyond that yes it seems to make sense to let user space
> >> > >> maintain any mapping of containers to ids.
> >> > >
> >> > > I agree with that, but we need something to move
> >> > > around between the various spaces ...
> >> >
> >> > If you have CAP_SYS_PTRACE or you have a child process
> >> > in a container you can create another with ptrace.
> >> >
> >> > Now I don't mind optimizing that case, with something like
> >> > the proposed bind_ns syscall. But we need to be darn certain
> >> > why it is safe, and does not change the security model that
> >> > we currently have.
> >>
> >> Sigh, and that's going to have to be a discussion per namespace.
> >
> > Well, assuming that we're using pids as identifiers, that means
> > we can only enter decendent namespaces, which means 'we' must
> > have created them. So anything we could do by entering the ns,
> > we could have done by creating it as well, right?
>
> It isn't strict descendents who we can see. i.e. init can create
> the thing, and we could have just logged into the network but init
> and us still share the same pid namespace.
>
> But yes it would be we can only enter descendent namespaces, for
> some definition of enter.
>
> There are two issues.
> 1) We may have a namespace we want to create and then remove the ability
> for the sysadmin to fiddle with, so it can play with encrypted data or
> something like that safely. Not quite unix but it is certainly worth
> considering.
Yeah, that occurred to me, but it doesn't seem like we can possibly make
sufficient guarantees to the client to make this worthwhile.
I'd love to be wrong about that, but if nothing else we can't prove to
the client that they're running on an unhacked host. So the host admin
will always have to be trusted.
> 2) When we only partially enter a namespace it is very easy for additional
> properties to enter that namespace. For example we enter the pid
> namespace and the mount namespace, but keep our current working directory
> in the previous namespace. Then a process in the restricted namespace
> can get out by cd into /proc/<?>/cwd.
Yup, entering existing namespaces should be all-or-nothing.
> If someones permissions to various objects does not depend on the namespace
> they are in quite possibly this is a non-issue. If we actually depend on
> the isolation to keep things secure enter is a setup for a first rate escape.
I don't believe the isolation can be effective between two namespaces
where one is an ancestor of another. It can be so long as one isn't
the ancestor of another, but then we're not allowing either to enter
the other namespace. So it's not a problem.
The bind_ns() proposed by Cedric is stricter, only allowing nsid 0 to
switch namespaces. So it may be overly restrictive, and does introduce
a new global namespace, but it is safe.
-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16954 is a reply to message #16810] |
Mon, 11 December 2006 22:53 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
> > Even letting the concept of nsproxy escape to user space sounds wrong.
> > nsproxy is an internal space optimization. It's not struct container
> > and I don't think we want it to become that.
>
> i don't agree here. we need that, so does openvz, vserver, people working
> on resource management.
I think what those projects need is _some_ way to group tasks. I'm not
sure they actually need nsproxies.
Two tasks in the same container could very well have different
nsproxies. The nsproxy defines how the pid namespace, and pid<->task
mappings happen for a given task. The init process for a container is
special and might actually appear in more than one pid namespace, while
its children might only appear in one. That means that this init
process's nsproxy can and should actually be different from its
children's. This is despite the fact that they are in the same
container.
If we really need this 'container' grouping, it can easily be something
pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16956 is a reply to message #16928] |
Mon, 11 December 2006 22:01 |
serue
Messages: 750 Registered: February 2006
|
Senior Member |
|
|
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >
> > Yeah, that occurred to me, but it doesn't seem like we can possibly make
> > sufficient guarantees to the client to make this worthwhile.
> >
> > I'd love to be wrong about that, but if nothing else we can't prove to
> > the client that they're running on an unhacked host. So the host admin
> > will always have to be trusted.
>
> To some extent that is true. Although all security models we have
> currently fall down if you hack the kernel, or run your kernel
> in a hacked virtual environment. It would be nice if under normal
> conditions you could mount an encrypted filesystem only in a container
> and not have concerns of those files escaping.
Hmm, well perhaps I'm being overly pessimistic - IBM research did have a
demo based on TPM of remote attestation, which may be usable for
ensuring that you're connecting to a service on your virtual machine on
a certain (unhacked) kernel on particular hardware, in which case what
you're talking about may be possible - given a stringent initial
environment (i.e. not the 'gimme $20/month for a hosted partition in
arizona' environment).
Given that, perhaps having a virtual machine with access to encrypted
storage - safe from the host machine admins - may not be unattainable
after all. And given that, it would be worth designing the ns_enter()
system call so that a parent cannot enter some child namespace.
> Which would probably be a matter of having a separate uid_ns and not
> allowing process outside of your container to have any permissions in
> that filesystem.
Yup. Or even just a separate uid_ns and an ecryptfs partition, so
that the host can back up the encrypted data incrementally (per file,
i.e. not just the whole dmcrypted loop file).
-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
|
|
|
Re: [patch -mm 09/17] nsproxy: add namespace flags [message #16962 is a reply to message #16865] |
Mon, 11 December 2006 15:27 |
Cedric Le Goater
Messages: 443 Registered: February 2006
|
Senior Member |
|
|
Eric W. Biederman wrote:
> Cedric Le Goater <clg@fr.ibm.com> writes:
>
>>>> /*
>>>> + * namespaces flags
>>>> + */
>>>> +#define NS_MNT 0x00000001
>>>> +#define NS_UTS 0x00000002
>>>> +#define NS_IPC 0x00000004
>>>> +#define NS_PID 0x00000008
>>>> +#define NS_NET 0x00000010
>>>> +#define NS_USER 0x00000020
>>>> +#define NS_ALL (NS_MNT|NS_UTS|NS_IPC|NS_PID|NS_NET|NS_USER)
>>> hmm, why _another_ set of flags to refer to the
>>> namespaces?
>> well, because namespaces are a new kind in the kernel
>
> Gratuitous incompatibility.
?
>>> is the clone()/unshare() set of flags not sufficient
>>> for that?
>> because we are reaching the limits of the CLONE_ flags.
>
> Not really. There are at least 8 bits that clone cannot use
> but that unshare can.
please, could you list them ?
>>> if so, shouldn't we switch (or even better change?
>>> the unshare() too) to a new set of syscalls?
>> unshare_ns() is a new syscall and we don't really need a
>> clone anyway. nop ?
>
> Huh? Clone should be the primary. There are certain namespaces
> that it are very hard to unshare, without creating a new process.
You just said above that clone had less available flags than
unshare ...
anyway, could you elaborate a bit more ? I have the opposite
feeling and you gave me that impression also a few month ago.
No problem for me, i just want a way to use this stuff without
>>> we should think twice before we create just another
>>> set of flags, and if we do so, please let us change
>>> them all, including certain clone flags (and add a
>>> single compatibility wrapper for the 'old' syscalls)
>> so you would keep the unshare as is but change the set
>> of flags its using, making sure the old ones are still
>> compatible with the new ones.
>>
>> something like this :
>>
>> int sys_unshare(int unshare_flags)
>> {
>> int unshare_ns_flags;
>>
>> unshare_ns_flags = convert_flags(unshare_flags);
>>
>> return sys_unshare_ns(unshare_ns_flags);
>> }
>>
>> ?
>
> If necessary.
ok good. will check it out.
C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16966 is a reply to message #16882] |
Mon, 11 December 2006 16:09 |
Cedric Le Goater
Messages: 443 Registered: February 2006
|
Senior Member |
|
|
Herbert Poetzl wrote:
> On Fri, Dec 08, 2006 at 01:57:38PM -0700, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>>
>>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>>> clg@fr.ibm.com writes:
>>>>
>>>>> From: Cedric Le Goater <clg@fr.ibm.com>
>>>>>
>>>>> This patch adds a hashtable of nsproxy using the nsproxy as a key.
>>>>> init_nsproxy is hashed at init with key 0. This is considered to be
>>>>> the 'host' nsproxy.
>>>> NAK. Which namespace do these ids live in?
>
> well, I gave a similar answer in another email,
> so I fully agree with the NAK here ...
hmm, I wasn't that clear to me. OK, let's dig :)
>>>> It sounds like you are setting up to make the 'host' nsproxy
>>>> special and have special rules. That also sounds wrong.
>>>>
>>>> Even letting the concept of nsproxy escape to user space sounds
>>>> wrong. nsproxy is an internal space optimization. It's not struct
>>>> container and I don't think we want it to become that.
>>>>
>>>> Eric
>>> So would you advocate referring to containers just by the pid of
>>> a process containing the nsproxy, and letting userspace maintain
>>> a mapping of id's to containers through container create/enter
>>> commands? Or is there some other way you were thinking of doing
>>> this?
>
>> There are two possible ways.
>> 1) Just use a process using the namespace.
>> This is easiest to implement.
>
>> 2) Have a struct pid reference in the namespace itself,
>> and probably an extra pointer in struct pid to find it.
>> This is the most stable, because fork/exit won't affect
>> which pid you need to use.
>
> while I agree that nsproxy is definitely the wrong
> point to tie a 'context' too, as it can contain a
> mixture of spaces from inside and outside a context,
> and it would require to forbid doing things like
> clone() with the space flags, both inside and outside
> a 'container' to allow to use them for actual vps
> applications, I think that we have to have some kind
> of handle to tie specific sets of namespaces too
this is nsproxy ...
> that 'can' be an nsproxy or something different, but
> I'm absolutely unhappy with tying it to a process,
hmm, what do you mean ? nsproxy survives the death of any
process. It's not tied to any process in particular. One
process creates it with an unshare but that's all.
the ->nsproxy in task_struct is a way to find it.
> as I already mentioned several times, that lightweight
> 'containers' do not use/have an init process, and no
> single process might survive the entire life span of
> that 'container' ...
I think there is a misunderstanding here. a 'container'
or 'nsproxy' or what ever is a set of namespaces which
are not tied to a process.
you can do that today on 2.6.19 with utsname.
>> Beyond that yes it seems to make sense to let user space
>> maintain any mapping of containers to ids.
>
> I agree with that, but we need something to move
> around between the various spaces ...
the bind_ns syscall lets the user specify the mapping. this
is not done by the kernel.
I had to introduce some rules, like giving more capabilities to
some processes, but that can be changed. For the
moment, they have to live in "init_proxy".
> for example, Linux-VServer ties the namespaces to
> the context structure (atm) which allows userspace
> to set and enter specific spaces of a guest context
> (I assume OpenVZ does similar)
What's the big difference with nsproxy ?
C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 10/17] nsproxy: add unshare_ns and bind_ns syscalls [message #16967 is a reply to message #16883] |
Mon, 11 December 2006 17:05 |
Cedric Le Goater
Messages: 443 Registered: February 2006
|
Senior Member |
|
|
Herbert Poetzl wrote:
> On Fri, Dec 08, 2006 at 12:26:49PM -0700, Eric W. Biederman wrote:
>> clg@fr.ibm.com writes:
>>
>>> From: Cedric Le Goater <clg@fr.ibm.com>
>>>
>>> The following patch defines 2 new syscalls specific to nsproxy and
>>> namespaces :
>>>
>>> * unshare_ns :
>>>
>>> enables a process to unshare one or more namespaces. this
>>> duplicates the unshare syscall for the moment but we
>>> expect to diverge when the number of namespaces increases
>> Are we out of clone flags yet? If not this is premature.
>
> no, but a different nevertheless related question:
> does anybody, except for 'us' use the unshare() syscall?
>
> because if not, then why not simply extend that one
> to 64bit and be done, we probably won't need a clone64()
> but if we find we do (at some point) adding that with
> the new flags would be trivial ...
>
> OTOH, we could also just add an unshare64() too
>
> anyway, we _will_ run out of flags in the near future
yes. that's probably the way to go. I'll rework unshare_ns() in a
unshare64(). it will give some air to the 32bits clone() and unshare() and
will let us use the >32bits flags for namespaces.
thanks,
C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16977 is a reply to message #16810] |
Tue, 12 December 2006 08:37 |
ebiederm
Messages: 1354 Registered: February 2006
|
Senior Member |
|
|
Cedric Le Goater <clg@fr.ibm.com> writes:
> Dave Hansen wrote:
>> On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
>>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>>> nsproxy is an internal space optimization. It's not struct container
>>>> and I don't think we want it to become that.
>>> i don't agree here. we need that, so does openvz, vserver, people working
>>> on resource management.
>>
>> I think what those projects need is _some_ way to group tasks. I'm not
>> sure they actually need nsproxies.
>
> not only tasks. ipc, fs, etc.
What is the important aspect that you need to group. What concept
are you trying to convey?
How do you describe a container in which someone is using the
pam_namespace module? So different tasks in the container have
a different mount namespace?
>> Two tasks in the same container could very well have different
>> nsproxies. The nsproxy defines how the pid namespace, and pid<->task
>> mappings happen for a given task.
>
> not only. there are other namespaces in nsproxy.
The point is that there is not a one to one mapping between containers
and nsproxies. There are likely to be more nsproxies than containers.
>> The init process for a container is
>> special and might actually appear in more than one pid namespace, while
>> its children might only appear in one. That means that this init
>> process's nsproxy can and should actually be different from its
>> children's. This is despite the fact that they are in the same
>> container.
>>
>> If we really need this 'container' grouping, it can easily be something
>> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
>
> ok so let's add a container object, containing a nsproxy and add
> another indirection ...
Well that isn't what Dave suggested, and I don't think it will give
you what you want.
Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16978 is a reply to message #16810] |
Tue, 12 December 2006 08:57 |
ebiederm
Messages: 1354 Registered: February 2006
|
Senior Member |
|
|
Kirill Korotaev <dev@sw.ru> writes:
>>
>> I think what those projects need is _some_ way to group tasks. I'm not
>> sure they actually need nsproxies.
>>
>> Two tasks in the same container could very well have different
>> nsproxies.
> what is container then from your POV?
A nested instance of user space. User space may unshare things
such as the mount namespace so it can give users the ability to
control their own mounts and the like.
>> The nsproxy defines how the pid namespace, and pid<->task
>> mappings happen for a given task. The init process for a container is
>> special and might actually appear in more than one pid namespace, while
>> its children might only appear in one. That means that this init
>> process's nsproxy can and should actually be different from its
>> children's. This is despite the fact that they are in the same
>> container.
> nsproxy has references to all namespaces, not just pid namespace.
> Thus it is a container "view" effectively.
> If container is something different, then please define it.
nsproxy has exactly one instance of all namespaces. A container
in the general case can hold other containers, and near containers
(like processes with separate mount namespaces). As well as
processes.
So nsproxy currently captures the common case for containers but not
the general case.
>> If we really need this 'container' grouping, it can easily be something
>> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
> You can add another indirection if really want it so much...
> But is it required?
> We created nsproxy which adds another level of indirection, but from performance
> POV
> it is questinable. I can say that we had a nice experience, when adding
> a single dereference in TCP code resulted in ~0.5% performance degradation.
I totally agree with that, nsproxy is something we need to watch from
a performance point of view. nsproxy is primarily a space
optimization to keep from bloating task struct, and possibly a fork
time optimization. At least at the point we added it no one could
measure overhead from using it.
That is one of the reasons I don't want nsproxy to become explicit and
be exported to user space. So if it is a performance problem we can
change the implementation without affecting users.
Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16981 is a reply to message #16954] |
Tue, 12 December 2006 07:09 |
Cedric Le Goater
Messages: 443 Registered: February 2006
|
Senior Member |
|
|
Dave Hansen wrote:
> On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>> nsproxy is an internal space optimization. It's not struct container
>>> and I don't think we want it to become that.
>> i don't agree here. we need that, so does openvz, vserver, people working
>> on resource management.
>
> I think what those projects need is _some_ way to group tasks. I'm not
> sure they actually need nsproxies.
not only tasks. ipc, fs, etc.
> Two tasks in the same container could very well have different
> nsproxies. The nsproxy defines how the pid namespace, and pid<->task
> mappings happen for a given task.
not only. there are other namespaces in nsproxy.
> The init process for a container is
> special and might actually appear in more than one pid namespace, while
> its children might only appear in one. That means that this init
> process's nsproxy can and should actually be different from its
> children's. This is despite the fact that they are in the same
> container.
>
> If we really need this 'container' grouping, it can easily be something
> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
ok so let's add a container object, containing a nsproxy and add
another indirection ...
C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16982 is a reply to message #16921] |
Tue, 12 December 2006 07:11 |
Cedric Le Goater
Messages: 443 Registered: February 2006
|
Senior Member |
|
|
Serge E. Hallyn wrote:
> Quoting Serge E. Hallyn (serue@us.ibm.com):
>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>> Herbert Poetzl <herbert@13thfloor.at> writes:
>>>>> Beyond that yes it seems to make sense to let user space
>>>>> maintain any mapping of containers to ids.
>>>> I agree with that, but we need something to move
>>>> around between the various spaces ...
>>> If you have CAP_SYS_PTRACE or you have a child process
>>> in a container you can create another with ptrace.
>>>
>>> Now I don't mind optimizing that case, with something like
>>> the proposed bind_ns syscall. But we need to be darn certain
>>> why it is safe, and does not change the security model that
>>> we currently have.
>> Sigh, and that's going to have to be a discussion per namespace.
>
> Well, assuming that we're using pids as identifiers, that means
we can't because a process could die while the namespace is still
referenced by an other subsystem. We need some kind of id.
> we can only enter decendent namespaces, which means 'we' must
> have created them. So anything we could do by entering the ns,
> we could have done by creating it as well, right?
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16983 is a reply to message #16969] |
Tue, 12 December 2006 15:29 |
serue
Messages: 750 Registered: February 2006
|
Senior Member |
|
|
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> I actually have code that will let me fork a process in a new namespace today
> >> with out needing bind_ns. What is more I don't even have to be root
> >> to use it.
> >
> > Can you elaborate? The user namespace patches don't enforce ptrace
> > yet, so you could unshare as root, become uid 500, then as uid 500
> > in the original namespace ptrace the process in the new namespace.
> > Is that what you're doing? If (when) ptrace enforces the uid namespace,
> > will that stop what you're doing?
>
> sys_ptrace is allowed in 2 situations.
> - The user and group identities are the same.
> - The calling process has CAP_SYS_PTRACE capability.
>
> So currently if the uid namespace enforces the user and group checks
> that will prevent the first case, and is very desirable. But it won't
> stop someone with CAP_SYS_PTRACE. Which given the normal case seems
> reasonable.
Yes, I was forgetting that intra-container ptrace is generally
inhibited by lack of a handle to processes in the other container.
So:
. in checkpoint/restart usage, the normal CAP_SYS_PTRACE
semantics is fine
. inside a vserver, the normal CAP_SYS_PTRACE is fine
. in general, a process inside one vserver cannot reference
a process in another vserver, so we don't need to worry
about ptrace permissions at all
. however, if we want to (as per emails yesterday) provide
some bit of enforcement of limits from parent namespaces
to child namespaces - where a pid is in fact available for
at least the init process (and, depending on our final
implementation, perhaps all processes) - then we need
something more.
As you say, selinux permissions would be one way to obtain this.
> Getting to the point where you can't trace what a process is doing
> would probably require some additional interprocess firewalling
> from something like selinux.
Yup.
thanks,
-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16984 is a reply to message #16981] |
Tue, 12 December 2006 15:45 |
serue
Messages: 750 Registered: February 2006
|
Senior Member |
|
|
Quoting Cedric Le Goater (clg@fr.ibm.com):
> Dave Hansen wrote:
> > On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
> >>> Even letting the concept of nsproxy escape to user space sounds wrong.
> >>> nsproxy is an internal space optimization. It's not struct container
> >>> and I don't think we want it to become that.
> >> i don't agree here. we need that, so does openvz, vserver, people working
> >> on resource management.
> >
> > I think what those projects need is _some_ way to group tasks. I'm not
> > sure they actually need nsproxies.
>
> not only tasks. ipc, fs, etc.
>
> > Two tasks in the same container could very well have different
> > nsproxies. The nsproxy defines how the pid namespace, and pid<->task
> > mappings happen for a given task.
>
> not only. there are other namespaces in nsproxy.
Right, and as Eric has pointed out, you may well want to use one id to
refer to several nsproxies - for instance if you are using unshare
to provide per-user private mount namespaces using pam_namespace.so
(that's mostly for LSPP systems right now, but I do this on my laptop
too). All my accounts are in the same 'container', but have different
mount namespaces, hence different nsproxies.
> > The init process for a container is
> > special and might actually appear in more than one pid namespace, while
> > its children might only appear in one. That means that this init
> > process's nsproxy can and should actually be different from its
> > children's. This is despite the fact that they are in the same
> > container.
> >
> > If we really need this 'container' grouping, it can easily be something
> > pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
>
> ok so let's add a container object, containing a nsproxy and add
> another indirection ...
No thanks.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #16989 is a reply to message #16954] |
Tue, 12 December 2006 08:43 |
dev
Messages: 1693 Registered: September 2005 Location: Moscow
|
Senior Member |
|
|
>>>Even letting the concept of nsproxy escape to user space sounds wrong.
>>>nsproxy is an internal space optimization. It's not struct container
>>>and I don't think we want it to become that.
>>
>>i don't agree here. we need that, so does openvz, vserver, people working
>>on resource management.
>
>
> I think what those projects need is _some_ way to group tasks. I'm not
> sure they actually need nsproxies.
>
> Two tasks in the same container could very well have different
> nsproxies.
what is container then from your POV?
> The nsproxy defines how the pid namespace, and pid<->task
> mappings happen for a given task. The init process for a container is
> special and might actually appear in more than one pid namespace, while
> its children might only appear in one. That means that this init
> process's nsproxy can and should actually be different from its
> children's. This is despite the fact that they are in the same
> container.
nsproxy has references to all namespaces, not just pid namespace.
Thus it is a container "view" effectively.
If container is something different, then please define it.
> If we really need this 'container' grouping, it can easily be something
> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
You can add another indirection if really want it so much...
But is it required?
We created nsproxy which adds another level of indirection, but from performance POV
it is questinable. I can say that we had a nice experience, when adding
a single dereference in TCP code resulted in ~0.5% performance degradation.
Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17000 is a reply to message #16956] |
Tue, 12 December 2006 23:22 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Mon, Dec 11, 2006 at 04:01:15PM -0600, Serge E. Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
> > "Serge E. Hallyn" <serue@us.ibm.com> writes:
> >
> > > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > >
> > > Yeah, that occurred to me, but it doesn't seem like we can possibly make
> > > sufficient guarantees to the client to make this worthwhile.
> > >
> > > I'd love to be wrong about that, but if nothing else we can't prove to
> > > the client that they're running on an unhacked host. So the host admin
> > > will always have to be trusted.
> >
> > To some extent that is true. Although all security models we have
> > currently fall down if you hack the kernel, or run your kernel
> > in a hacked virtual environment. It would be nice if under normal
> > conditions you could mount an encrypted filesystem only in a container
> > and not have concerns of those files escaping.
>
> Hmm, well perhaps I'm being overly pessimistic - IBM research did have a
> demo based on TPM of remote attestation, which may be usable for
> ensuring that you're connecting to a service on your virtual machine on
> a certain (unhacked) kernel on particular hardware, in which case what
> you're talking about may be possible - given a stringent initial
> environment (i.e. not the 'gimme $20/month for a hosted partition in
> arizona' environment).
interesting, how would you _ensure_ from inside
such an environment, that nobody tampered with
the kernel you are running on?
> Given that, perhaps having a virtual machine with access to encrypted
> storage - safe from the host machine admins - may not be unattainable
> after all. And given that, it would be worth designing the ns_enter()
> system call so that a parent cannot enter some child namespace.
we currently call this Context Privacy, and it
is partially implemented, but of course, it
does only work if the kernel is known good
> > Which would probably be a matter of having a separate uid_ns and not
> > allowing process outside of your container to have any permissions in
> > that filesystem.
>
> Yup. Or even just a separate uid_ns and an ecryptfs partition, so
> that the host can back up the encrypted data incrementally (per file,
> i.e. not just the whole dmcrypted loop file).
it's simple to avoid access to certain 'tagged'
devices and/or filesystems, it's hard to handle
kernel modifications or even simple things like
reading the kernel memory ...
best,
Herbert
> -serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17014 is a reply to message #16989] |
Wed, 13 December 2006 04:55 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Tue, Dec 12, 2006 at 11:43:38AM +0300, Kirill Korotaev wrote:
> >>>Even letting the concept of nsproxy escape to user space sounds wrong.
> >>>nsproxy is an internal space optimization. It's not struct container
> >>>and I don't think we want it to become that.
> >>i don't agree here. we need that, so does openvz, vserver, people
> >>working on resource management.
> >
> >
> > I think what those projects need is _some_ way to group tasks. I'm
> > not sure they actually need nsproxies.
> >
> > Two tasks in the same container could very well have different
> > nsproxies.
and typically, they will ...
> what is container then from your POV?
from my PoV, a container is something keeping
processes _inside_ which basically requires
the following elements:
- isolation from other containers
- virtualization of unique elements
- limitation on resources
- policy on all interfaces
the current spaces mostly address the isolation
and to some degree, the virtualization, which
is a good thing, but the container also requires
the resource limitation and the policy, to handle
interfaces to the outside (should not be new to
you, actually :)
so the container (may it be represented by a
structure or not), may reference an nsproxy
(as we do in the 2.6.19 versions of Linux-VServer)
but an nsproxy is not the proper element to
define a container ..
we also want to be able to have sub spaces inside
a container, as long as they do not interfere or
overcome the limitations and policy
> > The nsproxy defines how the pid namespace, and pid<->task
> > mappings happen for a given task. The init process for a container is
> > special and might actually appear in more than one pid namespace, while
> > its children might only appear in one. That means that this init
> > process's nsproxy can and should actually be different from its
> > children's. This is despite the fact that they are in the same
> > container.
> nsproxy has references to all namespaces, not just pid namespace.
> Thus it is a container "view" effectively.
it is a view into the world of one or more processes,
but not necessarily the view of all processes inside
a container :)
> If container is something different, then please define it.
see above ...
> > If we really need this 'container' grouping, it can easily be something
> > pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
> You can add another indirection if really want it so much...
> But is it required?
> We created nsproxy which adds another level of indirection, but from
> performance POV it is questinable.
I'm not very happy with the nsproxy abstraction,
as I think it would be better handled per task,
and I still have no real world test results what
overhead the nsproxy indirection causes
> I can say that we had a nice experience, when adding a single
> dereference in TCP code resulted in ~0.5% performance degradation.
yes, that is what I fear is happening right now
with the nsproxy ... but I think we need to test
that, and if it makes sense, switch to task direct
spaces (as we had before), just more of them ...
best,
Herbert
> Thanks,
> Kirill
>
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17026 is a reply to message #16984] |
Wed, 13 December 2006 15:00 |
Cedric Le Goater
Messages: 443 Registered: February 2006
|
Senior Member |
|
|
Serge E. Hallyn wrote:
> Quoting Cedric Le Goater (clg@fr.ibm.com):
>> Dave Hansen wrote:
>>> On Mon, 2006-12-11 at 16:23 +0100, Cedric Le Goater wrote:
>>>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>>>> nsproxy is an internal space optimization. It's not struct container
>>>>> and I don't think we want it to become that.
>>>> i don't agree here. we need that, so does openvz, vserver, people working
>>>> on resource management.
>>> I think what those projects need is _some_ way to group tasks. I'm not
>>> sure they actually need nsproxies.
>> not only tasks. ipc, fs, etc.
>>
>>> Two tasks in the same container could very well have different
>>> nsproxies. The nsproxy defines how the pid namespace, and pid<->task
>>> mappings happen for a given task.
>> not only. there are other namespaces in nsproxy.
>
> Right, and as Eric has pointed out, you may well want to use one id to
> refer to several nsproxies - for instance if you are using unshare
> to provide per-user private mount namespaces using pam_namespace.so
> (that's mostly for LSPP systems right now, but I do this on my laptop
> too). All my accounts are in the same 'container', but have different
> mount namespaces, hence different nsproxies.
I think we have definition issue here : what is a 'container' ?
I don't see any issue with the above scenario. unsharing mount namespace
results in the creation of a new nsproxy which will require a new identifier
in order to find this new mount namespace.
so yes, different mount namespaces, hence different nsproxies, hence
different ids if you want to find that new mount namespace.
>>> The init process for a container is
>>> special and might actually appear in more than one pid namespace, while
>>> its children might only appear in one. That means that this init
>>> process's nsproxy can and should actually be different from its
>>> children's. This is despite the fact that they are in the same
>>> container.
>>>
>>> If we really need this 'container' grouping, it can easily be something
>>> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
>> ok so let's add a container object, containing a nsproxy and add
>> another indirection ...
>
> No thanks.
exactly.
C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17028 is a reply to message #17014] |
Wed, 13 December 2006 15:17 |
Cedric Le Goater
Messages: 443 Registered: February 2006
|
Senior Member |
|
|
Herbert Poetzl wrote:
> On Tue, Dec 12, 2006 at 11:43:38AM +0300, Kirill Korotaev wrote:
>>>>> Even letting the concept of nsproxy escape to user space sounds wrong.
>>>>> nsproxy is an internal space optimization. It's not struct container
>>>>> and I don't think we want it to become that.
>
>>>> i don't agree here. we need that, so does openvz, vserver, people
>>>> working on resource management.
>>>
>>> I think what those projects need is _some_ way to group tasks. I'm
>>> not sure they actually need nsproxies.
>>>
>>> Two tasks in the same container could very well have different
>>> nsproxies.
>
> and typically, they will ...
that means we are missing a container object then, a vps, a vcontext, a
vsomething. nop ?
>> what is container then from your POV?
>
> from my PoV, a container is something keeping
> processes _inside_ which basically requires
> the following elements:
>
> - isolation from other containers
> - virtualization of unique elements
> - limitation on resources
> - policy on all interfaces
>
> the current spaces mostly address the isolation
> and to some degree, the virtualization, which
> is a good thing, but the container also requires
> the resource limitation and the policy, to handle
> interfaces to the outside (should not be new to
> you, actually :)
>
> so the container (may it be represented by a
> structure or not), may reference an nsproxy
> (as we do in the 2.6.19 versions of Linux-VServer)
> but an nsproxy is not the proper element to
> define a container ..
agree. it's not complete.
should we address that by introducing a new object ?
could that be done on per-product basis ? I mean like
in a driver model.
> we also want to be able to have sub spaces inside
> a container, as long as they do not interfere or
> overcome the limitations and policy
>
>>> The nsproxy defines how the pid namespace, and pid<->task
>>> mappings happen for a given task. The init process for a container is
>>> special and might actually appear in more than one pid namespace, while
>>> its children might only appear in one. That means that this init
>>> process's nsproxy can and should actually be different from its
>>> children's. This is despite the fact that they are in the same
>>> container.
>
>> nsproxy has references to all namespaces, not just pid namespace.
>> Thus it is a container "view" effectively.
>
> it is a view into the world of one or more processes,
> but not necessarily the view of all processes inside
> a container :)
>
>> If container is something different, then please define it.
>
> see above ...
>
>>> If we really need this 'container' grouping, it can easily be something
>>> pointed to _by_ the nsproxy, but it shouldn't _be_ the nsproxy.
>
>> You can add another indirection if really want it so much...
>> But is it required?
>> We created nsproxy which adds another level of indirection, but from
>> performance POV it is questinable.
>
> I'm not very happy with the nsproxy abstraction,
> as I think it would be better handled per task,
> and I still have no real world test results what
> overhead the nsproxy indirection causes
>
>> I can say that we had a nice experience, when adding a single
>> dereference in TCP code resulted in ~0.5% performance degradation.
>
> yes, that is what I fear is happening right now
> with the nsproxy ... but I think we need to test
> that, and if it makes sense, switch to task direct
> spaces (as we had before), just more of them ...
getting some figures would be nice and we might also be able
to improve the current nsproxy model.
C.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
|
Re: [patch -mm 08/17] nsproxy: add hashtable [message #17100 is a reply to message #17000] |
Wed, 20 December 2006 06:12 |
serue
Messages: 750 Registered: February 2006
|
Senior Member |
|
|
Quoting Herbert Poetzl (herbert@13thfloor.at):
> On Mon, Dec 11, 2006 at 04:01:15PM -0600, Serge E. Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > > "Serge E. Hallyn" <serue@us.ibm.com> writes:
> > >
> > > > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > > >
> > > > Yeah, that occurred to me, but it doesn't seem like we can possibly make
> > > > sufficient guarantees to the client to make this worthwhile.
> > > >
> > > > I'd love to be wrong about that, but if nothing else we can't prove to
> > > > the client that they're running on an unhacked host. So the host admin
> > > > will always have to be trusted.
> > >
> > > To some extent that is true. Although all security models we have
> > > currently fall down if you hack the kernel, or run your kernel
> > > in a hacked virtual environment. It would be nice if under normal
> > > conditions you could mount an encrypted filesystem only in a container
> > > and not have concerns of those files escaping.
> >
> > Hmm, well perhaps I'm being overly pessimistic - IBM research did have a
> > demo based on TPM of remote attestation, which may be usable for
> > ensuring that you're connecting to a service on your virtual machine on
> > a certain (unhacked) kernel on particular hardware, in which case what
> > you're talking about may be possible - given a stringent initial
> > environment (i.e. not the 'gimme $20/month for a hosted partition in
> > arizona' environment).
>
> interesting, how would you _ensure_ from inside
> such an environment, that nobody tampered with
> the kernel you are running on?
Sorry, took awhile to find the best reference, but I guess this would be
it:
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/459e9a2b1f668aee85256f330067589f
Another description is
http://domino.research.ibm.com/comm/research_people.nsf/pages/sailer.ima.html
I guess it was in 2004 that they did a demo of remote attestion at the
RSA conference, as described in the third to last paragraph in
http://cio.co.nz/cio.nsf/0/03943645293DB008CC256E47005D8EA2?OpenDocument
and in
http://domino.research.ibm.com/comm/pr.nsf/pages/news.20040218_linux.html
-serge
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Goto Forum:
Current Time: Sat Jul 27 14:51:20 GMT 2024
Total time taken to generate the page: 0.04173 seconds
|