OpenVZ Forum


Home » Mailing lists » Devel » [PATCH 0/13] Pid namespaces (OpenVZ view)
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13350 is a reply to message #13339] Fri, 25 May 2007 07:15 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Eric W. Biederman wrote:
> Pavel Emelianov <xemul@openvz.org> writes:
>
>> Eric W. Biederman wrote:
>>> Pavel Emelianov <xemul@openvz.org> writes:
>>>
>>>> That's how OpenVZ sees the pid namespaces.
>>>>
>>>> The main idea is that kernel keeps operating with tasks pid
>>>> as it did before, but each task obtains one more pid for each
>>>> pid type - the virtual pid. When putting the pid to user or
>>>> getting the pid from it kernel operates with the virtual ones.
>>> Just a quick reaction.
>>>
>>> - I would very much like to see a minimum of 3 levels of pids,
>> Why not 4? From my part, I would like to know, why such nesting
>> is important. We have plain IPC namespaces and nobody cares.
>> We will have isolated network namespaces, why pids are exception?
>
> 4+ is fine, and something we will probably care about someday.
> 3 seems to be the minimum necessary to get people thinking about
> adding more so we don't have arbitrary special cases, especially
> in the user interface. At 3 the things are simple enough we don't
> have to allocate additional data structures etc.
>
> If we don't need nesting we don't even need 2 levels, and we
> can remove the global pid. But we have had that conversation
> and especially for the current OpenVZ usage we need nesting.

We need nesting but 2 levels is more than enough. Yet again -
we have 2 level IPC namespace, 2 level network namespace etc.

Generic structures are not always needed. Say, why don't we
have N-level page tables in kernel? Why not make them generic?
What if some ia128 architecture will require 7-level tables!?

> Having more then two layers means we are prepared to use pid namespaces more
> generally. It really isn't that much harder.

It is not, but do we need to spend so much time on solving
not relevant problems?

>>> being supported. Otherwise it is easy to overlook some of the
>>> cases that are required to properly support nesting, which long
>>> terms seems important.
>>>
>>> - Semantically fork is easier then unshare. Unshare can mean
>> This is not. When you fork, the kid shares the session and the
>> group with its parent, but moving this pids to new ns is bad - the
>> parent will happen to be half-moved. Thus you need to break the
>> session and the group in fork(), but this is extra complexity.
>
> Nope. You will just need to have the child call setsid() if
> you don't want to share the session and the group.

Of course, but setsid() must be done *before* creating a new
namespace, Otherwise you will have a half-inserted into new
namespace task. This sounds awful.

> You can perfectly well share the sid and group with the parent,
> because internal to the kernel pids aren't numeric, they are struct
> pid pointers.
>
> There is the question of do you use foreign pid handling to display
> the session and the group, or do you allocate pids for the session
> and the group in the new pid namespace. At this point foreign pid
> handling looks sufficient.
>
>>> a lot of things, and it is easy to pick a meaning that has weird
>>> side effects. Your implementation has a serious problem in that you
>>> change the value of getpid() at runtime. Glibc does not know how to
>>> cope with the value of getpid() changing.
>> This pid changing happens only once per task lifetime.
>
> Unshare isn't once per task lifetime, unless you added some extra
> constraints.

It is once. You create a new namespace and that's all.

>> Though I haven't
>> seen any problems with glibc for many years running OpenVZ and I think,
>> that if glibc will want to cache this getpid() value we can teach it to
>> uncache this value in case someone called unshare() with CLONE_NEWPIDS.
>
> glibc very much caches the results of getpid().

Can you prove it? We have run OpenVZ for many years and with many
userspace configurations and we haven't seen the problems with
glibc ever.

> If you want to teach glibc not to cache getpid() fee free. The only
> way I know to get glibc to invalidates it's pid cache is to call fork.
>
> Eric
>
Re: [PATCH 2/13] Small preparations for namespaces [message #13392 is a reply to message #13345] Fri, 25 May 2007 13:01 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Pavel Emelianov (xemul@openvz.org):
> Serge E. Hallyn wrote:
> > Quoting Pavel Emelianov (xemul@openvz.org):
> >> This includes #ifdefs in get/put_pid_ns and rewriting
> >> the child_reaper() function to the more logical view.
> >>
> >> This doesn't fit logically into any other patch so
> >> I decided to make it separate.
> >>
> >> Signed-off-by: Pavel Emelianov <xemul@openvz.org>
> >>
> >> ---
> >>
> >> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> >> index 169c6c2..7af7191 100644
> >> --- a/include/linux/pid_namespace.h
> >> +++ b/include/linux/pid_namespace.h
> >> @@ -26,7 +26,9 @@ extern struct pid_namespace init_pid_ns;
> >>
> >> static inline void get_pid_ns(struct pid_namespace *ns)
> >> {
> >> +#ifdef CONFIG_PID_NS
> >> kref_get(&ns->kref);
> >> +#endif
> >> }
> >>
> >> extern struct pid_namespace *copy_pid_ns(int flags, struct pid_namespace *ns);
> >> @@ -34,12 +36,15 @@ extern void free_pid_ns(struct kref *kre
> >>
> >> static inline void put_pid_ns(struct pid_namespace *ns)
> >> {
> >> +#ifdef CONFIG_PID_NS
> >> kref_put(&ns->kref, free_pid_ns);
> >> +#endif
> >> }
> >>
> >> static inline struct task_struct *child_reaper(struct task_struct *tsk)
> >> {
> >> - return init_pid_ns.child_reaper;
> >> + BUG_ON(tsk != current);
> >> + return tsk->nsproxy->pid_ns->child_reaper;
> >> }
> >>
> >> #endif /* _LINUX_PID_NS_H */
> >
> > This can't be bisect-safe, right? You can't just use
> > tsk->nsproxy->pid_ns, as you've pointed out yourself.
>
> I can :) See - I have a proving BUG_ON() here.

I didn't know BUG_ON()'s actually warded off bugs :)

You've tested this with the infamous NFS testcase?

I don't see *why* it would work for you, but if you claim it does, I
guess you'd know better than I :)

-serge
Re: [PATCH 1/13] Round up the API [message #13393 is a reply to message #13346] Fri, 25 May 2007 13:02 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Pavel Emelianov (xemul@openvz.org):
> Serge E. Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> >>
> >>> Quoting Pavel Emelianov (xemul@openvz.org):
> >>>> The set of functions process_session, task_session, process_group
> >>>> and task_pgrp is confusing, as the names can be mixed with each other
> >>>> when looking at the code for a long time.
> >>>>
> >>>> The proposals are to
> >>>> * equip the functions that return the integer with _nr suffix to
> >>>> represent that fact,
> >>>> * and to make all functions work with task (not process) by making
> >>>> the common prefix of the same name.
> >>>>
> >>>> For monotony the routines signal_session() and set_signal_session()
> >>>> are replaced with task_session_nr() and set_task_session(), especially
> >>>> since they are only used with the explicit task->signal dereference.
> >>>>
> >>>> I've sent this before, but Andrew didn't include it, so I resend it
> >>>> as the part of this set.
> >>>>
> >>>> Signed-off-by: Pavel Emelianov <xemul@openvz.org>
> >>>> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> >>> Yup, I still like this patch.
> >> I'm borderline. Less error prone interfaces sound good, less
> >> duplication of information sounds good. Changing the names of
> >> historical function may be change for the sake of change and
> >> thus noise.
> >>
> >> However if we are going to go this far I think we need to remove
> >> the numeric pid cache from the task_struct.
> >
> > You mean tsk->pid?
> >
> > I agree, especially in Suka's version. Not sure it applies to Pavel's
> > version, though since the "real"/global pid is still stored only in
> > tsk->pid, right?
>
> No. All objects that have pid (task_struct, signal_struct and pid (struct))
> have two ids after this patch - virtual one and global one.

(Yes, so wouldn't removing task->pid be pretty detrimental?)

Could you outline how you would extend this to 3 levels? Would you just
add a 'vpid2' etc to the struct pid?

thanks,
-serge
Re: [PATCH 2/13] Small preparations for namespaces [message #13395 is a reply to message #13392] Fri, 25 May 2007 13:21 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Serge E. Hallyn wrote:
> Quoting Pavel Emelianov (xemul@openvz.org):
>> Serge E. Hallyn wrote:
>>> Quoting Pavel Emelianov (xemul@openvz.org):
>>>> This includes #ifdefs in get/put_pid_ns and rewriting
>>>> the child_reaper() function to the more logical view.
>>>>
>>>> This doesn't fit logically into any other patch so
>>>> I decided to make it separate.
>>>>
>>>> Signed-off-by: Pavel Emelianov <xemul@openvz.org>
>>>>
>>>> ---
>>>>
>>>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>>>> index 169c6c2..7af7191 100644
>>>> --- a/include/linux/pid_namespace.h
>>>> +++ b/include/linux/pid_namespace.h
>>>> @@ -26,7 +26,9 @@ extern struct pid_namespace init_pid_ns;
>>>>
>>>> static inline void get_pid_ns(struct pid_namespace *ns)
>>>> {
>>>> +#ifdef CONFIG_PID_NS
>>>> kref_get(&ns->kref);
>>>> +#endif
>>>> }
>>>>
>>>> extern struct pid_namespace *copy_pid_ns(int flags, struct pid_namespace *ns);
>>>> @@ -34,12 +36,15 @@ extern void free_pid_ns(struct kref *kre
>>>>
>>>> static inline void put_pid_ns(struct pid_namespace *ns)
>>>> {
>>>> +#ifdef CONFIG_PID_NS
>>>> kref_put(&ns->kref, free_pid_ns);
>>>> +#endif
>>>> }
>>>>
>>>> static inline struct task_struct *child_reaper(struct task_struct *tsk)
>>>> {
>>>> - return init_pid_ns.child_reaper;
>>>> + BUG_ON(tsk != current);
>>>> + return tsk->nsproxy->pid_ns->child_reaper;
>>>> }
>>>>
>>>> #endif /* _LINUX_PID_NS_H */
>>> This can't be bisect-safe, right? You can't just use
>>> tsk->nsproxy->pid_ns, as you've pointed out yourself.
>> I can :) See - I have a proving BUG_ON() here.
>
> I didn't know BUG_ON()'s actually warded off bugs :)

It does not, but it says to code reader that this call
expects something special. In this case - tsk is expected
to be current always. And it is.

> You've tested this with the infamous NFS testcase?

What testcase do you mean?

> I don't see *why* it would work for you, but if you claim it does, I
> guess you'd know better than I :)

I don't get you here. I've checked that the task passed to
child_reaper is current always. This BUG_ON prevents later
code from passing arbitrary task to it.

> -serge
>
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13396 is a reply to message #13348] Fri, 25 May 2007 13:25 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Pavel Emelianov (xemul@openvz.org):
> Serge E. Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> >>
> >>>> 3. Cleaner logic for namespace migration: with this approach
> >>>> one need to save the virtual pid and let global one change;
> >>>> with Suka's logic this is not clear how to migrate the level
> >>>> 2 namespace (concerning init to be level 0).
> >>> This is a very good point.
> >>>
> >>> How *would* we migrate the pids at the second level?
> >> As long as you don't try and restore pids into the initial pid namespace
> >> it isn't a problem. You just record the pid hierarchy and the pid
> >> for a task in that hierarchy. There really is nothing special going on
> >> that should make migration hard.
> >>
> >> Or did I miss something?
> >
> > Hmm, no, i guess you are right. I was thinking that getting the pid for
> > a process woudl be done purely from userspace, but I guess along with a
> > kernel helper to *set* pids, we could also have a kernel helper to get
> > all pids for all pid namespaces "above" that of the process doing the
> > checkpoint.
>
> So do you agree that if we migrate a VS we need to migrate the whole VS?

I started to respond, then realized you were probably asking something
different than I thought. My original response is below, but here is I
think the answer to your question, which is important because I think
your question might highlight a misunderstanding about the design of
Suka's code.

Let's say a vserver is started, and in there a pidns is started for a
checkpoint/restart job. So let's say we have PID 13 in the root
namespace starting PID 14 in a new namespace. So using (pid, pid_ns) as
the terminology, we havd (13,1) as the parent process, and (14,1)=(1,2)
as the init of the vserver. Let's ignore other tasks inthe vserver, and
just talk about (1402,2) as the init of the checkpoint restart job, so
it is (1402,2)=(1,3). And oh, yeah, (1402,2)=(1,3)=(2309,1).

Now when we want to migrate the vserver, a task in pid_ns 2 will look
for all tasks with pids in pidns 2. That will automatically include all
tasks in pid_ns 3. I think you thought I was asking how we would
include pid_ns 3, and are asking whether it woudl be ok to not migrate
pid_ns 3? (answer: it's irrelevant, all tasks in pid_ns 3 are also in
pid_ns 2 - and in pid_ns 1).

What I was actually asking was, in the same situation, how would the
task in pid_ns 2 doing the checkpoint get the pids in pid_ns 3. So it
sees the task as (1402,2), but needs to also store (1,3) and, on
restart, recreate a task with both those pids.

But I guess it will be pretty simple, and fall into place once we get
c/r semantics started.

thanks,
-serge

[ original response ]

I think that's the reasonable thing for people to do, but I don't think
we should force them to. I.e. there is no reason you shouldn't be able
to take one or two tasks out of a pidns and checkpoint them, and restart
them elsewhere. If it turns out they were talking to a third process
which wasn't checkpointed, well, too bad.

What you are more likely to need is a new clean set of namespaces to
restart in, but again I don't think we should enforce that. So whatever
mechanism we end up doing to implementing "clone_with_pid()", we should
handle -EBUSY correctly.

Anyway, why do you ask? (How does it follow from the conversation?)

I wasn't suggesting that it would be ok to only dump part of the pid
information, rather I was asking how we would do it correctly :)
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13397 is a reply to message #13343] Fri, 25 May 2007 13:29 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Pavel Emelianov (xemul@openvz.org):
> Serge E. Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Pavel Emelianov <xemul@openvz.org> writes:
> >>
> >>> That's how OpenVZ sees the pid namespaces.
> >>>
> >>> The main idea is that kernel keeps operating with tasks pid
> >>> as it did before, but each task obtains one more pid for each
> >>> pid type - the virtual pid. When putting the pid to user or
> >>> getting the pid from it kernel operates with the virtual ones.
> >> Just a quick reaction.
> >>
> >> - I would very much like to see a minimum of 3 levels of pids,
> >> being supported. Otherwise it is easy to overlook some of the
> >> cases that are required to properly support nesting, which long
> >> terms seems important.
> >
> > Pavel,
> >
> > If I wanted to start a virtual server and in there start some checkpoint
> > restart jobs, so I start a new pid namespace inside the c/r job, what
> > will happen?
>
> What will happen with this namespace on restore? What pids will
> you assign to it in the parent (but not that init) namespace?

No, no, my question is earlier. Maybe my use of the term
"checkpoint/restart job" is confusing, so let me call it a "batch job"
instead, with the understanding that it is started with the intent of
being safely checkpoint/restartable later on.

So in the original batch job, started in a vserver, what will the pids
look like in the checkpoint/restart job?

But I think I know the answer - you'll leave vpid == pid for these
tasks, and only set vpid differently when restarting a job, since that's
when you really care?

So the only situation where there might be a shortcoming is when
restarting a job in a vserver?

-serge

> a. arbitrary: that means that you don't care that subgroup
> of tasks in the VS namespace. Thus why don't move them
> into separate namespace
> b. try to hold them as they were: this way is likely to fail
> and can work w/o namespaces at all.
>
> So what's your answer?
>
> > a. second pidns unshare is refused
> > b. second pidns unshare is allowed, but c/r job is not visible
> > from the virtual server (but is from the global pidns)
> > c. second pidns unshare is allowed, and somehow the c/r job
> > is visible from the virtual server
> >
> > If (a), is this a short-term shortcoming for simplicity of prototype and
> > code review, or do you think it's actually the right thing t do long
> > term?
> >
> > thanks,
> > -serge
> >
> >> - Semantically fork is easier then unshare. Unshare can mean
> >> a lot of things, and it is easy to pick a meaning that has weird
> >> side effects. Your implementation has a serious problem in that you
> >> change the value of getpid() at runtime. Glibc does not know how to
> >> cope with the value of getpid() changing.
> >>
> >> Eric
> >
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13400 is a reply to message #13396] Fri, 25 May 2007 13:53 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Serge E. Hallyn wrote:
> Quoting Pavel Emelianov (xemul@openvz.org):
>> Serge E. Hallyn wrote:
>>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>>> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>>>>
>>>>>> 3. Cleaner logic for namespace migration: with this approach
>>>>>> one need to save the virtual pid and let global one change;
>>>>>> with Suka's logic this is not clear how to migrate the level
>>>>>> 2 namespace (concerning init to be level 0).
>>>>> This is a very good point.
>>>>>
>>>>> How *would* we migrate the pids at the second level?
>>>> As long as you don't try and restore pids into the initial pid namespace
>>>> it isn't a problem. You just record the pid hierarchy and the pid
>>>> for a task in that hierarchy. There really is nothing special going on
>>>> that should make migration hard.
>>>>
>>>> Or did I miss something?
>>> Hmm, no, i guess you are right. I was thinking that getting the pid for
>>> a process woudl be done purely from userspace, but I guess along with a
>>> kernel helper to *set* pids, we could also have a kernel helper to get
>>> all pids for all pid namespaces "above" that of the process doing the
>>> checkpoint.
>> So do you agree that if we migrate a VS we need to migrate the whole VS?
>
> I started to respond, then realized you were probably asking something
> different than I thought. My original response is below, but here is I
> think the answer to your question, which is important because I think
> your question might highlight a misunderstanding about the design of
> Suka's code.
>
> Let's say a vserver is started, and in there a pidns is started for a
> checkpoint/restart job. So let's say we have PID 13 in the root
> namespace starting PID 14 in a new namespace. So using (pid, pid_ns) as
> the terminology, we havd (13,1) as the parent process, and (14,1)=(1,2)
> as the init of the vserver. Let's ignore other tasks inthe vserver, and
> just talk about (1402,2) as the init of the checkpoint restart job, so
> it is (1402,2)=(1,3). And oh, yeah, (1402,2)=(1,3)=(2309,1).

Oh, this is heavy... Lets draw some diagrams.

You have a vserver with a namespace in it with a cpt job in it,
just like this:

[node. pids look like (N)]
`- [vserver. pids look like (N,V)]
`- [cpt job. pids look like (N,V,P)]

Is that OK?

We have task in "node" with pid (13) which spawns the task with
pid (14,1) into the "vserver", like this:

(13)
`- (14,1)

If so, then what the notion (14,1)=(1,2) mean?

As far as the "cpt job" is concerned we have smth like this:

(13)
`- (14,1)
`- (1402,2,1)

where (1402,2,1) is the root of the "cpt job", right?

> Now when we want to migrate the vserver, a task in pid_ns 2 will look
> for all tasks with pids in pidns 2. That will automatically include all
> tasks in pid_ns 3. I think you thought I was asking how we would
> include pid_ns 3, and are asking whether it woudl be ok to not migrate
> pid_ns 3? (answer: it's irrelevant, all tasks in pid_ns 3 are also in
> pid_ns 2 - and in pid_ns 1).
>
> What I was actually asking was, in the same situation, how would the
> task in pid_ns 2 doing the checkpoint get the pids in pid_ns 3. So it
> sees the task as (1402,2), but needs to also store (1,3) and, on
> restart, recreate a task with both those pids.
>
> But I guess it will be pretty simple, and fall into place once we get
> c/r semantics started.
>
> thanks,
> -serge
>
> [ original response ]
>
> I think that's the reasonable thing for people to do, but I don't think
> we should force them to. I.e. there is no reason you shouldn't be able
> to take one or two tasks out of a pidns and checkpoint them, and restart
> them elsewhere. If it turns out they were talking to a third process
> which wasn't checkpointed, well, too bad.
>
> What you are more likely to need is a new clean set of namespaces to
> restart in, but again I don't think we should enforce that. So whatever
> mechanism we end up doing to implementing "clone_with_pid()", we should
> handle -EBUSY correctly.
>
> Anyway, why do you ask? (How does it follow from the conversation?)
>
> I wasn't suggesting that it would be ok to only dump part of the pid
> information, rather I was asking how we would do it correctly :)
>
Re: [PATCH 2/13] Small preparations for namespaces [message #13401 is a reply to message #13395] Fri, 25 May 2007 13:55 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Pavel Emelianov (xemul@openvz.org):
> Serge E. Hallyn wrote:
> > Quoting Pavel Emelianov (xemul@openvz.org):
> >> Serge E. Hallyn wrote:
> >>> Quoting Pavel Emelianov (xemul@openvz.org):
> >>>> This includes #ifdefs in get/put_pid_ns and rewriting
> >>>> the child_reaper() function to the more logical view.
> >>>>
> >>>> This doesn't fit logically into any other patch so
> >>>> I decided to make it separate.
> >>>>
> >>>> Signed-off-by: Pavel Emelianov <xemul@openvz.org>
> >>>>
> >>>> ---
> >>>>
> >>>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> >>>> index 169c6c2..7af7191 100644
> >>>> --- a/include/linux/pid_namespace.h
> >>>> +++ b/include/linux/pid_namespace.h
> >>>> @@ -26,7 +26,9 @@ extern struct pid_namespace init_pid_ns;
> >>>>
> >>>> static inline void get_pid_ns(struct pid_namespace *ns)
> >>>> {
> >>>> +#ifdef CONFIG_PID_NS
> >>>> kref_get(&ns->kref);
> >>>> +#endif
> >>>> }
> >>>>
> >>>> extern struct pid_namespace *copy_pid_ns(int flags, struct pid_namespace *ns);
> >>>> @@ -34,12 +36,15 @@ extern void free_pid_ns(struct kref *kre
> >>>>
> >>>> static inline void put_pid_ns(struct pid_namespace *ns)
> >>>> {
> >>>> +#ifdef CONFIG_PID_NS
> >>>> kref_put(&ns->kref, free_pid_ns);
> >>>> +#endif
> >>>> }
> >>>>
> >>>> static inline struct task_struct *child_reaper(struct task_struct *tsk)
> >>>> {
> >>>> - return init_pid_ns.child_reaper;
> >>>> + BUG_ON(tsk != current);
> >>>> + return tsk->nsproxy->pid_ns->child_reaper;
> >>>> }
> >>>>
> >>>> #endif /* _LINUX_PID_NS_H */
> >>> This can't be bisect-safe, right? You can't just use
> >>> tsk->nsproxy->pid_ns, as you've pointed out yourself.
> >> I can :) See - I have a proving BUG_ON() here.
> >
> > I didn't know BUG_ON()'s actually warded off bugs :)
>
> It does not, but it says to code reader that this call
> expects something special. In this case - tsk is expected
> to be current always. And it is.

I don't think that's sufficient.

It's been awhile so I'm fuzzy on the details, but I think we only fixed
the race by always returning init_pid_ns instead of tsk->nsproxy_pid_ns,
and tsk being current is not safe.

> > You've tested this with the infamous NFS testcase?
>
> What testcase do you mean?

http://lkml.org/lkml/2007/1/17/65

> > I don't see *why* it would work for you, but if you claim it does, I
> > guess you'd know better than I :)
>
> I don't get you here. I've checked that the task passed to
> child_reaper is current always. This BUG_ON prevents later
> code from passing arbitrary task to it.

I don't think that's enough.

thanks,
-serge
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13405 is a reply to message #13400] Fri, 25 May 2007 14:25 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Pavel Emelianov (xemul@openvz.org):
> Serge E. Hallyn wrote:
> > Quoting Pavel Emelianov (xemul@openvz.org):
> >> Serge E. Hallyn wrote:
> >>> Quoting Eric W. Biederman (ebiederm@xmission.com):
> >>>> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> >>>>
> >>>>>> 3. Cleaner logic for namespace migration: with this approach
> >>>>>> one need to save the virtual pid and let global one change;
> >>>>>> with Suka's logic this is not clear how to migrate the level
> >>>>>> 2 namespace (concerning init to be level 0).
> >>>>> This is a very good point.
> >>>>>
> >>>>> How *would* we migrate the pids at the second level?
> >>>> As long as you don't try and restore pids into the initial pid namespace
> >>>> it isn't a problem. You just record the pid hierarchy and the pid
> >>>> for a task in that hierarchy. There really is nothing special going on
> >>>> that should make migration hard.
> >>>>
> >>>> Or did I miss something?
> >>> Hmm, no, i guess you are right. I was thinking that getting the pid for
> >>> a process woudl be done purely from userspace, but I guess along with a
> >>> kernel helper to *set* pids, we could also have a kernel helper to get
> >>> all pids for all pid namespaces "above" that of the process doing the
> >>> checkpoint.
> >> So do you agree that if we migrate a VS we need to migrate the whole VS?
> >
> > I started to respond, then realized you were probably asking something
> > different than I thought. My original response is below, but here is I
> > think the answer to your question, which is important because I think
> > your question might highlight a misunderstanding about the design of
> > Suka's code.
> >
> > Let's say a vserver is started, and in there a pidns is started for a
> > checkpoint/restart job. So let's say we have PID 13 in the root
> > namespace starting PID 14 in a new namespace. So using (pid, pid_ns) as
> > the terminology, we havd (13,1) as the parent process, and (14,1)=(1,2)
> > as the init of the vserver. Let's ignore other tasks inthe vserver, and
> > just talk about (1402,2) as the init of the checkpoint restart job, so
> > it is (1402,2)=(1,3). And oh, yeah, (1402,2)=(1,3)=(2309,1).
>
> Oh, this is heavy... Lets draw some diagrams.
>
> You have a vserver with a namespace in it with a cpt job in it,
> just like this:
>
> [node. pids look like (N)]
> `- [vserver. pids look like (N,V)]
> `- [cpt job. pids look like (N,V,P)]
>
> Is that OK?

It's different from the notation I was using.

Let's stick to calling every process by a full "upid", i.e.
(pid, pid namespace #) because while it's longer it gives more
information.

> We have task in "node" with pid (13) which spawns the task with
> pid (14,1) into the "vserver", like this:
>
> (13)
> `- (14,1)
>
> If so, then what the notion (14,1)=(1,2) mean?

It means that (pid 14, pid_ns 1) = (pid 1, pid_ns 2). It describes one
task, which in pid namespace 1 is known by pid 14, and in pid namespace
2 is known by pid 1.

(I see the repetative low numbers were confusing...)

> As far as the "cpt job" is concerned we have smth like this:
>
> (13)
> `- (14,1)
> `- (1402,2,1)
>
> where (1402,2,1) is the root of the "cpt job", right?

Sure, and in my notation this would be

[(13,1)]
`- [(14,1)(1,2)]
`- [(2309,1)(1402,2)(1,3)]

Again each level is just one task, but known by several pids.

So coming back to the idea of checkpoint all of pid_ns=2, we would be
checkpointing both task [(14,1)(1,2)] and task [(2309,1)(1402,2)(1,3)].
And my question had been how would we access and store the fact that the
third task has pid (1,3), which we MUST store and reset, because that is
that task's active pid namespace, meaning it only knows itself as (1,3).

The task in pid namespace 2 which is doing the checkpointing generally
only knows the third task as (1402,2), so we need to provide a mechanism
for it to dump all pids in "higher" pid namespaces.

Note that, of course, pids in "lower" pid namespaces can be randomly
set. If we are restarting pid namespace 2 on a new system, it's
perfeclty ok for the pids to look like:

[(467,1)]
`- [(5597,1)(1,2)]
`- [(5598,1)(1402,2)(1,3)]

Heh, or even

[(14,1)(467,2)]
`- [(444,1)(5597,2)(1,3)]
`- [(445,1)(5598,2)(1402,3)(1,4)]

thanks,
-serge

> > Now when we want to migrate the vserver, a task in pid_ns 2 will look
> > for all tasks with pids in pidns 2. That will automatically include all
> > tasks in pid_ns 3. I think you thought I was asking how we would
> > include pid_ns 3, and are asking whether it woudl be ok to not migrate
> > pid_ns 3? (answer: it's irrelevant, all tasks in pid_ns 3 are also in
> > pid_ns 2 - and in pid_ns 1).
> >
> > What I was actually asking was, in the same situation, how would the
> > task in pid_ns 2 doing the checkpoint get the pids in pid_ns 3. So it
> > sees the task as (1402,2), but needs to also store (1,3) and, on
> > restart, recreate a task with both those pids.
> >
> > But I guess it will be pretty simple, and fall into place once we get
> > c/r semantics started.
> >
> > thanks,
> > -serge
> >
> > [ original response ]
> >
> > I think that's the reasonable thing for people to do, but I don't think
> > we should force them to. I.e. there is no reason you shouldn't be able
> > to take one or two tasks out of a pidns and checkpoint them, and restart
> > them elsewhere. If it turns out they were talking to a third process
> > which wasn't checkpointed, well, too bad.
> >
> > What you are more likely to need is a new clean set of namespaces to
> > restart in, but again I don't think we should enforce that. So whatever
> > mechanism we end up doing to implementing "clone_with_pid()", we should
> > handle -EBUSY correctly.
> >
> > Anyway, why do you ask? (How does it follow from the conversation?)
> >
> > I wasn't suggesting that it would be ok to only dump part of the pid
> > information, rather I was asking how we would do it correctly :)
> >
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13409 is a reply to message #13347] Fri, 25 May 2007 14:55 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Pavel Emelianov <xemul@openvz.org> writes:
>
> *I* would like to know how you migrate a *part* of a virtual
> server? What happens with pids, IPC ids, network connections?
>
> There are many entities in VS that are not bound to task, but to
> VS and if you migrate only half of them you're risking in loosing
> the integrity of the VS. If you don't care it - why do you need
> namespaces at all?

Well there are other uses for namespaces like providing a context
for private mounts.

That said the concept for migration is not a partial VS. But
a nested VS.

Eric
Re: [PATCH 11/13] Changes to show virtual ids to user [message #13410 is a reply to message #13344] Fri, 25 May 2007 15:48 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Pavel Emelianov <xemul@openvz.org> writes:

> Eric W. Biederman wrote:
>> Pavel Emelianov <xemul@sw.ru> writes:
>>
>>> That's true. Sending of signal from parent ns to children
>>> is tricky question. It has many solutions, I wanted to
>>> discuss which one is better:
>>
>> With unix domain sockets and the like it is conceivable we get
>> a pid transfer from one namespace to another and both namespaces
>> are leaf namespaces. I don't remember we can get a leaf to leaf
>> transfer when sending signals.
>
> We should not allow any transfer from leaf NS to leaf NS.
> Should I explain why?

In a checkpointable context it is a bad thing, and we can prevent it
by carefully setting up all of the namespaces.

However it is a fundamental possibility that exists, and because we
can avoid it with careful setup. I don't see a reason to deny it
if something was either inadvertantly or explicitly causes it
to happen.

Do you have another reason for denying the transfer that I'm
not thinking of?


>>
>> The worst case I can see with pid == 0. Is that it would be a bug
>> that we can fix later. For other cases it would seem to be a user
>> space API thing that we get stuck with for all time.
>
> We cannot trust userspace application to expect some pid other than
> positive. All that we can is either use some always-absent pid or
> send the signal as SI_KERNEL.
>
> Our experience show that making decisions like above causes random
> applications failures that are hard (or even impossible) to debug.

Ok. So I guess I see what you are proposing is picking an arbitrary
pid, say pid == 2, and reserving that in all pid namespaces and using
it when we have a pid that does not map to a specific namespace. I'm
fine with that.

All I care about is that we have a solution, preferably simple,
to the non-mapped pid problem.

Eric
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13485 is a reply to message #13405] Mon, 28 May 2007 07:48 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Serge E. Hallyn wrote:
> Quoting Pavel Emelianov (xemul@openvz.org):
>> Serge E. Hallyn wrote:
>>> Quoting Pavel Emelianov (xemul@openvz.org):
>>>> Serge E. Hallyn wrote:
>>>>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>>>>> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>>>>>>
>>>>>>>> 3. Cleaner logic for namespace migration: with this approach
>>>>>>>> one need to save the virtual pid and let global one change;
>>>>>>>> with Suka's logic this is not clear how to migrate the level
>>>>>>>> 2 namespace (concerning init to be level 0).
>>>>>>> This is a very good point.
>>>>>>>
>>>>>>> How *would* we migrate the pids at the second level?
>>>>>> As long as you don't try and restore pids into the initial pid namespace
>>>>>> it isn't a problem. You just record the pid hierarchy and the pid
>>>>>> for a task in that hierarchy. There really is nothing special going on
>>>>>> that should make migration hard.
>>>>>>
>>>>>> Or did I miss something?
>>>>> Hmm, no, i guess you are right. I was thinking that getting the pid for
>>>>> a process woudl be done purely from userspace, but I guess along with a
>>>>> kernel helper to *set* pids, we could also have a kernel helper to get
>>>>> all pids for all pid namespaces "above" that of the process doing the
>>>>> checkpoint.
>>>> So do you agree that if we migrate a VS we need to migrate the whole VS?
>>> I started to respond, then realized you were probably asking something
>>> different than I thought. My original response is below, but here is I
>>> think the answer to your question, which is important because I think
>>> your question might highlight a misunderstanding about the design of
>>> Suka's code.
>>>
>>> Let's say a vserver is started, and in there a pidns is started for a
>>> checkpoint/restart job. So let's say we have PID 13 in the root
>>> namespace starting PID 14 in a new namespace. So using (pid, pid_ns) as
>>> the terminology, we havd (13,1) as the parent process, and (14,1)=(1,2)
>>> as the init of the vserver. Let's ignore other tasks inthe vserver, and
>>> just talk about (1402,2) as the init of the checkpoint restart job, so
>>> it is (1402,2)=(1,3). And oh, yeah, (1402,2)=(1,3)=(2309,1).
>> Oh, this is heavy... Lets draw some diagrams.
>>
>> You have a vserver with a namespace in it with a cpt job in it,
>> just like this:
>>
>> [node. pids look like (N)]
>> `- [vserver. pids look like (N,V)]
>> `- [cpt job. pids look like (N,V,P)]
>>
>> Is that OK?
>
> It's different from the notation I was using.
>
> Let's stick to calling every process by a full "upid", i.e.
> (pid, pid namespace #) because while it's longer it gives more
> information.
>
>> We have task in "node" with pid (13) which spawns the task with
>> pid (14,1) into the "vserver", like this:
>>
>> (13)
>> `- (14,1)
>>
>> If so, then what the notion (14,1)=(1,2) mean?
>
> It means that (pid 14, pid_ns 1) = (pid 1, pid_ns 2). It describes one
> task, which in pid namespace 1 is known by pid 14, and in pid namespace
> 2 is known by pid 1.
>
> (I see the repetative low numbers were confusing...)
>
>> As far as the "cpt job" is concerned we have smth like this:
>>
>> (13)
>> `- (14,1)
>> `- (1402,2,1)
>>
>> where (1402,2,1) is the root of the "cpt job", right?
>
> Sure, and in my notation this would be
>
> [(13,1)]
> `- [(14,1)(1,2)]
> `- [(2309,1)(1402,2)(1,3)]
>
> Again each level is just one task, but known by several pids.
>
> So coming back to the idea of checkpoint all of pid_ns=2, we would be
> checkpointing both task [(14,1)(1,2)] and task [(2309,1)(1402,2)(1,3)].
> And my question had been how would we access and store the fact that the
> third task has pid (1,3), which we MUST store and reset, because that is
> that task's active pid namespace, meaning it only knows itself as (1,3).
>
> The task in pid namespace 2 which is doing the checkpointing generally
> only knows the third task as (1402,2), so we need to provide a mechanism
> for it to dump all pids in "higher" pid namespaces.
>
> Note that, of course, pids in "lower" pid namespaces can be randomly
> set. If we are restarting pid namespace 2 on a new system, it's
> perfeclty ok for the pids to look like:
>
> [(467,1)]
> `- [(5597,1)(1,2)]
> `- [(5598,1)(1402,2)(1,3)]
>
> Heh, or even
>
> [(14,1)(467,2)]
> `- [(444,1)(5597,2)(1,3)]
> `- [(445,1)(5598,2)(1402,3)(1,4)]

Hmm. I see. So you don't care that the pids in the namespace #2 are still
the same. I can understand that politics for namespace #1, but for #2...

OK, if you need this let us go on with such model, but I'd like to see
the CONFIG_PID_NS_MULTILEVEL for this. Or at least CONFIG_PID_NS_FLAT for
my model as we do not need to sacrifice the performance to such generic
behavior.

Thanks,
Pavel.

>
> thanks,
> -serge
>
>>> Now when we want to migrate the vserver, a task in pid_ns 2 will look
>>> for all tasks with pids in pidns 2. That will automatically include all
>>> tasks in pid_ns 3. I think you thought I was asking how we would
>>> include pid_ns 3, and are asking whether it woudl be ok to not migrate
>>> pid_ns 3? (answer: it's irrelevant, all tasks in pid_ns 3 are also in
>>> pid_ns 2 - and in pid_ns 1).
>>>
>>> What I was actually asking was, in the same situation, how would the
>>> task in pid_ns 2 doing the checkpoint get the pids in pid_ns 3. So it
>>> sees the task as (1402,2), but needs to also store (1,3) and, on
>>> restart, recreate a task with both those pids.
>>>
>>> But I guess it will be pretty simple, and fall into place once we get
>>> c/r semantics started.
>>>
>>> thanks,
>>> -serge
>>>
>>> [ original response ]
>>>
>>> I think that's the reasonable thing for people to do, but I don't think
>>> we should force them to. I.e. there is no reason you shouldn't be able
>>> to take one or two tasks out of a pidns and checkpoint them, and restart
>>> them elsewhere. If it turns out they were talking to a third process
>>> which wasn't checkpointed, well, too bad.
>>>
>>> What you are more likely to need is a new clean set of namespaces to
>>> restart in, but again I don't think we should enforce that. So whatever
>>> mechanism we end up doing to implementing "clone_with_pid()", we should
>>> handle -EBUSY correctly.
>>>
>>> Anyway, why do you ask? (How does it follow from the conversation?)
>>>
>>> I wasn't suggesting that it would be ok to only dump part of the pid
>>> information, rather I was asking how we would do it correctly :)
>>>
>
Re: Instructions of how to make testing easy [message #13490 is a reply to message #13299] Thu, 24 May 2007 14:55 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Hello Pavel !

I'm giving it a try.

For those using qemu, you'll need this patch:

http://lkml.org/lkml/2007/5/16/360

thanks for the patchset pavel.

C.
Re: Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13497 is a reply to message #13350] Mon, 28 May 2007 11:50 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Pavel Emelianov wrote:

>>> Though I haven't
>>>seen any problems with glibc for many years running OpenVZ and I think,
>>>that if glibc will want to cache this getpid() value we can teach it to
>>>uncache this value in case someone called unshare() with CLONE_NEWPIDS.
>>
>>glibc very much caches the results of getpid().
>
>
> Can you prove it? We have run OpenVZ for many years and with many
> userspace configurations and we haven't seen the problems with
> glibc ever.

Pavel, but we always do full namespace entering with fork()'s and such
actions. So we simply couldn't trigger getpid() caching.

Thanks,
Kirill
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13506 is a reply to message #13485] Tue, 29 May 2007 04:30 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Pavel Emelianov <xemul@openvz.org> writes:

> Hmm. I see. So you don't care that the pids in the namespace #2 are still
> the same. I can understand that politics for namespace #1, but for #2...

I'm confused, I think the statement above is wrong.

If we just checkpoint/restart a leaf pid namespace we don't care about
the other pids, in other namespace.

If we checkpoint/restart a pid namespace with another pid namespace
nested inside it we need to preserve the pids in the pid namespace we
are checkpointing and in a nested pid namespaces.

Pids in namespaces that none of the process we are migrating cannot
see we do not care about. (i.e. the init pid namespace, and possibly
some of it's children)

> OK, if you need this let us go on with such model, but I'd like to see
> the CONFIG_PID_NS_MULTILEVEL for this. Or at least CONFIG_PID_NS_FLAT for
> my model as we do not need to sacrifice the performance to such generic
> behavior.

Where is the world would a performance sacrafice come in? If you
happen to be using a deeply nested pid namespace I can see a small
performance hit, there is fundamentally more to do. However if you
don't use a nested pid namespace there should not be more work todo
and it should be impossible to measure the over head.

Further 3 levels should be as simple to implement and as cheap as two
levels. Because we can continue to use static allocation.

Eric
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13520 is a reply to message #13506] Tue, 29 May 2007 07:47 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Eric W. Biederman wrote:
> Pavel Emelianov <xemul@openvz.org> writes:
>
>> Hmm. I see. So you don't care that the pids in the namespace #2 are still
>> the same. I can understand that politics for namespace #1, but for #2...
>
> I'm confused, I think the statement above is wrong.
>
> If we just checkpoint/restart a leaf pid namespace we don't care about
> the other pids, in other namespace.
>
> If we checkpoint/restart a pid namespace with another pid namespace
> nested inside it we need to preserve the pids in the pid namespace we
> are checkpointing and in a nested pid namespaces.
>
> Pids in namespaces that none of the process we are migrating cannot
> see we do not care about. (i.e. the init pid namespace, and possibly
> some of it's children)
>
>> OK, if you need this let us go on with such model, but I'd like to see
>> the CONFIG_PID_NS_MULTILEVEL for this. Or at least CONFIG_PID_NS_FLAT for
>> my model as we do not need to sacrifice the performance to such generic
>> behavior.
>
> Where is the world would a performance sacrafice come in? If you

Easy! Consider the problem of getting a list of pids for proc. In case
of flat layout we just take a number from a known structure. In case of
nested pids we have to scan through the list of pid_elem-s or lookup
the hash or something similar.

The same stays true for wait() when we have to compare pids in the
eligible_child(), for setpgid(), terminal ioctls and so on and so forth.

Not to be unfounded I will measure booth cases with unixbench's spawn,
execl and shell tests and with "ps -xaf" and report the results. All will
be run in init namespace and in "level one" namespace. If the flat layout
wins (with noticeable difference) I would insist having two of them. Agree?

> happen to be using a deeply nested pid namespace I can see a small
> performance hit, there is fundamentally more to do. However if you
> don't use a nested pid namespace there should not be more work todo
> and it should be impossible to measure the over head.
>
> Further 3 levels should be as simple to implement and as cheap as two
> levels. Because we can continue to use static allocation.

Wait a bit. Do you mean that there's enough to have only 3 levels of
namespaces? I.e. to have a struct pid look like
struct pid {
int pid;
int pid1; /* for first level */
int pid2; /* for 2nd level */
...
}
?

> Eric
>
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13532 is a reply to message #13520] Tue, 29 May 2007 12:36 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Pavel Emelianov <xemul@openvz.org> writes:

> Eric W. Biederman wrote:
>> Pavel Emelianov <xemul@openvz.org> writes:
>> Where is the world would a performance sacrafice come in? If you
>
> Easy! Consider the problem of getting a list of pids for proc. In case
> of flat layout we just take a number from a known structure. In case of
> nested pids we have to scan through the list of pid_elem-s or lookup
> the hash or something similar.

We walk through the pidmap. That should not change either way.

I'm actually not horribly fond of walking through the pidmap but
it was needed for correctness so we could have a stable token we could
return to user space for restarting readdir in /proc.

> The same stays true for wait() when we have to compare pids in the
> eligible_child(), for setpgid(), terminal ioctls and so on and so forth.

We should be comparing struct pid pointers not user space pid_t values.
With that being the case we should convert at the edge of user space
and all should be good.

>> happen to be using a deeply nested pid namespace I can see a small
>> performance hit, there is fundamentally more to do. However if you
>> don't use a nested pid namespace there should not be more work todo
>> and it should be impossible to measure the over head.
>>
>> Further 3 levels should be as simple to implement and as cheap as two
>> levels. Because we can continue to use static allocation.
>
> Wait a bit. Do you mean that there's enough to have only 3 levels of
> namespaces? I.e. to have a struct pid look like
> struct pid {
> int pid;
> int pid1; /* for first level */
> int pid2; /* for 2nd level */
> ...
> }
> ?

Initially yes. 3 levels should be enough. Ultimately we may want more
but that should be a small tweak at the implementation level. Nothing
outside of the pid functions should care.

Eric
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #13535 is a reply to message #13350] Tue, 29 May 2007 13:07 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Hmm. I seem to have forgotten to send this one.

Pavel Emelianov <xemul@openvz.org> writes:

> Eric W. Biederman wrote:

> Generic structures are not always needed. Say, why don't we
> have N-level page tables in kernel? Why not make them generic?
> What if some ia128 architecture will require 7-level tables!?

PID namespaces unlike the other namespaces are fundamentally nested.
Which is an unfortunate pain. But if you want to allow nesting
of containers of different types such as system containers and
application containers you need nested PID namespaces.

>> Having more then two layers means we are prepared to use pid namespaces more
>> generally. It really isn't that much harder.
>
> It is not, but do we need to spend so much time on solving
> not relevant problems?

It is relevant to some of us. Therefore it is a relevant problem.

>>>> - Semantically fork is easier then unshare. Unshare can mean
>>> This is not. When you fork, the kid shares the session and the
>>> group with its parent, but moving this pids to new ns is bad - the
>>> parent will happen to be half-moved. Thus you need to break the
>>> session and the group in fork(), but this is extra complexity.
>>
>> Nope. You will just need to have the child call setsid() if
>> you don't want to share the session and the group.
>
> Of course, but setsid() must be done *before* creating a new
> namespace, Otherwise you will have a half-inserted into new
> namespace task. This sounds awful.

We can experience weird interactions, but not really worse then the
sending a signal from outside the namespace. So we may want to
map the pids of the session and the pgrp into the new namespace but
functionally it's not really a big deal, and we can call setsid
after the fork.

>>>> a lot of things, and it is easy to pick a meaning that has weird
>>>> side effects. Your implementation has a serious problem in that you
>>>> change the value of getpid() at runtime. Glibc does not know how to
>>>> cope with the value of getpid() changing.
>>> This pid changing happens only once per task lifetime.
>>
>> Unshare isn't once per task lifetime, unless you added some extra
>> constraints.
>
> It is once. You create a new namespace and that's all.

What prevents you from calling unshare multiple times?

>>> Though I haven't
>>> seen any problems with glibc for many years running OpenVZ and I think,
>>> that if glibc will want to cache this getpid() value we can teach it to
>>> uncache this value in case someone called unshare() with CLONE_NEWPIDS.
>>
>> glibc very much caches the results of getpid().
>
> Can you prove it? We have run OpenVZ for many years and with many
> userspace configurations and we haven't seen the problems with
> glibc ever.

Yes. I did a migration prototype in user space. Migrated a process
to a new pid, but getpid returned the pid before migration. So I
investigated why, including reading the glibc code. glibc cache the
pid value. Once the value is cached only a fork invalidates the
cache.

From: nptl/sysdeps/unix/sysv/linux/getpid.c

pid_t
__getpid (void)
{
pid_t result = THREAD_GETMEM (THREAD_SELF, pid);
if (__builtin_expect (result <= 0, 0))
result = really_getpid (result);
return result;
}

THREAD_GETMEM is a memory read.
really_getpid is the syscall.

Eric
Re: [PATCH 11/13] Changes to show virtual ids to user [message #13605 is a reply to message #13410] Tue, 29 May 2007 12:32 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Hello !

>>> The worst case I can see with pid == 0. Is that it would be a bug
>>> that we can fix later. For other cases it would seem to be a user
>>> space API thing that we get stuck with for all time.
>> We cannot trust userspace application to expect some pid other than
>> positive. All that we can is either use some always-absent pid or
>> send the signal as SI_KERNEL.
>>
>> Our experience show that making decisions like above causes random
<>> applications failures that are hard (or even impossible) to debug.

> Ok. So I guess I see what you are proposing is picking an arbitrary
> pid, say pid == 2, and reserving that in all pid namespaces and using
> it when we have a pid that does not map to a specific namespace. I'm
> fine with that.
>
> All I care about is that we have a solution, preferably simple,
> to the non-mapped pid problem.

Pavel, are you against using pid == 0 and setting si_code to SI_KERNEL ?

C.
Re: [PATCH 11/13] Changes to show virtual ids to user [message #13626 is a reply to message #13605] Thu, 31 May 2007 07:57 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Cedric Le Goater wrote:
> Hello !
>
>>>> The worst case I can see with pid == 0. Is that it would be a bug
>>>> that we can fix later. For other cases it would seem to be a user
>>>> space API thing that we get stuck with for all time.
>>> We cannot trust userspace application to expect some pid other than
>>> positive. All that we can is either use some always-absent pid or
>>> send the signal as SI_KERNEL.
>>>
>>> Our experience show that making decisions like above causes random
> <>> applications failures that are hard (or even impossible) to debug.
>
>> Ok. So I guess I see what you are proposing is picking an arbitrary
>> pid, say pid == 2, and reserving that in all pid namespaces and using
>> it when we have a pid that does not map to a specific namespace. I'm
>> fine with that.
>>
>> All I care about is that we have a solution, preferably simple,
>> to the non-mapped pid problem.
>
> Pavel, are you against using pid == 0 and setting si_code to SI_KERNEL ?

I think I am. A quick grep through the code revealed one place where
this can happen, so I believe application are (have to be) somehow
prepared to this.

> C.
>
Re: [PATCH 11/13] Changes to show virtual ids to user [message #13627 is a reply to message #13626] Thu, 31 May 2007 08:00 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Pavel Emelianov wrote:
> Cedric Le Goater wrote:
>> Hello !
>>
>>>>> The worst case I can see with pid == 0. Is that it would be a bug
>>>>> that we can fix later. For other cases it would seem to be a user
>>>>> space API thing that we get stuck with for all time.
>>>> We cannot trust userspace application to expect some pid other than
>>>> positive. All that we can is either use some always-absent pid or
>>>> send the signal as SI_KERNEL.
>>>>
>>>> Our experience show that making decisions like above causes random
>> <>> applications failures that are hard (or even impossible) to debug.
>>
>>> Ok. So I guess I see what you are proposing is picking an arbitrary
>>> pid, say pid == 2, and reserving that in all pid namespaces and using
>>> it when we have a pid that does not map to a specific namespace. I'm
>>> fine with that.
>>>
>>> All I care about is that we have a solution, preferably simple,
>>> to the non-mapped pid problem.
>> Pavel, are you against using pid == 0 and setting si_code to SI_KERNEL ?
>
> I think I am. A quick grep through the code revealed one place where

Sorry. I have misprinted. I meant "I think I am *NOT*". My bad :(

> this can happen, so I believe application are (have to be) somehow
> prepared to this.
>
>> C.
>>
>
>
Re: [PATCH 11/13] Changes to show virtual ids to user [message #13634 is a reply to message #13627] Thu, 31 May 2007 11:26 Go to previous messageGo to next message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Pavel Emelianov <xemul@openvz.org> writes:

> Pavel Emelianov wrote:
>> Cedric Le Goater wrote:
>>> Hello !
>>>
>>>>>> The worst case I can see with pid == 0. Is that it would be a bug
>>>>>> that we can fix later. For other cases it would seem to be a user
>>>>>> space API thing that we get stuck with for all time.
>>>>> We cannot trust userspace application to expect some pid other than
>>>>> positive. All that we can is either use some always-absent pid or
>>>>> send the signal as SI_KERNEL.
>>>>>
>>>>> Our experience show that making decisions like above causes random
>>> <>> applications failures that are hard (or even impossible) to debug.
>>>
>>>> Ok. So I guess I see what you are proposing is picking an arbitrary
>>>> pid, say pid == 2, and reserving that in all pid namespaces and using
>>>> it when we have a pid that does not map to a specific namespace. I'm
>>>> fine with that.
>>>>
>>>> All I care about is that we have a solution, preferably simple,
>>>> to the non-mapped pid problem.
>>> Pavel, are you against using pid == 0 and setting si_code to SI_KERNEL ?
>>
>> I think I am. A quick grep through the code revealed one place where
>
> Sorry. I have misprinted. I meant "I think I am *NOT*". My bad :(
>
>> this can happen, so I believe application are (have to be) somehow
>> prepared to this.

Where was this. I'd like to follow your complete line of thinking.

Eric
Re: [PATCH 11/13] Changes to show virtual ids to user [message #13635 is a reply to message #13634] Thu, 31 May 2007 11:46 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Eric W. Biederman wrote:
> Pavel Emelianov <xemul@openvz.org> writes:
>
>> Pavel Emelianov wrote:
>>> Cedric Le Goater wrote:
>>>> Hello !
>>>>
>>>>>>> The worst case I can see with pid == 0. Is that it would be a bug
>>>>>>> that we can fix later. For other cases it would seem to be a user
>>>>>>> space API thing that we get stuck with for all time.
>>>>>> We cannot trust userspace application to expect some pid other than
>>>>>> positive. All that we can is either use some always-absent pid or
>>>>>> send the signal as SI_KERNEL.
>>>>>>
>>>>>> Our experience show that making decisions like above causes random
>>>> <>> applications failures that are hard (or even impossible) to debug.
>>>>
>>>>> Ok. So I guess I see what you are proposing is picking an arbitrary
>>>>> pid, say pid == 2, and reserving that in all pid namespaces and using
>>>>> it when we have a pid that does not map to a specific namespace. I'm
>>>>> fine with that.
>>>>>
>>>>> All I care about is that we have a solution, preferably simple,
>>>>> to the non-mapped pid problem.
>>>> Pavel, are you against using pid == 0 and setting si_code to SI_KERNEL ?
>>> I think I am. A quick grep through the code revealed one place where
>> Sorry. I have misprinted. I meant "I think I am *NOT*". My bad :(
>>
>>> this can happen, so I believe application are (have to be) somehow
>>> prepared to this.
>
> Where was this. I'd like to follow your complete line of thinking.

The line concerning why I think that sending a signal from
SI_KERNEL is good solution?

> Eric
>
Re: [PATCH 11/13] Changes to show virtual ids to user [message #13636 is a reply to message #13635] Thu, 31 May 2007 13:41 Go to previous message
ebiederm is currently offline  ebiederm
Messages: 1354
Registered: February 2006
Senior Member
Pavel Emelianov <xemul@openvz.org> writes:

> Eric W. Biederman wrote:
>> Pavel Emelianov <xemul@openvz.org> writes:
>>
>>> Pavel Emelianov wrote:
>>>> Cedric Le Goater wrote:
>>>>> Hello !
>>>>>
>>>>>>>> The worst case I can see with pid == 0. Is that it would be a bug
>>>>>>>> that we can fix later. For other cases it would seem to be a user
>>>>>>>> space API thing that we get stuck with for all time.
>>>>>>> We cannot trust userspace application to expect some pid other than
>>>>>>> positive. All that we can is either use some always-absent pid or
>>>>>>> send the signal as SI_KERNEL.
>>>>>>>
>>>>>>> Our experience show that making decisions like above causes random
>>>>> <>> applications failures that are hard (or even impossible) to debug.
>>>>>
>>>>>> Ok. So I guess I see what you are proposing is picking an arbitrary
>>>>>> pid, say pid == 2, and reserving that in all pid namespaces and using
>>>>>> it when we have a pid that does not map to a specific namespace. I'm
>>>>>> fine with that.
>>>>>>
>>>>>> All I care about is that we have a solution, preferably simple,
>>>>>> to the non-mapped pid problem.
>>>>> Pavel, are you against using pid == 0 and setting si_code to SI_KERNEL ?
>>>> I think I am. A quick grep through the code revealed one place where
>>> Sorry. I have misprinted. I meant "I think I am *NOT*". My bad :(
>>>
>>>> this can happen, so I believe application are (have to be) somehow
>>>> prepared to this.
>>
>> Where was this. I'd like to follow your complete line of thinking.
>
> The line concerning why I think that sending a signal from
> SI_KERNEL is good solution?

Let me just restate everything to be certain we are not getting
confused.

The problem was what to do with signals from unmmaped pids.

You have just said pid == 0 with SI_KERNEL seems to work.

The kernel occasionally sends signal that way already.

The primary argument against this in my memory was that we
a user space application might treat the kernel case special
(more trust), so it might be a bad idea.

I believe what you just said was that user space has to be ready
to handle signals from pid == 0 with SI_KERNEL set. Therefore this
should just work. I don't think you have addressed the levels of
trust in user space issue or I might be confused.

Eric
Re: [PATCH 5/13] Expand the pid/task seeking functions set [message #18644 is a reply to message #13289] Thu, 24 May 2007 17:11 Go to previous message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Thu, 2007-05-24 at 16:50 +0400, Pavel Emelianov wrote:
> 
> +struct pid * fastcall __find_vpid(int nr, struct pid_namespace *ns)
> +{
> +#ifdef CONFIG_PID_NS
> +       struct hlist_node *elem;
> +       struct pid *pid;
> +#endif
> +
> +       if (ns == &init_pid_ns)
> +               return find_pid(nr);
> +
> +#ifdef CONFIG_PID_NS
> +       hlist_for_each_entry_rcu(pid, elem,
> +                       &vpid_hash[vpid_hashfn(nr, ns)], vpid_chain) {
> +               if (pid->vnr == nr && pid->ns == ns)
> +                       return pid;
> +       }
> +#endif
> +       return NULL;
> +} 

I am a bit worried that there are too many #ifdefs here.  Your patch
series adds ~20 of them, and they look to me to be mostly in .c files.
Section 2 in SubmittingPatches somewhat discourages this.

Do you have any plans for cleaning these up?

-- Dave

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #18659 is a reply to message #13327] Thu, 24 May 2007 16:41 Go to previous message
Daniel Lezcano is currently offline  Daniel Lezcano
Messages: 417
Registered: June 2006
Senior Member
Pavel Emelianov wrote:
> Eric W. Biederman wrote:
>   
>> Pavel Emelianov <xemul@openvz.org> writes:
>>
>>     
>>> That's how OpenVZ sees the pid namespaces.
>>>
>>> The main idea is that kernel keeps operating with tasks pid
>>> as it did before, but each task obtains one more pid for each
>>> pid type - the virtual pid. When putting the pid to user or
>>> getting the pid from it kernel operates with the virtual ones.
>>>       
>> Just a quick reaction. 
>>
>> - I would very much like to see a minimum of 3 levels of pids,
>>     
>
> Why not 4? From my part, I would like to know, why such nesting
> is important. We have plain IPC namespaces and nobody cares.
> We will have isolated network namespaces, why pids are exception?
>   
Pavel,

I am taking advantage to the opportunity to ask you if you plan to send 
a new network namespace patchset ?

  -- Daniel
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #18665 is a reply to message #13284] Fri, 25 May 2007 08:30 Go to previous message
Daniel Lezcano is currently offline  Daniel Lezcano
Messages: 417
Registered: June 2006
Senior Member
Pavel Emelianov wrote:
> Daniel Lezcano wrote:
>   
>> Pavel Emelianov wrote:
>>     
>>> Eric W. Biederman wrote:
>>>  
>>>       
>>>> Pavel Emelianov <xemul@openvz.org> writes:
>>>>
>>>>    
>>>>         
>>>>> That's how OpenVZ sees the pid namespaces.
>>>>>
>>>>> The main idea is that kernel keeps operating with tasks pid
>>>>> as it did before, but each task obtains one more pid for each
>>>>> pid type - the virtual pid. When putting the pid to user or
>>>>> getting the pid from it kernel operates with the virtual ones.
>>>>>       
>>>>>           
>>>> Just a quick reaction.
>>>> - I would very much like to see a minimum of 3 levels of pids,
>>>>     
>>>>         
>>> Why not 4? From my part, I would like to know, why such nesting
>>> is important. We have plain IPC namespaces and nobody cares.
>>> We will have isolated network namespaces, why pids are exception?
>>>   
>>>       
>> Pavel,
>>
>> I am taking advantage to the opportunity to ask you if you plan to send
>> a new network namespace patchset ?
>>     
>
> Unfortunately no :( Right now we're focusing on pids and
> resource management.
>   
Yep, a big part :)

Did you, OpenVZ guys, had time to look at Eric's patchset ?


_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 5/13] Expand the pid/task seeking functions set [message #18668 is a reply to message #18644] Fri, 25 May 2007 07:08 Go to previous message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Dave Hansen wrote:
> On Thu, 2007-05-24 at 16:50 +0400, Pavel Emelianov wrote:
>> +struct pid * fastcall __find_vpid(int nr, struct pid_namespace *ns)
>> +{
>> +#ifdef CONFIG_PID_NS
>> +       struct hlist_node *elem;
>> +       struct pid *pid;
>> +#endif
>> +
>> +       if (ns == &init_pid_ns)
>> +               return find_pid(nr);
>> +
>> +#ifdef CONFIG_PID_NS
>> +       hlist_for_each_entry_rcu(pid, elem,
>> +                       &vpid_hash[vpid_hashfn(nr, ns)], vpid_chain) {
>> +               if (pid->vnr == nr && pid->ns == ns)
>> +                       return pid;
>> +       }
>> +#endif
>> +       return NULL;
>> +} 
> 
> I am a bit worried that there are too many #ifdefs here.  Your patch
> series adds ~20 of them, and they look to me to be mostly in .c files.
> Section 2 in SubmittingPatches somewhat discourages this.
> 
> Do you have any plans for cleaning these up?

Sure I have. But this approach makes review simpler - everyone
explicitly see what exact actions are taken in each place. In
the second iteration this will be make in a more elegant way
like making static inline stubs etc.

This set is a kind of RFC and proof-of-concept. I didn't intent
this to be merged to any tree as is. That's why a attached the
lats patch with strut in proc to observe the whole tree.

BTW, question to Sukadev - how did you test your patches? I do
know that ps utility doesn't work without full /proc tree and
I don's see similar hacks in your patchset.

> -- Dave
> 
> 

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 0/13] Pid namespaces (OpenVZ view) [message #18669 is a reply to message #18659] Fri, 25 May 2007 07:26 Go to previous message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Daniel Lezcano wrote:
> Pavel Emelianov wrote:
>> Eric W. Biederman wrote:
>>  
>>> Pavel Emelianov <xemul@openvz.org> writes:
>>>
>>>    
>>>> That's how OpenVZ sees the pid namespaces.
>>>>
>>>> The main idea is that kernel keeps operating with tasks pid
>>>> as it did before, but each task obtains one more pid for each
>>>> pid type - the virtual pid. When putting the pid to user or
>>>> getting the pid from it kernel operates with the virtual ones.
>>>>       
>>> Just a quick reaction.
>>> - I would very much like to see a minimum of 3 levels of pids,
>>>     
>>
>> Why not 4? From my part, I would like to know, why such nesting
>> is important. We have plain IPC namespaces and nobody cares.
>> We will have isolated network namespaces, why pids are exception?
>>   
> Pavel,
> 
> I am taking advantage to the opportunity to ask you if you plan to send
> a new network namespace patchset ?

Unfortunately no :( Right now we're focusing on pids and
resource management.

>  -- Daniel
> 

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 5/13] Expand the pid/task seeking functions set [message #18683 is a reply to message #18668] Fri, 25 May 2007 23:36 Go to previous message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Pavel Emelianov [xemul@openvz.org] wrote:
| Dave Hansen wrote:
| > On Thu, 2007-05-24 at 16:50 +0400, Pavel Emelianov wrote:
| >> +struct pid * fastcall __find_vpid(int nr, struct pid_namespace *ns)
| >> +{
| >> +#ifdef CONFIG_PID_NS
| >> +       struct hlist_node *elem;
| >> +       struct pid *pid;
| >> +#endif
| >> +
| >> +       if (ns == &init_pid_ns)
| >> +               return find_pid(nr);
| >> +
| >> +#ifdef CONFIG_PID_NS
| >> +       hlist_for_each_entry_rcu(pid, elem,
| >> +                       &vpid_hash[vpid_hashfn(nr, ns)], vpid_chain) {
| >> +               if (pid->vnr == nr && pid->ns == ns)
| >> +                       return pid;
| >> +       }
| >> +#endif
| >> +       return NULL;
| >> +} 
| > 
| > I am a bit worried that there are too many #ifdefs here.  Your patch
| > series adds ~20 of them, and they look to me to be mostly in .c files.
| > Section 2 in SubmittingPatches somewhat discourages this.
| > 
| > Do you have any plans for cleaning these up?
| 
| Sure I have. But this approach makes review simpler - everyone
| explicitly see what exact actions are taken in each place. In
| the second iteration this will be make in a more elegant way
| like making static inline stubs etc.
| 
| This set is a kind of RFC and proof-of-concept. I didn't intent
| this to be merged to any tree as is. That's why a attached the
| lats patch with strut in proc to observe the whole tree.
| 
| BTW, question to Sukadev - how did you test your patches? I do
| know that ps utility doesn't work without full /proc tree and
| I don's see similar hacks in your patchset.

Patches (#13 and #14) in my patchset allow remounting /proc in
a child namespace. So the script (lxc-wrap.sh Patch-0) remounts
/proc when it enters the new namespace. "ps -e" in the namespace
only shows processes from that namespace.

"ps -e" in init pid ns shows processes in child namespaces with
numeric pids (pid_t) from init pid ns.


| 
| > -- Dave
| > 
| > 
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Previous Topic: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
Next Topic: [RFC][PATCH 0/16] Enable cloning of pid namespace
Goto Forum:
  


Current Time: Sat Dec 02 14:32:37 GMT 2023

Total time taken to generate the page: 0.02308 seconds