OpenVZ Forum


Home » Mailing lists » Devel » [RFC][PATCH 0/15] Pid namespaces
Re: [PATCH 11/15] Signal semantics [message #15379 is a reply to message #15305] Sun, 29 July 2007 11:23 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/27, sukadev@us.ibm.com wrote:
>
> Pavel Emelianov [xemul@openvz.org] wrote:
> | Oleg Nesterov wrote:
> | >>
> | >>@@ -1852,7 +1950,7 @@ relock:
> | >> * within that pid space. It can of course get signals from
> | >> * its parent pid space.
> | >> */
> | >>- if (current == task_child_reaper(current))
> | >>+ if (kinfo.flags & KERN_SIGINFO_CINIT)
> | >> continue;
> | >
> | >I think the whole idea is broken, it assumes the sender put something into
> | >"struct sigqueue".
> |
> | Yup. That's the problem. It seems to me that the only way to handle init's
> | signals is to check for permissions in the sending path.
>
> We can check permissions in the sending path - and in fact we do check for
> SIGKILL case (deny_signal_to_container_init() below).
>
> But the receiver knows/decides whether or not the signal is wanted/not. No ?

I can't understand your question. Yes, this is what we are doing currently,
but this is broken by this patch.

> Are you saying we should check/special case all fatal signals ?
>
> |
> | >Suppose that /sbin/init has no handler for (say) SIGTERM, and we send this
> | >signal from the same namespace. send_signal() sets SIGQUEUE_CINIT, but it
> | >is lost because __group_complete_signal() silently "converts" sig_fatal()
> | >signals to SIGKILL using sigaddset().
>
> Yes, I should have called it out, but this patch currently assumes /sbin/init
> (or container-init) has a handler for the fatal signals like SIGTERM

Changelog says nothing about that. And in that case we don't need any complications
except a) deny_signal_to_container_init() (should be named deny_SIGKILL_to_container_init)
and b) "cross-namespace signals must have si_code == SI_KERNEL".

I don't know whether this limitation (/sbin/init must install the handler
for each fatal signal) acceptable or not.

However, we should also take care about sig_kernel_stop() signals, and please
note that it is not possible to install a handler for SIGSTOP.

> | >>+static void encode_sender_info(struct task_struct *t, struct sigqueue *q)
> | >>+{
> | >>+ if (pid_ns_equal(t)) {
> | >>+ if (is_container_init(t)) {
> | >>+ q->flags |= SIGQUEUE_CINIT;
> | >
> | >Ironically, this change carefully preserves the bug we already have :)
> | >
> | >This doesn't protect init from "bad" signal if we send it to sub-thread
> | >of init. Actually, this make the behaviour a bit worse compared to what
> | >we currently have. Currently, at least the main init's thread survives
> | >if we send SIGKILL to sub-thread.
>
> Do you mean "init's main thread" ?

Yes.

> But doesn't SIGKILL to any thread kill
> the entire process ?

It should, but it doesn't if it was sent to init's sub-thread, exactly
because of child_reaper() check in get_signal_to_deliver().

> | >>+ error = deny_signal_to_container_init(t, sig);
> | >>+ if (error)
> | >>+ return error;
> | >
> | >Hm. Could you explain this change? Why do we need a special check for
> | >SIGKILL?
>
> As you pointed out above, SIGKILL goes through the __group_complete_signal()/
> sigaddset() path and bypasses/loses the KERN_SIGINFO_CINIT flag. Other
> sig_fatal() signals take this path too, but we assume for now, container-init
> has a handler.

No, SIGKILL doesn't bypasses send_signal(). IOW, if other parts of this patch
were correct, we don't need this change. If init has a handler, we don't neeed
other parts.

> | >(What about ptrace_attach() btw? If it is possible to send a signal to init
> | > from the "parent" namespace, perhaps it makes sense to allow ptracing as
> | > well).
> |
> | ptracing of tasks fro different namespaces is not possible at all, since
> | strace utility determines the fork()-ed child pid from the parent's eax
> | register, which would contain the pid value as this parent sees his child.
> | But if the strace is in different namespace - it won't be able to find
> | this child with the pid value from parent's eax.
> |
> | Maybe it's worth disabling cross-namespaces ptracing...
>
> I think so too. Its probably not a serious limitation ?

My question was not clear, sorry. And I was confused because I had a false
impression that ptrace_attach() was already changed to use is_container_init().

Afaics, the cross-namespaces ptracing should work (modulo fork() problem
pointed out by Pavel), and probably it is useful.

But we should fix ptrace_attach(), it should not be possible to do PTRACE_ATTACH
to /sbin/init from the _same_ namespace.

Oleg.
Re: [PATCH 3/15] kern_siginfo helper [message #15381 is a reply to message #15297] Sun, 29 July 2007 11:40 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> TODO: This is more an exploratory patch and modifies only interfaces
> necessary to implement correct signal semantics in pid namespaces.
>
> If the approach is feasible, we could consistently use 'kern_siginfo'
> in other signal interfaces and possibly in 'struct sigqueue'.
>
> We could modify dequeue_signal() to directly work with 'kern_siginfo'
> and remove dequeue_signal_kern_info().

Well... I know, it is very easy to blame somebody else's patch, and probably
my taste is not good...

But honestly, I personally think this approach is a horror, and any alternative
is better :)

I'd rather change dequeue_signal() so that it takes "struct sigqueue *"
parameter instead of "siginfo_t *", or add a new "int *flags".

OK, this doesn't work anyway, we should do something different. Perhaps
just do all checks on sender's side.

It is a bit strange that this patch is 3/15, and the rest bits in 11/15,
not very convenient for the review.

Oleg.
Re: [PATCH 7/15] Helpers to obtain pid numbers [message #15382 is a reply to message #15301] Sun, 29 July 2007 12:08 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> --- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
> +++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
> @@ -83,12 +92,34 @@ extern void FASTCALL(detach_pid(struct t
>
> extern struct pid *alloc_pid(struct pid_namespace *ns);
> extern void FASTCALL(free_pid(struct pid *pid));
> +
> +/*
> + * the helpers to get the pid's id seen from different namespaces
> + *
> + * pid_nr() : global id, i.e. the id seen from the init namespace;

This looks a bit strange to me, but perhaps this is just matter of taste.
I think pid_nr(pid) should be pid_nr_ns(pid, current->nsproxy->pid_ns),
this is imho much closer to the current meaning. I won't persist though.

> +pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
> +{
> + pid_t nr = 0;
> + if (pid && ns->level <= pid->level)
> + nr = pid->numbers[ns->level].nr;
> + return nr;
> +}

I am not sure I understand the "ns->level <= pid->level" check. Isn't it
a bug to use a "wrong" namespace here? In that case BUG_ON() looks better.

If ns could be wrong, "ns->level <= pid->level" is not enough, we should
also check "pid->numbers[ns->level].ns == ns", no?

Oleg.
Re: [PATCH 8/15] Helpers to find the task by its numerical ids [message #15383 is a reply to message #15302] Sun, 29 July 2007 12:39 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> +#define find_pid(pid) find_pid_ns(pid, &init_pid_ns)

Again, I think find_pid() should use current's active ns, not
init_pid_ns. Just grep for find_pid/find_task_by_pid.

> --- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
> +++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
> @@ -204,19 +221,20 @@ static void delayed_put_pid(struct rcu_h
> goto out;
> }
>
> -struct pid * fastcall find_pid(int nr)
> +struct pid * fastcall find_pid_ns(int nr, struct pid_namespace *ns)
> {
> struct hlist_node *elem;
> - struct pid *pid;
> + struct upid *pnr;
> +
> + hlist_for_each_entry_rcu(pnr, elem,
> + &pid_hash[pid_hashfn(nr, ns)], pid_chain)
> + if (pnr->nr == nr && pnr->ns == ns)
^^^^^^^^^^^^^
Aha, that is why we need upid->ns.

I am a bit surprised we don't move the global pid_hash into the
"struct pid_namespace", this could speedup the search, and we
don't need upid->ns.

> -struct pid *find_ge_pid(int nr)
> +struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> {
> struct pid *pid;
>
> do {
> - pid = find_pid(nr);
> + pid = find_pid_ns(nr, ns);
> if (pid)
> break;
> - nr = next_pidmap(task_active_pid_ns(current), nr);
> + nr = next_pidmap(ns, nr);
> } while (nr > 0);
>
> return pid;

This means we should fix the caller, next_tgid(), but this is done
in 15/15.

Oleg.
Re: [PATCH 15/15] Hooks over the code to show correct values to user [message #15384 is a reply to message #15309] Sun, 29 July 2007 14:29 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> int
> kill_proc(pid_t pid, int sig, int priv)
> {
> - return kill_proc_info(sig, __si_special(priv), pid);
> + int ret;
> +
> + rcu_read_lock();
> + ret = kill_pid_info(sig, __si_special(priv), find_pid(pid));
> + rcu_read_unlock();
> + return ret;
> }

I think this is wrong. kill_proc() should behave the same as kill_proc_info(),
so this change is not needed. With this patch they use different namespaces
to find the task, this is not consistent.

(sadly, this patch is huge, very difficult to review).

Oleg.
Re: [PATCH 10/15] Make each namespace has its own proc tree [message #15385 is a reply to message #15304] Sun, 29 July 2007 15:56 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> static int proc_get_sb(struct file_system_type *fs_type,
> int flags, const char *dev_name, void *data, struct vfsmount *mnt)
> {
> [...snip...]
>
> + if (!sb->s_root) {
> + sb->s_flags = flags;
> + err = proc_fill_super(sb);
> + if (err) {
> + up_write(&sb->s_umount);
> + deactivate_super(sb);
> + return err;
> + }
> +
> + ei = PROC_I(sb->s_root->d_inode);
> + if (!ei->pid) {
> + rcu_read_lock();
> + ei->pid = get_pid(find_pid_ns(1, ns));

Where do we "put" this pid?

> void release_task(struct task_struct * p)
> {
> + struct pid *pid;
> struct task_struct *leader;
> int zap_leader;
> repeat:
> @@ -160,6 +161,20 @@ repeat:
> write_lock_irq(&tasklist_lock);
> ptrace_unlink(p);
> BUG_ON(!list_empty(&p->ptrace_list) ||
> !list_empty(&p->ptrace_children));
> + /*
> + * we have to keep this pid till proc_flush_task() to make
> + * it possible to flush all dentries holding it. pid will
> + * be put ibidem
> + *
> + * however if the pid belogs to init namespace only, we can
> + * optimize this out
> + */
> + pid = task_pid(p);
> + if (pid->level != 0)
> + get_pid(pid);
> + else
> + pid = NULL;
> +
> __exit_signal(p);
>
> /*
> @@ -184,7 +199,8 @@ repeat:
> }
>
> write_unlock_irq(&tasklist_lock);
> - proc_flush_task(p, NULL);
> + proc_flush_task(p, pid);
> + put_pid(pid);

Oh, this is not nice...

Look, proc_flush_task() doesn't need pid, it can use active pid_ns
to iterate the ->parent chain and do proc_flush_task_mnt(), and it
can check "ns->child_reaper == current" for mntput(), yes?

So, can't we do instead

release_task()
{
struct pid_namespace *my_ns = ...; // get it from pid;

...

write_unlock_irq(&tasklist_lock);

// __exit_signal()->...->put_pid() drops the reference,
// but this ns should still be valid because /proc itself
// holds a reference to it, even if we are /sbin/init.
proc_flush_task(p, my_ns);
...
}

No?

Oleg.
Re: [PATCH 10/15] Make each namespace has its own proc tree [message #15387 is a reply to message #15385] Sun, 29 July 2007 17:02 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/29, Oleg Nesterov wrote:
>
> Look, proc_flush_task() doesn't need pid, it can use active pid_ns

Oops. Yes, proc_flush_task() doesn't use ->numbers[].nr, but it should?

proc_flush_task_mnt() uses task->pid to flush /proc/$pid, this looks wrong,
we should use ->number[ns->level].nr, no?

Oleg.
Re: [PATCH 5/15] Introduce struct upid [message #15398 is a reply to message #15374] Mon, 30 July 2007 05:58 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> --- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
>> +++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
>> @@ -40,15 +40,21 @@ enum pid_type
>> * processes.
>> */
>>
>> -struct pid
>> -{
>> - atomic_t count;
>> +struct upid {
>> /* Try to keep pid_chain in the same cacheline as nr for find_pid */
>> int nr;
>> + struct pid_namespace *ns;
>> struct hlist_node pid_chain;
>> +};
>> +
>> +struct pid
>> +{
>> + atomic_t count;
>> /* lists of tasks that use this pid */
>> struct hlist_head tasks[PIDTYPE_MAX];
>> struct rcu_head rcu;
>> + int level;
>> + struct upid numbers[1];
>> };
>
> Well. Definitely, the kernel can't be compiled with this patch applied,
> this seems to be against the rules...

Yes. U forgot to mention, that this patchset is git-bisect-not-safe :)
I sent the safe split earlier, but it was harder to make and understand,
so I decided not to waste the time and sent a badly-split set just to get
comments about the approach. The ways a big patch is split wouldn't affect
the comments about the ideas, bugs, etc.

> So. The task has a single (PIDTYPE_MAX) pid no matter how many namespaces
> can see it, and "struct pid" has an array of numbers for each namespace.
>
> Still I can't understand why do we need upid->ns, can't we kill it?
> Suppose we add "struct pid_namespace *parent_ns" to "struct pid_namespace",
> init_pid_ns.parent_ns == NULL.

We already have it :)

> Now,
>
> struct upid {
> int nr;
> struct hlist_node pid_chain;
> };
>
> struct pid
> {
> atomic_t count;
> struct hlist_head tasks[PIDTYPE_MAX];
> struct rcu_head rcu;
> struct pid_namespace *active_ns;
> struct upid numbers[0];
> };
>
> We populate pid->numbers in "reverse" order, so that pid->numbers[0] lives
> in pid->active_ns.
>
> Now, for example,
>
> void free_pid(struct pid *pid)
> {
> struct pid_namespace *ns;
> unsigned long flags;
> int i;
>
> spin_lock_irqsave(&pidmap_lock, flags);
> for (i = 0, ns = pid->active_ns; ns; i++, ns = ns->parent_ns)
> hlist_del_rcu(&pid->numbers[i].pid_chain);
> spin_unlock_irqrestore(&pidmap_lock, flags);
>
> for (i = 0, ns = pid->active_ns; ns; i++, ns = ns->parent_ns)
> free_pidmap(ns, pid->numbers[i].nr);
>
> call_rcu(&pid->rcu, delayed_put_pid);
> }
>
> Possible?

Possible, but how will
struct pid *find_pid_nr_ns(int nr, struct pid_namespace *ns);
look then? The only way (I see) is to make

hlist_for_each_entry (upid, ...)
if (upid->nr == nr && upid->ns == ns)
return container_of(upid, struct pid, ...)

> Oleg.

Thanks,
Pavel
Re: [PATCH 6/15] Make alloc_pid(), free_pid() and put_pid() work with struct upid [message #15399 is a reply to message #15376] Mon, 30 July 2007 06:03 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> -struct pid *alloc_pid(void)
>> +struct pid *alloc_pid(struct pid_namespace *ns)
>
> Why? We have the only caller, copy_process(), ns == task_active_pid_ns()
> always.

task_active_pid_ns() by newly created task, not the current! That's why
we need to pass something to alloc_pid() to find this new namespace.
Task or namespace itself - is the matter of choice - I selected the
most obvious argument :)

>> {
>> struct pid *pid;
>> enum pid_type type;
>> - int nr = -1;
>> - struct pid_namespace *ns;
>> + int i, nr;
>> + struct pid_namespace *tmp;
>>
>> - ns = task_active_pid_ns(current);
>> pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
>> if (!pid)
>> goto out;
>>
>> - nr = alloc_pidmap(ns);
>> - if (nr < 0)
>> - goto out_free;
>> + tmp = ns;
>> + for (i = ns->level; i >= 0; i--) {
>> + nr = alloc_pidmap(tmp);
>> + if (nr < 0)
>> + goto out_free;
>> +
>> + pid->numbers[i].nr = nr;
>> + pid->numbers[i].ns = tmp;
>> + tmp = tmp->parent;
>
> Hm... There is no ->parent in "struct pid_namespace", and this
> patch doesn't add it.

Parent is added in another patch - 12/15. I will split it better
when sending to Andrew - patches will be smaller and bisect-safe.

>> + if (ns != &init_pid_ns)
>> + get_pid_ns(ns);
>
> Q: put_pid() checks "ns != &init_pid_ns" as well, this is just
> an optimization, yes? Perhaps we can move this check into

It is :)

> get_pid_ns/put_pid_ns.

I think you're right.

> We are doing get_pid_ns() only for the "top namespace"... I guess
> this can work if pid_namespace does get_pid_ns() on its ->parent.
> This patch looks incomplete.

Yes. This set is not well split, sorry. I wanted to get comments about the
approach, bugs, etc (I have already mentioned this in another letter...)

> Oleg.
>
>

Thanks,
Pavel.
Re: [PATCH 3/15] kern_siginfo helper [message #15400 is a reply to message #15381] Mon, 30 July 2007 06:07 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> TODO: This is more an exploratory patch and modifies only interfaces
>> necessary to implement correct signal semantics in pid namespaces.
>>
>> If the approach is feasible, we could consistently use 'kern_siginfo'
>> in other signal interfaces and possibly in 'struct sigqueue'.
>>
>> We could modify dequeue_signal() to directly work with 'kern_siginfo'
>> and remove dequeue_signal_kern_info().
>
> Well... I know, it is very easy to blame somebody else's patch, and probably
> my taste is not good...
>
> But honestly, I personally think this approach is a horror, and any alternative
> is better :)
>
> I'd rather change dequeue_signal() so that it takes "struct sigqueue *"
> parameter instead of "siginfo_t *", or add a new "int *flags".
>
> OK, this doesn't work anyway, we should do something different. Perhaps
> just do all checks on sender's side.

Yes. Signal handling in namespaces turned out to be the most complicated
part of the set. I start thinking to drop this part till we have the "core"
in -mm tree. Suka, what do you think?

> It is a bit strange that this patch is 3/15, and the rest bits in 11/15,
> not very convenient for the review.

Well, I thought that a split like
1. patches for kernel to prepare it for the set
2. the set itself
is the best to review. Maybe I was wrong, but how to make this then?
E.g. I have a MS_KERNOUNT patch, but its changes are used *much* later.

> Oleg.
>
>

Thanks,
Pavel
Re: [PATCH 7/15] Helpers to obtain pid numbers [message #15401 is a reply to message #15382] Mon, 30 July 2007 06:11 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> --- linux-2.6.23-rc1-mm1.orig/include/linux/pid.h 2007-07-26 16:34:45.000000000 +0400
>> +++ linux-2.6.23-rc1-mm1-7/include/linux/pid.h 2007-07-26 16:36:37.000000000 +0400
>> @@ -83,12 +92,34 @@ extern void FASTCALL(detach_pid(struct t
>>
>> extern struct pid *alloc_pid(struct pid_namespace *ns);
>> extern void FASTCALL(free_pid(struct pid *pid));
>> +
>> +/*
>> + * the helpers to get the pid's id seen from different namespaces
>> + *
>> + * pid_nr() : global id, i.e. the id seen from the init namespace;
>
> This looks a bit strange to me, but perhaps this is just matter of taste.
> I think pid_nr(pid) should be pid_nr_ns(pid, current->nsproxy->pid_ns),
> this is imho much closer to the current meaning. I won't persist though.

pid_nr, find_task_by_pid and all other stuff, that existed in the kernel
before the set are intended to work with global ids only (just as it was
before the set).

>> +pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>> +{
>> + pid_t nr = 0;
>> + if (pid && ns->level <= pid->level)
>> + nr = pid->numbers[ns->level].nr;
>> + return nr;
>> +}
>
> I am not sure I understand the "ns->level <= pid->level" check. Isn't it
> a bug to use a "wrong" namespace here? In that case BUG_ON() looks better.

Yes, that's a check for bad namespace passed.

> If ns could be wrong, "ns->level <= pid->level" is not enough, we should
> also check "pid->numbers[ns->level].ns == ns", no?

Yes, we should, and you're right in that we must have BUG_ON() here.

> Oleg.
>
>

Thanks,
Pavel
Re: [PATCH 8/15] Helpers to find the task by its numerical ids [message #15402 is a reply to message #15383] Mon, 30 July 2007 06:15 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> +#define find_pid(pid) find_pid_ns(pid, &init_pid_ns)
>
> Again, I think find_pid() should use current's active ns, not
> init_pid_ns. Just grep for find_pid/find_task_by_pid.
>
>> --- linux-2.6.23-rc1-mm1.orig/kernel/pid.c 2007-07-26 16:34:45.000000000 +0400
>> +++ linux-2.6.23-rc1-mm1-7/kernel/pid.c 2007-07-26 16:36:37.000000000 +0400
>> @@ -204,19 +221,20 @@ static void delayed_put_pid(struct rcu_h
>> goto out;
>> }
>>
>> -struct pid * fastcall find_pid(int nr)
>> +struct pid * fastcall find_pid_ns(int nr, struct pid_namespace *ns)
>> {
>> struct hlist_node *elem;
>> - struct pid *pid;
>> + struct upid *pnr;
>> +
>> + hlist_for_each_entry_rcu(pnr, elem,
>> + &pid_hash[pid_hashfn(nr, ns)], pid_chain)
>> + if (pnr->nr == nr && pnr->ns == ns)
> ^^^^^^^^^^^^^
> Aha, that is why we need upid->ns.

That's it :)

> I am a bit surprised we don't move the global pid_hash into the
> "struct pid_namespace", this could speedup the search, and we
> don't need upid->ns.

Hm... Worth thinking about, but this hash itself is large enough and
its size depends on the node's number of pages, so we'll have

1. either to make per-namespace hash (much) smaller;
2. or to give (too) many memory for it.

>> -struct pid *find_ge_pid(int nr)
>> +struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
>> {
>> struct pid *pid;
>>
>> do {
>> - pid = find_pid(nr);
>> + pid = find_pid_ns(nr, ns);
>> if (pid)
>> break;
>> - nr = next_pidmap(task_active_pid_ns(current), nr);
>> + nr = next_pidmap(ns, nr);
>> } while (nr > 0);
>>
>> return pid;
>
> This means we should fix the caller, next_tgid(), but this is done
> in 15/15.

Sorry :)

> Oleg.
>
>

Thank,
Pavel
Re: [PATCH 9/15] Move alloc_pid() after the namespace is cloned [message #15403 is a reply to message #15355] Mon, 30 July 2007 06:17 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> This is a fix for Sukadev's patch that moved the alloc_pid() call from
>> do_fork() into copy_process().
>
> ... and this patch changes almost every line from Sukadev's patch.

It does. My bad :( I have reviewed Suka's patch badly and was sure it
puts the alloc_pid() right where we need this.

> Sorry gents, but isn't it better to ask Andrew to drop that patch
> (which is quite useless by itself), and send a new one which incorporates
> all necessary changes? Imho, it would be much easier to understand.

Hm... Maybe it's better to ask him to fold these patches together?

>> @@ -1406,7 +1422,13 @@ long do_fork(unsigned long clone_flags,
>> if (!IS_ERR(p)) {
>> struct completion vfork;
>>
>> - nr = pid_nr(task_pid(p));
>> + /*
>> + * this is enough to call pid_nr_ns here, but this if
>> + * improves optimisation of regular fork()
>> + */
>> + nr = (clone_flags & CLONE_NEWPID) ?
>> + task_pid_nr_ns(p, current->nsproxy->pid_ns) :
>> + task_pid_vnr(p);
>
> Shouldn't we do the same for CLONE_PARENT_SETTID in copy_process() ?
> Otherwise *parent_tidptr may have a wrong value which doesn't match
> to what fork() returns.

Oops. We should. Thanks :)

> Oleg.

Thanks,
Pavel
Re: [PATCH 10/15] Make each namespace has its own proc tree [message #15404 is a reply to message #15385] Mon, 30 July 2007 06:43 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> static int proc_get_sb(struct file_system_type *fs_type,
>> int flags, const char *dev_name, void *data, struct vfsmount *mnt)
>> {
>> [...snip...]
>>
>> + if (!sb->s_root) {
>> + sb->s_flags = flags;
>> + err = proc_fill_super(sb);
>> + if (err) {
>> + up_write(&sb->s_umount);
>> + deactivate_super(sb);
>> + return err;
>> + }
>> +
>> + ei = PROC_I(sb->s_root->d_inode);
>> + if (!ei->pid) {
>> + rcu_read_lock();
>> + ei->pid = get_pid(find_pid_ns(1, ns));
>
> Where do we "put" this pid?

In proc_delete_inode()

>> void release_task(struct task_struct * p)
>> {
>> + struct pid *pid;
>> struct task_struct *leader;
>> int zap_leader;
>> repeat:
>> @@ -160,6 +161,20 @@ repeat:
>> write_lock_irq(&tasklist_lock);
>> ptrace_unlink(p);
>> BUG_ON(!list_empty(&p->ptrace_list) ||
>> !list_empty(&p->ptrace_children));
>> + /*
>> + * we have to keep this pid till proc_flush_task() to make
>> + * it possible to flush all dentries holding it. pid will
>> + * be put ibidem
>> + *
>> + * however if the pid belogs to init namespace only, we can
>> + * optimize this out
>> + */
>> + pid = task_pid(p);
>> + if (pid->level != 0)
>> + get_pid(pid);
>> + else
>> + pid = NULL;
>> +
>> __exit_signal(p);
>>
>> /*
>> @@ -184,7 +199,8 @@ repeat:
>> }
>>
>> write_unlock_irq(&tasklist_lock);
>> - proc_flush_task(p, NULL);
>> + proc_flush_task(p, pid);
>> + put_pid(pid);
>
> Oh, this is not nice...
>
> Look, proc_flush_task() doesn't need pid, it can use active pid_ns

It cannot. By the time release_task() is called the task is already
exit_task_namespaces()-ed :(

> to iterate the ->parent chain and do proc_flush_task_mnt(), and it
> can check "ns->child_reaper == current" for mntput(), yes?
>
> So, can't we do instead
>
> release_task()
> {
> struct pid_namespace *my_ns = ...; // get it from pid;

So, that's what you mean. Well, that's sounds reasonable. Thanks.

> ...
>
> write_unlock_irq(&tasklist_lock);
>
> // __exit_signal()->...->put_pid() drops the reference,
> // but this ns should still be valid because /proc itself
> // holds a reference to it, even if we are /sbin/init.
> proc_flush_task(p, my_ns);
> ...
> }
>
> No?

Yes. Thanks!

> Oleg.

Pavel
Re: [PATCH 10/15] Make each namespace has its own proc tree [message #15405 is a reply to message #15387] Mon, 30 July 2007 06:45 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/29, Oleg Nesterov wrote:
>> Look, proc_flush_task() doesn't need pid, it can use active pid_ns
>
> Oops. Yes, proc_flush_task() doesn't use ->numbers[].nr, but it should?
>
> proc_flush_task_mnt() uses task->pid to flush /proc/$pid, this looks wrong,
> we should use ->number[ns->level].nr, no?

Indeed. Hmm, why then pids were released during my tests...
Looks like we do still need struct pid for proc_flush_task().

> Oleg.

Thanks,
Pavel
Re: [PATCH 15/15] Hooks over the code to show correct values to user [message #15406 is a reply to message #15384] Mon, 30 July 2007 06:49 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> int
>> kill_proc(pid_t pid, int sig, int priv)
>> {
>> - return kill_proc_info(sig, __si_special(priv), pid);
>> + int ret;
>> +
>> + rcu_read_lock();
>> + ret = kill_pid_info(sig, __si_special(priv), find_pid(pid));
>> + rcu_read_unlock();
>> + return ret;
>> }
>
> I think this is wrong. kill_proc() should behave the same as kill_proc_info(),
> so this change is not needed. With this patch they use different namespaces
> to find the task, this is not consistent.

Actually, callers of this use tsk->pid (global pid) as an argument, so
find_vpid() might return wrong value.

> (sadly, this patch is huge, very difficult to review).

This is *very* huge, but all it does is just replace tsk->pid, find_task_by_pid,
pd_nr, etc with appropriate task_pid_nr(), pid_nr_ns() and find_task_by_vpid() etc.

> Oleg.

Thanks,
Pavel
Re: [PATCH 11/15] Signal semantics [message #15410 is a reply to message #15357] Mon, 30 July 2007 09:31 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
[snip]

>> | Maybe it's worth disabling cross-namespaces ptracing...
>>
>> I think so too. Its probably not a serious limitation ?
>
> Several people think we will implement 'namespace entering' through a
> ptrace hack, where maybe the admin ptraces the init in a child pidns,

Why not implement namespace entering w/o any hacks? :)

> makes it fork, and makes the child execute what it wants (i.e. ps -ef).
>
> You're talking about killing that functionality?

No. We're talking about disabling the things that are not supposed
to work at all.

> -serge
>
Re: [PATCH 11/15] Signal semantics [message #15411 is a reply to message #15358] Mon, 30 July 2007 09:34 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
sukadev@us.ibm.com wrote:
> Serge E. Hallyn [serue@us.ibm.com] wrote:
> | Quoting sukadev@us.ibm.com (sukadev@us.ibm.com):
> | > Pavel Emelianov [xemul@openvz.org] wrote:
> | > | Oleg Nesterov wrote:
> | > | >(What about ptrace_attach() btw? If it is possible to send a signal to init
> | > | > from the "parent" namespace, perhaps it makes sense to allow ptracing as
> | > | > well).
> | > |
> | > | ptracing of tasks fro different namespaces is not possible at all, since
> | > | strace utility determines the fork()-ed child pid from the parent's eax
> | > | register, which would contain the pid value as this parent sees his child.
> | > | But if the strace is in different namespace - it won't be able to find
> | > | this child with the pid value from parent's eax.
> | > |
> | > | Maybe it's worth disabling cross-namespaces ptracing...
> | >
> | > I think so too. Its probably not a serious limitation ?
> |
> | Several people think we will implement 'namespace entering' through a
> | ptrace hack, where maybe the admin ptraces the init in a child pidns,
> | makes it fork, and makes the child execute what it wants (i.e. ps -ef).
> |
> | You're talking about killing that functionality?
>
> No. I was only thinking in terms of debugging container init and missed
> the namespace entering part.
>
> Pavel, I am not sure I understand your comment about being unable to
> ptrace() a child ns.

Ok. We have a task with pid 100, which tries to clone the new namespace.
This task fork-s and we have a new task with a couple of pids (101, 1).
Then this "init" forks again and we have the third task with pids (102, 2).
The problem is that when the 3rd task appears the return value from fork(),
that is - the new task's pid as it is seen by it's parent (2nd task), will
go to eax register (for i386) and it will be 2! But the prtaces from the
initial namespace cannot address this task with pid 2.

> BTW, I am able to gdb a process (incl container-init) from parent ns now.

Debugging separate processes is possible, but when this task starts forking
with namespaces creation - this becomes impossible.

> |
> | -serge
>
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15417 is a reply to message #15377] Mon, 30 July 2007 11:56 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> @@ -895,6 +915,7 @@ fastcall NORET_TYPE void do_exit(long co
>> {
>> struct task_struct *tsk = current;
>> int group_dead;
>> + struct pid_namespace *pid_ns = tsk->nsproxy->pid_ns;
>>
>> profile_task_exit(tsk);
>>
>> @@ -905,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
>> if (unlikely(!tsk->pid))
>> panic("Attempted to kill the idle task!");
>> if (unlikely(tsk == task_child_reaper(tsk))) {
>> - if (task_active_pid_ns(tsk) != &init_pid_ns)
>> - task_active_pid_ns(tsk)->child_reaper =
>> - init_pid_ns.child_reaper;
>> + if (pid_ns != &init_pid_ns) {
>> + zap_pid_ns_processes(pid_ns);
>> + pid_ns->child_reaper = init_pid_ns.child_reaper;
>> + }
>> else
>> panic("Attempted to kill init!");
>
> No, no, this is wrong. Yes, the current code is buggy too, I'll send
> the fix.
>
> I think this code should be moved below under the "if (group_dead)",
> and we should use tsk->group_leader.
>
>> +void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>> +{
>> + int i;
>> + int nr;
>> + int nfree;
>> + int options = WNOHANG|WEXITED|__WALL;
>> +
>> +repeat:
>> + /*
>> + * We know pid == 1 is terminating. Find remaining pid_ts
>> + * in the namespace, signal them and then wait for them
>> + * exit.
>> + */
>> + nr = next_pidmap(pid_ns, 1);
>> + while (nr > 0) {
>> + kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
>> + nr = next_pidmap(pid_ns, nr);
>> + }
>> +
>> + nr = next_pidmap(pid_ns, 1);
>> + while (nr > 0) {
>> + do_wait(nr, options, NULL, NULL, NULL);
>
> When the first child of init exits, it sends SIGCHLD. After that,
> do_wait() will never sleep, so we are doing a busy-wait loop.
> Not good, especially when we have a niced child, can livelock.
>
>> + nr = next_pidmap(pid_ns, nr);
>> + }
>> +
>> + nfree = 0;
>> + for (i = 0; i < PIDMAP_ENTRIES; i++)
>> + nfree += atomic_read(&pid_ns->pidmap[i].nr_free);
>> +
>> + /*
>> + * If pidmap has entries for processes other than 0 and 1, retry.
>> + */
>> + if (nfree < (BITS_PER_PAGE * PIDMAP_ENTRIES - 2))
>> + goto repeat;
>
> This doesn't look right.
>
> Suppose that some "struct pid" was pinned from the parent namespace.
> In that case zap_pid_ns_processes() will burn CPU until put_pid(), bad.

Nope. struct pid can be pinned, but the pidmap "fingerprint" cannot.
So as soon as the release_task() is called the pidmap becomes free and
we can proceed.

However I agree with the "burn CPU" issue - wait must sleep if needed.

> I think we can rely on forget_original_child() and do something like
> this:
>
> zap_active_ns_processes(void)
> {
> // kill all tasks in our ns and below
> kill(-1, SIGKILL);

That would be too slow to walk through all the tasks in a node searching
for a couple of them we need. fing_ge_pid() looks better to me.

> do {
> clear_thread_flag(TIF_SIGPENDING);
> } while (wait(NULL) != -ECHLD);
> }
>
> Oleg.
>
>

Thanks,
Pavel
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15419 is a reply to message #15417] Mon, 30 July 2007 15:44 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/30, Pavel Emelyanov wrote:
>
> Oleg Nesterov wrote:
> >>+
> >>+ nfree = 0;
> >>+ for (i = 0; i < PIDMAP_ENTRIES; i++)
> >>+ nfree += atomic_read(&pid_ns->pidmap[i].nr_free);
> >>+
> >>+ /*
> >>+ * If pidmap has entries for processes other than 0 and 1, retry.
> >>+ */
> >>+ if (nfree < (BITS_PER_PAGE * PIDMAP_ENTRIES - 2))
> >>+ goto repeat;
> >
> >This doesn't look right.
> >
> >Suppose that some "struct pid" was pinned from the parent namespace.
> >In that case zap_pid_ns_processes() will burn CPU until put_pid(), bad.
>
> Nope. struct pid can be pinned, but the pidmap "fingerprint" cannot.

Heh. It was specially designed this way, but I managed to forget.

You are right, thanks for correcting me.

> So as soon as the release_task() is called the pidmap becomes free and
> we can proceed.

Well, it doesn't matter, but strictly speaking this is not true.
release_task()->detach_pid(PIDTYPE_PID) doesn't necessary free pidmap,
it could be "used" by other tasks as PGID/SID.

> However I agree with the "burn CPU" issue - wait must sleep if needed.
>
> >I think we can rely on forget_original_child() and do something like
> >this:
> >
> > zap_active_ns_processes(void)
> > {
> > // kill all tasks in our ns and below
> > kill(-1, SIGKILL);
>
> That would be too slow to walk through all the tasks in a node searching
> for a couple of them we need. fing_ge_pid() looks better to me.

OK. I personally dislike the "retry" logic (and it is not safe if we want
wait() to actually sleep waiting for the child), but we can do the "kill"
loop under tasklist_lock.

In any case, I don't think we should use next_pid() for wait(), or check
pidmap[].nr_free,

> > do {
> > clear_thread_flag(TIF_SIGPENDING);
> > } while (wait(NULL) != -ECHLD);
> > }

just wait for any child until -ECHLD. After that we can return safely.

Oleg.
Re: [PATCH 9/15] Move alloc_pid() after the namespace is cloned [message #15422 is a reply to message #15403] Mon, 30 July 2007 23:43 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Pavel Emelianov [xemul@openvz.org] wrote:
| Oleg Nesterov wrote:
| >On 07/26, Pavel Emelyanov wrote:
| >>This is a fix for Sukadev's patch that moved the alloc_pid() call from
| >>do_fork() into copy_process().
| >
| >... and this patch changes almost every line from Sukadev's patch.
|
| It does. My bad :( I have reviewed Suka's patch badly and was sure it
| puts the alloc_pid() right where we need this.

I should have reviewed Pavel's closely too. Sorry.

|
| >Sorry gents, but isn't it better to ask Andrew to drop that patch
| >(which is quite useless by itself), and send a new one which incorporates
| >all necessary changes? Imho, it would be much easier to understand.
|
| Hm... Maybe it's better to ask him to fold these patches together?

I think so, but even dropping my patch is fine with me.

|
| >>@@ -1406,7 +1422,13 @@ long do_fork(unsigned long clone_flags,
| >> if (!IS_ERR(p)) {
| >> struct completion vfork;
| >>
| >>- nr = pid_nr(task_pid(p));
| >>+ /*
| >>+ * this is enough to call pid_nr_ns here, but this if
| >>+ * improves optimisation of regular fork()
| >>+ */
| >>+ nr = (clone_flags & CLONE_NEWPID) ?
| >>+ task_pid_nr_ns(p, current->nsproxy->pid_ns) :
| >>+ task_pid_vnr(p);
| >
| >Shouldn't we do the same for CLONE_PARENT_SETTID in copy_process() ?
| >Otherwise *parent_tidptr may have a wrong value which doesn't match
| >to what fork() returns.
|
| Oops. We should. Thanks :)
|
| >Oleg.
|
| Thanks,
| Pavel
Re: [PATCH 3/15] kern_siginfo helper [message #15433 is a reply to message #15400] Tue, 31 July 2007 00:21 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Pavel Emelianov [xemul@openvz.org] wrote:
| Oleg Nesterov wrote:
| >On 07/26, Pavel Emelyanov wrote:
| >>TODO: This is more an exploratory patch and modifies only
| >>interfaces
| >> necessary to implement correct signal semantics in pid namespaces.
| >>
| >> If the approach is feasible, we could consistently use 'kern_siginfo'
| >> in other signal interfaces and possibly in 'struct sigqueue'.
| >>
| >> We could modify dequeue_signal() to directly work with 'kern_siginfo'
| >> and remove dequeue_signal_kern_info().
| >
| >Well... I know, it is very easy to blame somebody else's patch, and
| >probably
| >my taste is not good...
| >
| >But honestly, I personally think this approach is a horror, and any
| >alternative
| >is better :)

Hmm. My reasoning was that since siginfo_t was directly "shared" with
user space, extending it even to add a flag was pain.

| >
| >I'd rather change dequeue_signal() so that it takes "struct sigqueue *"
| >parameter instead of "siginfo_t *", or add a new "int *flags".

My earlier version to Containers@ passed in "int *signal_cinit" couple
of levels down and used that in get_signal_to_deliver() but that
looked ugly :-) Passing in sigqueue to dequeue_signal() may be better,
but anyway...

| >
| >OK, this doesn't work anyway, we should do something different. Perhaps
| >just do all checks on sender's side.
|
| Yes. Signal handling in namespaces turned out to be the most complicated
| part of the set. I start thinking to drop this part till we have the "core"
| in -mm tree. Suka, what do you think?

Yes. Lets focus on the core for now and allow privileged user in a child-ns
to terminate the container-init. I will try the signal approach from sender
side also.

|
| >It is a bit strange that this patch is 3/15, and the rest bits in 11/15,
| >not very convenient for the review.
|
| Well, I thought that a split like
| 1. patches for kernel to prepare it for the set
| 2. the set itself
| is the best to review. Maybe I was wrong, but how to make this then?
| E.g. I have a MS_KERNOUNT patch, but its changes are used *much* later.
|
| >Oleg.
| >
| >
|
| Thanks,
| Pavel
Re: [PATCH 9/15] Move alloc_pid() after the namespace is cloned [message #15437 is a reply to message #15303] Tue, 31 July 2007 05:49 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Pavel Emelianov [xemul@openvz.org] wrote:
| When we create new namespace we will need to allocate the struct pid,
| that will have one extra struct upid in array, comparing to the parent.
| Thus we need to know the new namespace (if any) in alloc_pid() to init
| this struct upid properly, so move the alloc_pid() call lower in
| copy_process().
|
| This is a fix for Sukadev's patch that moved the alloc_pid() call from
| do_fork() into copy_process().
|
| Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
|
| ---
|
| fork.c | 59 ++++++++++++++++++++++++++++++++++++++---------------------
| 1 files changed, 38 insertions(+), 21 deletions(-)
|
| diff -upr linux-2.6.23-rc1-mm1.orig/kernel/fork.c
| linux-2.6.23-rc1-mm1-7/kernel/fork.c
| --- linux-2.6.23-rc1-mm1.orig/kernel/fork.c 2007-07-26
| 16:34:45.000000000 +0400
| +++ linux-2.6.23-rc1-mm1-7/kernel/fork.c 2007-07-26
| 16:36:37.000000000 +0400
| @@ -1028,16 +1029,9 @@ static struct task_struct *copy_process(
| if (p->binfmt && !try_module_get(p->binfmt->module))
| goto bad_fork_cleanup_put_domain;
|
| - if (pid != &init_struct_pid) {
| - pid = alloc_pid();
| - if (!pid)
| - goto bad_fork_put_binfmt_module;
| - }
| -
| p->did_exec = 0;
| delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
| copy_flags(clone_flags, p);
| - p->pid = pid_nr(pid);
| INIT_LIST_HEAD(&p->children);
| INIT_LIST_HEAD(&p->sibling);
| p->vfork_done = NULL;
| @@ -1112,10 +1106,6 @@ static struct task_struct *copy_process(
| p->blocked_on = NULL; /* not blocked yet */
| #endif
|
| - p->tgid = p->pid;
| - if (clone_flags & CLONE_THREAD)
| - p->tgid = current->tgid;
| -
| if ((retval = security_task_alloc(p)))
| goto bad_fork_cleanup_policy;
| if ((retval = audit_alloc(p)))
| @@ -1141,6 +1131,17 @@ static struct task_struct *copy_process(
| if (retval)
| goto bad_fork_cleanup_namespaces;
|
| + if (pid != &init_struct_pid) {
| + pid = alloc_pid(task_active_pid_ns(p));
| + if (!pid)
| + goto bad_fork_cleanup_namespaces;

We should set retval to -EAGAIN before this goto. Otherwise, when we
take this goto, retval would be 0 (bc copy_thread() succeeded) and we
crash in the caller.


| + }
| +
| + p->pid = pid_nr(pid);
| + p->tgid = p->pid;
| + if (clone_flags & CLONE_THREAD)
| + p->tgid = current->tgid;
| +
| p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr
| : NULL;
| /*
| * Clear TID on mm_release()?
| @@ -1237,7 +1242,7 @@ static struct task_struct *copy_process(
| spin_unlock(&current->sighand->siglock);
| write_unlock_irq(&tasklist_lock);
| retval = -ERESTARTNOINTR;
| - goto bad_fork_cleanup_namespaces;
| + goto bad_fork_free_pid;
| }
|
| if (clone_flags & CLONE_THREAD) {
| @@ -1266,11 +1271,22 @@ static struct task_struct *copy_process(
| __ptrace_link(p, current->parent);
|
| if (thread_group_leader(p)) {
| - p->signal->tty = current->signal->tty;
| - p->signal->pgrp = task_pgrp_nr(current);
| - set_task_session(p, task_session_nr(current));
| - attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
| - attach_pid(p, PIDTYPE_SID, task_session(current));
| + if (clone_flags & CLONE_NEWPID) {
| + p->nsproxy->pid_ns->child_reaper = p;
| + p->signal->tty = NULL;
| + p->signal->pgrp = p->pid;
| + set_task_session(p, p->pid);
| + attach_pid(p, PIDTYPE_PGID, pid);
| + attach_pid(p, PIDTYPE_SID, pid);
| + } else {
| + p->signal->tty = current->signal->tty;
| + p->signal->pgrp = task_pgrp_nr(current);
| + set_task_session(p,
| task_session_nr(current));
| + attach_pid(p, PIDTYPE_PGID,
| + task_pgrp(current));
| + attach_pid(p, PIDTYPE_SID,
| + task_session(current));
| + }
|
| list_add_tail_rcu(&p->tasks, &init_task.tasks);
| __get_cpu_var(process_counts)++;
| @@ -1288,6 +1304,9 @@ static struct task_struct *copy_process(
| container_post_fork(p);
| return p;
|
| +bad_fork_free_pid:
| + if (pid != &init_struct_pid)
| + free_pid(pid);
| bad_fork_cleanup_namespaces:
| exit_task_namespaces(p);
| bad_fork_cleanup_keys:
| @@ -1322,9 +1341,6 @@ bad_fork_cleanup_container:
| #endif
| container_exit(p, container_callbacks_done);
| delayacct_tsk_free(p);
| - if (pid != &init_struct_pid)
| - free_pid(pid);
| -bad_fork_put_binfmt_module:
| if (p->binfmt)
| module_put(p->binfmt->module);
| bad_fork_cleanup_put_domain:
| @@ -1406,7 +1422,13 @@ long do_fork(unsigned long clone_flags,
| if (!IS_ERR(p)) {
| struct completion vfork;
|
| - nr = pid_nr(task_pid(p));
| + /*
| + * this is enough to call pid_nr_ns here, but this if
| + * improves optimisation of regular fork()
| + */
| + nr = (clone_flags & CLONE_NEWPID) ?
| + task_pid_nr_ns(p, current->nsproxy->pid_ns) :
| + task_pid_vnr(p);
|
| if (clone_flags & CLONE_VFORK) {
| p->vfork_done = &vfork;
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15441 is a reply to message #15308] Tue, 31 July 2007 09:07 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/30, sukadev@us.ibm.com wrote:
>
> --- lx26-23-rc1-mm1.orig/kernel/exit.c 2007-07-26 20:08:16.000000000 -0700
> +++ lx26-23-rc1-mm1/kernel/exit.c 2007-07-30 23:10:30.000000000 -0700
> @@ -915,6 +915,7 @@ fastcall NORET_TYPE void do_exit(long co
> {
> struct task_struct *tsk = current;
> int group_dead;
> + struct pid_namespace *pid_ns = tsk->nsproxy->pid_ns;
>
> profile_task_exit(tsk);
>
> @@ -925,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
> if (unlikely(!tsk->pid))
> panic("Attempted to kill the idle task!");
> if (unlikely(tsk == task_child_reaper(tsk))) {
> - if (task_active_pid_ns(tsk) != &init_pid_ns)
> - task_active_pid_ns(tsk)->child_reaper =
> - init_pid_ns.child_reaper;
> + if (pid_ns != &init_pid_ns) {
> + zap_pid_ns_processes(pid_ns);
> + pid_ns->child_reaper = init_pid_ns.child_reaper;
> + }
> else
> panic("Attempted to kill init!");
> }

Just to remind you, this is not right when init is multi-threaded,
we should do this only when the last thread exits.

> -static long do_wait(pid_t pid, int options, struct siginfo __user *infop,
> +long do_wait(pid_t pid, int options, struct siginfo __user *infop,
> int __user *stat_addr, struct rusage __user *ru)

Small nit, other in-kernel reapers use sys_wait4(), perhaps we can use
it too, in that case we don't need to export do_wait().

> +void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> +{
> + int nr;
> + int rc;
> + int options = WEXITED|__WALL;
> +
> + /*
> + * We know pid == 1 is terminating. Find remaining pid_ts
> + * in the namespace, signal them and then wait for them
> + * exit.
> + */
> + nr = next_pidmap(pid_ns, 1);
> + while (nr > 0) {
> + kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
> + nr = next_pidmap(pid_ns, nr);
> + }

Without tasklist_lock held this is not reliable.

Oleg.
Re: [PATCH 15/15] Hooks over the code to show correct values to user [message #15442 is a reply to message #15406] Tue, 31 July 2007 10:04 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/30, Pavel Emelyanov wrote:
>
> Oleg Nesterov wrote:
> >On 07/26, Pavel Emelyanov wrote:
> >>int
> >>kill_proc(pid_t pid, int sig, int priv)
> >>{
> >>- return kill_proc_info(sig, __si_special(priv), pid);
> >>+ int ret;
> >>+
> >>+ rcu_read_lock();
> >>+ ret = kill_pid_info(sig, __si_special(priv), find_pid(pid));
> >>+ rcu_read_unlock();
> >>+ return ret;
> >>}
> >
> >I think this is wrong. kill_proc() should behave the same as
> >kill_proc_info(),
> >so this change is not needed. With this patch they use different namespaces
> >to find the task, this is not consistent.
>
> Actually, callers of this use tsk->pid (global pid) as an argument, so
> find_vpid() might return wrong value.

Yes I see. But still I don't agree on this issue.

kill_proc() is a simple wrapper on top of kill_proc_info(), not good
to break this. And with this patch they use different namespaces to
search the pid. Imho, not consistent.

Probably we can ignore this for now, but suppose we have some out-of-tree
driver which does kill_proc(pid_number), and the application from non-init
namespace does ioctl(SET_PID_NUMBER, getpid()).

And this is why btw I think find_pid/pid_nr should use active namespace, not
init_pid_ns. That driver can save "struct task_struct*" or "struct pid*".

OK, I understand it is a pain to "fix" the in-tree callers of kill_proc()
(say, we can introduce kill_pid_t() or something), so let's forget this.
In fact, we'd better remove kill_proc(), we should avoid using pid_t, the
callers should be converted to use struct pid.

Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15465 is a reply to message #15441] Wed, 01 August 2007 06:16 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Oleg Nesterov [oleg@tv-sign.ru] wrote:
| On 07/30, sukadev@us.ibm.com wrote:
| >
| > --- lx26-23-rc1-mm1.orig/kernel/exit.c	2007-07-26 20:08:16.000000000 -0700
| > +++ lx26-23-rc1-mm1/kernel/exit.c	2007-07-30 23:10:30.000000000 -0700
| > @@ -915,6 +915,7 @@ fastcall NORET_TYPE void do_exit(long co
| >  {
| >  	struct task_struct *tsk = current;
| >  	int group_dead;
| > +	struct pid_namespace *pid_ns = tsk->nsproxy->pid_ns;
| >  
| >  	profile_task_exit(tsk);
| >  
| > @@ -925,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
| >  	if (unlikely(!tsk->pid))
| >  		panic("Attempted to kill the idle task!");
| >  	if (unlikely(tsk == task_child_reaper(tsk))) {
| > -		if (task_active_pid_ns(tsk) != &init_pid_ns)
| > -			task_active_pid_ns(tsk)->child_reaper =
| > -					init_pid_ns.child_reaper;
| > +		if (pid_ns != &init_pid_ns) {
| > +			zap_pid_ns_processes(pid_ns);
| > +			pid_ns->child_reaper = init_pid_ns.child_reaper;
| > +		}
| >  		else
| >  			panic("Attempted to kill init!");
| >  	}
| 
| Just to remind you, this is not right when init is multi-threaded,
| we should do this only when the last thread exits.

Sorry, I needed to clarify somethings about the multi-threaded init. I
got the impresssion that you were sending a patch for the existing bug,
and meant to review/clarify in the context of the patch.

Anyways, re: requirements for multi-threaded init:

	Our current definition of is_container_init() and task_child_reaper()
	refer only to the main-thread of the container-init (since they check
	for pid_t == 1)

	If the main-thread is exiting and is the last thread in the group,
	we want terminate other processes in the pid ns (simple case).

	If the main thread is exiting, but is not the last thread in the
	group, should we let it exit and let the next thread in the group
	the reaper of the pid ns ?
	
	Then we would have the pid ns w/o a container-init (i.e reaper
	does not have a pid_t == 1, but probably does not matter).

	And, when this last thread is exiting, we want to terminate other
	processes in the ns right ?

| 
| > -static long do_wait(pid_t pid, int options, struct siginfo __user *infop,
| > +long do_wait(pid_t pid, int options, struct siginfo __user *infop,
| >  		    int __user *stat_addr, struct rusage __user *ru)
| 
| Small nit, other in-kernel reapers use sys_wait4(), perhaps we can use
| it too, in that case we don't need to export do_wait().

Ok.

| 
| > +void zap_pid_ns_processes(struct pid_namespace *pid_ns)
| > +{
| > +	int nr;
| > +	int rc;
| > +	int options = WEXITED|__WALL;
| > +
| > +	/*
| > +	 * We know pid == 1 is terminating. Find remaining pid_ts
| > +	 * in the namespace, signal them and then wait for them
| > +	 * exit.
| > +	 */
| > +	nr = next_pidmap(pid_ns, 1);
| > +	while (nr > 0) {
| > +		kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
| > +		nr = next_pidmap(pid_ns, nr);
| > +	}
| 
| Without tasklist_lock held this is not reliable.

Ok. BTW, find_ge_pid() also walks the pidmap, but does not seem to hold
the tasklist_lock. Is that bc its only used in /proc ?

| 
| Oleg.
Re: [PATCH 11/15] Signal semantics [message #15478 is a reply to message #15410] Wed, 01 August 2007 16:13 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Pavel Emelyanov (xemul@openvz.org):
> [snip]
> 
> >>| Maybe it's worth disabling cross-namespaces ptracing...
> >>
> >>I think so too. Its probably not a serious limitation ?
> >
> >Several people think we will implement 'namespace entering' through a
> >ptrace hack, where maybe the admin ptraces the init in a child pidns,
> 
> Why not implement namespace entering w/o any hacks? :)

I did, as a patch on top of the nsproxy container subsystem.  The
response was that that is a hack, and ptrace is cleaner  :)

So the current options for namespace entering would be:

	* using Cedric's bind_ns() functionality, which assigns an
	  integer global id to a namespace, and allows a process to
	  enter a namespace by that global id
	* using my nsproxy container subsystem patch, which lets
	  a process enter another namespace using
	  	echo pid > /container/some/cont/directory/tasks
	  and eventually might allow construction of custom
	  namespaces, i.e.
	  	mkdir /container/c1/c2
		ln -s /container/c1/c1/network /container/c1/c2/network
		echo $$ > /container/c1/c2/tasks
	* using ptrace to coerce a process in the target namespace
	  into forking and executing the desired program.

> >makes it fork, and makes the child execute what it wants (i.e. ps -ef).
> >
> >You're talking about killing that functionality?
> 
> No. We're talking about disabling the things that are not supposed 
> to work at all.

Uh, well in the abstract that sounds like a sound policy...

-serge
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15479 is a reply to message #15465] Wed, 01 August 2007 16:00 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Tue, 2007-07-31 at 23:16 -0700, sukadev@us.ibm.com wrote:
> Oleg Nesterov [oleg@tv-sign.ru] wrote:
> | On 07/30, sukadev@us.ibm.com wrote:
> | >
> | > --- lx26-23-rc1-mm1.orig/kernel/exit.c	2007-07-26 20:08:16.000000000 -0700
> | > +++ lx26-23-rc1-mm1/kernel/exit.c	2007-07-30 23:10:30.000000000 -0700
> | > @@ -915,6 +915,7 @@ fastcall NORET_TYPE void do_exit(long co
> | >  {
> | >  	struct task_struct *tsk = current;
> | >  	int group_dead;
> | > +	struct pid_namespace *pid_ns = tsk->nsproxy->pid_ns;
> | >  
> | >  	profile_task_exit(tsk);
> | >  
> | > @@ -925,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
> | >  	if (unlikely(!tsk->pid))
> | >  		panic("Attempted to kill the idle task!");
> | >  	if (unlikely(tsk == task_child_reaper(tsk))) {
> | > -		if (task_active_pid_ns(tsk) != &init_pid_ns)
> | > -			task_active_pid_ns(tsk)->child_reaper =
> | > -					init_pid_ns.child_reaper;
> | > +		if (pid_ns != &init_pid_ns) {
> | > +			zap_pid_ns_processes(pid_ns);
> | > +			pid_ns->child_reaper = init_pid_ns.child_reaper;
> | > +		}
> | >  		else
> | >  			panic("Attempted to kill init!");
> | >  	}
> | 
> | Just to remind you, this is not right when init is multi-threaded,
> | we should do this only when the last thread exits.
> 
> Sorry, I needed to clarify somethings about the multi-threaded init. I
> got the impresssion that you were sending a patch for the existing bug,
> and meant to review/clarify in the context of the patch.
> 
> Anyways, re: requirements for multi-threaded init:
> 
> 	Our current definition of is_container_init() and task_child_reaper()
> 	refer only to the main-thread of the container-init (since they check
> 	for pid_t == 1)

Remember, the "pid" is actually a tgid:
        
        asmlinkage long sys_getpid(void)
        {
                return current->tgid;
        }

So, there are multiple tasks with a "pid" == 1 with a multithreaded
init. 

> 	If the main-thread is exiting and is the last thread in the group,
> 	we want terminate other processes in the pid ns (simple case).
> 
> 	If the main thread is exiting, but is not the last thread in the
> 	group, should we let it exit and let the next thread in the group
> 	the reaper of the pid ns ?

Well, what happens with a multithreaded init today?
	
-- Dave
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15482 is a reply to message #15465] Wed, 01 August 2007 19:48 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/31, sukadev@us.ibm.com wrote:
>
> Oleg Nesterov [oleg@tv-sign.ru] wrote:
> | >  
> | > @@ -925,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
> | >  	if (unlikely(!tsk->pid))
> | >  		panic("Attempted to kill the idle task!");
> | >  	if (unlikely(tsk == task_child_reaper(tsk))) {
> | > -		if (task_active_pid_ns(tsk) != &init_pid_ns)
> | > -			task_active_pid_ns(tsk)->child_reaper =
> | > -					init_pid_ns.child_reaper;
> | > +		if (pid_ns != &init_pid_ns) {
> | > +			zap_pid_ns_processes(pid_ns);
> | > +			pid_ns->child_reaper = init_pid_ns.child_reaper;
> | > +		}
> | >  		else
> | >  			panic("Attempted to kill init!");
> | >  	}
> | 
> | Just to remind you, this is not right when init is multi-threaded,
> | we should do this only when the last thread exits.
> 
> Sorry, I needed to clarify somethings about the multi-threaded init. I
> got the impresssion that you were sending a patch for the existing bug,
> and meant to review/clarify in the context of the patch.

Ah, sorry, I forgot to send the patch to fix the bug in mainline.
Will try to do tomorrow, please feel free to do this if you wish.

> 	Our current definition of is_container_init() and task_child_reaper()
> 	refer only to the main-thread of the container-init (since they check
> 	for pid_t == 1)

Yes.

> 	If the main-thread is exiting and is the last thread in the group,
> 	we want terminate other processes in the pid ns (simple case).

Yes.

> 	If the main thread is exiting, but is not the last thread in the
> 	group, should we let it exit and let the next thread in the group
> 	the reaper of the pid ns ?

We can, but why? The main thread's task_struct can't go away until all
sub-threads exit. Its ->nsproxy will be NULL, but this doesn't matter.

> 	Then we would have the pid ns w/o a container-init (i.e reaper
> 	does not have a pid_t == 1, but probably does not matter).
> 
> 	And, when this last thread is exiting, we want to terminate other
> 	processes in the ns right ?

Yes, when this last thread is exiting, the entire process is exiting.

> | > +void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> | > +{
> | > +	int nr;
> | > +	int rc;
> | > +	int options = WEXITED|__WALL;
> | > +
> | > +	/*
> | > +	 * We know pid == 1 is terminating. Find remaining pid_ts
> | > +	 * in the namespace, signal them and then wait for them
> | > +	 * exit.
> | > +	 */
> | > +	nr = next_pidmap(pid_ns, 1);
> | > +	while (nr > 0) {
> | > +		kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
> | > +		nr = next_pidmap(pid_ns, nr);
> | > +	}
> | 
> | Without tasklist_lock held this is not reliable.
> 
> Ok. BTW, find_ge_pid() also walks the pidmap, but does not seem to hold
> the tasklist_lock. Is that bc its only used in /proc ?

Yes, but this is something different. With or without tasklist_lock,
find_ge_pid()/next_tgid() is not "reliable" (note that alloc_pid() doesn't
take tasklist), but this doesn't matter for /proc.

We should take tasklist_lock to prevent the new process creation.
We can have the "false positives" (copy_process() in progress, PGID/SID
pids), but this is OK. Note that copy_process() checks signal_pending()
after write_lock_irq(&tasklist_lock), that is why it helps.

Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15483 is a reply to message #15479] Wed, 01 August 2007 19:51 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/01, Dave Hansen wrote:
>
> > 	If the main thread is exiting, but is not the last thread in the
> > 	group, should we let it exit and let the next thread in the group
> > 	the reaper of the pid ns ?
> 
> Well, what happens with a multithreaded init today?

As it was already discussed, the current code is buggy, and should be
fixed.

Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15498 is a reply to message #15479] Thu, 02 August 2007 07:37 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Dave Hansen [haveblue@us.ibm.com] wrote:
| On Tue, 2007-07-31 at 23:16 -0700, sukadev@us.ibm.com wrote:
| > Oleg Nesterov [oleg@tv-sign.ru] wrote:
| > | On 07/30, sukadev@us.ibm.com wrote:
| > | >
| > | > --- lx26-23-rc1-mm1.orig/kernel/exit.c	2007-07-26 20:08:16.000000000 -0700
| > | > +++ lx26-23-rc1-mm1/kernel/exit.c	2007-07-30 23:10:30.000000000 -0700
| > | > @@ -915,6 +915,7 @@ fastcall NORET_TYPE void do_exit(long co
| > | >  {
| > | >  	struct task_struct *tsk = current;
| > | >  	int group_dead;
| > | > +	struct pid_namespace *pid_ns = tsk->nsproxy->pid_ns;
| > | >  
| > | >  	profile_task_exit(tsk);
| > | >  
| > | > @@ -925,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
| > | >  	if (unlikely(!tsk->pid))
| > | >  		panic("Attempted to kill the idle task!");
| > | >  	if (unlikely(tsk == task_child_reaper(tsk))) {
| > | > -		if (task_active_pid_ns(tsk) != &init_pid_ns)
| > | > -			task_active_pid_ns(tsk)->child_reaper =
| > | > -					init_pid_ns.child_reaper;
| > | > +		if (pid_ns != &init_pid_ns) {
| > | > +			zap_pid_ns_processes(pid_ns);
| > | > +			pid_ns->child_reaper = init_pid_ns.child_reaper;
| > | > +		}
| > | >  		else
| > | >  			panic("Attempted to kill init!");
| > | >  	}
| > | 
| > | Just to remind you, this is not right when init is multi-threaded,
| > | we should do this only when the last thread exits.
| > 
| > Sorry, I needed to clarify somethings about the multi-threaded init. I
| > got the impresssion that you were sending a patch for the existing bug,
| > and meant to review/clarify in the context of the patch.
| > 
| > Anyways, re: requirements for multi-threaded init:
| > 
| > 	Our current definition of is_container_init() and task_child_reaper()
| > 	refer only to the main-thread of the container-init (since they check
| > 	for pid_t == 1)
| 
| Remember, the "pid" is actually a tgid:
|         
|         asmlinkage long sys_getpid(void)
|         {
|                 return current->tgid;
|         }
| 
| So, there are multiple tasks with a "pid" == 1 with a multithreaded
| init. 

Yes, and so am now wondering if is_container_init(), is_global_init()
and the "tsk == task_child_reaper(tsk)" checks be replaced with with
something that covers other threads in the reaper ?

| 
| > 	If the main-thread is exiting and is the last thread in the group,
| > 	we want terminate other processes in the pid ns (simple case).
| > 
| > 	If the main thread is exiting, but is not the last thread in the
| > 	group, should we let it exit and let the next thread in the group
| > 	the reaper of the pid ns ?
| 
| Well, what happens with a multithreaded init today?
| 	
| -- Dave
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15499 is a reply to message #15482] Thu, 02 August 2007 07:29 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Oleg Nesterov [oleg@tv-sign.ru] wrote:
| On 07/31, sukadev@us.ibm.com wrote:
| >
| > Oleg Nesterov [oleg@tv-sign.ru] wrote:
| > | >  
| > | > @@ -925,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
| > | >  	if (unlikely(!tsk->pid))
| > | >  		panic("Attempted to kill the idle task!");
| > | >  	if (unlikely(tsk == task_child_reaper(tsk))) {
| > | > -		if (task_active_pid_ns(tsk) != &init_pid_ns)
| > | > -			task_active_pid_ns(tsk)->child_reaper =
| > | > -					init_pid_ns.child_reaper;
| > | > +		if (pid_ns != &init_pid_ns) {
| > | > +			zap_pid_ns_processes(pid_ns);
| > | > +			pid_ns->child_reaper = init_pid_ns.child_reaper;
| > | > +		}
| > | >  		else
| > | >  			panic("Attempted to kill init!");
| > | >  	}
| > | 
| > | Just to remind you, this is not right when init is multi-threaded,
| > | we should do this only when the last thread exits.
| > 
| > Sorry, I needed to clarify somethings about the multi-threaded init. I
| > got the impresssion that you were sending a patch for the existing bug,
| > and meant to review/clarify in the context of the patch.
| 
| Ah, sorry, I forgot to send the patch to fix the bug in mainline.
| Will try to do tomorrow, please feel free to do this if you wish.

I can do that, but am still a bit confused about this multi-threaded
init :-)

| 
| > 	Our current definition of is_container_init() and task_child_reaper()
| > 	refer only to the main-thread of the container-init (since they check
| > 	for pid_t == 1)
| 
| Yes.

This means that we cannot have a check like "tsk == task_child_reaper(tsk)"
to properly detect the child reaper process right ? 

Its basically a very dumb question - How do we detect a container_init()
in the multi-threaded case ?  Should we use "task->tgid == 1" ?

IOW to identify if the last thread of a child reaper is exiting, should we
check "task->tgid == 1" and the "group_dead" flag  in do_exit() ?

| 
| > 	If the main-thread is exiting and is the last thread in the group,
| > 	we want terminate other processes in the pid ns (simple case).
| 
| Yes.
| 
| > 	If the main thread is exiting, but is not the last thread in the
| > 	group, should we let it exit and let the next thread in the group
| > 	the reaper of the pid ns ?
| 
| We can, but why? The main thread's task_struct can't go away until all
| sub-threads exit. Its ->nsproxy will be NULL, but this doesn't matter.

After the main thread exits task_child_reaper() would still refer to
the main thread right ? So when one of the other processes in the
namespace calls forget_original_parent(), it would reparent the process
to the main thread - no ? The main thread still has a valid task_struct,
but it has exited and cannot adapt children...

BTW, are there any actual users of multi-threaded init ? Or is this
something that can be considered outside the "core" patchset and
addressed soon, but separately like the signalling-container-init issue ?

| 
| > 	Then we would have the pid ns w/o a container-init (i.e reaper
| > 	does not have a pid_t == 1, but probably does not matter).
| > 
| > 	And, when this last thread is exiting, we want to terminate other
| > 	processes in the ns right ?
| 
| Yes, when this last thread is exiting, the entire process is exiting.
| 
| > | > +void zap_pid_ns_processes(struct pid_namespace *pid_ns)
| > | > +{
| > | > +	int nr;
| > | > +	int rc;
| > | > +	int options = WEXITED|__WALL;
| > | > +
| > | > +	/*
| > | > +	 * We know pid == 1 is terminating. Find remaining pid_ts
| > | > +	 * in the namespace, signal them and then wait for them
| > | > +	 * exit.
| > | > +	 */
| > | > +	nr = next_pidmap(pid_ns, 1);
| > | > +	while (nr > 0) {
| > | > +		kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
| > | > +		nr = next_pidmap(pid_ns, nr);
| > | > +	}
| > | 
| > | Without tasklist_lock held this is not reliable.
| > 
| > Ok. BTW, find_ge_pid() also walks the pidmap, but does not seem to hold
| > the tasklist_lock. Is that bc its only used in /proc ?
| 
| Yes, but this is something different. With or without tasklist_lock,
| find_ge_pid()/next_tgid() is not "reliable" (note that alloc_pid() doesn't
| take tasklist), but this doesn't matter for /proc.
| 
| We should take tasklist_lock to prevent the new process creation.
| We can have the "false positives" (copy_process() in progress, PGID/SID
| pids), but this is OK. Note that copy_process() checks signal_pending()
| after write_lock_irq(&tasklist_lock), that is why it helps.

Ok. Thx.

| 
| Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15505 is a reply to message #15499] Thu, 02 August 2007 15:39 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/02, sukadev@us.ibm.com wrote:
>
> Oleg Nesterov [oleg@tv-sign.ru] wrote:
> | On 07/31, sukadev@us.ibm.com wrote:
> | >
> | > Oleg Nesterov [oleg@tv-sign.ru] wrote:
> | > | >  
> | > | > @@ -925,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
> | > | >  	if (unlikely(!tsk->pid))
> | > | >  		panic("Attempted to kill the idle task!");
> | > | >  	if (unlikely(tsk == task_child_reaper(tsk))) {
> | > | > -		if (task_active_pid_ns(tsk) != &init_pid_ns)
> | > | > -			task_active_pid_ns(tsk)->child_reaper =
> | > | > -					init_pid_ns.child_reaper;
> | > | > +		if (pid_ns != &init_pid_ns) {
> | > | > +			zap_pid_ns_processes(pid_ns);
> | > | > +			pid_ns->child_reaper = init_pid_ns.child_reaper;

OOPS. I didn't notice this before, but this is not right too (regardless
of multi-threaded init problems).

We should not "reset" ->child_reaper here, we may have exiting tasks
which will re-parent their ->children to global init.

No, we are still /sbin/init of this namespace even if we are exiting,
->child_reaper should point to us, at least until zap_pid_ns_processes()
completes.

> | > 	Our current definition of is_container_init() and task_child_reaper()
> | > 	refer only to the main-thread of the container-init (since they check
> | > 	for pid_t == 1)
> | 
> | Yes.
> 
> This means that we cannot have a check like "tsk == task_child_reaper(tsk)"
> to properly detect the child reaper process right ? 

Yes, we should use "tsk->group_leader == task_child_reaper(tsk)"

> Its basically a very dumb question - How do we detect a container_init()
> in the multi-threaded case ?

Good point. I think is_container_init(tsk) needs a fix:

	-	pid = task_pid(tsk);
	+	pid = task_pid(tsk->group_leader);

Or, perhaps better, change the callers to use tsk->group_leader.

> Should we use "task->tgid == 1" ?

No, no, "task->tgid == 1" means "global" init.

> IOW to identify if the last thread of a child reaper is exiting, should we
> check "task->tgid == 1" and the "group_dead" flag  in do_exit() ?

See above, but yes, as I said before I think we should do this under
the "if (group_dead)" check below.

> | > 	If the main thread is exiting, but is not the last thread in the
> | > 	group, should we let it exit and let the next thread in the group
> | > 	the reaper of the pid ns ?
> | 
> | We can, but why? The main thread's task_struct can't go away until all
> | sub-threads exit. Its ->nsproxy will be NULL, but this doesn't matter.
> 
> After the main thread exits task_child_reaper() would still refer to
> the main thread right ? So when one of the other processes in the
> namespace calls forget_original_parent(), it would reparent the process
> to the main thread - no ? The main thread still has a valid task_struct,
> but it has exited and cannot adapt children...

Yes it can't, and yes, this is somewhat against the rules.

But, afaics, this should work. Because do_wait() from the alive sub-thread
still can reap the child, note that do_wait() iterates over all sub-threads
->children lists. Please note also that do_notify_parent() uses group
signal, so it will wake up some alive sub-thread.

This is wrong for the "normal" process (because when the last thread exits
main_thread->children is lost), but this seems to be OK for the /sbin/init,
exactly because we are doing zap_pid_ns_processes().

Sukadev, may I ask you to add a fat comment about this in your patch?

> BTW, are there any actual users of multi-threaded init ? Or is this
> something that can be considered outside the "core" patchset and
> addressed soon, but separately like the signalling-container-init issue ?

Well, I don't know. Please also see the reply to Kirill's message...

Oleg.
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15506 is a reply to message #15295] Thu, 02 August 2007 16:18 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 07/26, Pavel Emelyanov wrote:
>
> The reason to release namespaces after reparenting is that when task
> exits it may send a signal to its parent (SIGCHLD), but if the parent
> has already exited its namespaces there will be no way to decide what
> pid to dever to him - parent can be from different namespace.

I almost forgot about this one...

After reading the whole series, I can't understand the above explanation
any longer. The parent can't be from different namespace, either we have
another sub-thread, or we reparent the child to /sbin/init which should
be from the same namespace.

If we are /sbin/init, we kill all tasks from the same namespace before
doing exit_notify().

Could you clarify?

Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15507 is a reply to message #15308] Thu, 02 August 2007 17:29 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/02, sukadev@us.ibm.com wrote:
>
> Oleg Nesterov [oleg@tv-sign.ru] wrote:
> | > | > | > +		if (pid_ns != &init_pid_ns) {
> | > | > | > +			zap_pid_ns_processes(pid_ns);
> | > | > | > +			pid_ns->child_reaper = init_pid_ns.child_reaper;
> | 
> | OOPS. I didn't notice this before, but this is not right too (regardless
> | of multi-threaded init problems).
> | 
> | We should not "reset" ->child_reaper here, we may have exiting tasks
> | which will re-parent their ->children to global init.
> | 
> | No, we are still /sbin/init of this namespace even if we are exiting,
> | ->child_reaper should point to us, at least until zap_pid_ns_processes()
> | completes.
> 
> Yes, we are resetting the reaper _after_ zap_pid_ns_processes() completes
> right ? (all other processes in the namespace must have exited).

OOPS again :) Can't understand how I managed to misread this code.

This means that we should take care about multi-thread init exit,
otherwise the non-root user can crash the kernel.

>From reply to Kirill's message:

	> Still. A non-root user does clone(CLONE_PIDNS), then clone(CLONE_THREAD),
	> and sys_exit() from the main thread, then proceeds with fork()s. Now this
	> ns has the global init as a child reaper, and admin can't kill entire pid_ns
	> by killing its init. Worse, (see the reply to Sukadev' message), we should
	> not reset pid_ns->child_reaper before zap_pid_ns_processes(). In that case
	> ->child_reaper points to the freed task when the last thread exits, this
	> means the non-root user can crash the kernel.

Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15508 is a reply to message #15505] Thu, 02 August 2007 17:20 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Oleg Nesterov [oleg@tv-sign.ru] wrote:
| On 08/02, sukadev@us.ibm.com wrote:
| >
| > Oleg Nesterov [oleg@tv-sign.ru] wrote:
| > | On 07/31, sukadev@us.ibm.com wrote:
| > | >
| > | > Oleg Nesterov [oleg@tv-sign.ru] wrote:
| > | > | >  
| > | > | > @@ -925,9 +926,10 @@ fastcall NORET_TYPE void do_exit(long co
| > | > | >  	if (unlikely(!tsk->pid))
| > | > | >  		panic("Attempted to kill the idle task!");
| > | > | >  	if (unlikely(tsk == task_child_reaper(tsk))) {
| > | > | > -		if (task_active_pid_ns(tsk) != &init_pid_ns)
| > | > | > -			task_active_pid_ns(tsk)->child_reaper =
| > | > | > -					init_pid_ns.child_reaper;
| > | > | > +		if (pid_ns != &init_pid_ns) {
| > | > | > +			zap_pid_ns_processes(pid_ns);
| > | > | > +			pid_ns->child_reaper = init_pid_ns.child_reaper;
| 
| OOPS. I didn't notice this before, but this is not right too (regardless
| of multi-threaded init problems).
| 
| We should not "reset" ->child_reaper here, we may have exiting tasks
| which will re-parent their ->children to global init.
| 
| No, we are still /sbin/init of this namespace even if we are exiting,
| ->child_reaper should point to us, at least until zap_pid_ns_processes()
| completes.

Yes, we are resetting the reaper _after_ zap_pid_ns_processes() completes
right ? (all other processes in the namespace must have exited).


| 
| > | > 	Our current definition of is_container_init() and task_child_reaper()
| > | > 	refer only to the main-thread of the container-init (since they check
| > | > 	for pid_t == 1)
| > | 
| > | Yes.
| > 
| > This means that we cannot have a check like "tsk == task_child_reaper(tsk)"
| > to properly detect the child reaper process right ? 
| 
| Yes, we should use "tsk->group_leader == task_child_reaper(tsk)"

Ok.

| 
| > Its basically a very dumb question - How do we detect a container_init()
| > in the multi-threaded case ?
| 
| Good point. I think is_container_init(tsk) needs a fix:
| 
| 	-	pid = task_pid(tsk);
| 	+	pid = task_pid(tsk->group_leader);
| 

Ok.

| Or, perhaps better, change the callers to use tsk->group_leader.

Ok.
| 
| > Should we use "task->tgid == 1" ?
| 
| No, no, "task->tgid == 1" means "global" init.

Grr. I got that mixed up bw my implm and Pavel's :-) task->pid and
task->tgid referred to "active-pid-ns pid" in mine.

| 
| > IOW to identify if the last thread of a child reaper is exiting, should we
| > check "task->tgid == 1" and the "group_dead" flag  in do_exit() ?
| 
| See above, but yes, as I said before I think we should do this under
| the "if (group_dead)" check below.
| 
| > | > 	If the main thread is exiting, but is not the last thread in the
| > | > 	group, should we let it exit and let the next thread in the group
| > | > 	the reaper of the pid ns ?
| > | 
| > | We can, but why? The main thread's task_struct can't go away until all
| > | sub-threads exit. Its ->nsproxy will be NULL, but this doesn't matter.
| > 
| > After the main thread exits task_child_reaper() would still refer to
| > the main thread right ? So when one of the other processes in the
| > namespace calls forget_original_parent(), it would reparent the process
| > to the main thread - no ? The main thread still has a valid task_struct,
| > but it has exited and cannot adapt children...
| 
| Yes it can't, and yes, this is somewhat against the rules.
| 
| But, afaics, this should work. Because do_wait() from the alive sub-thread
| still can reap the child, note that do_wait() iterates over all sub-threads
| ->children lists. Please note also that do_notify_parent() uses group
| signal, so it will wake up some alive sub-thread.
| 
| This is wrong for the "normal" process (because when the last thread exits
| main_thread->children is lost), but this seems to be OK for the /sbin/init,
| exactly because we are doing zap_pid_ns_processes().
| 
| Sukadev, may I ask you to add a fat comment about this in your patch?

Sure.

| 
| > BTW, are there any actual users of multi-threaded init ? Or is this
| > something that can be considered outside the "core" patchset and
| > addressed soon, but separately like the signalling-container-init issue ?
| 
| Well, I don't know. Please also see the reply to Kirill's message...
| 
| Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15510 is a reply to message #15507] Thu, 02 August 2007 18:36 Go to previous messageGo to next message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Oleg Nesterov [oleg@tv-sign.ru] wrote:
| On 08/02, sukadev@us.ibm.com wrote:
| >
| > Oleg Nesterov [oleg@tv-sign.ru] wrote:
| > | > | > | > +		if (pid_ns != &init_pid_ns) {
| > | > | > | > +			zap_pid_ns_processes(pid_ns);
| > | > | > | > +			pid_ns->child_reaper = init_pid_ns.child_reaper;
| > | 
| > | OOPS. I didn't notice this before, but this is not right too (regardless
| > | of multi-threaded init problems).
| > | 
| > | We should not "reset" ->child_reaper here, we may have exiting tasks
| > | which will re-parent their ->children to global init.
| > | 
| > | No, we are still /sbin/init of this namespace even if we are exiting,
| > | ->child_reaper should point to us, at least until zap_pid_ns_processes()
| > | completes.
| > 
| > Yes, we are resetting the reaper _after_ zap_pid_ns_processes() completes
| > right ? (all other processes in the namespace must have exited).
| 
| OOPS again :) Can't understand how I managed to misread this code.
| 
| This means that we should take care about multi-thread init exit,
| otherwise the non-root user can crash the kernel.
| 
| >From reply to Kirill's message:
| 
| 	> Still. A non-root user does clone(CLONE_PIDNS), then clone(CLONE_THREAD),

Agree we should fix the crash. But we need CAP_SYS_ADMIN to clone
pid or other namespaces - this is enforced in copy_namespaces() and
unshare_nsproxy_namespaces()

Suka
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15512 is a reply to message #15510] Thu, 02 August 2007 18:49 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/02, sukadev@us.ibm.com wrote:
>
> Oleg Nesterov [oleg@tv-sign.ru] wrote:
> | 
> | This means that we should take care about multi-thread init exit,
> | otherwise the non-root user can crash the kernel.
> | 
> | >From reply to Kirill's message:
> | 
> | 	> Still. A non-root user does clone(CLONE_PIDNS), then clone(CLONE_THREAD),
> 
> Agree we should fix the crash. But we need CAP_SYS_ADMIN to clone
> pid or other namespaces - this is enforced in copy_namespaces() and
> unshare_nsproxy_namespaces()

Hmm. sys_unshare(CLONE_PIDNS) doesn't (and shouldn't) work anyway, but
I don't see the CAP_SYS_ADMIN check in copy_process()->copy_namespaces()
path.

Perhaps I just missed it (sorry, I already cleared my mbox, so I can't
look at theses patches), but is it a good idea to require CAP_SYS_ADMIN?
I think it would be nice if a normal user can create containers, no?

Oleg.
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #15514 is a reply to message #15512] Thu, 02 August 2007 19:13 Go to previous messageGo to next message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Oleg Nesterov (oleg@tv-sign.ru):
> On 08/02, sukadev@us.ibm.com wrote:
> >
> > Oleg Nesterov [oleg@tv-sign.ru] wrote:
> > | 
> > | This means that we should take care about multi-thread init exit,
> > | otherwise the non-root user can crash the kernel.
> > | 
> > | >From reply to Kirill's message:
> > | 
> > | 	> Still. A non-root user does clone(CLONE_PIDNS), then clone(CLONE_THREAD),
> > 
> > Agree we should fix the crash. But we need CAP_SYS_ADMIN to clone
> > pid or other namespaces - this is enforced in copy_namespaces() and
> > unshare_nsproxy_namespaces()
> 
> Hmm. sys_unshare(CLONE_PIDNS) doesn't (and shouldn't) work anyway, but
> I don't see the CAP_SYS_ADMIN check in copy_process()->copy_namespaces()
> path.
> 
> Perhaps I just missed it (sorry, I already cleared my mbox, so I can't
> look at theses patches), but is it a good idea to require CAP_SYS_ADMIN?
> I think it would be nice if a normal user can create containers, no?

For pid namespaces I can't think of any reason why CAP_SYS_ADMIN should
be needed, since you can't hide processes that way.  Same for uts
namespaces.

However for ipc and mount namespaces I'd want to think about it some
more.  Any case I can think of right now where they'd be unsafe, is
unsafe anyway, so maybe it's fine...

-serge
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15569 is a reply to message #15506] Mon, 06 August 2007 08:00 Go to previous messageGo to previous message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
>> The reason to release namespaces after reparenting is that when task
>> exits it may send a signal to its parent (SIGCHLD), but if the parent
>> has already exited its namespaces there will be no way to decide what
>> pid to dever to him - parent can be from different namespace.
> 
> I almost forgot about this one...
> 
> After reading the whole series, I can't understand the above explanation
> any longer. The parent can't be from different namespace, either we have
> another sub-thread, or we reparent the child to /sbin/init which should
> be from the same namespace.

If the child that is a new namespace's init is exiting its parent is from the
different namespace. Moreover, we will probably want to implement "entering"
the pid namespace, so  having tasks with parents from another namespace will
be OK.

> If we are /sbin/init, we kill all tasks from the same namespace before
> doing exit_notify().
> 
> Could you clarify?
> 
> Oleg.
> 
>
Previous Topic: Re: [RFC, PATCH] handle the multi-threaded init's exit() properly
Next Topic: [PATCH 0/14] sysfs cleanups
Goto Forum:
  


Current Time: Wed Oct 16 05:10:38 GMT 2024

Total time taken to generate the page: 0.05039 seconds