OpenVZ Forum


Home » Mailing lists » Devel » [RFC][PATCH 0/15] Pid namespaces
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15574 is a reply to message #15569] Mon, 06 August 2007 09:54 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/06, Pavel Emelyanov wrote:
> Oleg Nesterov wrote:
> >On 07/26, Pavel Emelyanov wrote:
> >>The reason to release namespaces after reparenting is that when task
> >>exits it may send a signal to its parent (SIGCHLD), but if the parent
> >>has already exited its namespaces there will be no way to decide what
> >>pid to dever to him - parent can be from different namespace.
> >
> >I almost forgot about this one...
> >
> >After reading the whole series, I can't understand the above explanation
> >any longer. The parent can't be from different namespace, either we have
> >another sub-thread, or we reparent the child to /sbin/init which should
> >be from the same namespace.
> 
> If the child that is a new namespace's init is exiting its parent is from the
> different namespace.

In that case it doesn't have childs. The were SIGKILL'ed before exit_notify().

> Moreover, we will probably want to implement "entering"
> the pid namespace, so  having tasks with parents from another namespace will
> be OK.

Well. I saw this word "entering", but I don't know the meaning. Just curious,
could you explain?

And, if an exiting task has a child which is already from another namespace,
why can't we release our namespace before re-parenting? I guess I need to
know what "entering" means to understand this...

Oleg.
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15575 is a reply to message #15574] Mon, 06 August 2007 09:58 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 08/06, Pavel Emelyanov wrote:
>> Oleg Nesterov wrote:
>>> On 07/26, Pavel Emelyanov wrote:
>>>> The reason to release namespaces after reparenting is that when task
>>>> exits it may send a signal to its parent (SIGCHLD), but if the parent
>>>> has already exited its namespaces there will be no way to decide what
>>>> pid to dever to him - parent can be from different namespace.
>>> I almost forgot about this one...
>>>
>>> After reading the whole series, I can't understand the above explanation
>>> any longer. The parent can't be from different namespace, either we have
>>> another sub-thread, or we reparent the child to /sbin/init which should
>>> be from the same namespace.
>> If the child that is a new namespace's init is exiting its parent is from the
>> different namespace.
> 
> In that case it doesn't have childs. The were SIGKILL'ed before exit_notify().

It does not, but it's parent - does :)

>> Moreover, we will probably want to implement "entering"
>> the pid namespace, so  having tasks with parents from another namespace will
>> be OK.
> 
> Well. I saw this word "entering", but I don't know the meaning. Just curious,
> could you explain?

"Entering" means "moving task to arbitrary namespace"

> And, if an exiting task has a child which is already from another namespace,
> why can't we release our namespace before re-parenting? I guess I need to
> know what "entering" means to understand this...

One of the desired actions was the following:
1. task X clones the new namespace with the child Y as this namespace's init;
2. task X waits for SIGCHILD to come informing that the namespace is dead.
In this scenario we need to set the Y's pid as it is seen from X's 
namespace in siginfo.

> Oleg.
> 
>
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15576 is a reply to message #15575] Mon, 06 August 2007 10:38 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/06, Pavel Emelyanov wrote:
>
> Oleg Nesterov wrote:
> >On 08/06, Pavel Emelyanov wrote:
> >>Oleg Nesterov wrote:
> >>>On 07/26, Pavel Emelyanov wrote:
> >>>>The reason to release namespaces after reparenting is that when task
> >>>>exits it may send a signal to its parent (SIGCHLD), but if the parent
> >>>>has already exited its namespaces there will be no way to decide what
> >>>>pid to dever to him - parent can be from different namespace.
> >>>I almost forgot about this one...
> >>>
> >>>After reading the whole series, I can't understand the above explanation
> >>>any longer. The parent can't be from different namespace, either we have
> >>>another sub-thread, or we reparent the child to /sbin/init which should
> >>>be from the same namespace.
> >>If the child that is a new namespace's init is exiting its parent is from 
> >>the
> >>different namespace.
> >
> >In that case it doesn't have childs. The were SIGKILL'ed before 
> >exit_notify().
> 
> It does not, but it's parent - does :)

Yes. But in that case forget_original_parent() is no-op! This means that
it is not needed to move exit_task_namepsace() after.

> >>Moreover, we will probably want to implement "entering"
> >>the pid namespace, so  having tasks with parents from another namespace 
> >>will
> >>be OK.
> >
> >Well. I saw this word "entering", but I don't know the meaning. Just 
> >curious,
> >could you explain?
> 
> "Entering" means "moving task to arbitrary namespace"

Heh. Very much nontrivial, good luck :) At least we need to grow ->numbers[],
and if its pid was pinned...

> >And, if an exiting task has a child which is already from another 
> >namespace,
> >why can't we release our namespace before re-parenting? I guess I need to
> >know what "entering" means to understand this...
> 
> One of the desired actions was the following:
> 1. task X clones the new namespace with the child Y as this namespace's 
> init;
> 2. task X waits for SIGCHILD to come informing that the namespace is dead.
> In this scenario we need to set the Y's pid as it is seen from X's 
> namespace in siginfo.

Yes sure. But again, how this is connected to "we should do exit_task_namespace()
after forget_original_parent()" ?

I guess I missed something stupid and simple...

Oleg.
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15578 is a reply to message #15576] Mon, 06 August 2007 11:21 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 08/06, Pavel Emelyanov wrote:
>> Oleg Nesterov wrote:
>>> On 08/06, Pavel Emelyanov wrote:
>>>> Oleg Nesterov wrote:
>>>>> On 07/26, Pavel Emelyanov wrote:
>>>>>> The reason to release namespaces after reparenting is that when task
>>>>>> exits it may send a signal to its parent (SIGCHLD), but if the parent
>>>>>> has already exited its namespaces there will be no way to decide what
>>>>>> pid to dever to him - parent can be from different namespace.
>>>>> I almost forgot about this one...
>>>>>
>>>>> After reading the whole series, I can't understand the above explanation
>>>>> any longer. The parent can't be from different namespace, either we have
>>>>> another sub-thread, or we reparent the child to /sbin/init which should
>>>>> be from the same namespace.
>>>> If the child that is a new namespace's init is exiting its parent is from 
>>>> the
>>>> different namespace.
>>> In that case it doesn't have childs. The were SIGKILL'ed before 
>>> exit_notify().
>> It does not, but it's parent - does :)
> 
> Yes. But in that case forget_original_parent() is no-op! This means that
> it is not needed to move exit_task_namepsace() after.
> 
>>>> Moreover, we will probably want to implement "entering"
>>>> the pid namespace, so  having tasks with parents from another namespace 
>>>> will
>>>> be OK.
>>> Well. I saw this word "entering", but I don't know the meaning. Just 
>>> curious,
>>> could you explain?
>> "Entering" means "moving task to arbitrary namespace"
> 
> Heh. Very much nontrivial, good luck :) At least we need to grow ->numbers[],
> and if its pid was pinned...
> 
>>> And, if an exiting task has a child which is already from another 
>>> namespace,
>>> why can't we release our namespace before re-parenting? I guess I need to
>>> know what "entering" means to understand this...
>> One of the desired actions was the following:
>> 1. task X clones the new namespace with the child Y as this namespace's 
>> init;
>> 2. task X waits for SIGCHILD to come informing that the namespace is dead.
>> In this scenario we need to set the Y's pid as it is seen from X's 
>> namespace in siginfo.
> 
> Yes sure. But again, how this is connected to "we should do exit_task_namespace()
> after forget_original_parent()" ?
> 
> I guess I missed something stupid and simple...

If task X is exiting and has already exit_task_namespaces()-ed task
Y will OOPs during its exit in determining parent's namespace. I agree 
that in that case this is not important what namespace X belongs to, 
but we need to handle the race with changing the nsproxy from not-NULL
to NULL. This is OK to make this under task_lock() but what to add 
extra locking for if we can avoid it?

> Oleg.
> 
>
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15579 is a reply to message #15576] Mon, 06 August 2007 11:29 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 08/06, Pavel Emelyanov wrote:
>> Oleg Nesterov wrote:
>>> On 08/06, Pavel Emelyanov wrote:
>>>> Oleg Nesterov wrote:
>>>>> On 07/26, Pavel Emelyanov wrote:
>>>>>> The reason to release namespaces after reparenting is that when task
>>>>>> exits it may send a signal to its parent (SIGCHLD), but if the parent
>>>>>> has already exited its namespaces there will be no way to decide what
>>>>>> pid to dever to him - parent can be from different namespace.
>>>>> I almost forgot about this one...
>>>>>
>>>>> After reading the whole series, I can't understand the above explanation
>>>>> any longer. The parent can't be from different namespace, either we have
>>>>> another sub-thread, or we reparent the child to /sbin/init which should
>>>>> be from the same namespace.
>>>> If the child that is a new namespace's init is exiting its parent is from 
>>>> the
>>>> different namespace.
>>> In that case it doesn't have childs. The were SIGKILL'ed before 
>>> exit_notify().
>> It does not, but it's parent - does :)
> 
> Yes. But in that case forget_original_parent() is no-op! This means that
> it is not needed to move exit_task_namepsace() after.
> 
>>>> Moreover, we will probably want to implement "entering"
>>>> the pid namespace, so  having tasks with parents from another namespace 
>>>> will
>>>> be OK.
>>> Well. I saw this word "entering", but I don't know the meaning. Just 
>>> curious,
>>> could you explain?
>> "Entering" means "moving task to arbitrary namespace"
> 
> Heh. Very much nontrivial, good luck :) At least we need to grow ->numbers[],
> and if its pid was pinned...
> 
>>> And, if an exiting task has a child which is already from another 
>>> namespace,
>>> why can't we release our namespace before re-parenting? I guess I need to
>>> know what "entering" means to understand this...
>> One of the desired actions was the following:
>> 1. task X clones the new namespace with the child Y as this namespace's 
>> init;
>> 2. task X waits for SIGCHILD to come informing that the namespace is dead.
>> In this scenario we need to set the Y's pid as it is seen from X's 
>> namespace in siginfo.
> 
> Yes sure. But again, how this is connected to "we should do exit_task_namespace()
> after forget_original_parent()" ?
> 
> I guess I missed something stupid and simple...

In other words. Let task X live in init_pid_ns, task Y is his child and lives
int another namespace. task X and task Y both die. This will happen:

1. Task X call exit_task_namespaces() 
   and sets its nsproxy to NULL
                                        1. Task Y is going to notify the 
                                           parent (X) and dereferences its
                                           nsproxy -> OOPS
2. Task X reparents all its children

If we move the exit_task_namespace this will happen:

1. Task X reparents all its children
2. Task X call exit_task_namespaces() 
   and sets its nsproxy to NULL

In such case is tasy Y will dereference the parent's nsproxy it will not
OOPS because either its parent will be not X already, or X's nsproxy is
not yet released.

> Oleg.
> 
>
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15585 is a reply to message #15579] Mon, 06 August 2007 12:48 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/06, Pavel Emelyanov wrote:
>
> Oleg Nesterov wrote:
> >On 08/06, Pavel Emelyanov wrote:
> >>Oleg Nesterov wrote:
> >>>On 08/06, Pavel Emelyanov wrote:
> >>>>Oleg Nesterov wrote:
> >>>>>On 07/26, Pavel Emelyanov wrote:
> >>>>>>The reason to release namespaces after reparenting is that when task
> >>>>>>exits it may send a signal to its parent (SIGCHLD), but if the parent
> >>>>>>has already exited its namespaces there will be no way to decide what
> >>>>>>pid to dever to him - parent can be from different namespace.
> >>>>>I almost forgot about this one...
> >>>>>
> >I guess I missed something stupid and simple...
> 
> In other words. Let task X live in init_pid_ns, task Y is his child and 
> lives
> int another namespace. task X and task Y both die. This will happen:
> 
> 1. Task X call exit_task_namespaces() 
>   and sets its nsproxy to NULL

Ah, got it, thanks. So the problem is not namespace itself (parent's or
child's), there are still valid (even if different but related).

We just can't get ->parent->nsproxy. I was greatly confused by the "parent
can be from different namespace" above. We have exactly same problem if
namespaces are not differ.

IOW, the problem is: we can't clear ->nsproxy (exit_task_namespaces) until
we get rid of ->children. This have nothing to do with different namespace.

Oleg.
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15586 is a reply to message #15578] Mon, 06 August 2007 12:52 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/06, Pavel Emelyanov wrote:
>
> If task X is exiting and has already exit_task_namespaces()-ed task
> Y will OOPs during its exit in determining parent's namespace. I agree 
> that in that case this is not important what namespace X belongs to, 
> but we need to handle the race with changing the nsproxy from not-NULL
> to NULL. This is OK to make this under task_lock() but what to add 
> extra locking for if we can avoid it?

No, we can't take task_lock() under write_lock(tasklist).

Oleg.
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15587 is a reply to message #15585] Mon, 06 August 2007 13:36 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 08/06, Pavel Emelyanov wrote:
>> Oleg Nesterov wrote:
>>> On 08/06, Pavel Emelyanov wrote:
>>>> Oleg Nesterov wrote:
>>>>> On 08/06, Pavel Emelyanov wrote:
>>>>>> Oleg Nesterov wrote:
>>>>>>> On 07/26, Pavel Emelyanov wrote:
>>>>>>>> The reason to release namespaces after reparenting is that when task
>>>>>>>> exits it may send a signal to its parent (SIGCHLD), but if the parent
>>>>>>>> has already exited its namespaces there will be no way to decide what
>>>>>>>> pid to dever to him - parent can be from different namespace.
>>>>>>> I almost forgot about this one...
>>>>>>>
>>> I guess I missed something stupid and simple...
>> In other words. Let task X live in init_pid_ns, task Y is his child and 
>> lives
>> int another namespace. task X and task Y both die. This will happen:
>>
>> 1. Task X call exit_task_namespaces() 
>>   and sets its nsproxy to NULL
> 
> Ah, got it, thanks. So the problem is not namespace itself (parent's or
> child's), there are still valid (even if different but related).
> 
> We just can't get ->parent->nsproxy. I was greatly confused by the "parent
> can be from different namespace" above. We have exactly same problem if
> namespaces are not differ.
> 
> IOW, the problem is: we can't clear ->nsproxy (exit_task_namespaces) until
> we get rid of ->children. This have nothing to do with different namespace.

No. If the parent is always in the same namespace we do not need to
get its nsproxy :) Problem is exactly in that the parent's namespace
is to be known.

> Oleg.
> 
>
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15588 is a reply to message #15586] Mon, 06 August 2007 13:38 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Oleg Nesterov wrote:
> On 08/06, Pavel Emelyanov wrote:
>> If task X is exiting and has already exit_task_namespaces()-ed task
>> Y will OOPs during its exit in determining parent's namespace. I agree 
>> that in that case this is not important what namespace X belongs to, 
>> but we need to handle the race with changing the nsproxy from not-NULL
>> to NULL. This is OK to make this under task_lock() but what to add 
>> extra locking for if we can avoid it?
> 
> No, we can't take task_lock() under write_lock(tasklist).

I meant that we could save the namspace earlier (w/o tasklist),
carry it up to the place we need and release it later. But all
this is too complex and it's easier to get rid of children and
then release the namespaces.

> Oleg.
> 
>
Re: [PATCH 1/15] Move exit_task_namespaces() [message #15589 is a reply to message #15587] Mon, 06 August 2007 13:55 Go to previous message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/06, Pavel Emelyanov wrote:
>
> Oleg Nesterov wrote:
> >On 08/06, Pavel Emelyanov wrote:
> >>Oleg Nesterov wrote:
> >>>On 08/06, Pavel Emelyanov wrote:
> >>>>Oleg Nesterov wrote:
> >>>>>On 08/06, Pavel Emelyanov wrote:
> >>>>>>Oleg Nesterov wrote:
> >>>>>>>On 07/26, Pavel Emelyanov wrote:
> >>>>>>>>The reason to release namespaces after reparenting is that when task
> >>>>>>>>exits it may send a signal to its parent (SIGCHLD), but if the 
> >>>>>>>>parent
> >>>>>>>>has already exited its namespaces there will be no way to decide 
> >>>>>>>>what
> >>>>>>>>pid to dever to him - parent can be from different namespace.
> >>>>>>>I almost forgot about this one...
> >>>>>>>
> >>>I guess I missed something stupid and simple...
> >>In other words. Let task X live in init_pid_ns, task Y is his child and 
> >>lives
> >>int another namespace. task X and task Y both die. This will happen:
> >>
> >>1. Task X call exit_task_namespaces() 
> >>  and sets its nsproxy to NULL
> >
> >Ah, got it, thanks. So the problem is not namespace itself (parent's or
> >child's), there are still valid (even if different but related).
> >
> >We just can't get ->parent->nsproxy. I was greatly confused by the "parent
> >can be from different namespace" above. We have exactly same problem if
> >namespaces are not differ.
> >
> >IOW, the problem is: we can't clear ->nsproxy (exit_task_namespaces) until
> >we get rid of ->children. This have nothing to do with different namespace.
> 
> No. If the parent is always in the same namespace we do not need to
> get its nsproxy :) Problem is exactly in that the parent's namespace
> is to be known.

Yes yes, I see. I meant: once do_notify_parent() was modified to use
parent->nsproxy to figure out correct pid_t, that problem has nothing
to do with namespaces, it is just parent->nsproxy access.

But this is not safe, btw? do_notify_parent() can get parent->nsproxy
which is under destruction (sys_unshare). Then we read its ->pid_ns,
but at this time "struct nsproxy" could be kmem_cache_free()'ed ?

Of course, this is just theoretical, irqs are disabled, and the window
is tiny.

Oleg.
Re: [PATCH 1/15] Move exit_task_namespaces() [message #19457 is a reply to message #15320] Thu, 26 July 2007 16:59 Go to previous message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Oleg Nesterov wrote:
> On 07/26, Pavel Emelyanov wrote:
> 
>>Make task release its namespaces after it has reparented all his
>>children to child_reaper, but before it notifies its parent about
>>its death.
>>
>>The reason to release namespaces after reparenting is that when task
>>exits it may send a signal to its parent (SIGCHLD), but if the parent
>>has already exited its namespaces there will be no way to decide what
>>pid to dever to him - parent can be from different namespace.
>>
>>The reason to release namespace before notifying the parent it that
>>when task sends a SIGCHLD to parent it can call wait() on this taks
>>and release it. But releasing the mnt namespace implies dropping
>>of all the mounts in the mnt namespace and NFS expects the task to
>>have valid sighand pointer.
>>
>>Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
>>
>>---
>>
>>exit.c |    5 ++++-
>>1 files changed, 4 insertions(+), 1 deletion(-)
>>
>>diff -upr linux-2.6.23-rc1-mm1.orig/kernel/exit.c 
>>linux-2.6.23-rc1-mm1-7/kernel/exit.c
>>--- linux-2.6.23-rc1-mm1.orig/kernel/exit.c	2007-07-26 
>>16:34:45.000000000 +0400
>>+++ linux-2.6.23-rc1-mm1-7/kernel/exit.c	2007-07-26 
>>16:36:37.000000000 +0400
>>@@ -788,6 +804,10 @@ static void exit_notify(struct task_stru
>>	BUG_ON(!list_empty(&tsk->children));
>>	BUG_ON(!list_empty(&tsk->ptrace_children));
>>
>>+	write_unlock_irq(&tasklist_lock);
>>+	exit_task_namespaces(tsk);
>>+	write_lock_irq(&tasklist_lock);
> 
> 
> No.
> 
> We "cleared" our ->children/->ptrace_children lists. Now suppose that
> another thread dies, and its forget_original_parent() choose us as a
> new reaper before we re-take tasklist.
> 
> I'll try to read other patches tomorrow, but I can't avoid a stupid
> question: can we have a CONFIG_ for that? This series adds a lot of
> complications.

the was a request from many people including Andrew that CONFIG_XXX
is bad approach.

> OK, I guess the answer is "it is very difficult t achieve", but can't
> help myself :)

First versions of Pavel's patches had CONFIG_XXX for this.

Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC][PATCH 0/15] Pid namespaces [message #19458 is a reply to message #15329] Fri, 27 July 2007 06:08 Go to previous message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
sukadev@us.ibm.com [sukadev@us.ibm.com] wrote:
| Pavel,
| 
| We seem to have a memory leak. Its new in this patchset (i.e the
| following test ran fine on the 2.6.22-rc6-mm1 patchset).
| 
| To repro: run "pidns_exec ./mypid" in a tight-loop  - where mypid.c
| is:
| 
| #include <stdio.h>
| #include <unistd.h>
| 
| main()
| {
|         printf("Pid %d, Ppid %d, Pgid %d, Sid %d\n",
|                  getpid(), getppid(), getpgid(0), getsid(0));
| }
| 
| I ran into OOM in about 30 mins. I am still investigating.
| 
| BTW, can we include a simple test program like the pidns_exec in this
| patch-0 for whoever want to play with pidns ?
| 
| Suka
| 

I think the problem is in create_pid_namespace(). kref_init() sets the
refcount to 1 and then we do a get_pid_ns() which sets it to 2.

free_nsproxy() frees one of this references, but the other is never
freed.

This patch seems to fix the leak. The patch also creates a slab cache
for the pid_namespace

---

Create a slab-cache for 'struct pid_namespace' and fix a memory leak
due to an extra reference in create_pid_namespace().

---
 kernel/pid.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: lx26-23-rc1-mm1/kernel/pid.c
===================================================================
--- lx26-23-rc1-mm1.orig/kernel/pid.c	2007-07-26 20:08:16.000000000 -0700
+++ lx26-23-rc1-mm1/kernel/pid.c	2007-07-26 22:31:42.000000000 -0700
@@ -35,6 +35,7 @@
 static struct hlist_head *pid_hash;
 static int pidhash_shift;
 struct pid init_struct_pid = INIT_STRUCT_PID;
+static struct kmem_cache *pid_ns_cachep;
 
 int pid_max = PID_MAX_DEFAULT;
 
@@ -525,7 +526,7 @@ static struct pid_namespace *create_pid_
 	struct pid_namespace *ns;
 	int i;
 
-	ns = kmalloc(sizeof(struct pid_namespace), GFP_KERNEL);
+	ns = kmem_cache_alloc(pid_ns_cachep, GFP_KERNEL);
 	if (ns == NULL)
 		goto out;
 
@@ -544,7 +545,6 @@ static struct pid_namespace *create_pid_
 
 	set_bit(0, ns->pidmap[0].page);
 	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
-	get_pid_ns(ns);
 
 	for (i = 1; i < PIDMAP_ENTRIES; i++) {
 		ns->pidmap[i].page = 0;
@@ -556,7 +556,7 @@ static struct pid_namespace *create_pid_
 out_free_map:
 	kfree(ns->pidmap[0].page);
 out_free:
-	kfree(ns);
+	kmem_cache_free(pid_ns_cachep, ns);
 out:
 	return ERR_PTR(-ENOMEM);
 }
@@ -567,7 +567,7 @@ static void destroy_pid_namespace(struct
 
 	for (i = 0; i < PIDMAP_ENTRIES; i++)
 		kfree(ns->pidmap[i].page);
-	kfree(ns);
+	kmem_cache_free(pid_ns_cachep, ns);
 }
 
   struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
@@ -687,4 +687,6 @@ void __init pidmap_init(void)
 	init_pid_ns.pid_cachep = create_pid_cachep(1);
 	if (init_pid_ns.pid_cachep == NULL)
 		panic("Can't create pid_1 cachep\n");
+
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
 }
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC][PATCH 0/15] Pid namespaces [message #19459 is a reply to message #19458] Fri, 27 July 2007 06:47 Go to previous message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
sukadev@us.ibm.com wrote:
> sukadev@us.ibm.com [sukadev@us.ibm.com] wrote:
> | Pavel,
> | 
> | We seem to have a memory leak. Its new in this patchset (i.e the
> | following test ran fine on the 2.6.22-rc6-mm1 patchset).
> | 
> | To repro: run "pidns_exec ./mypid" in a tight-loop  - where mypid.c
> | is:
> | 
> | #include <stdio.h>
> | #include <unistd.h>
> | 
> | main()
> | {
> |         printf("Pid %d, Ppid %d, Pgid %d, Sid %d\n",
> |                  getpid(), getppid(), getpgid(0), getsid(0));
> | }
> | 
> | I ran into OOM in about 30 mins. I am still investigating.
> | 
> | BTW, can we include a simple test program like the pidns_exec in this
> | patch-0 for whoever want to play with pidns ?
> | 
> | Suka
> | 
> 
> I think the problem is in create_pid_namespace(). kref_init() sets the
> refcount to 1 and then we do a get_pid_ns() which sets it to 2.
> 
> free_nsproxy() frees one of this references, but the other is never
> freed.

That's it! I've switched from my old fast (and bad) reference counting
scheme to new slow (and correct) one, but forgot to remove this "get".
This is the real problem. Thanks :)

> This patch seems to fix the leak. The patch also creates a slab cache
> for the pid_namespace

OK, I'll include this patch. Thanks.

> ---
> 
> Create a slab-cache for 'struct pid_namespace' and fix a memory leak
> due to an extra reference in create_pid_namespace().
> 
> ---
>  kernel/pid.c |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> Index: lx26-23-rc1-mm1/kernel/pid.c
> ===================================================================
> --- lx26-23-rc1-mm1.orig/kernel/pid.c	2007-07-26 20:08:16.000000000 -0700
> +++ lx26-23-rc1-mm1/kernel/pid.c	2007-07-26 22:31:42.000000000 -0700
> @@ -35,6 +35,7 @@
>  static struct hlist_head *pid_hash;
>  static int pidhash_shift;
>  struct pid init_struct_pid = INIT_STRUCT_PID;
> +static struct kmem_cache *pid_ns_cachep;
>  
>  int pid_max = PID_MAX_DEFAULT;
>  
> @@ -525,7 +526,7 @@ static struct pid_namespace *create_pid_
>  	struct pid_namespace *ns;
>  	int i;
>  
> -	ns = kmalloc(sizeof(struct pid_namespace), GFP_KERNEL);
> +	ns = kmem_cache_alloc(pid_ns_cachep, GFP_KERNEL);
>  	if (ns == NULL)
>  		goto out;
>  
> @@ -544,7 +545,6 @@ static struct pid_namespace *create_pid_
>  
>  	set_bit(0, ns->pidmap[0].page);
>  	atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
> -	get_pid_ns(ns);
>  
>  	for (i = 1; i < PIDMAP_ENTRIES; i++) {
>  		ns->pidmap[i].page = 0;
> @@ -556,7 +556,7 @@ static struct pid_namespace *create_pid_
>  out_free_map:
>  	kfree(ns->pidmap[0].page);
>  out_free:
> -	kfree(ns);
> +	kmem_cache_free(pid_ns_cachep, ns);
>  out:
>  	return ERR_PTR(-ENOMEM);
>  }
> @@ -567,7 +567,7 @@ static void destroy_pid_namespace(struct
>  
>  	for (i = 0; i < PIDMAP_ENTRIES; i++)
>  		kfree(ns->pidmap[i].page);
> -	kfree(ns);
> +	kmem_cache_free(pid_ns_cachep, ns);
>  }
>  
>    struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
> @@ -687,4 +687,6 @@ void __init pidmap_init(void)
>  	init_pid_ns.pid_cachep = create_pid_cachep(1);
>  	if (init_pid_ns.pid_cachep == NULL)
>  		panic("Can't create pid_1 cachep\n");
> +
> +	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
>  }
> 

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 11/15] Signal semantics [message #19539 is a reply to message #15478] Thu, 02 August 2007 08:35 Go to previous message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Serge E. Hallyn wrote:
> Quoting Pavel Emelyanov (xemul@openvz.org):
> 
>>[snip]
>>
>>
>>>>| Maybe it's worth disabling cross-namespaces ptracing...
>>>>
>>>>I think so too. Its probably not a serious limitation ?
>>>
>>>Several people think we will implement 'namespace entering' through a
>>>ptrace hack, where maybe the admin ptraces the init in a child pidns,
>>
>>Why not implement namespace entering w/o any hacks? :)
> 
> 
> I did, as a patch on top of the nsproxy container subsystem.  The
> response was that that is a hack, and ptrace is cleaner  :)
> 
> So the current options for namespace entering would be:
> 
> 	* using Cedric's bind_ns() functionality, which assigns an
> 	  integer global id to a namespace, and allows a process to
> 	  enter a namespace by that global id

looks more or less good and what OVZ actually does.
So I would prefer this one.

> 	* using my nsproxy container subsystem patch, which lets
> 	  a process enter another namespace using
> 	  	echo pid > /container/some/cont/directory/tasks
> 	  and eventually might allow construction of custom
> 	  namespaces, i.e.
> 	  	mkdir /container/c1/c2
> 		ln -s /container/c1/c1/network /container/c1/c2/network
> 		echo $$ > /container/c1/c2/tasks

Sound ok and logical as well.

> 	* using ptrace to coerce a process in the target namespace
> 	  into forking and executing the desired program.

you'll need to change ptrace interface in this case imho...
doesn't sound ok at all... at least for me. So I agree with Pavel.


>>>makes it fork, and makes the child execute what it wants (i.e. ps -ef).
>>>
>>>You're talking about killing that functionality?
>>
>>No. We're talking about disabling the things that are not supposed 
>>to work at all.
> 
> 
> Uh, well in the abstract that sounds like a sound policy...

Pavel simply meant that no one plans to disable functionality in question.

Thanks,
Kirill

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #19540 is a reply to message #15483] Thu, 02 August 2007 08:37 Go to previous message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Oleg Nesterov wrote:
> On 08/01, Dave Hansen wrote:
> 
>>>	If the main thread is exiting, but is not the last thread in the
>>>	group, should we let it exit and let the next thread in the group
>>>	the reaper of the pid ns ?
>>
>>Well, what happens with a multithreaded init today?
> 
> 
> As it was already discussed, the current code is buggy, and should be
> fixed.

I'm not that sure it MUST be fixed. There are no multi-threaded init's anywhere.
Oleg, does it worth changing without reasons?

Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 11/15] Signal semantics [message #19541 is a reply to message #19539] Thu, 02 August 2007 20:09 Go to previous message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Kirill Korotaev (dev@sw.ru):
> Serge E. Hallyn wrote:
> > Quoting Pavel Emelyanov (xemul@openvz.org):
> > 
> >>[snip]
> >>
> >>
> >>>>| Maybe it's worth disabling cross-namespaces ptracing...
> >>>>
> >>>>I think so too. Its probably not a serious limitation ?
> >>>
> >>>Several people think we will implement 'namespace entering' through a
> >>>ptrace hack, where maybe the admin ptraces the init in a child pidns,
> >>
> >>Why not implement namespace entering w/o any hacks? :)
> > 
> > 
> > I did, as a patch on top of the nsproxy container subsystem.  The
> > response was that that is a hack, and ptrace is cleaner  :)
> > 
> > So the current options for namespace entering would be:
> > 
> > 	* using Cedric's bind_ns() functionality, which assigns an
> > 	  integer global id to a namespace, and allows a process to
> > 	  enter a namespace by that global id
> 
> looks more or less good and what OVZ actually does.
> So I would prefer this one.

I think this was Cedric's last post of it:

https://lists.linux-foundation.org/pipermail/containers/2007-June/005665.html

However I'm pretty sure Eric would be soundly against this.

> > 	* using my nsproxy container subsystem patch, which lets
> > 	  a process enter another namespace using
> > 	  	echo pid > /container/some/cont/directory/tasks
> > 	  and eventually might allow construction of custom
> > 	  namespaces, i.e.
> > 	  	mkdir /container/c1/c2
> > 		ln -s /container/c1/c1/network /container/c1/c2/network
> > 		echo $$ > /container/c1/c2/tasks
> 
> Sound ok and logical as well.

I last posted this here:
http://www.mail-archive.com/devel@openvz.org/msg00295.html
In the ensuing thread, the ptrace-based solution is also discussed.

> > 	* using ptrace to coerce a process in the target namespace
> > 	  into forking and executing the desired program.
> 
> you'll need to change ptrace interface in this case imho...
> doesn't sound ok at all... at least for me. So I agree with Pavel.

Well maybe the problem is that while I think of it as ptrace-based, it's
really only like ptrace in that it hijacks another process for a moment
to clone it, but it likely will end up a completely different system
call.

Eric, just curious, have you worked on this at all since february?

-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #19543 is a reply to message #19540] Thu, 02 August 2007 16:08 Go to previous message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/02, Kirill Korotaev wrote:
>
> Oleg Nesterov wrote:
> > On 08/01, Dave Hansen wrote:
> > 
> >>>	If the main thread is exiting, but is not the last thread in the
> >>>	group, should we let it exit and let the next thread in the group
> >>>	the reaper of the pid ns ?
> >>
> >>Well, what happens with a multithreaded init today?
> > 
> > 
> > As it was already discussed, the current code is buggy, and should be
> > fixed.
> 
> I'm not that sure it MUST be fixed. There are no multi-threaded init's anywhere.
> Oleg, does it worth changing without reasons?

I don't know. But the kernel already tries to support multi-threaded init's.
Look at de_thread(), it could be simplified a bit (and we don't need tasklist
lock for zap_other_threads()) if we forbid them.

Still. A non-root user does clone(CLONE_PIDNS), then clone(CLONE_THREAD),
and sys_exit() from the main thread, then proceeds with fork()s. Now this
ns has the global init as a child reaper, and admin can't kill entire pid_ns
by killing its init. Worse, (see the reply to Sukadev' message), we should
not reset pid_ns->child_reaper before zap_pid_ns_processes(). In that case
->child_reaper points to the freed task when the last thread exits, this
means the non-root user can crash the kernel.

Or, some embedded system uses multi-threaded init, and the kernel panics
when the main thread exits.

Perhaps this is just a "quality of implementation" question. sys_exit()
from the main thread should be OK, why /sbin/init should be special?

That said, I personally do not think that multi-threaded init is terribly
useful.

Oleg.

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #19544 is a reply to message #19543] Thu, 02 August 2007 17:08 Go to previous message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/02, Oleg Nesterov wrote:
>
> On 08/02, Kirill Korotaev wrote:
> >
> > Oleg Nesterov wrote:
> > > 
> > > As it was already discussed, the current code is buggy, and should be
> > > fixed.
> > 
> > I'm not that sure it MUST be fixed. There are no multi-threaded init's anywhere.
> > Oleg, does it worth changing without reasons?
> 
> I don't know. But the kernel already tries to support multi-threaded init's.
> Look at de_thread(), it could be simplified a bit (and we don't need tasklist
> lock for zap_other_threads()) if we forbid them.
> 
> Still. A non-root user does clone(CLONE_PIDNS), then clone(CLONE_THREAD),
> and sys_exit() from the main thread, then proceeds with fork()s. Now this
> ns has the global init as a child reaper, and admin can't kill entire pid_ns
> by killing its init. Worse, (see the reply to Sukadev' message), we should
> not reset pid_ns->child_reaper before zap_pid_ns_processes(). In that case
> ->child_reaper points to the freed task when the last thread exits, this
> means the non-root user can crash the kernel.
> 
> Or, some embedded system uses multi-threaded init, and the kernel panics
> when the main thread exits.
> 
> Perhaps this is just a "quality of implementation" question. sys_exit()
> from the main thread should be OK, why /sbin/init should be special?
> 
> That said, I personally do not think that multi-threaded init is terribly
> useful.

So I think the patch below makes sense for now. Note that it removes the
games with pid_ns->child_reaper: this doesn't work currently, and this
has to be modified when we actually support pid namespaces anyway.

Oleg.

--- t/kernel/exit.c~MTINIT	2007-07-28 16:58:17.000000000 +0400
+++ t/kernel/exit.c	2007-08-02 20:59:59.000000000 +0400
@@ -895,6 +895,14 @@ static void check_stack_usage(void)
 static inline void check_stack_usage(void) {}
 #endif
 
+static inline void exit_child_reaper(struct task_struct *tsk)
+{
+	if (likely(tsk->group_leader != child_reaper(tsk)))
+		return;
+
+	panic("Attempted to kill init!");
+}
+
 fastcall NORET_TYPE void do_exit(long code)
 {
 	struct task_struct *tsk = current;
@@ -908,13 +916,6 @@ fastcall NORET_TYPE void do_exit(long co
 		panic("Aiee, killing interrupt handler!");
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");
-	if (unlikely(tsk == child_reaper(tsk))) {
-		if (tsk->nsproxy->pid_ns != &init_pid_ns)
-			tsk->nsproxy->pid_ns->child_reaper = init_pid_ns.child_reaper;
-		else
-			panic("Attempted to kill init!");
-	}
-
 
 	if (unlikely(current->ptrace & PT_TRACE_EXIT)) {
 		current->ptrace_message = code;
@@ -964,6 +965,7 @@ fastcall NORET_TYPE void do_exit(long co
 	}
 	group_dead = atomic_dec_and_test(&tsk->signal->live);
 	if (group_dead) {
+		exit_child_reaper(tsk);
 		hrtimer_cancel(&tsk->signal->real_timer);
 		exit_itimers(tsk->signal);
 	}

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #19545 is a reply to message #19544] Fri, 03 August 2007 06:22 Go to previous message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Here is the modified patch that applies on top of Oleg's patch
to fix the error for threaded-init. Hope the "big-fat-comment" in
exit_child_reaper() (about child-reaper inheriting children even
after exiting) makes sense.

---

From: Pavel Emelyanov <xemul@openvz.org>
Subject: [PATCH 14/15] Destroy pid namespace on init's death

From: Sukadev Bhattiprolu <sukadev@us.ibm.com>

Terminate all processes in a namespace when the reaper of the namespace
is exiting. We do this by walking the pidmap of the namespace and sending
SIGKILL to all processes.

Changelog:
	[Oleg Nesterov]: In zap_pid_ns_processes() wait for any child
		rather than iterating over all pid_ts in the pidmap. Clear
		TIF_SIGPENDING flag for successive wait() calls.

	[Oleg Nesterov]: Ensure the logic works even with multi-threaded
		container-init process.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>

---

 include/linux/pid.h |    2 ++
 kernel/exit.c       |   27 ++++++++++++++++++++++++++-
 kernel/pid.c        |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

Index: lx26-23-rc1-mm1/include/linux/pid.h
===================================================================
--- lx26-23-rc1-mm1.orig/include/linux/pid.h	2007-08-02 11:03:39.000000000 -0700
+++ lx26-23-rc1-mm1/include/linux/pid.h	2007-08-02 11:06:47.000000000 -0700
@@ -118,6 +118,8 @@ extern struct pid *find_ge_pid(int nr, s
 
 extern struct pid *alloc_pid(struct pid_namespace *ns);
 extern void FASTCALL(free_pid(struct pid *pid));
+extern void zap_pid_ns_processes(struct pid_namespace *pid_ns,
+		struct task_struct *tsk);
 
 /*
  * the helpers to get the pid's id seen from different namespaces
Index: lx26-23-rc1-mm1/kernel/exit.c
===================================================================
--- lx26-23-rc1-mm1.orig/kernel/exit.c	2007-08-02 11:06:36.000000000 -0700
+++ lx26-23-rc1-mm1/kernel/exit.c	2007-08-02 23:06:47.000000000 -0700
@@ -916,7 +916,32 @@ static inline void exit_child_reaper(str
 	if (likely(tsk->group_leader != task_child_reaper(tsk)))
 		return;
 
-	panic("Attempted to kill init!");
+	if (tsk->nsproxy->pid_ns == &init_pid_ns)
+		panic("Attempted to kill init!");
+
+	/*
+	 * @tsk is the last thread in the 'container-init' and is exiting.
+	 * Terminate all remaining processes in the namespace and reap them
+	 * before exiting @tsk.
+	 *
+	 * Note that @tsk (last thread of container-init) may not necessarily
+	 * be the child-reaper (i.e main thread of container-init) of the
+	 * namespace i.e the child_reaper may have already exited.
+ 	 *
+	 * Even after a child_reaper exits, we let it inherit orphaned children,
+	 * because, pid_ns->child_reaper remains valid as long as there is
+	 * at least one living sub-thread in the container init.
+
+	 * This living sub-thread of the container-init will be notified when
+	 * a child inherited by the 'child-reaper' exits (do_notify_parent()
+	 * uses __group_send_sig_info()). Further, when reaping child processes,
+	 * do_wait() iterates over children of all living sub threads.
+
+	 * i.e even though 'child_reaper' thread is listed as the parent of the
+	 * orphaned children, any living sub-thread in the  container-init can
+	 * receive notification of the child exiting and reap the child.
+	 */
+	zap_pid_ns_processes(tsk->nsproxy->pid_ns, tsk);
 }
 
 fastcall NORET_TYPE void do_exit(long code)
Index: lx26-23-rc1-mm1/kernel/pid.c
===================================================================
--- lx26-23-rc1-mm1.orig/kernel/pid.c	2007-08-02 11:03:39.000000000 -0700
+++ lx26-23-rc1-mm1/kernel/pid.c	2007-08-02 23:06:40.000000000 -0700
@@ -29,6 +29,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/init_task.h>
 #include <linux/proc_fs.h>
+#include <linux/syscalls.h>
 
 #define pid_hashfn(nr, ns)	\
 	hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
@@ -593,6 +594,46 @@ out:
 	return new_ns;
 }
 
+void zap_pid_ns_processes(struct pid_namespace *pid_ns,
+			struct task_struct *reaper)
+{
+	int nr;
+	int rc;
+	pid_t reaper_pid = pid_nr_ns(task_pid(reaper), pid_ns);
+
+	/*
+	 * The last thread in the container-init thread group is terminating.
+	 * Find remaining pid_ts in the namespace, signal and wait for them
+	 * to exit.
+	 *
+	 * Note:  This signals each threads in the namespace - even those that
+	 * 	  belong to the same thread group, To avoid this, we would have
+	 * 	  to walk the entire tasklist looking a processes in this
+	 * 	  namespace, but that could be unnecessarily expensive if the
+	 * 	  pid namespace has just a few processes. Or we need to
+	 * 	  maintain a tasklist for each pid namespace.
+	 *
+	 */
+	read_lock(&tasklist_lock);
+	nr = next_pidmap(pid_ns, 0);
+	while (nr > 0) {
+		if (reaper_pid != nr)
+			kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
+		nr = next_pidmap(pid_ns, nr);
+	}
+	read_unlock(&tasklist_lock);
+
+	do {
+		clear_thread_flag(TIF_SIGPENDING);
+		rc = sys_wait4(-1, NULL, __WALL, NULL);
+	} while (rc != -ECHILD);
+
+
+	/* Don't need a child reaper for this pid namespace anymore */
+	pid_ns->child_reaper = NULL;
+	return;
+}
+
 static void do_free_pid_ns(struct work_struct *w)
 {
 	struct pid_namespace *ns, *parent;
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #19551 is a reply to message #15308] Fri, 03 August 2007 21:36 Go to previous message
Sukadev Bhattiprolu is currently offline  Sukadev Bhattiprolu
Messages: 413
Registered: August 2006
Senior Member
Oleg Nesterov [oleg@tv-sign.ru] wrote:
| This is harmless (and note that it is possible that current was actually killed
| with SIGKILL from the parent namespace), but the code imho looks confusing.
| 
| I'd suggest to make zap_pid_ns_processes(void), and start the loop from nr == 1.
| Or zap_pid_ns_processes(struct pid_namespace *pid_ns).
| 
| Oleg.

Agree. Here is the modified patch.
---

From: Pavel Emelyanov <xemul@openvz.org>
Subject: [PATCH 14/15] Destroy pid namespace on init's death

From: Sukadev Bhattiprolu <sukadev@us.ibm.com>

Terminate all processes in a namespace when the reaper of the namespace
is exiting. We do this by walking the pidmap of the namespace and sending
SIGKILL to all processes.

Changelog:
	[Oleg Nesterov]: In zap_pid_ns_processes() wait for any child
		rather than iterating over all pid_ts in the pidmap. Clear
		TIF_SIGPENDING flag for successive wait() calls.

	[Oleg Nesterov]: Ensure the logic works even with multi-threaded
		container-init process.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>

---

 include/linux/pid.h |    1 +
 kernel/exit.c       |   27 ++++++++++++++++++++++++++-
 kernel/pid.c        |   38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+), 1 deletion(-)

Index: lx26-23-rc1-mm1/include/linux/pid.h
===================================================================
--- lx26-23-rc1-mm1.orig/include/linux/pid.h	2007-08-02 11:03:39.000000000 -0700
+++ lx26-23-rc1-mm1/include/linux/pid.h	2007-08-03 13:29:40.000000000 -0700
@@ -118,6 +118,7 @@ extern struct pid *find_ge_pid(int nr, s
 
 extern struct pid *alloc_pid(struct pid_namespace *ns);
 extern void FASTCALL(free_pid(struct pid *pid));
+extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 
 /*
  * the helpers to get the pid's id seen from different namespaces
Index: lx26-23-rc1-mm1/kernel/exit.c
===================================================================
--- lx26-23-rc1-mm1.orig/kernel/exit.c	2007-08-02 11:06:36.000000000 -0700
+++ lx26-23-rc1-mm1/kernel/exit.c	2007-08-03 13:56:37.000000000 -0700
@@ -916,7 +916,32 @@ static inline void exit_child_reaper(str
 	if (likely(tsk->group_leader != task_child_reaper(tsk)))
 		return;
 
-	panic("Attempted to kill init!");
+	if (tsk->nsproxy->pid_ns == &init_pid_ns)
+		panic("Attempted to kill init!");
+
+	/*
+	 * @tsk is the last thread in the 'container-init' and is exiting.
+	 * Terminate all remaining processes in the namespace and reap them
+	 * before exiting @tsk.
+	 *
+	 * Note that @tsk (last thread of container-init) may not necessarily
+	 * be the child-reaper (i.e main thread of container-init) of the
+	 * namespace i.e the child_reaper may have already exited.
+ 	 *
+	 * Even after a child_reaper exits, we let it inherit orphaned children,
+	 * because, pid_ns->child_reaper remains valid as long as there is
+	 * at least one living sub-thread in the container init.
+
+	 * This living sub-thread of the container-init will be notified when
+	 * a child inherited by the 'child-reaper' exits (do_notify_parent()
+	 * uses __group_send_sig_info()). Further, when reaping child processes,
+	 * do_wait() iterates over children of all living sub threads.
+
+	 * i.e even though 'child_reaper' thread is listed as the parent of the
+	 * orphaned children, any living sub-thread in the container-init can
+	 * perform the role of the child_reaper.
+	 */
+	zap_pid_ns_processes(tsk->nsproxy->pid_ns);
 }
 
 fastcall NORET_TYPE void do_exit(long code)
Index: lx26-23-rc1-mm1/kernel/pid.c
===================================================================
--- lx26-23-rc1-mm1.orig/kernel/pid.c	2007-08-02 11:03:39.000000000 -0700
+++ lx26-23-rc1-mm1/kernel/pid.c	2007-08-03 13:56:12.000000000 -0700
@@ -29,6 +29,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/init_task.h>
 #include <linux/proc_fs.h>
+#include <linux/syscalls.h>
 
 #define pid_hashfn(nr, ns)	\
 	hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
@@ -593,6 +594,43 @@ out:
 	return new_ns;
 }
 
+void zap_pid_ns_processes(struct pid_namespace *pid_ns)
+{
+	int nr;
+	int rc;
+
+	/*
+	 * The last thread in the container-init thread group is terminating.
+	 * Find remaining pid_ts in the namespace, signal and wait for them
+	 * to exit.
+	 *
+	 * Note:  This signals each threads in the namespace - even those that
+	 * 	  belong to the same thread group, To avoid this, we would have
+	 * 	  to walk the entire tasklist looking a processes in this
+	 * 	  namespace, but that could be unnecessarily expensive if the
+	 * 	  pid namespace has just a few processes. Or we need to
+	 * 	  maintain a tasklist for each pid namespace.
+	 *
+	 */
+	read_lock(&tasklist_lock);
+	nr = next_pidmap(pid_ns, 1);
+	while (nr > 0) {
+		kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
+		nr = next_pidmap(pid_ns, nr);
+	}
+	read_unlock(&tasklist_lock);
+
+	do {
+		clear_thread_flag(TIF_SIGPENDING);
+		rc = sys_wait4(-1, NULL, __WALL, NULL);
+	} while (rc != -ECHILD);
+
+
+	/* Child reaper for the pid namespace is going away */
+	pid_ns->child_reaper = NULL;
+	return;
+}
+
 static void do_free_pid_ns(struct work_struct *w)
 {
 	struct pid_namespace *ns, *parent;
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 14/15] Destroy pid namespace on init's death [message #19552 is a reply to message #19545] Fri, 03 August 2007 10:55 Go to previous message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/02, sukadev@us.ibm.com wrote:
> 
> --- lx26-23-rc1-mm1.orig/kernel/exit.c	2007-08-02 11:06:36.000000000 -0700
> +++ lx26-23-rc1-mm1/kernel/exit.c	2007-08-02 23:06:47.000000000 -0700
> @@ -916,7 +916,32 @@ static inline void exit_child_reaper(str
>  	if (likely(tsk->group_leader != task_child_reaper(tsk)))
>  		return;
>  
> -	panic("Attempted to kill init!");
> +	if (tsk->nsproxy->pid_ns == &init_pid_ns)
> +		panic("Attempted to kill init!");
> +
> +	/*
> +	 * @tsk is the last thread in the 'container-init' and is exiting.
> +	 * Terminate all remaining processes in the namespace and reap them
> +	 * before exiting @tsk.
> +	 *
> +	 * Note that @tsk (last thread of container-init) may not necessarily
> +	 * be the child-reaper (i.e main thread of container-init) of the
> +	 * namespace i.e the child_reaper may have already exited.
> + 	 *
> +	 * Even after a child_reaper exits, we let it inherit orphaned children,
> +	 * because, pid_ns->child_reaper remains valid as long as there is
> +	 * at least one living sub-thread in the container init.
> +
> +	 * This living sub-thread of the container-init will be notified when
> +	 * a child inherited by the 'child-reaper' exits (do_notify_parent()
> +	 * uses __group_send_sig_info()). Further, when reaping child processes,
> +	 * do_wait() iterates over children of all living sub threads.
> +
> +	 * i.e even though 'child_reaper' thread is listed as the parent of the
> +	 * orphaned children, any living sub-thread in the  container-init can
> +	 * receive notification of the child exiting and reap the child.
> +	 */

Great, thanks.

> +	zap_pid_ns_processes(tsk->nsproxy->pid_ns, tsk);
>  }
>
> +void zap_pid_ns_processes(struct pid_namespace *pid_ns,
> +			struct task_struct *reaper)
> +{
> +	int nr;
> +	int rc;
> +	pid_t reaper_pid = pid_nr_ns(task_pid(reaper), pid_ns);

I personally dislike these paramaters. reaper == current, and it is /sbin/init
always. Not because zap_pid_ns_processes() is always called with reaper == current,
but because zap_pid_ns_processes() can't work otherwise: we are using do_wait()
and assume that forget_original_parent() will re-parent threads to use.

And we use it just to figure out reaper_pid, it is used to avoid sending
SIGKILL to us,

> +	read_lock(&tasklist_lock);
> +	nr = next_pidmap(pid_ns, 0);
> +	while (nr > 0) {
> +		if (reaper_pid != nr)
> +			kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
> +		nr = next_pidmap(pid_ns, nr);
> +	}
> +	read_unlock(&tasklist_lock);

But this doesn't work if we are not ->group_leader (iow, when reaper_pid != 1).
Because in that case we are doing kill_proc_info(SIGKILL, SEND_SIG_PRIV, 1),
which sends the signal to entire thread group, and thus to us (because we are
the last alive thread).

This is harmless (and note that it is possible that current was actually killed
with SIGKILL from the parent namespace), but the code imho looks confusing.

I'd suggest to make zap_pid_ns_processes(void), and start the loop from nr == 1.
Or zap_pid_ns_processes(struct pid_namespace *pid_ns).

Oleg.

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Previous Topic: Re: [RFC, PATCH] handle the multi-threaded init's exit() properly
Next Topic: [PATCH 0/14] sysfs cleanups
Goto Forum:
  


Current Time: Thu Jan 02 19:16:53 GMT 2025

Total time taken to generate the page: 0.09000 seconds