| Home » Mailing lists » Devel » Remaining straight forward kthread API conversions... Goto Forum:
	|  |  
	|  |  
	| 
		
			| Re: [PATCH] powerpc pseries eeh: Convert to kthread API [message #18502 is a reply to message #18295] | Tue, 24 April 2007 05:43   |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
>> Further in general it doesn't make sense to grab a module reference
>> and call that sufficient because we would like to request that the
>> module exits.
>
> Which is, btw, I think a total misdesign of our module stuff, but heh, I
> remember that lead to some flamewars back then...
>
> Like anything else, modules should have separated the entrypoints for
>
>  - Initiating a removal request
>  - Releasing the module
>
> The former is use did "rmmod", can unregister things from subsystems,
> etc... (and can file if the driver decides to refuse removal requests
> when it's busy doing things or whatever policy that module wants to
> implement).
>
> The later is called when all references to the modules have been
> dropped, it's a bit like the kref "release" (and could be implemented as
> one).
>
> If we had done that (simple) thing back then, module refcounting would
> have been much less of a problem... I remember some reasons why that was
> veto'ed but I didn't and still don't agree.
The basic point is because a thread can terminate sooner if we have an
explicit request to stop, we need that in the design.
Because we need to find the threads to request that they stop we need to
have some way to track them.
Since we need to have some way to track them having an explicit data
structure that the callers manage seems to make sense.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH] powerpc pseries eeh: Convert to kthread API [message #18506 is a reply to message #18505] | Tue, 24 April 2007 17:24   |  
			| 
				
				
					|  linas Messages: 3
 Registered: April 2007
 | Junior Member |  |  |  
	| On Tue, Apr 24, 2007 at 11:38:53AM +1000, Benjamin Herrenschmidt wrote:
> > The only reason for using threads here is to get the error recovery
> > out of an interrupt context (where errors may be detected), and then,
> > an hour later, decrement a counter (which is how we limit these to 
> > 6 per hour). Thread reaping is "trivial", the thread just exits
> > after an hour.
> 
> In addition, it should be a thread and not done from within keventd
> because :
> 
>  - It can take a long time (well, relatively but still too long for a
> work queue)
Uhh, 15 or 20 seconds even. That's a long time by any kernel standard.
>  - The driver callbacks might need to use keventd or do flush_workqueue
> to synchronize with their own workqueues when doing an internal
> recovery.
> 
> > Since these are events rare, I've no particular concern about
> > performance or resource consumption. The current code seems 
> > to work just fine. :-)
> 
> I think moving to kthread's is cleaner (just a wrapper around kernel
> threads that simplify dealing with reaping them out mostly) and I agree
> with Christoph that it would be nice to be able to "fire off" kthreads
> from interrupt context.. in many cases, we abuse work queues for things
> that should really done from kthreads instead (basically anything that
> takes more than a couple hundred microsecs or so).
It would be nice to have threads that can be "fired off" from an
interrupt context.  That would simplify the EEH code slightly 
(removing a few dozen lines of code that do this bounce).
I presume that various device drivers might find this useful as well.
--linas
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] powerpc pseries eeh: Convert to kthread API [message #18507 is a reply to message #18468] | Tue, 24 April 2007 17:35   |  
			| 
				
				
					|  linas Messages: 3
 Registered: April 2007
 | Junior Member |  |  |  
	| On Thu, Apr 19, 2007 at 01:58:45AM -0600, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
> 
> This patch modifies the startup of eehd to use kthread_run
> not a combination of kernel_thread and daemonize.  Making
> the code slightly simpler and more maintainable.
For the patch that touched arch/powerpc/platforms/pseries/eeh_event.c,
I ran a variety of tests, and couldn't see/find/evoke any adverse
effects, so ..
Acked-by: Linas Vepstas <linas@austin.ibm.com>
> The second question is whether this is the right implementation.
> kthread_create already works by using a workqueue to create the thread
> and then waits for it.  If we really want to support creating threads
> asynchronously on demand we should have a proper API in kthread.c for
> this instead of spreading workqueues.
Yes, exactly; all I really want is to start a thread from an
interrupt context, and pass a structure to it.  This is pretty much
all that arch/powerpc/platforms/pseries/eeh_event.c is trying to do,
and little else.
--linas
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] powerpc pseries eeh: Convert to kthread API [message #18508 is a reply to message #18496] | Tue, 24 April 2007 02:42   |  
			| 
				
				
					|  Benjamin Herrenschmid Messages: 10
 Registered: February 2006
 | Junior Member |  |  |  
	| On Mon, 2007-04-23 at 20:08 -0600, Eric W. Biederman wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
> 
> >> The only reason for using threads here is to get the error recovery
> >> out of an interrupt context (where errors may be detected), and then,
> >> an hour later, decrement a counter (which is how we limit these to 
> >> 6 per hour). Thread reaping is "trivial", the thread just exits
> >> after an hour.
> >
> > In addition, it should be a thread and not done from within keventd
> > because :
> >
> >  - It can take a long time (well, relatively but still too long for a
> > work queue)
> >
> >  - The driver callbacks might need to use keventd or do flush_workqueue
> > to synchronize with their own workqueues when doing an internal
> > recovery.
> >
> >> Since these are events rare, I've no particular concern about
> >> performance or resource consumption. The current code seems 
> >> to work just fine. :-)
> >
> > I think moving to kthread's is cleaner (just a wrapper around kernel
> > threads that simplify dealing with reaping them out mostly) and I agree
> > with Christoph that it would be nice to be able to "fire off" kthreads
> > from interrupt context.. in many cases, we abuse work queues for things
> > that should really done from kthreads instead (basically anything that
> > takes more than a couple hundred microsecs or so).
> 
> On that note does anyone have a problem is we manage the irq spawning
> safe kthreads the same way that we manage the work queue entries.
> 
> i.e. by a structure allocated by the caller?
Not sure... I can see places where I might want to spawn an arbitrary
number of these without having to preallocate structures... and if I
allocate on the fly, then I need a way to free that structure when the
kthread is reaped which I don't think we have currently, do we ? (In
fact, I could use that for other things too now that I'm thinking of
it ... I might have a go at providing optional kthread destructors).
Ben.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: Getting the new RxRPC patches upstream [message #18511 is a reply to message #18498] | Tue, 24 April 2007 13:37   |  
			| 
				
				
					|  David Howells Messages: 44
 Registered: October 2006
 | Member |  |  |  
	| Oleg Nesterov <oleg@tv-sign.ru> wrote:
> > > We only care when del_timer() returns true. In that case, if the timer
> > > function still runs (possible for single-threaded wqs), it has already
> > > passed __queue_work().
> > 
> > Why do you assume that?
Sorry, I should have been more clear.  I meant the assumption that we only
care about a true return from del_timer().
> If del_timer() returns true, the timer was pending. This means it was
> started by work->func() (note that __run_timers() clears timer_pending()
> before calling timer->function). This in turn means that
> delayed_work_timer_fn() has already called __queue_work(dwork), otherwise
> work->func() has no chance to run.
But if del_timer() returns 0, then there may be a problem.  We can't tell the
difference between the following two cases:
 (1) The timer hadn't been started.
 (2) The timer had been started, has expired and is no longer pending, but
     another CPU is running its handler routine.
try_to_del_timer_sync() _does_, however, distinguish between these cases: the
first is the 0 return, the second is the -1 return, and the case where it
dequeued the timer is the 1 return.
BTW, can a timer handler be preempted?  I assume not...  But it can be delayed
by interrupt processing.
David
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] powerpc pseries eeh: Convert to kthread API [message #18512 is a reply to message #18499] | Tue, 24 April 2007 05:00   |  
			| 
				
				
					|  Benjamin Herrenschmid Messages: 10
 Registered: February 2006
 | Junior Member |  |  |  
	| > Further in general it doesn't make sense to grab a module reference
> and call that sufficient because we would like to request that the
> module exits.
Which is, btw, I think a total misdesign of our module stuff, but heh, I
remember that lead to some flamewars back then...
Like anything else, modules should have separated the entrypoints for
 - Initiating a removal request
 - Releasing the module
The former is use did "rmmod", can unregister things from subsystems,
etc... (and can file if the driver decides to refuse removal requests
when it's busy doing things or whatever policy that module wants to
implement).
The later is called when all references to the modules have been
dropped, it's a bit like the kref "release" (and could be implemented as
one).
If we had done that (simple) thing back then, module refcounting would
have been much less of a problem... I remember some reasons why that was
veto'ed but I didn't and still don't agree.
Ben.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: Getting the new RxRPC patches upstream [message #18516 is a reply to message #18371] | Tue, 24 April 2007 15:51   |  
			| 
				
				
					|  David Howells Messages: 44
 Registered: October 2006
 | Member |  |  |  
	| Oleg Nesterov <oleg@tv-sign.ru> wrote:
> Let's look at (2). cancel_delayed_work() (on top of del_timer()) returns 0,
> and this is correct, we failed to cancel the timer, and we don't know whether
> work->func() finished, or not.
Yes.
> The current code uses del_timer_sync(). It will also return 0. However, it
> will spin waiting for timer->function() to complete. So we are just wasting
> CPU.
That's my objection to using cancel_delayed_work() as it stands, although in
most cases it's a relatively minor waste of time.  However, if the timer
expiry routine gets interrupted then it may not be so minor...  So, yes, I'm
in full agreement with you there.
> I guess I misunderstood you. Perhaps, you propose a new helper which use
> try_to_del_timer_sync(), yes? Unless I missed something, this doesn't help.
> Because the return value == -1 should be treated as 0. We failed to stop
> the timer, and we can't free dwork.
Consider how I'm using try_to_cancel_delayed_work(): I use this when I want to
queue a delayed work item with a particular timeout (usually for immediate
processing), but the work item may already be pending.
If try_to_cancel_delayed_work() returns 0 or 1 (not pending or pending but
dequeued) then I can go ahead and just schedule the work item (I'll be holding
a lock to prevent anyone else from interfering).
However, if try_to_cancel_delayed_work() returns -1 then there's no usually no
point attempting to schedule the work item because I know the timer expiry
handler is doing that or going to do that.
The code looks like this in pretty much all cases:
	if (try_to_cancel_delayed_work(&afs_server_reaper) >= 0)
		schedule_delayed_work(&afs_server_reaper, 0);
And so could well be packaged into a convenience routine and placed in
workqueue.[ch].  However, this would still concern Dave Miller as my patches
would still be altering non-net stuff or depending on non-net patches he
doesn't have in his GIT tree.
Using cancel_delayer_work() instead would be acceptable, functionally, as that
just waits till the -1 return case no longer holds true, and so always returns
0 or 1.
In RxRPC, this is only used to cancel a pair global delayed work items in the
rmmod path, and so the inefficiency of cancel_delayed_work() is something I
can live with, though it's something I'd want to reduce in the longer term.
In AFS, this is not only used in object destruction paths, but is also used to
cancel the callback timer and initiate synchronisation processing with
immediate effect.
David
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: Getting the new RxRPC patches upstream [message #18517 is a reply to message #18371] | Tue, 24 April 2007 16:58   |  
			| 
				
				
					|  David Howells Messages: 44
 Registered: October 2006
 | Member |  |  |  
	| Oleg Nesterov <oleg@tv-sign.ru> wrote:
> > > The current code uses del_timer_sync(). It will also return 0. However,
> > > it will spin waiting for timer->function() to complete. So we are just
> > > wasting CPU.
> > 
> > That's my objection to using cancel_delayed_work() as it stands, although in
> > most cases it's a relatively minor waste of time.  However, if the timer
> > expiry routine gets interrupted then it may not be so minor...  So, yes, I'm
> > in full agreement with you there.
> 
> Great. I'll send the s/del_timer_sync/del_timer/ patch.
I didn't say I necessarily agreed that this was a good idea.  I just meant that
I agree that it will waste CPU.  You must still audit all uses of
cancel_delayed_work().
> Aha, now I see what you mean. However. Why the code above is better then
> 
> 	cancel_delayed_work(&afs_server_reaper);
> 	schedule_delayed_work(&afs_server_reaper, 0);
> 
> ? (I assume we already changed cancel_delayed_work() to use del_timer).
Because calling schedule_delayed_work() is a waste of CPU if the timer expiry
handler is currently running at this time as *that* is going to also schedule
the delayed work item.
> If delayed_work_timer_fn() is not running - both variants (let's denote them
> as 1 and 2) do the same.
Yes, but that's not the point.
> Now suppose that delayed_work_timer_fn() is running.
> 
> 	1: lock_timer_base(), return -1, skip schedule_delayed_work().
>
> 	2: check timer_pending(), return 0, call schedule_delayed_work(),
> 	   return immediately because test_and_set_bit(WORK_STRUCT_PENDING)
> 	   fails.
I don't see what you're illustrating here.  Are these meant to be two steps in
a single process?  Or are they two alternate steps?
> So I still don't think try_to_del_timer_sync() can help in this particular
> case.
It permits us to avoid the test_and_set_bit() under some circumstances.
> To some extent, try_to_cancel_delayed_work is
> 
> 	int try_to_cancel_delayed_work(dwork)
> 	{
> 		ret = cancel_delayed_work(dwork);
> 		if (!ret && work_pending(&dwork->work))
> 			ret = -1;
> 		return ret;
> 	}
> 
> iow, work_pending() looks like a more "precise" indication that work->func()
> is going to run soon.
Ah, but the timer routine may try to set the work item pending flag *after* the
work_pending() check you have here.  Furthermore, it would be better to avoid
the work_pending() check entirely because that check involves interacting with
atomic ops done on other CPUs.  try_to_del_timer_sync() returning -1 tells us
without a shadow of a doubt that the work item is either scheduled now or will
be scheduled very shortly, thus allowing us to avoid having to do it ourself.
David
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: Getting the new RxRPC patches upstream [message #18518 is a reply to message #18511] | Tue, 24 April 2007 14:22   |  
			| 
				
				
					|  Oleg Nesterov Messages: 143
 Registered: August 2006
 | Senior Member |  |  |  
	| On 04/24, David Howells wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> wrote:
> 
> > > > We only care when del_timer() returns true. In that case, if the timer
> > > > function still runs (possible for single-threaded wqs), it has already
> > > > passed __queue_work().
> > > 
> > > Why do you assume that?
> 
> Sorry, I should have been more clear.  I meant the assumption that we only
> care about a true return from del_timer().
> 
> > If del_timer() returns true, the timer was pending. This means it was
> > started by work->func() (note that __run_timers() clears timer_pending()
> > before calling timer->function). This in turn means that
> > delayed_work_timer_fn() has already called __queue_work(dwork), otherwise
> > work->func() has no chance to run.
> 
> But if del_timer() returns 0, then there may be a problem.  We can't tell the
> difference between the following two cases:
> 
>  (1) The timer hadn't been started.
> 
>  (2) The timer had been started, has expired and is no longer pending, but
>      another CPU is running its handler routine.
> 
> try_to_del_timer_sync() _does_, however, distinguish between these cases: the
> first is the 0 return, the second is the -1 return, and the case where it
> dequeued the timer is the 1 return.
Of course, del_timer() and del_timer_sync() are different. What I meant the
latter buys nothing for cancel_delayed_work() (which in fact could be named
try_to_cancel_delayed_work()).
Let's look at (2). cancel_delayed_work() (on top of del_timer()) returns 0,
and this is correct, we failed to cancel the timer, and we don't know whether
work->func() finished, or not.
The current code uses del_timer_sync(). It will also return 0. However, it will
spin waiting for timer->function() to complete. So we are just wasting CPU.
I guess I misunderstood you. Perhaps, you propose a new helper which use
try_to_del_timer_sync(), yes? Unless I missed something, this doesn't help.
Because the return value == -1 should be treated as 0. We failed to stop
the timer, and we can't free dwork.
IOW, currently we should do:
	if (!cancel_delayed_work(dwork))
		cancel_work_sync(dwork));
The same if we use del_timer(). If we use try_to_del_timer_sync(),
	if (cancel_delayed_work(dwork) <= 0)
		cancel_work_sync(dwork));
(of course, dwork shouldn't re-arm itself).
Could you clarify if I misunderstood you again?
> BTW, can a timer handler be preempted?  I assume not...  But it can be delayed
> by interrupt processing.
No, it can't be preempted, it runs in softirq context.
Oleg.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: Getting the new RxRPC patches upstream [message #18520 is a reply to message #18371] | Tue, 24 April 2007 18:22   |  
			| 
				
				
					|  David Howells Messages: 44
 Registered: October 2006
 | Member |  |  |  
	| Oleg Nesterov <oleg@tv-sign.ru> wrote:
> Sure, I'll grep for cancel_delayed_work(). But unless I missed something,
> this change should be completely transparent for all users. Otherwise, it
> is buggy.
I guess you will have to make sure that cancel_delayed_work() is always
followed by a flush of the workqueue, otherwise you might get this situation:
	CPU 0				CPU 1
	===============================	=======================
					<timer expires>
	cancel_delayed_work(x) == 0	-->delayed_work_timer_fn(x)
	kfree(x);			-->do_IRQ()
	y = kmalloc(); // reuses x
					<--do_IRQ()
					__queue_work(x)
	--- OOPS ---
That's my main concern.  If you are certain that can't happen, then fair
enough.
Note that although you can call cancel_delayed_work() from within a work item
handler, you can't then follow it up with a flush as it's very likely to
deadlock.
> > Because calling schedule_delayed_work() is a waste of CPU if the timer
> > expiry handler is currently running at this time as *that* is going to
> > also schedule the delayed work item.
> 
> Yes. But otoh, try_to_del_timer_sync() is a waste of CPU compared to
> del_timer(), when the timer is not pending.
I suppose that's true.  As previously stated, my main objection to del_timer()
is the fact that it doesn't tell you if the timer expiry function is still
running.
Can you show me a patch illustrating exactly how you want to change
cancel_delayed_work()?  I can't remember whether you've done so already, but
if you have, I can't find it.  Is it basically this?:
 static inline int cancel_delayed_work(struct delayed_work *work)
 {
 	int ret;
-	ret = del_timer_sync(&work->timer);
+	ret = del_timer(&work->timer);
 	if (ret)
 		work_release(&work->work);
 	return ret;
 }
I was thinking this situation might be a problem:
	CPU 0				CPU 1
	===============================	=======================
					<timer expires>
	cancel_delayed_work(x) == 0	-->delayed_work_timer_fn(x)
	schedule_delayed_work(x,0)	-->do_IRQ()
	<keventd scheduled>
	x->work()
					<--do_IRQ()
					__queue_work(x)
But it won't, will it?
> > Ah, but the timer routine may try to set the work item pending flag
> > *after* the work_pending() check you have here.
> 
> No, delayed_work_timer_fn() doesn't set the _PENDING flag.
Good point.  I don't think that's a problem because cancel_delayed_work()
won't clear the pending flag if it didn't remove a timer.
> First, this is very unlikely event, delayed_work_timer_fn() is very fast
> unless interrupted.
Yeah, I guess so.
Okay, you've convinced me, I think - provided you consider the case I
outlinded at the top of this email.
If you give me a patch to alter cancel_delayed_work(), I'll substitute it for
mine and use that that instead.  Dave Miller will just have to live with that
patch being there:-)
David
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: Getting the new RxRPC patches upstream [message #18521 is a reply to message #18516] | Tue, 24 April 2007 16:40   |  
			| 
				
				
					|  Oleg Nesterov Messages: 143
 Registered: August 2006
 | Senior Member |  |  |  
	| On 04/24, David Howells wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> wrote:
> 
> > The current code uses del_timer_sync(). It will also return 0. However, it
> > will spin waiting for timer->function() to complete. So we are just wasting
> > CPU.
> 
> That's my objection to using cancel_delayed_work() as it stands, although in
> most cases it's a relatively minor waste of time.  However, if the timer
> expiry routine gets interrupted then it may not be so minor...  So, yes, I'm
> in full agreement with you there.
Great. I'll send the s/del_timer_sync/del_timer/ patch.
> > I guess I misunderstood you. Perhaps, you propose a new helper which use
> > try_to_del_timer_sync(), yes? Unless I missed something, this doesn't help.
> > Because the return value == -1 should be treated as 0. We failed to stop
> > the timer, and we can't free dwork.
> 
> Consider how I'm using try_to_cancel_delayed_work(): I use this when I want to
> queue a delayed work item with a particular timeout (usually for immediate
> processing), but the work item may already be pending.
> 
> If try_to_cancel_delayed_work() returns 0 or 1 (not pending or pending but
> dequeued) then I can go ahead and just schedule the work item (I'll be holding
> a lock to prevent anyone else from interfering).
> 
> However, if try_to_cancel_delayed_work() returns -1 then there's no usually no
> point attempting to schedule the work item because I know the timer expiry
> handler is doing that or going to do that.
> 
> 
> The code looks like this in pretty much all cases:
> 
> 	if (try_to_cancel_delayed_work(&afs_server_reaper) >= 0)
> 		schedule_delayed_work(&afs_server_reaper, 0);
Aha, now I see what you mean. However. Why the code above is better then
	cancel_delayed_work(&afs_server_reaper);
	schedule_delayed_work(&afs_server_reaper, 0);
? (I assume we already changed cancel_delayed_work() to use del_timer).
If delayed_work_timer_fn() is not running - both variants (let's denote them
as 1 and 2) do the same.
Now suppose that delayed_work_timer_fn() is running.
	1: lock_timer_base(), return -1, skip schedule_delayed_work().
	2: check timer_pending(), return 0, call schedule_delayed_work(),
	   return immediately because test_and_set_bit(WORK_STRUCT_PENDING)
	   fails.
So I still don't think try_to_del_timer_sync() can help in this particular
case.
To some extent, try_to_cancel_delayed_work is
	int try_to_cancel_delayed_work(dwork)
	{
		ret = cancel_delayed_work(dwork);
		if (!ret && work_pending(&dwork->work))
			ret = -1;
		return ret;
	}
iow, work_pending() looks like a more "precise" indication that work->func()
is going to run soon.
Oleg.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: Getting the new RxRPC patches upstream [message #18522 is a reply to message #18517] | Tue, 24 April 2007 17:33   |  
			| 
				
				
					|  Oleg Nesterov Messages: 143
 Registered: August 2006
 | Senior Member |  |  |  
	| On 04/24, David Howells wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> wrote:
> 
> > Great. I'll send the s/del_timer_sync/del_timer/ patch.
> 
> I didn't say I necessarily agreed that this was a good idea.  I just meant that
> I agree that it will waste CPU.  You must still audit all uses of
> cancel_delayed_work().
Sure, I'll grep for cancel_delayed_work(). But unless I missed something,
this change should be completely transparent for all users. Otherwise, it
is buggy.
> > Aha, now I see what you mean. However. Why the code above is better then
> > 
> > 	cancel_delayed_work(&afs_server_reaper);
> > 	schedule_delayed_work(&afs_server_reaper, 0);
> > 
> > ? (I assume we already changed cancel_delayed_work() to use del_timer).
> 
> Because calling schedule_delayed_work() is a waste of CPU if the timer expiry
> handler is currently running at this time as *that* is going to also schedule
> the delayed work item.
Yes. But otoh, try_to_del_timer_sync() is a waste of CPU compared to del_timer(),
when the timer is not pending.
> > 	1: lock_timer_base(), return -1, skip schedule_delayed_work().
> >
> > 	2: check timer_pending(), return 0, call schedule_delayed_work(),
> > 	   return immediately because test_and_set_bit(WORK_STRUCT_PENDING)
> > 	   fails.
> 
> I don't see what you're illustrating here.  Are these meant to be two steps in
> a single process?  Or are they two alternate steps?
two alternate steps.
1 means
	if (try_to_cancel_delayed_work())
		schedule_delayed_work();
2 means
	cancel_delayed_work();
	schedule_delayed_work();
> > So I still don't think try_to_del_timer_sync() can help in this particular
> > case.
> 
> It permits us to avoid the test_and_set_bit() under some circumstances.
Yes. But lock_timer_base() is more costly.
> > To some extent, try_to_cancel_delayed_work is
> > 
> > 	int try_to_cancel_delayed_work(dwork)
> > 	{
> > 		ret = cancel_delayed_work(dwork);
> > 		if (!ret && work_pending(&dwork->work))
> > 			ret = -1;
> > 		return ret;
> > 	}
> > 
> > iow, work_pending() looks like a more "precise" indication that work->func()
> > is going to run soon.
> 
> Ah, but the timer routine may try to set the work item pending flag *after* the
> work_pending() check you have here.
No, delayed_work_timer_fn() doesn't set the _PENDING flag.
>                                      Furthermore, it would be better to avoid
> the work_pending() check entirely because that check involves interacting with
> atomic ops done on other CPUs.
Sure, the implementation of try_to_cancel_delayed_work() above is just for
illustration. I don't think we need try_to_cancel_delayed_work() at all.
>                                try_to_del_timer_sync() returning -1 tells us
> without a shadow of a doubt that the work item is either scheduled now or will
> be scheduled very shortly, thus allowing us to avoid having to do it ourself.
First, this is very unlikely event, delayed_work_timer_fn() is very fast unless
interrupted.
_PENDING flag won't be cleared until this work is executed by run_workqueue().
In generak, work_pending() after del_timer() is imho better way to avoid the
unneeded schedule_delayed_work().
But again, I can't undertand the win for that particular case.
Oleg.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: Getting the new RxRPC patches upstream [message #18523 is a reply to message #18520] | Tue, 24 April 2007 19:34   |  
			| 
				
				
					|  Oleg Nesterov Messages: 143
 Registered: August 2006
 | Senior Member |  |  |  
	| On 04/24, David Howells wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> wrote:
> 
> > Sure, I'll grep for cancel_delayed_work(). But unless I missed something,
> > this change should be completely transparent for all users. Otherwise, it
> > is buggy.
> 
> I guess you will have to make sure that cancel_delayed_work() is always
> followed by a flush of the workqueue, otherwise you might get this situation:
> 
> 	CPU 0				CPU 1
> 	===============================	=======================
> 					<timer expires>
> 	cancel_delayed_work(x) == 0	-->delayed_work_timer_fn(x)
> 	kfree(x);			-->do_IRQ()
> 	y = kmalloc(); // reuses x
> 					<--do_IRQ()
> 					__queue_work(x)
> 	--- OOPS ---
> 
> That's my main concern.  If you are certain that can't happen, then fair
> enough.
Yes sure. Note that this is documented:
	/*
	 * Kill off a pending schedule_delayed_work().  Note that the work callback
	 * function may still be running on return from cancel_delayed_work().  Run
	 * flush_workqueue() or cancel_work_sync() to wait on it.
	 */
This comment is not very precise though. If the work doesn't re-arm itself,
we need cancel_work_sync() only if cancel_delayed_work() returns 0.
So there is no difference with the proposed change. Except, return value == 0
means:
	currently (del_timer_sync): callback may still be running or scheduled
	with del_timer: may still be running, or scheduled, or will be scheduled
	right now.
However, this is the same from the caller POV.
> Can you show me a patch illustrating exactly how you want to change
> cancel_delayed_work()?  I can't remember whether you've done so already, but
> if you have, I can't find it.  Is it basically this?:
> 
>  static inline int cancel_delayed_work(struct delayed_work *work)
>  {
>  	int ret;
> 
> -	ret = del_timer_sync(&work->timer);
> +	ret = del_timer(&work->timer);
>  	if (ret)
>  		work_release(&work->work);
>  	return ret;
>  }
Yes, exactly. The patch is trivial, but I need some time to write the
understandable changelog...
> I was thinking this situation might be a problem:
> 
> 	CPU 0				CPU 1
> 	===============================	=======================
> 					<timer expires>
> 	cancel_delayed_work(x) == 0	-->delayed_work_timer_fn(x)
> 	schedule_delayed_work(x,0)	-->do_IRQ()
> 	<keventd scheduled>
> 	x->work()
> 					<--do_IRQ()
> 					__queue_work(x)
> 
> But it won't, will it?
Yes, I think this should be OK. schedule_delayed_work() will notice
_PENDING and abort, so the last "x->work()" doesn't happen.
What can happen is
					<timer expires>
	cancel_delayed_work(x) == 0
					-->delayed_work_timer_fn(x)
					__queue_work(x)
					<keventd scheduled>
					x->work()
	schedule_delayed_work(x,0)
	<the work is scheduled again>
, so we can have an "unneeded schedule", but this is very unlikely.
Oleg.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH] ia64 sn xpc: Convert to use kthread API. [message #18547 is a reply to message #18331] | Thu, 26 April 2007 20:00   |  
			| 
				
				
					|  Dean Nelson Messages: 6
 Registered: April 2007
 | Junior Member |  |  |  
	| On Thu, Apr 19, 2007 at 01:58:44AM -0600, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
> 
> This patch starts the xpc kernel threads using kthread_run
> not a combination of kernel_thread and daemonize.  Resuling
> in slightly simpler and more maintainable code.
> 
> Cc: Jes Sorensen <jes@sgi.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  arch/ia64/sn/kernel/xpc_main.c |   31 +++++++++++++------------------
>  1 files changed, 13 insertions(+), 18 deletions(-)
Acked-by: Dean Nelson <dcn@sgi.com>
Andrew, I've tested Eric's patch in conjunction with a fix from Christoph
Lameter, which you've already got (it cleaned up a couple of compiler errors),
and the following patch that I'd like added (it cleans up a couple of
compiler warning errors and makes a few cosmetic changes).
Thanks,
Dean
Signed-off-by: Dean Nelson <dcn@sgi.com>
Index: mm-tree/arch/ia64/sn/kernel/xpc_main.c
===================================================================
--- mm-tree.orig/arch/ia64/sn/kernel/xpc_main.c	2007-04-25 14:04:51.701213426 -0500
+++ mm-tree/arch/ia64/sn/kernel/xpc_main.c	2007-04-26 06:29:02.447330438 -0500
@@ -568,7 +568,6 @@
 
 	task = kthread_run(xpc_activating, (void *) ((u64) partid),
 			  "xpc%02d", partid);
-
 	if (unlikely(IS_ERR(task))) {
 		spin_lock_irqsave(&part->act_lock, irq_flags);
 		part->act_state = XPC_P_INACTIVE;
@@ -808,7 +807,6 @@
 			int ignore_disconnecting)
 {
 	unsigned long irq_flags;
-	pid_t pid;
 	u64 args = XPC_PACK_ARGS(ch->partid, ch->number);
 	struct xpc_partition *part = &xpc_partitions[ch->partid];
 	struct task_struct *task;
@@ -840,7 +838,7 @@
 		(void) xpc_part_ref(part);
 		xpc_msgqueue_ref(ch);
 
-		task = kthread_run(xpc_daemonize_kthread, args,
+		task = kthread_run(xpc_daemonize_kthread, (void *) args,
 				   "xpc%02dc%d", ch->partid, ch->number);
 		if (IS_ERR(task)) {
 			/* the fork failed */
@@ -1381,7 +1379,8 @@
 	 * activate based on info provided by SAL. This new thread is short
 	 * lived and will exit once discovery is complete.
 	 */
-	task = kthread_run(xpc_initiate_discovery, NULL, XPC_DISCOVERY_THREAD_NAME);
+	task = kthread_run(xpc_initiate_discovery, NULL,
+			   XPC_DISCOVERY_THREAD_NAME);
 	if (IS_ERR(task)) {
 		dev_err(xpc_part, "failed while forking discovery thread\n");
 
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] ia64 sn xpc: Convert to use kthread API. [message #18556 is a reply to message #18294] | Fri, 27 April 2007 18:34   |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| Dean Nelson <dcn@sgi.com> writes:
>
> Christoph is correct in that XPC has a single thread that exists throughout
> its lifetime, another set of threads that exist for the time that active
> contact with other XPCs running on other SGI system partitions exists, and
> finally there is a pool of threads that exist on an as needed basis once
> a channel connection has been established between two partitions.
>
> In principle I approve of the kthread API and its use as opposed to what
> XPC currently does (calls kernel_thread(), daemonize(), wait_for_completion(),
> and complete()). So Christoph's patch that changes the single long-lived
> thread to use kthread_stop() and kthread_should_stop() is appreciated.
>
> But the fact that another thread, started at the xpc_init() time, that does
> discovery of other SGI system partitions wasn't converted points out a
> weekness in either my thinking or the kthread API. This discovery thread
> does its job and then exits. Should XPC be rmmod'd while the discovery
> thread is still running we would need to do a kthread_stop() against it. 
> But kthread_stop() isn't set up to deal with a task that has already exited.
> And if what once was the task structure of this exited task has been
> reallocated to another new task, we'd end up stopping it should it be
> operating under the kthread API, or possibly waiting a very long time
> for it to exit if it is not.
Patches are currently under development to allow kthreads to exit
before kthread_stop is called.  The big thing is that once we allow
kernel threads that exited by themselves to be reaped by kthread_stop
we have some significant work to do.
> I'm also a little uneasy that kthread_stop() has an "only one thread can
> stop another thread at a time" design. It's a potential bottleneck on
> very large systems where threads are blocked and unable to respond to a
> kthread_should_stop() for some period of time.
There are already patches out there to fix this issue.
> XPC is in need of threads that can block indefinitely, which is why XPC
> is in the business of maintaining a pool of threads. Currently there is
> no such capability (that I know of) that is provided by linux. Workqueues
> can't block indefinitely.
I'm not certain I understand this requirement.  Do you mean block indefinitely
unless requested to stop?
> And for performance reasons these threads need to be able to be created
> quickly. These threads are involved in delivering messages to XPC's users
> (like XPNET) and we had latency issues that led us to use kernel_thread()
> directly instead of the kthread API. Additionally, XPC may need to have
> hundreds of these threads active at any given time.
Ugh.  Can you tell me a little more about the latency issues?
Is having a non-halting kthread_create enough to fix this?
So you don't have to context switch several times to get the
thread running?
Or do you need more severe latency reductions?
The more severe fix would require some significant changes to copy_process
and every architecture would need to be touched to fix up copy_thread.
It is possible, it is a lot of work, and the reward is far from obvious.
> I think it would be great if the kthread API (or underlying implementation)
> could be changed to handle these issues. I'd love for XPC to not have to
> maintain this sort of thing itself.
Currently daemonize is a serious maintenance problem.
Using daemonize and kernel_thread to create kernel threads is a blocker
in implementing the pid namespace because of their use of pid_t.
So I am motivated to get this fixed.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] ia64 sn xpc: Convert to use kthread API. [message #18557 is a reply to message #18485] | Fri, 27 April 2007 17:41   |  
			| 
				
				
					|  Dean Nelson Messages: 6
 Registered: April 2007
 | Junior Member |  |  |  
	| On Thu, Apr 19, 2007 at 04:51:03PM -0700, Andrew Morton wrote:
> Another driver which should be fully converted to the kthread API:
> kthread_stop() and kthread_should_stop().
> 
> And according to my logs, this driver was added to the tree more than
> a year _after_ the kthread interface was made available.
> 
> This isn't good.
On Sun, Apr 22, 2007 at 09:36:47PM +0100, Christoph Hellwig wrote:
> On Thu, Apr 19, 2007 at 01:58:44AM -0600, Eric W. Biederman wrote:
> > From: Eric W. Biederman <ebiederm@xmission.com>
> >
> > This patch starts the xpc kernel threads using kthread_run
> > not a combination of kernel_thread and daemonize.  Resuling
> > in slightly simpler and more maintainable code.
>
> This driver is a really twisted maze.  It has a lot of threads,
> some of them running through the whole lifetime of the driver,
> some short-lived and some in a sort of a pool.
>
> The patch below fixes up the long-lived thread as well as fixing
> gazillions of leaks in the init routine by switching to proper
> goto-based unwinding.
I see that the setting of 'xpc_rsvd_page->vars_pa = 0;' in xpc_init() is
considered a leak by Christoph (hch), but it really is not. If you look at
xpc_rsvd_page_init() where it is set up, you might see that the reserved page
is something XPC gets from SAL who created it at system boot time. If XPC is
rmmod'd and insmod'd again, it will be handed the same page of memory by SAL.
So there is no memory leak here.
As for the other suspected leaks mentioned I'm not sure what they could be.
It may be the fact that XPC continues coming up when presented with error
returns from register_reboot_notifier() and register_die_notifier(). There
is no leak in this, just simply a loss of an early notification to other
SGI system partitions that this partition is going down. A fact they will
sooner or later discover on their own. A notice of this degraded
functionality is written to the console. And the likelyhood that these
functions should ever return an error is very, very small (currently,
neither of them has an error return).
>From my experience the goto-based unwinding of error returns is not necessarily
a superior approach. I spent a month tracking down a difficult to reproduce
problem that ended up being an error return jumping to the wrong label in a
goto-based unwind. Problems can arise with either approach. I've also seen
the compiler generate less code for the non-goto approach. I'm not a compiler
person so I can't explain this, nor can I say that it's always the case, but
at one time when I did a bake off between the two approaches the non-goto
approach generated less code.
Having said this I have no problem with switching to a goto-based unwinding
of errors if that is what the community prefers. I personally find it more
readable than the non-goto approach.
> Note that thread pools are something we have in a few places,
> and might be worth handling in the core kthread infrastructure,
> as tearing down pools will get a bit complicated using the
> kthread APIs.
Christoph is correct in that XPC has a single thread that exists throughout
its lifetime, another set of threads that exist for the time that active
contact with other XPCs running on other SGI system partitions exists, and
finally there is a pool of threads that exist on an as needed basis once
a channel connection has been established between two partitions.
In principle I approve of the kthread API and its use as opposed to what
XPC currently does (calls kernel_thread(), daemonize(), wait_for_completion(),
and complete()). So Christoph's patch that changes the single long-lived
thread to use kthread_stop() and kthread_should_stop() is appreciated.
But the fact that another thread, started at the xpc_init() time, that does
discovery of other SGI system partitions wasn't converted points out a
weekness in either my thinking or the kthread API. This discovery thread
does its job and then exits. Should XPC be rmmod'd while the discovery
thread is still running we would need to do a kthread_stop() against it. 
But kthread_stop() isn't set up to deal with a task that has already exited.
And if what once was the task structure of this exited task has been
reallocated to another new task, we'd end up stopping it should it be
operating under the kthread API, or possibly waiting a very long time
for it to exit if it is not.
I'm also a little uneasy that kthread_stop() has an "only one thread can
stop another thread at a time" design. It's a potential bottleneck on
very large systems where threads are blocked and unable to respond to a
kthread_should_stop() for some period of time.
XPC is in need of threads that can block indefinitely, which is why XPC
is in the business of maintaining a pool of threads. Currently there is
no such capability (that I know of) that is provided by linux. Workqueues
can't block indefinitely.
And for performance reasons these threads need to be able to be created
quickly. These threads are involved in delivering messages to XPC's users
(like XPNET) and we had latency issues that led us to use kernel_thread()
directly instead of the kthread API. Additionally, XPC may need to have
hundreds of these threads active at any given time.
I think it would be great if the kthread API (or underlying implementation)
could be changed to handle these issues. I'd love for XPC to not have to
maintain this sort of thing itself.
Dean
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] ia64 sn xpc: Convert to use kthread API. [message #18569 is a reply to message #18294] | Fri, 27 April 2007 20:33   |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| Dean Nelson <dcn@sgi.com> writes:
> On Fri, Apr 27, 2007 at 12:34:02PM -0600, Eric W. Biederman wrote:
>> Dean Nelson <dcn@sgi.com> writes:
>> >
>> > Christoph is correct in that XPC has a single thread that exists throughout
>> > its lifetime, another set of threads that exist for the time that active
>> > contact with other XPCs running on other SGI system partitions exists, and
>> > finally there is a pool of threads that exist on an as needed basis once
>> > a channel connection has been established between two partitions.
>> >
>> > In principle I approve of the kthread API and its use as opposed to what
>> > XPC currently does (calls kernel_thread(), daemonize(),
> wait_for_completion(),
>> > and complete()). So Christoph's patch that changes the single long-lived
>> > thread to use kthread_stop() and kthread_should_stop() is appreciated.
>> >
>> > But the fact that another thread, started at the xpc_init() time, that does
>> > discovery of other SGI system partitions wasn't converted points out a
>> > weekness in either my thinking or the kthread API. This discovery thread
>> > does its job and then exits. Should XPC be rmmod'd while the discovery
>> > thread is still running we would need to do a kthread_stop() against it. 
>> > But kthread_stop() isn't set up to deal with a task that has already exited.
>> > And if what once was the task structure of this exited task has been
>> > reallocated to another new task, we'd end up stopping it should it be
>> > operating under the kthread API, or possibly waiting a very long time
>> > for it to exit if it is not.
>> 
>> Patches are currently under development to allow kthreads to exit
>> before kthread_stop is called.  The big thing is that once we allow
>> kernel threads that exited by themselves to be reaped by kthread_stop
>> we have some significant work to do.
>> 
>> > I'm also a little uneasy that kthread_stop() has an "only one thread can
>> > stop another thread at a time" design. It's a potential bottleneck on
>> > very large systems where threads are blocked and unable to respond to a
>> > kthread_should_stop() for some period of time.
>> 
>> There are already patches out there to fix this issue.
>> 
>> > XPC is in need of threads that can block indefinitely, which is why XPC
>> > is in the business of maintaining a pool of threads. Currently there is
>> > no such capability (that I know of) that is provided by linux. Workqueues
>> > can't block indefinitely.
>> 
>> I'm not certain I understand this requirement.  Do you mean block indefinitely
>> unless requested to stop?
>
> These threads can block waiting for a hardware DMA engine, which has a 28
> second timeout setpoint.
Ok.  So this is an interruptible sleep?
Do you have any problems being woken up out of that interruptible sleep
by kthread_stop?
I am in the process of modifying kthread_stop to wake up thread in an
interruptible sleep and set signal_pending, so they will break out.
>> > And for performance reasons these threads need to be able to be created
>> > quickly. These threads are involved in delivering messages to XPC's users
>> > (like XPNET) and we had latency issues that led us to use kernel_thread()
>> > directly instead of the kthread API. Additionally, XPC may need to have
>> > hundreds of these threads active at any given time.
>> 
>> Ugh.  Can you tell me a little more about the latency issues?
>
> After placing a message in a local message queue, one SGI system partition
> will interrupt another to retrieve the message. We need to minimize the
> time from entering XPC's interrupt handler to the time that the message
> can be DMA transferred and delivered to the consumer (like XPNET) to
> whom it was sent.
>
>> Is having a non-halting kthread_create enough to fix this?
>> So you don't have to context switch several times to get the
>> thread running?
>> 
>> Or do you need more severe latency reductions?
>> 
>> The more severe fix would require some significant changes to copy_process
>> and every architecture would need to be touched to fix up copy_thread.
>> It is possible, it is a lot of work, and the reward is far from obvious.
>
> I think a non-halting kthread_create() should be sufficient. It is in
> effect what XPC has now in calling kernel_thread() directly.
A little different but pretty close.
We call kthread_create() it prepares everything and places it on
a queue and wakes up kthreadd.
kthreadd then wakes up and forks the thread.
After the thread has finishing setting up it will call complete on
a completion so kthread_create can continue on it's merry way
but it should not need to go to sleep waiting for someone to
call kthread_bind.
But if you can live with what I have just described that will
be easy to code up.
It is a little slower then kernel_thread but hopefully not much.
> Taking it one step further, if you added the notion of a thread pool,
> where upon exit, a thread isn't destroyed but rather is queued ready to
> handle the next kthread_create_quick() request.
That might happen.  So far I am avoiding the notion of a thread pool for
as long as I can.  There is some sense in it, especially in generalizing
the svc thread pool code from nfs.  But if I don't have to go there I would
prefer it.
>> > I think it would be great if the kthread API (or underlying implementation)
>> > could be changed to handle these issues. I'd love for XPC to not have to
>> > maintain this sort of thing itself.
>> 
>> Currently daemonize is a serious maintenance problem.
>> 
>> Using daemonize and kernel_thread to create kernel threads is a blocker
>> in implementing the pid namespace because of their use of pid_t.
>> 
>> So I am motivated to get this fixed.
>
> This would also address the problems we see with huge pid spaces for
> kernel threads on our largest machines.  In the example from last week,
> we had 10 threads each on 4096 cpus.  If we reworked work_queues to use
> the kthread_create_nonblocking() thread pool, we could probably collapse
> the need for having all of those per-task, per-cpu work queues.
Patches have already been sent (and I don't think found problems with)
that make kthreadd pid == 2, and they also modify daemonize to reparent
to kthreadd instead of init.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] ia64 sn xpc: Convert to use kthread API. [message #18570 is a reply to message #18556] | Fri, 27 April 2007 20:12   |  
			| 
				
				
					|  Dean Nelson Messages: 6
 Registered: April 2007
 | Junior Member |  |  |  
	| On Fri, Apr 27, 2007 at 12:34:02PM -0600, Eric W. Biederman wrote:
> Dean Nelson <dcn@sgi.com> writes:
> >
> > Christoph is correct in that XPC has a single thread that exists throughout
> > its lifetime, another set of threads that exist for the time that active
> > contact with other XPCs running on other SGI system partitions exists, and
> > finally there is a pool of threads that exist on an as needed basis once
> > a channel connection has been established between two partitions.
> >
> > In principle I approve of the kthread API and its use as opposed to what
> > XPC currently does (calls kernel_thread(), daemonize(), wait_for_completion(),
> > and complete()). So Christoph's patch that changes the single long-lived
> > thread to use kthread_stop() and kthread_should_stop() is appreciated.
> >
> > But the fact that another thread, started at the xpc_init() time, that does
> > discovery of other SGI system partitions wasn't converted points out a
> > weekness in either my thinking or the kthread API. This discovery thread
> > does its job and then exits. Should XPC be rmmod'd while the discovery
> > thread is still running we would need to do a kthread_stop() against it. 
> > But kthread_stop() isn't set up to deal with a task that has already exited.
> > And if what once was the task structure of this exited task has been
> > reallocated to another new task, we'd end up stopping it should it be
> > operating under the kthread API, or possibly waiting a very long time
> > for it to exit if it is not.
> 
> Patches are currently under development to allow kthreads to exit
> before kthread_stop is called.  The big thing is that once we allow
> kernel threads that exited by themselves to be reaped by kthread_stop
> we have some significant work to do.
> 
> > I'm also a little uneasy that kthread_stop() has an "only one thread can
> > stop another thread at a time" design. It's a potential bottleneck on
> > very large systems where threads are blocked and unable to respond to a
> > kthread_should_stop() for some period of time.
> 
> There are already patches out there to fix this issue.
> 
> > XPC is in need of threads that can block indefinitely, which is why XPC
> > is in the business of maintaining a pool of threads. Currently there is
> > no such capability (that I know of) that is provided by linux. Workqueues
> > can't block indefinitely.
> 
> I'm not certain I understand this requirement.  Do you mean block indefinitely
> unless requested to stop?
These threads can block waiting for a hardware DMA engine, which has a 28
second timeout setpoint.
> > And for performance reasons these threads need to be able to be created
> > quickly. These threads are involved in delivering messages to XPC's users
> > (like XPNET) and we had latency issues that led us to use kernel_thread()
> > directly instead of the kthread API. Additionally, XPC may need to have
> > hundreds of these threads active at any given time.
> 
> Ugh.  Can you tell me a little more about the latency issues?
After placing a message in a local message queue, one SGI system partition
will interrupt another to retrieve the message. We need to minimize the
time from entering XPC's interrupt handler to the time that the message
can be DMA transferred and delivered to the consumer (like XPNET) to
whom it was sent.
> Is having a non-halting kthread_create enough to fix this?
> So you don't have to context switch several times to get the
> thread running?
> 
> Or do you need more severe latency reductions?
> 
> The more severe fix would require some significant changes to copy_process
> and every architecture would need to be touched to fix up copy_thread.
> It is possible, it is a lot of work, and the reward is far from obvious.
I think a non-halting kthread_create() should be sufficient. It is in
effect what XPC has now in calling kernel_thread() directly.
Taking it one step further, if you added the notion of a thread pool,
where upon exit, a thread isn't destroyed but rather is queued ready to
handle the next kthread_create_quick() request.
> > I think it would be great if the kthread API (or underlying implementation)
> > could be changed to handle these issues. I'd love for XPC to not have to
> > maintain this sort of thing itself.
> 
> Currently daemonize is a serious maintenance problem.
> 
> Using daemonize and kernel_thread to create kernel threads is a blocker
> in implementing the pid namespace because of their use of pid_t.
> 
> So I am motivated to get this fixed.
This would also address the problems we see with huge pid spaces for
kernel threads on our largest machines.  In the example from last week,
we had 10 threads each on 4096 cpus.  If we reworked work_queues to use
the kthread_create_nonblocking() thread pool, we could probably collapse
the need for having all of those per-task, per-cpu work queues.
Dean
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] cpci_hotplug: Convert to use the kthread API [message #18571 is a reply to message #18463] | Fri, 27 April 2007 22:07   |  
			| 
				
				
					|  Scott Murray Messages: 4
 Registered: April 2007
 | Junior Member |  |  |  
	| On Sun, 22 Apr 2007, Christoph Hellwig wrote:
> On Thu, Apr 19, 2007 at 12:55:29AM -0600, Eric W. Biederman wrote:
> > From: Eric W. Biederman <ebiederm@xmission.com> - unquoted
> > 
> > kthread_run replaces the kernel_thread and  daemonize calls
> > during thread startup.
> > 
> > Calls to signal_pending were also removed as it is currently
> > impossible for the cpci_hotplug thread to receive signals.
> 
> This drivers thread are a bit of a miss, although a lot better than
> most other pci hotplug drivers :)
> 
> Below is more complete conversion to the kthread infrastructure +
> wake_up_process to wake the thread.  Note that we had to keep
> a thread_finished variable because the existing one had dual use.
Sorry, it took me a few days to get to testing this out.  It looks good,    
but I had to make a couple of tweaks to avoid a hang when rmmod'ing a
board driver.  The board drivers do:
cpci_hp_stop()
cpci_hp_unregister_controller(controller)
to shutdown, and the check in cpci_hp_unregister_controller if the thread
is running wasn't working due to a bit too much code being excised.  The
result was kthread_stop being called twice, which hangs.  I've indicated 
my changes to avoid this inline below.
Scott
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Scott Murray <scottm@somanetworks.com>
> Index: linux-2.6/drivers/pci/hotplug/cpci_hotplug_core.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/hotplug/cpci_hotplug_core.c	2007-04-22 12:54:17.000000000 +0200
> +++ linux-2.6/drivers/pci/hotplug/cpci_hotplug_core.c	2007-04-22 13:01:42.000000000 +0200
> @@ -35,6 +35,7 @@
>  #include <linux/smp_lock.h>
>  #include <asm/atomic.h>
>  #include <linux/delay.h>
> +#include <linux/kthread.h>
>  #include "cpci_hotplug.h"
>  
>  #define DRIVER_AUTHOR	"Scott Murray <scottm@somanetworks.com>"
> @@ -59,9 +60,8 @@ static int slots;
>  static atomic_t extracting;
>  int cpci_debug;
>  static struct cpci_hp_controller *controller;
> -static struct semaphore event_semaphore;	/* mutex for process loop (up if something to process) */
> -static struct semaphore thread_exit;		/* guard ensure thread has exited before calling it quits */
> -static int thread_finished = 1;
> +static struct task_struct *cpci_thread;
> +static int thread_finished;
>  
>  static int enable_slot(struct hotplug_slot *slot);
>  static int disable_slot(struct hotplug_slot *slot);
> @@ -357,9 +357,7 @@ cpci_hp_intr(int irq, void *data)
>  	controller->ops->disable_irq();
>  
>  	/* Trigger processing by the event thread */
> -	dbg("Signal event_semaphore");
> -	up(&event_semaphore);
> -	dbg("exited cpci_hp_intr");
> +	wake_up_process(cpci_thread);
>  	return IRQ_HANDLED;
>  }
>  
> @@ -521,17 +519,12 @@ event_thread(void *data)
>  {
>  	int rc;
>  
> -	lock_kernel();
> -	daemonize("cpci_hp_eventd");
> -	unlock_kernel();
> -
>  	dbg("%s - event thread started", __FUNCTION__);
>  	while (1) {
>  		dbg("event thread sleeping");
> -		down_interruptible(&event_semaphore);
> -		dbg("event thread woken, thread_finished = %d",
> -		    thread_finished);
> -		if (thread_finished || signal_pending(current))
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		schedule();
> +		if (kthread_should_stop())
>  			break;
>  		do {
>  			rc = check_slots();
> @@ -541,18 +534,17 @@ event_thread(void *data)
>  			} else if (rc < 0) {
>  				dbg("%s - error checking slots", __FUNCTION__);
>  				thread_finished = 1;
> -				break;
> +				goto out;
>  			}
> -		} while (atomic_read(&extracting) && !thread_finished);
> -		if (thread_finished)
> +		} while (atomic_read(&extracting) && !kthread_should_stop());
> +		if (kthread_should_stop())
>  			break;
>  
>  		/* Re-enable ENUM# interrupt */
>  		dbg("%s - re-enabling irq", __FUNCTION__);
>  		controller->ops->enable_irq();
>  	}
> -	dbg("%s - event thread signals exit", __FUNCTION__);
> -	up(&thread_exit);
> + out:
>  	return 0;
>  }
>  
> @@ -562,12 +554,8 @@ poll_thread(void *data)
>  {
>  	int rc;
>  
> -	lock_kernel();
> -	daemonize("cpci_hp_polld");
> -	unlock_kernel();
> -
>  	while (1) {
> -		if (thread_finished || signal_pending(current))
> +		if (kthread_should_stop() || signal_pending(current))
>  			break;
>  		if (controller->ops->query_enum()) {
>  			do {
> @@ -578,48 +566,34 @@ poll_thread(void *data)
>  				} else if (rc < 0) {
>  					dbg("%s - error checking slots", __FUNCTION__);
>  					thread_finished = 1;
> -					break;
> +					goto out;
>  				}
> -			} while (atomic_read(&extracting) && !thread_finished);
> +			} while (atomic_read(&extracting) && !kthread_should_stop());
>  		}
>  		msleep(100);
>  	}
> -	dbg("poll thread signals exit");
> -	up(&thread_exit);
> + out:
>  	return 0;
>  }
>  
>  static int
>  cpci_start_thread(void)
>  {
> -	int pid;
> -
> -	/* initialize our semaphores */
> -	init_MUTEX_LOCKED(&event_semaphore);
> -	init_MUTEX_LOCKED(&thread_exit);
> -	thread_finished = 0;
> -
>  	if (controller->irq)
> -		pid = kernel_thread(event_thread, NULL, 0);
> +		cpci_thread = kthread_run(event_thread, NULL, "cpci_hp_eventd");
>  	else
> -		pid = kernel_thread(poll_thread, NULL, 0);
> -	if (pid < 0) {
> +		cpci_thread = kthread_run(poll_thread, NULL, "cpci_hp_polld");
> +	if (IS_ERR(cpci_thread)) {
>  		err("Can't start up our thread");
> -		return -1;
> +		return PTR_ERR(cpci_thread);
>  	}
> -	dbg("Our thread pid = %d", pid);
There still needs to be a:
        thread_finished = 0;
here, so that things work if a board driver is insmod'ed again after a 
rmmod (i.e. cpci_hp_start after a cpci_hp_stop).
>  	return 0;
>  }
>  
>  static void
>  cpci_stop_thread(void)
>  {
> -	thread_finished = 1;
> -	dbg("thread finish command given");
> -	if (controller->irq)
> -		up(&event_semaphore);
> -	dbg("wait for thread to exit");
> -	down(&thread_exit);
> +	kthread_stop(cpci_thread);
As well, there still needs to be a:
        thread_finished = 1;
                         
here to make the check in cpci_hp_unregister_controller work.
>  }
>  
>  int
 
-- 
Scott Murray
SOMA Networks, Inc.
Toronto, Ontario
e-mail: scottm@somanetworks.com
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] ia64 sn xpc: Convert to use kthread API. [message #18572 is a reply to message #18569] | Mon, 30 April 2007 15:22   |  
			| 
				
				
					|  Dean Nelson Messages: 6
 Registered: April 2007
 | Junior Member |  |  |  
	| On Fri, Apr 27, 2007 at 02:33:32PM -0600, Eric W. Biederman wrote:
> Dean Nelson <dcn@sgi.com> writes:
> 
> > On Fri, Apr 27, 2007 at 12:34:02PM -0600, Eric W. Biederman wrote:
> >> Dean Nelson <dcn@sgi.com> writes:
> >> >
> >> > XPC is in need of threads that can block indefinitely, which is why XPC
> >> > is in the business of maintaining a pool of threads. Currently there is
> >> > no such capability (that I know of) that is provided by linux. Workqueues
> >> > can't block indefinitely.
> >> 
> >> I'm not certain I understand this requirement.  Do you mean block indefinitely
> >> unless requested to stop?
> >
> > These threads can block waiting for a hardware DMA engine, which has a 28
> > second timeout setpoint.
> 
> Ok.  So this is an interruptible sleep?
No, the hardware DMA engine's software interface, doesn't sleep
nor relinquish the CPU. But there are other spots where we do sleep
interruptibly.
> Do you have any problems being woken up out of that interruptible sleep
> by kthread_stop?
> 
> I am in the process of modifying kthread_stop to wake up thread in an
> interruptible sleep and set signal_pending, so they will break out.
No, this is fine, just avoid designing the kthread stop mechanism
to require a thread being requested to stop to actually stop in some
finite amount of time, and that by it not stopping other kthread stop
requests are held off. Allow the thread to take as much time as it needs
to respond to the kthread stop request.
> >> > And for performance reasons these threads need to be able to be created
> >> > quickly. These threads are involved in delivering messages to XPC's users
> >> > (like XPNET) and we had latency issues that led us to use kernel_thread()
> >> > directly instead of the kthread API. Additionally, XPC may need to have
> >> > hundreds of these threads active at any given time.
> >> 
> >> Ugh.  Can you tell me a little more about the latency issues?
> >
> > After placing a message in a local message queue, one SGI system partition
> > will interrupt another to retrieve the message. We need to minimize the
> > time from entering XPC's interrupt handler to the time that the message
> > can be DMA transferred and delivered to the consumer (like XPNET) to
> > whom it was sent.
> >
> >> Is having a non-halting kthread_create enough to fix this?
> >> So you don't have to context switch several times to get the
> >> thread running?
> >> 
> >> Or do you need more severe latency reductions?
> >> 
> >> The more severe fix would require some significant changes to copy_process
> >> and every architecture would need to be touched to fix up copy_thread.
> >> It is possible, it is a lot of work, and the reward is far from obvious.
> >
> > I think a non-halting kthread_create() should be sufficient. It is in
> > effect what XPC has now in calling kernel_thread() directly.
> 
> A little different but pretty close.
> 
> We call kthread_create() it prepares everything and places it on
> a queue and wakes up kthreadd.
> 
> kthreadd then wakes up and forks the thread.
> 
> After the thread has finishing setting up it will call complete on
> a completion so kthread_create can continue on it's merry way
> but it should not need to go to sleep waiting for someone to
> call kthread_bind.
> 
> But if you can live with what I have just described that will
> be easy to code up.
> 
> It is a little slower then kernel_thread but hopefully not much.
I was aware of this behavior of kthread_create(), which I consider
'halting' in that the thread doing the kthread_create() blocks waiting
for kthreadd to get scheduled, call kernel_thread(), and then call
complete(). By your mentioning a 'non-halting' kthread_create() I
thought you were planning to create a new flavor of kthread_create()
that called kernel_thread() directly and reparented the child thread to
kthreadd. My mistake.
So there will be more overhead (time-wise) for XPC in calling
kthread_run() as opposed to it formerly calling kernel_thread() directly.
Thus requiring XPC to utilize a pool of kthread_create()'d threads.
> > Taking it one step further, if you added the notion of a thread pool,
> > where upon exit, a thread isn't destroyed but rather is queued ready to
> > handle the next kthread_create_quick() request.
> 
> That might happen.  So far I am avoiding the notion of a thread pool for
> as long as I can.  There is some sense in it, especially in generalizing
> the svc thread pool code from nfs.  But if I don't have to go there I would
> prefer it.
This means that XPC will have to retain its thread pool, but I can
understand you not wanting to go there.
Thanks,
Dean
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH] md: Remove broken SIGKILL support [message #18574 is a reply to message #18315] | Tue, 01 May 2007 06:13   |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| Neil Brown <neilb@suse.de> writes:
> On Thursday April 19, ebiederm@xmission.com wrote:
>> From: Eric W. Biederman <ebiederm@xmission.com>
>> 
>> Currently md_thread calls allow_signal so it can receive a
>> SIGKILL but then does nothing with it except flush the
>> sigkill so that it not can use an interruptible sleep.
>> 
>> This whole dance is silly so remove the unnecessary
>> and broken signal handling logic.
>
> (sorry of the delay in replying)
>
> You missed some related code which should help you see that it is -
> maybe - not completely 'silly' (though I confess it might be slightly
> broken).
> In md_check_recovery:
>
> 	if (signal_pending(current)) {
> 		if (mddev->pers->sync_request) {
> 			printk(KERN_INFO "md: %s in immediate safe mode\n",
> 			       mdname(mddev));
> 			mddev->safemode = 2;
> 		}
> 		flush_signals(current);
> 	}
Thanks.
> The idea is that alt-sysrq-K will send SIGKILL to all processes
> including the md support threads, which will cause them to enter
> "immediate safe mode" so that the metadata will be marked clean
> immediately at every opportunity.  That way you can use alt-sysrq:
>   sync,unmount,kill,reboot
> and be fairly sure that you md array will be shut down cleanly.
>
> I'd be just as happy to link this into Unmount (aka
> do_emergency_remount), but that doesn't seem at all straight forward,
> and in any case should be done before the current code is ripped out.
>
> While we do have a reboot_notifier which tries to stop all arrays,
> I've never been comfortable with that.  A reboot really should just
> reboot... 
>
> What I would REALLY like is for the block device to know whether it is
> open read-only or read-write.  Then I could mark it clean when it
> becomes read-only as would happen when do_emergency_remount remounts
> it read-only.
>
> I might see how hard that would be...
My goal to get signals to kernel threads out of the user space interface
especially for non-privileged processes, so everything that we do with
kernel threads can just be an unimportant implementation detail to user
space.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] md: Remove broken SIGKILL support [message #18575 is a reply to message #18352] | Tue, 01 May 2007 00:47   |  
			| 
				
				
					|  Neil Brown Messages: 6
 Registered: October 2006
 | Junior Member |  |  |  
	| On Thursday April 19, ebiederm@xmission.com wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
> 
> Currently md_thread calls allow_signal so it can receive a
> SIGKILL but then does nothing with it except flush the
> sigkill so that it not can use an interruptible sleep.
> 
> This whole dance is silly so remove the unnecessary
> and broken signal handling logic.
(sorry of the delay in replying)
You missed some related code which should help you see that it is -
maybe - not completely 'silly' (though I confess it might be slightly
broken).
In md_check_recovery:
	if (signal_pending(current)) {
		if (mddev->pers->sync_request) {
			printk(KERN_INFO "md: %s in immediate safe mode\n",
			       mdname(mddev));
			mddev->safemode = 2;
		}
		flush_signals(current);
	}
The idea is that alt-sysrq-K will send SIGKILL to all processes
including the md support threads, which will cause them to enter
"immediate safe mode" so that the metadata will be marked clean
immediately at every opportunity.  That way you can use alt-sysrq:
  sync,unmount,kill,reboot
and be fairly sure that you md array will be shut down cleanly.
I'd be just as happy to link this into Unmount (aka
do_emergency_remount), but that doesn't seem at all straight forward,
and in any case should be done before the current code is ripped out.
While we do have a reboot_notifier which tries to stop all arrays,
I've never been comfortable with that.  A reboot really should just
reboot... 
What I would REALLY like is for the block device to know whether it is
open read-only or read-write.  Then I could mark it clean when it
becomes read-only as would happen when do_emergency_remount remounts
it read-only.
I might see how hard that would be...
NeilBrown
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] ia64 sn xpc: Convert to use kthread API. [message #18576 is a reply to message #18294] | Wed, 02 May 2007 15:44   |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| Dean Nelson <dcn@sgi.com> writes:
> On Mon, Apr 30, 2007 at 10:22:30AM -0500, Dean Nelson wrote:
>> On Fri, Apr 27, 2007 at 02:33:32PM -0600, Eric W. Biederman wrote:
>> > Dean Nelson <dcn@sgi.com> writes:
>> > 
>> > > Taking it one step further, if you added the notion of a thread pool,
>> > > where upon exit, a thread isn't destroyed but rather is queued ready to
>> > > handle the next kthread_create_quick() request.
>> > 
>> > That might happen.  So far I am avoiding the notion of a thread pool for
>> > as long as I can.  There is some sense in it, especially in generalizing
>> > the svc thread pool code from nfs.  But if I don't have to go there I would
>> > prefer it.
>> 
>> This means that XPC will have to retain its thread pool, but I can
>> understand you not wanting to go there.
>
> On Thu, Apr 26, 2007 at 01:11:15PM -0600, Eric W. Biederman wrote:
>>
>> Ok.    Because of the module unloading issue, and because we don't have
>> a lot of these threads running around, the current plan is to fix
>> thread_create and kthread_stop so that they must always be paired,
>> and so that kthread_stop will work correctly if the task has already
>> exited.
>>
>> Basically that just involves calling get_task_struct in kthread_create
>> and put_task_struct in kthread_stop.
>
> Okay, so I need to expand upon Christoph Hellwig's patch so that all
> the kthread_create()'d threads are kthread_stop()'d.
>
> This is easy to do for the XPC thread that exists for the lifetime of XPC,
> as well as for the threads created to manage the SGI system partitions.
>
> XPC has the one discovery thread that is created when XPC is first started
> and exits as soon as it has finished discovering all existing SGI system
> partitions. With your forthcoming change to kthread_stop() that will allow
> it to be called after the thread has exited, doing this one is also easy.
> Note that the kthread_stop() for this discovery thread won't occur until
> XPC is rmmod'd. This means that its task_struct will not be freed for
> possibly a very long time (i.e., weeks). Is that a problem?
As long as there is only one, not really.  It would be good if we could
get rid of it though.
The practical problem is the race with rmmod, in particular if someone
calls rmmod while this thread is still running.
If I get clever I think this is likely solvable with something like.
kthread_maybe_stop(struct task_struct **loc)
{
        struct task_struct *tsk;
        tsk = xchg(loc, NULL);
        if (tsk)
        	kthread_stop(tsk);
}
kthread_stop_self(struct task_struct **loc, int exit_code)
{
        struct task_struct *tsk;
        
        tsk = xchg(loc, NULL);
        if (tsk)
        	put_task_struct(tsk);
	do_exit(tsk);                
}
I'm not quite convinced that is a common enough paradigm to implement
that.
> But then we come to XPC's pool of threads that deliver channel messages
> to the appropriate consumer (like XPNET) and can block indefinitely. As
> mentioned earlier there could be hundreds if not thousands of these
> (our systems keep getting bigger). So now requiring a kthread_stop()
> for each one of these becomes more of a problem, as it is a lot of
> task_struct pointers to maintain.
>
> Currently, XPC maintains these threads via a
> wait_event_interruptible_exclusive() queue so that it can wakeup as many
> or as few as needed at any given moment by calling wake_up_nr(). When XPC
> is rmmod'd, a flag is set which causes them to exit and wake_up_all()
> is called.  Therefore XPC dosen't need to remember their pids or
> task_struct pointers.
>
> So what would you suggest we do for this pool of threads?
Good question.
The whole concept of something that feels like a core part of the
platform code being modular I'm still looking at strange.
> Is there any way to have a version of kthread_create() that doesn't
> require a matching kthread_stop()? Or add a kthread_not_stopping()
> that does the put_task_struct() call, so as to eliminate the need for
> calling kthread_stop()?
Yes.  I was thinking calling it kthread_orphan or something like that.
We can't make anything like that the default, because of the modular
remove problem, but it's not to hard.
> Or should we reconsider the kthread pool approach
> (and get XPC out of the thread management business altogether)? Robin
> Holt is putting together a proposal for how one could do a kthread pool,
> it should provide a bit more justification for going down that road.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [PATCH] ia64 sn xpc: Convert to use kthread API. [message #18577 is a reply to message #18572] | Wed, 02 May 2007 15:16   |  
			| 
				
				
					|  Dean Nelson Messages: 6
 Registered: April 2007
 | Junior Member |  |  |  
	| On Mon, Apr 30, 2007 at 10:22:30AM -0500, Dean Nelson wrote:
> On Fri, Apr 27, 2007 at 02:33:32PM -0600, Eric W. Biederman wrote:
> > Dean Nelson <dcn@sgi.com> writes:
> > 
> > > Taking it one step further, if you added the notion of a thread pool,
> > > where upon exit, a thread isn't destroyed but rather is queued ready to
> > > handle the next kthread_create_quick() request.
> > 
> > That might happen.  So far I am avoiding the notion of a thread pool for
> > as long as I can.  There is some sense in it, especially in generalizing
> > the svc thread pool code from nfs.  But if I don't have to go there I would
> > prefer it.
> 
> This means that XPC will have to retain its thread pool, but I can
> understand you not wanting to go there.
On Thu, Apr 26, 2007 at 01:11:15PM -0600, Eric W. Biederman wrote:
>
> Ok.    Because of the module unloading issue, and because we don't have
> a lot of these threads running around, the current plan is to fix
> thread_create and kthread_stop so that they must always be paired,
> and so that kthread_stop will work correctly if the task has already
> exited.
>
> Basically that just involves calling get_task_struct in kthread_create
> and put_task_struct in kthread_stop.
Okay, so I need to expand upon Christoph Hellwig's patch so that all
the kthread_create()'d threads are kthread_stop()'d.
This is easy to do for the XPC thread that exists for the lifetime of XPC,
as well as for the threads created to manage the SGI system partitions.
XPC has the one discovery thread that is created when XPC is first started
and exits as soon as it has finished discovering all existing SGI system
partitions. With your forthcoming change to kthread_stop() that will allow
it to be called after the thread has exited, doing this one is also easy.
Note that the kthread_stop() for this discovery thread won't occur until
XPC is rmmod'd. This means that its task_struct will not be freed for
possibly a very long time (i.e., weeks). Is that a problem?
But then we come to XPC's pool of threads that deliver channel messages
to the appropriate consumer (like XPNET) and can block indefinitely. As
mentioned earlier there could be hundreds if not thousands of these
(our systems keep getting bigger). So now requiring a kthread_stop()
for each one of these becomes more of a problem, as it is a lot of
task_struct pointers to maintain.
Currently, XPC maintains these threads via a
wait_event_interruptible_exclusive() queue so that it can wakeup as many
or as few as needed at any given moment by calling wake_up_nr(). When XPC
is rmmod'd, a flag is set which causes them to exit and wake_up_all()
is called.  Therefore XPC dosen't need to remember their pids or
task_struct pointers.
So what would you suggest we do for this pool of threads?
Is there any way to have a version of kthread_create() that doesn't
require a matching kthread_stop()? Or add a kthread_not_stopping()
that does the put_task_struct() call, so as to eliminate the need for
calling kthread_stop()?  Or should we reconsider the kthread pool approach
(and get XPC out of the thread management business altogether)? Robin
Holt is putting together a proposal for how one could do a kthread pool,
it should provide a bit more justification for going down that road.
Thanks,
Dean
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  | 
 
 
 Current Time: Sat Oct 25 11:26:34 GMT 2025 
 Total time taken to generate the page: 0.08914 seconds |