Home » Mailing lists » Devel » [PATCH] BC: resource beancounters (v2) 
	| 
		
 |  
	
		
		
			| Re: [PATCH 2/6] BC: beancounters core (API) [message #5621 is a reply to message #5612] | 
			Thu, 24 August 2006 15:00    | 
		 
		
			
				
				
				
					
						  
						Andrew Morton
						 Messages: 127 Registered: December 2005 
						
					 | 
					Senior Member  | 
					 | 
		 
		 
	 | 
 
	
		On Thu, 24 Aug 2006 16:06:11 +0400 
Kirill Korotaev <dev@sw.ru> wrote: 
 
> >>+#define bc_charge_locked(bc, r, v, s)			(0) 
> >>> +#define bc_charge(bc, r, v)				(0) 
> > 
> >akpm:/home/akpm> cat t.c 
> >void foo(void) 
> >{ 
> >	(0); 
> >} 
> >akpm:/home/akpm> gcc -c -Wall t.c 
> >t.c: In function 'foo': 
> >t.c:4: warning: statement with no effect 
>  
> these functions return value should always be checked (!). 
 
We have __must_check for that. 
 
> i.e. it is never called like: 
>   ub_charge(bc, r, v); 
 
Also... 
 
	if (bc_charge(tpyo, undefined_variable, syntax_error)) 
 
will happily compile if !CONFIG_BEANCOUNTER. 
 
Turning these stubs into static inline __must_check functions fixes all this.
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	| 
		
 |  
	| 
		
 |  
	
		
		
			| Re: [PATCH 2/6] BC: beancounters core (API) [message #5626 is a reply to message #5544] | 
			Thu, 24 August 2006 17:09    | 
		 
		
			
				
				
				
					
						  
						Oleg Nesterov
						 Messages: 143 Registered: August 2006 
						
					 | 
					Senior Member  | 
					 | 
		 
		 
	 | 
 
	
		On 08/23, Kirill Korotaev wrote: 
> 
> +struct beancounter *beancounter_findcreate(uid_t uid, int mask) 
> +{ 
> +	struct beancounter *new_bc, *bc; 
> +	unsigned long flags; 
> +	struct hlist_head *slot; 
> +	struct hlist_node *pos; 
> + 
> +	slot = &bc_hash[bc_hash_fun(uid)]; 
> +	new_bc = NULL; 
> + 
> +retry: 
> +	spin_lock_irqsave(&bc_hash_lock, flags); 
> +	hlist_for_each_entry (bc, pos, slot, hash) 
> +		if (bc->bc_id == uid) 
> +			break; 
> + 
> +	if (pos != NULL) { 
> +		get_beancounter(bc); 
> +		spin_unlock_irqrestore(&bc_hash_lock, flags); 
> + 
> +		if (new_bc != NULL) 
> +			kmem_cache_free(bc_cachep, new_bc); 
> +		return bc; 
> +	} 
> + 
> +	if (!(mask & BC_ALLOC)) 
> +		goto out_unlock; 
 
Very minor nit: it is not clear why we are doing this check under 
bc_hash_lock. I'd suggest to do 
 
	if (!(mask & BC_ALLOC)) 
		goto out; 
 
after unlock(bc_hash_lock) and kill out_unlock label. 
 
> +	if (new_bc != NULL) 
> +		goto out_install; 
> + 
> +	spin_unlock_irqrestore(&bc_hash_lock, flags); 
> + 
> +	new_bc = kmem_cache_alloc(bc_cachep, 
> +			mask & BC_ALLOC_ATOMIC ? GFP_ATOMIC : GFP_KERNEL); 
> +	if (new_bc == NULL) 
> +		goto out; 
> + 
> +	memcpy(new_bc, &default_beancounter, sizeof(*new_bc)); 
 
May be it is just me, but I need a couple of seconds to parse this 'memcpy'. 
How about 
 
	*new_bc = default_beancounter; 
 
? 
 
Oleg.
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	| 
		
 |  
	| 
		
 |  
	
		
		
			| Re: [PATCH 2/6] BC: beancounters core (API) [message #5649 is a reply to message #5621] | 
			Fri, 25 August 2006 10:51    | 
		 
		
			
				
				
				
					
						  
						dev
						 Messages: 1693 Registered: September 2005  Location: Moscow
						
					 | 
					Senior Member  | 
					 
  | 
		 
		 
	 | 
 
	
		Andrew Morton wrote: 
> On Thu, 24 Aug 2006 16:06:11 +0400 
> Kirill Korotaev <dev@sw.ru> wrote: 
>  
>  
>>>>+#define bc_charge_locked(bc, r, v, s)			(0) 
>>>> 
>>>>>+#define bc_charge(bc, r, v)				(0) 
>>> 
>>>akpm:/home/akpm> cat t.c 
>>>void foo(void) 
>>>{ 
>>>	(0); 
>>>} 
>>>akpm:/home/akpm> gcc -c -Wall t.c 
>>>t.c: In function 'foo': 
>>>t.c:4: warning: statement with no effect 
>> 
>>these functions return value should always be checked (!). 
>  
>  
> We have __must_check for that. 
>  
>  
>>i.e. it is never called like: 
>>  ub_charge(bc, r, v); 
>  
>  
> Also... 
>  
> 	if (bc_charge(tpyo, undefined_variable, syntax_error)) 
>  
> will happily compile if !CONFIG_BEANCOUNTER. 
>  
> Turning these stubs into static inline __must_check functions fixes all this. 
 
ok. will replace all empty stubs with inlines (with __must_check where appropriate) 
 
Thanks, 
Kirill
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	| 
		
 |  
	
		
		
			| Re:  [PATCH 1/6] BC: kconfig [message #5652 is a reply to message #5582] | 
			Fri, 25 August 2006 11:27    | 
		 
		
			
				
				
				
					
						  
						dev
						 Messages: 1693 Registered: September 2005  Location: Moscow
						
					 | 
					Senior Member  | 
					 
  | 
		 
		 
	 | 
 
	
		Matt Helsley wrote: 
> On Wed, 2006-08-23 at 15:04 -0700, Dave Hansen wrote: 
>  
>>On Wed, 2006-08-23 at 15:01 +0400, Kirill Korotaev wrote: 
>> 
>>>--- ./arch/sparc64/Kconfig.arkcfg	2006-07-17 17:01:11.000000000 +0400 
>>>+++ ./arch/sparc64/Kconfig	2006-08-10 17:56:36.000000000 +0400 
>>>@@ -432,3 +432,5 @@ source "security/Kconfig" 
>>> source "crypto/Kconfig" 
>>>  
>>> source "lib/Kconfig" 
>>>+ 
>>>+source "kernel/bc/Kconfig" 
>> 
>>... 
>> 
>>>--- ./arch/sparc64/Kconfig.arkcfg	2006-07-17 17:01:11.000000000 +0400 
>>>+++ ./arch/sparc64/Kconfig	2006-08-10 17:56:36.000000000 +0400 
>>>@@ -432,3 +432,5 @@ source "security/Kconfig" 
>>> source "crypto/Kconfig" 
>>>  
>>> source "lib/Kconfig" 
>>>+ 
>>>+source "kernel/bc/Kconfig" 
>> 
>>Is it just me, or do these patches look a little funky?  Looks like it 
>>is trying to patch the same thing into the same file, twice.  Also, the 
>>patches look to be -p0 instead of -p1.   
>  
>  
> They do appear to be -p0 
it is -p1. patches are generated with gendiff and ./ in names is for -p1 
 
> 	They aren't adding the same thing twice to the same file. This patch 
> makes different arches source the same Kconfig. 
>  
> 	I seem to recall Chandra suggested that instead of doing it this way it 
> would be more appropriate to add the source line to init/Kconfig because 
> it's more central and arch-independent. I tend to agree. 
agreed. init/Kconfig looks like a good place for including 
kernel/bc/Kconfig 
 
Kirill
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re:  [PATCH 1/6] BC: kconfig [message #5653 is a reply to message #5579] | 
			Fri, 25 August 2006 11:31    | 
		 
		
			
				
				
				
					
						  
						dev
						 Messages: 1693 Registered: September 2005  Location: Moscow
						
					 | 
					Senior Member  | 
					 
  | 
		 
		 
	 | 
 
	
		Dave Hansen wrote: 
> On Wed, 2006-08-23 at 15:01 +0400, Kirill Korotaev wrote: 
>  
>>--- ./arch/sparc64/Kconfig.arkcfg	2006-07-17 17:01:11.000000000 +0400 
>>+++ ./arch/sparc64/Kconfig	2006-08-10 17:56:36.000000000 +0400 
>>@@ -432,3 +432,5 @@ source "security/Kconfig" 
>> source "crypto/Kconfig" 
>>  
>> source "lib/Kconfig" 
>>+ 
>>+source "kernel/bc/Kconfig" 
>  
> ... 
>  
>>--- ./arch/sparc64/Kconfig.arkcfg	2006-07-17 17:01:11.000000000 +0400 
>>+++ ./arch/sparc64/Kconfig	2006-08-10 17:56:36.000000000 +0400 
>>@@ -432,3 +432,5 @@ source "security/Kconfig" 
>> source "crypto/Kconfig" 
>>  
>> source "lib/Kconfig" 
>>+ 
>>+source "kernel/bc/Kconfig" 
>  
>  
> Is it just me, or do these patches look a little funky?  Looks like it 
> is trying to patch the same thing into the same file, twice.  Also, the 
> patches look to be -p0 instead of -p1.   
>  
> I'm having a few problems applying them. 
Oh, it's my fault. I pasted text twice :/ 
 
Kirill
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re: [PATCH] BC: resource beancounters (v2) [message #5654 is a reply to message #5570] | 
			Fri, 25 August 2006 11:47    | 
		 
		
			
				
				
				
					
						  
						dev
						 Messages: 1693 Registered: September 2005  Location: Moscow
						
					 | 
					Senior Member  | 
					 
  | 
		 
		 
	 | 
 
	
		Andrew Morton wrote: 
>>As the first step we want to propose for discussion 
>>the most complicated parts of resource management: 
>>kernel memory and virtual memory. 
>  
>  
> The patches look reasonable to me - mergeable after updating them for 
> today's batch of review commentlets. 
sure. will do updates as long as there are reasonable comments. 
 
> I have two high-level problems though. 
>  
> a) I don't yet have a sense of whether this implementation 
>    is appropriate/sufficient for the various other 
>    applications which people are working on. 
>  
>    If the general shape is OK and we think this 
>    implementation can be grown into one which everyone can 
>    use then fine. 
>  
> And... 
>  
>  
>>The patch set to be sent provides core for BC and 
>>management of kernel memory only. Virtual memory 
>>management will be sent in a couple of days. 
>  
>  
> We need to go over this work before we can commit to the BC 
> core.  Last time I looked at the VM accounting patch it 
> seemed rather unpleasing from a maintainability POV. 
hmmm... in which regard? 
 
> And, if I understand it correctly, the only response to a job 
> going over its VM limits is to kill it, rather than trimming 
> it.  Which sounds like a big problem? 
No, UBC virtual memory management refuses occur on mmap()'s. 
Andrey Savochkin wrote already a brief summary on vm resource management: 
 
------------- cut ---------------- 
The task of limiting a container to 4.5GB of memory bottles down to the 
question: what to do when the container starts to use more than assigned 
4.5GB of memory? 
 
At this moment there are only 3 viable alternatives. 
 
A) Have separate memory management for each container, 
   with separate buddy allocator, lru lists, page replacement mechanism. 
   That implies a considerable overhead, and the main challenge there 
   is sharing of pages between these separate memory managers. 
 
B) Return errors on extension of mappings, but not on page faults, where 
   memory is actually consumed. 
   In this case it makes sense to take into account not only the size of used 
   memory, but the size of created mappings as well. 
   This is approximately what "privvmpages" accounting/limiting provides in 
   UBC. 
 
C) Rely on OOM killer. 
   This is a fall-back method in UBC, for the case "privvmpages" limits 
   still leave the possibility to overload the system. 
 
It would be nice, indeed, to invent something new. 
The ideal mechanism would 
 - slow down the container over-using memory, to signal the user that 
   he is over his limits, 
 - at the same time this slowdown shouldn't lead to the increase of memory 
   usage: for example, a simple slowdown of apache web server would lead 
   to the growth of the number of serving children and consumption of more 
   memory while showing worse performance, 
 - and, at the same time, it shouldn't penalize the rest of the system from 
   the performance point of view... 
May be this can be achieved via carefully tuned swapout mechanism together 
with disk bandwidth management capable of tracking asynchronous write 
requests, may be something else is required. 
It's really a big challenge. 
 
Meanwhile, I guess we can only make small steps in improving Linux resource 
management features for this moment. 
------------- cut ---------------- 
 
Thanks, 
Kirill
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re: [PATCH] BC: resource beancounters (v2) [message #5655 is a reply to message #5654] | 
			Fri, 25 August 2006 14:30    | 
		 
		
			
				
				
				
					
						  
						Andrew Morton
						 Messages: 127 Registered: December 2005 
						
					 | 
					Senior Member  | 
					 | 
		 
		 
	 | 
 
	
		On Fri, 25 Aug 2006 15:49:15 +0400 
Kirill Korotaev <dev@sw.ru> wrote: 
 
> > We need to go over this work before we can commit to the BC 
> > core.  Last time I looked at the VM accounting patch it 
> > seemed rather unpleasing from a maintainability POV. 
> hmmm... in which regard? 
 
Little changes all over the MM code which might get accidentally broken. 
 
> > And, if I understand it correctly, the only response to a job 
> > going over its VM limits is to kill it, rather than trimming 
> > it.  Which sounds like a big problem? 
> No, UBC virtual memory management refuses occur on mmap()'s. 
 
That's worse, isn't it?  Firstly it rules out big sparse mappings and secondly 
 
	mmap_and_use(80% of container size) 
	fork_and_immediately_exec(/bin/true) 
 
will fail at the fork? 
 
 
> Andrey Savochkin wrote already a brief summary on vm resource management: 
>  
> ------------- cut ---------------- 
> The task of limiting a container to 4.5GB of memory bottles down to the 
> question: what to do when the container starts to use more than assigned 
> 4.5GB of memory? 
>  
> At this moment there are only 3 viable alternatives. 
>  
> A) Have separate memory management for each container, 
>    with separate buddy allocator, lru lists, page replacement mechanism. 
>    That implies a considerable overhead, and the main challenge there 
>    is sharing of pages between these separate memory managers. 
>  
> B) Return errors on extension of mappings, but not on page faults, where 
>    memory is actually consumed. 
>    In this case it makes sense to take into account not only the size of used 
>    memory, but the size of created mappings as well. 
>    This is approximately what "privvmpages" accounting/limiting provides in 
>    UBC. 
>  
> C) Rely on OOM killer. 
>    This is a fall-back method in UBC, for the case "privvmpages" limits 
>    still leave the possibility to overload the system. 
>  
 
D) Virtual scan of mm's in the over-limit container 
 
E) Modify existing physical scanner to be able to skip pages which 
   belong to not-over-limit containers. 
 
F) Something else ;)
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	| 
		
 |  
	| 
		
 |  
	| 
		
 |  
	
		
		
			| Re: BC: resource beancounters (v2) [message #5661 is a reply to message #5655] | 
			Fri, 25 August 2006 16:30    | 
		 
		
			
				
				
				
					
						  
						Andrey Savochkin
						 Messages: 47 Registered: December 2005 
						
					 | 
					Member  | 
					 | 
		 
		 
	 | 
 
	
		On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote: 
> On Fri, 25 Aug 2006 15:49:15 +0400 
> Kirill Korotaev <dev@sw.ru> wrote: 
>  
> > Andrey Savochkin wrote already a brief summary on vm resource management: 
> >  
> > ------------- cut ---------------- 
> > The task of limiting a container to 4.5GB of memory bottles down to the 
> > question: what to do when the container starts to use more than assigned 
> > 4.5GB of memory? 
> >  
> > At this moment there are only 3 viable alternatives. 
> >  
> > A) Have separate memory management for each container, 
> >    with separate buddy allocator, lru lists, page replacement mechanism. 
> >    That implies a considerable overhead, and the main challenge there 
> >    is sharing of pages between these separate memory managers. 
> >  
> > B) Return errors on extension of mappings, but not on page faults, where 
> >    memory is actually consumed. 
> >    In this case it makes sense to take into account not only the size of used 
> >    memory, but the size of created mappings as well. 
> >    This is approximately what "privvmpages" accounting/limiting provides in 
> >    UBC. 
> >  
> > C) Rely on OOM killer. 
> >    This is a fall-back method in UBC, for the case "privvmpages" limits 
> >    still leave the possibility to overload the system. 
> >  
>  
> D) Virtual scan of mm's in the over-limit container 
>  
> E) Modify existing physical scanner to be able to skip pages which 
>    belong to not-over-limit containers. 
 
I've actually tried (E), but it didn't work as I wished. 
 
It didn't handle well shared pages. 
Then, in my experiments such modified scanner was unable to regulate 
quality-of-service.  When I ran 2 over-the-limit containers, they worked 
equally slow regardless of their limits and work set size. 
That is, I didn't observe a smooth transition "under limit, maximum 
performance" to "slightly over limit, a bit reduced performance" to 
"significantly over limit, poor performance".  Neither did I see any fairness 
in how containers got penalized for exceeding their limits. 
 
My explanation of what I observed is that 
 - since filesystem caches play a huge role in performance, page scanner will 
   be very limited in controlling container's performance if caches 
   stay shared between containers, 
 - in the absence of decent disk I/O manager, stalls due to swapin/swapout 
   are more influenced by disk subsystem than by page scanner policy. 
So in fact modified page scanner provides control over memory usage only as 
"stay under limits or die", and doesn't show many advantages over (B) or (C). 
At the same time, skipping pages visibly penalizes "good citizens", not only 
in disk bandwidth but in CPU overhead as well. 
 
So I settled for (A)-(C) for now. 
But it certainly would be interesting to hear if someone else makes such 
experiments. 
 
Best regards 
 
Andrey
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re: BC: resource beancounters (v2) [message #5663 is a reply to message #5661] | 
			Fri, 25 August 2006 17:50    | 
		 
		
			
				
				
				
					
						  
						Andrew Morton
						 Messages: 127 Registered: December 2005 
						
					 | 
					Senior Member  | 
					 | 
		 
		 
	 | 
 
	
		On Fri, 25 Aug 2006 20:30:26 +0400 
Andrey Savochkin <saw@sw.ru> wrote: 
 
> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote: 
> >  
> > D) Virtual scan of mm's in the over-limit container 
> >  
> > E) Modify existing physical scanner to be able to skip pages which 
> >    belong to not-over-limit containers. 
>  
> I've actually tried (E), but it didn't work as I wished. 
>  
> It didn't handle well shared pages. 
> Then, in my experiments such modified scanner was unable to regulate 
> quality-of-service.  When I ran 2 over-the-limit containers, they worked 
> equally slow regardless of their limits and work set size. 
> That is, I didn't observe a smooth transition "under limit, maximum 
> performance" to "slightly over limit, a bit reduced performance" to 
> "significantly over limit, poor performance".  Neither did I see any fairness 
> in how containers got penalized for exceeding their limits. 
>  
> My explanation of what I observed is that 
>  - since filesystem caches play a huge role in performance, page scanner will 
>    be very limited in controlling container's performance if caches 
>    stay shared between containers, 
>  - in the absence of decent disk I/O manager, stalls due to swapin/swapout 
>    are more influenced by disk subsystem than by page scanner policy. 
> So in fact modified page scanner provides control over memory usage only as 
> "stay under limits or die", and doesn't show many advantages over (B) or (C). 
> At the same time, skipping pages visibly penalizes "good citizens", not only 
> in disk bandwidth but in CPU overhead as well. 
>  
> So I settled for (A)-(C) for now. 
> But it certainly would be interesting to hear if someone else makes such 
> experiments. 
>  
 
Makes sense.  If one is looking for good machine partitioning then a shared 
disk is obviously a great contention point.  To address that we'd need to 
be able to say "container A swaps to /dev/sda1 and container B swaps to 
/dev/sdb1".  But the swap system at present can't do that.
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re: BC: resource beancounters (v2) [message #5666 is a reply to message #5661] | 
			Fri, 25 August 2006 19:00    | 
		 
		
			
				
				
				
					
						  
						Chandra Seetharaman
						 Messages: 88 Registered: August 2006 
						
					 | 
					Member  | 
					 | 
		 
		 
	 | 
 
	
		Have you seen/tried the memory controller in CKRM/Resource Groups ? 
http://sourceforge.net/projects/ckrm 
 
It maintains a per resource group LRU lists and also maintains a list of 
over-guarantee groups (with ordering based on where they are in their 
guarantee-limit scale). So, when a reclaim needs to happen, pages are 
first freed from a group that is way over its limit, and then the next 
one and so on. 
 
Few things that it does that are not good: 
 - doesn't account shared pages accurately 
 - moves all pages from a task when the task moves to a different group 
 - totally new reclamation path 
 
regards, 
 
chandra 
On Fri, 2006-08-25 at 20:30 +0400, Andrey Savochkin wrote: 
> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote: 
> > On Fri, 25 Aug 2006 15:49:15 +0400 
> > Kirill Korotaev <dev@sw.ru> wrote: 
> >  
> > > Andrey Savochkin wrote already a brief summary on vm resource management: 
> > >  
> > > ------------- cut ---------------- 
> > > The task of limiting a container to 4.5GB of memory bottles down to the 
> > > question: what to do when the container starts to use more than assigned 
> > > 4.5GB of memory? 
> > >  
> > > At this moment there are only 3 viable alternatives. 
> > >  
> > > A) Have separate memory management for each container, 
> > >    with separate buddy allocator, lru lists, page replacement mechanism. 
> > >    That implies a considerable overhead, and the main challenge there 
> > >    is sharing of pages between these separate memory managers. 
> > >  
> > > B) Return errors on extension of mappings, but not on page faults, where 
> > >    memory is actually consumed. 
> > >    In this case it makes sense to take into account not only the size of used 
> > >    memory, but the size of created mappings as well. 
> > >    This is approximately what "privvmpages" accounting/limiting provides in 
> > >    UBC. 
> > >  
> > > C) Rely on OOM killer. 
> > >    This is a fall-back method in UBC, for the case "privvmpages" limits 
> > >    still leave the possibility to overload the system. 
> > >  
> >  
> > D) Virtual scan of mm's in the over-limit container 
> >  
> > E) Modify existing physical scanner to be able to skip pages which 
> >    belong to not-over-limit containers. 
>  
> I've actually tried (E), but it didn't work as I wished. 
>  
> It didn't handle well shared pages. 
> Then, in my experiments such modified scanner was unable to regulate 
> quality-of-service.  When I ran 2 over-the-limit containers, they worked 
> equally slow regardless of their limits and work set size. 
> That is, I didn't observe a smooth transition "under limit, maximum 
> performance" to "slightly over limit, a bit reduced performance" to 
> "significantly over limit, poor performance".  Neither did I see any fairness 
> in how containers got penalized for exceeding their limits. 
>  
> My explanation of what I observed is that 
>  - since filesystem caches play a huge role in performance, page scanner will 
>    be very limited in controlling container's performance if caches 
>    stay shared between containers, 
>  - in the absence of decent disk I/O manager, stalls due to swapin/swapout 
>    are more influenced by disk subsystem than by page scanner policy. 
> So in fact modified page scanner provides control over memory usage only as 
> "stay under limits or die", and doesn't show many advantages over (B) or (C). 
> At the same time, skipping pages visibly penalizes "good citizens", not only 
> in disk bandwidth but in CPU overhead as well. 
>  
> So I settled for (A)-(C) for now. 
> But it certainly would be interesting to hear if someone else makes such 
> experiments. 
>  
> Best regards 
>  
> Andrey 
--  
 
 ------------------------------------------------------------ ---------- 
    Chandra Seetharaman               | Be careful what you choose.... 
              - sekharan@us.ibm.com   |      .......you may get it. 
 ------------------------------------------------------------ ----------
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re: BC: resource beancounters (v2) [message #5676 is a reply to message #5661] | 
			Sat, 26 August 2006 02:15    | 
		 
		
			
				
				
				
					
						  
						Rohit Seth
						 Messages: 101 Registered: August 2006 
						
					 | 
					Senior Member  | 
					 | 
		 
		 
	 | 
 
	
		On Fri, 2006-08-25 at 20:30 +0400, Andrey Savochkin wrote: 
> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote: 
> > On Fri, 25 Aug 2006 15:49:15 +0400 
> > Kirill Korotaev <dev@sw.ru> wrote: 
> >  
> > > Andrey Savochkin wrote already a brief summary on vm resource management: 
> > >  
> > > ------------- cut ---------------- 
> > > The task of limiting a container to 4.5GB of memory bottles down to the 
> > > question: what to do when the container starts to use more than assigned 
> > > 4.5GB of memory? 
> > >  
> > > At this moment there are only 3 viable alternatives. 
> > >  
> > > A) Have separate memory management for each container, 
> > >    with separate buddy allocator, lru lists, page replacement mechanism. 
> > >    That implies a considerable overhead, and the main challenge there 
> > >    is sharing of pages between these separate memory managers. 
> > >  
 
Yes, sharing of pages across different containers/managers will be a 
problem.  Why not just disallow that scenario (that is what fake nodes 
proposal would also end up doing). 
 
> > > B) Return errors on extension of mappings, but not on page faults, where 
> > >    memory is actually consumed. 
> > >    In this case it makes sense to take into account not only the size of used 
> > >    memory, but the size of created mappings as well. 
> > >    This is approximately what "privvmpages" accounting/limiting provides in 
> > >    UBC. 
> > > 
 
Keeping a tab on all the virtual mappings in a container must also be 
troublesome.  And IMO is not the right way to go...this is even a 
stricter version of overcommit_memory...right? 
>   
> > > C) Rely on OOM killer. 
> > >    This is a fall-back method in UBC, for the case "privvmpages" limits 
> > >    still leave the possibility to overload the system. 
> > >  
> >  
> > D) Virtual scan of mm's in the over-limit container 
> >  
 
This seems like an interesting choice.  If we can quickly inactivate 
some pages belonging to tasks in over_the_limit container. 
 
> > E) Modify existing physical scanner to be able to skip pages which 
> >    belong to not-over-limit containers. 
>  
> I've actually tried (E), but it didn't work as I wished. 
>  
> It didn't handle well shared pages. 
> Then, in my experiments such modified scanner was unable to regulate 
> quality-of-service.  When I ran 2 over-the-limit containers, they worked 
> equally slow regardless of their limits and work set size. 
> That is, I didn't observe a smooth transition "under limit, maximum 
> performance" to "slightly over limit, a bit reduced performance" to 
> "significantly over limit, poor performance".  Neither did I see any fairness 
> in how containers got penalized for exceeding their limits. 
>  
 
That sure is an interesting observation though I think it really depends 
on if you are doing the same amount of work when counts have just gone 
above the limits to the point where they are way over the limit. 
 
> My explanation of what I observed is that 
>  - since filesystem caches play a huge role in performance, page scanner will 
>    be very limited in controlling container's performance if caches 
>    stay shared between containers, 
 
Yeah, if a page is shared between containers then you can end up doing 
nothing useful.  And that is where containers dedicated to individual 
filesystem could be useful. 
 
>  - in the absence of decent disk I/O manager, stalls due to swapin/swapout 
>    are more influenced by disk subsystem than by page scanner policy. 
> So in fact modified page scanner provides control over memory usage only as 
> "stay under limits or die", and doesn't show many advantages over (B) or (C). 
> At the same time, skipping pages visibly penalizes "good citizens", not only 
> in disk bandwidth but in CPU overhead as well. 
>  
 
Sure that CPU, disk and other variables will kick in when you start 
swapping.  But then apps are expected to suffer when gone over limit. 
The drawback is the apps that are not hit the limit will also suffer, 
but then that is where extra controllers like CPU will kick in. 
 
Maybe, we have a flag for each container indicating whether the tasks 
belonging to that container should be killed immediately or they are 
okay to run with lower performance as far as they can. 
 
-rohit
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re: [PATCH] BC: resource beancounters (v2) [message #5678 is a reply to message #5657] | 
			Sat, 26 August 2006 03:55    | 
		 
		
			
				
				
				
					
						  
						Nick Piggin
						 Messages: 35 Registered: March 2006 
						
					 | 
					Member  | 
					 | 
		 
		 
	 | 
 
	
		Alan Cox wrote: 
> Ar Sad, 2006-08-26 am 01:14 +1000, ysgrifennodd Nick Piggin: 
>  
>>I still think doing simple accounting per-page would be a better way to 
>>go than trying to pin down all "user allocatable" kernel allocations. 
>>And would require all of about 2 hooks in the page allocator. And would 
>>track *actual* RAM allocated by that container. 
>  
>  
> You have a variety of kernel objects you want to worry about and they 
> have very differing properties. 
>  
> Some are basically shared resources - page cache, dentries, inodes, etc 
> and can be balanced pretty well by the kernel (ok the dentries are a bit 
> of a problem right now). Others are very specific "owned" resources - 
> like file handles, sockets and vmas. 
 
That's true (OTOH I'd argue it would still be very useful for things 
like pagecache, so one container can't start a couple of 'dd' loops 
and turn everyone else to crap). And while the sharing may not be 
exactly captured, statistically things should balance over time. 
 
So I'm not arguing about _also_ accounting resources that are limited 
in other ways (than just the RAM they consume). 
 
But as a DoS protection measure on RAM usage, trying to account all 
kernel allocations that are user triggerable just sounds hard to 
maintain, holey, ugly, invsive (and not perfect either -- in fact it 
still isn't clear to me that it is any better than my proposal). 
 
>  
> Tracking actual RAM use by container/user/.. isn't actually that 
> interesting. It's also inconveniently sub page granularity. 
 
If it isn't interesting, then I don't think we want it (at least, until 
someone does get an interest in it). 
 
>  
> Its a whole seperate question whether you want a separate bean counter 
> limit for sockets, file handles, vmas etc. 
 
Yeah that's fair enough. We obviously want to avoid exposing limits on 
things that it doesn't make sense to limit, or that is a kernel 
implementation detail as much as possible. 
 
eg. so I would be happy to limit virtual address, less happy to limit 
vmas alone (unless that is in the context of accounting their RAM usage 
or their implied vaddr charge). 
 
--  
SUSE Labs, Novell Inc. 
Send instant messages to your online friends http://au.messenger.yahoo.com
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	| 
		
 |  
	| 
		
 |  
	| 
		
 |  
	
		
		
			| Re:  Re: BC: resource beancounters (v2) [message #5709 is a reply to message #5707] | 
			Mon, 28 August 2006 17:40    | 
		 
		
			| 
				
	 | 
 
	
		Rohit Seth wrote: 
> On Sat, 2006-08-26 at 17:37 +0100, Alan Cox wrote: 
>    
>> Ar Gwe, 2006-08-25 am 19:15 -0700, ysgrifennodd Rohit Seth: 
>>      
>>> Yes, sharing of pages across different containers/managers will be a 
>>> problem.  Why not just disallow that scenario (that is what fake nodes 
>>> proposal would also end up doing). 
>>>        
>> Because it destroys the entire point of using containers instead of 
>> something like Xen - which is sharing. Also at the point I am using 
>> beancounters per user I don't want glibc per use, libX11 per use glib 
>> per use gtk per user etc.. 
>> 
>> 
>>      
> 
> I'm not saying per use glibc etc.  That will indeed be useless and bring 
> it to virtualization world.  Just like fake node, one should be allowed 
> to use pages that are already in  (for example) page cache- so that you 
> don't end up duplicating all shared stuff.  But as far as charging is 
> concerned, charge it to container who either got the page in page cache 
> OR if FS based semantics exist then charge it to the container where the 
> file belongs.  What I was suggesting is to not charge a page to 
> different counters. 
>    
 
Consider the following simple scenario: there are 50 containers  
(numbered, say, 1 to 50) all sharing a single installation of Fedora  
Core 5. They all run sshd, apache, syslogd, crond and some other stuff  
like that. This is actually quite a real scenario. 
 
In the world that you propose the container which was unlucky to start  
first (probably the one with ID of either 1 or 50) will be charged for  
all the memory, and all the 
others will have most of their memory for free. And in such a world  
per-container memory accounting or limiting is just not possible.
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re:  Re: BC: resource beancounters (v2) [message #5721 is a reply to message #5709] | 
			Mon, 28 August 2006 22:28    | 
		 
		
			
				
				
				
					
						  
						Rohit Seth
						 Messages: 101 Registered: August 2006 
						
					 | 
					Senior Member  | 
					 | 
		 
		 
	 | 
 
	
		On Mon, 2006-08-28 at 21:41 +0400, Kir Kolyshkin wrote: 
> Rohit Seth wrote: 
> > 
> > I'm not saying per use glibc etc.  That will indeed be useless and bring 
> > it to virtualization world.  Just like fake node, one should be allowed 
> > to use pages that are already in  (for example) page cache- so that you 
> > don't end up duplicating all shared stuff.  But as far as charging is 
> > concerned, charge it to container who either got the page in page cache 
> > OR if FS based semantics exist then charge it to the container where the 
> > file belongs.  What I was suggesting is to not charge a page to 
> > different counters. 
> >    
>  
> Consider the following simple scenario: there are 50 containers  
> (numbered, say, 1 to 50) all sharing a single installation of Fedora  
> Core 5. They all run sshd, apache, syslogd, crond and some other stuff  
> like that. This is actually quite a real scenario. 
>  
> In the world that you propose the container which was unlucky to start  
> first (probably the one with ID of either 1 or 50) will be charged for  
> all the memory, and all the 
> others will have most of their memory for free. And in such a world  
> per-container memory accounting or limiting is just not possible. 
 
If you are only having task based accounting then yes the first 
container using a page will be charged.  And when it hit its limit then 
it will inactivate some of the pages.  If some other container now uses 
the same page (that got inactivated) again then this next container will 
be charged for that page. 
 
Though if we have file/directory based accounting then shared pages 
belonging to /usr/lib or /usr/bin can go to a common container. 
 
-rohit
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	| 
		
 |  
	
		
		
			| Re:  [PATCH 6/6] BC: kernel memory accounting (marks) [message #5731 is a reply to message #5573] | 
			Tue, 29 August 2006 09:52    | 
		 
		
			
				
				
				
					
						  
						dev
						 Messages: 1693 Registered: September 2005  Location: Moscow
						
					 | 
					Senior Member  | 
					 
  | 
		 
		 
	 | 
 
	
		Dave Hansen wrote: 
> I'm still a bit concerned about if we actually need the 'struct page' 
> pointer.  I've gone through all of the users, and I'm not sure that I 
> see any that _require_ having a pointer in 'struct page'.  I think it 
> will take some rework, especially with the pagetables, but it should be 
> quite doable. 
don't worry: 
1. we will introduce a separate patch moving this pointer 
  into mirroring array 
2. this pointer is still required for _user_ pages tracking, 
  that's why I don't follow your suggestion right now... 
 
> vmalloc: 
> 	Store in vm_struct 
> fd_set_bits: 
> poll_get: 
> mount hashtable: 
> 	Don't need alignment.  use the slab? 
> pagetables: 
> 	either store in an extra field of 'struct page', or use the 
> 	mm's.  mm should always be available when alloc/freeing a 
> 	pagetable page 
>  
> Did I miss any? 
flocks, pipe buffers, task_struct, sighand, signal, vmas, 
posix timers, uid_cache, shmem dirs,  
 
Thanks, 
Kirill
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	| 
		
 |  
	
		
		
			| Re: [PATCH] BC: resource beancounters (v2) [message #5747 is a reply to message #5655] | 
			Tue, 29 August 2006 15:33    | 
		 
		
			
				
				
				
					
						  
						dev
						 Messages: 1693 Registered: September 2005  Location: Moscow
						
					 | 
					Senior Member  | 
					 
  | 
		 
		 
	 | 
 
	
		Andrew Morton wrote: 
> On Fri, 25 Aug 2006 15:49:15 +0400 
> Kirill Korotaev <dev@sw.ru> wrote: 
>  
>  
>>>We need to go over this work before we can commit to the BC 
>>>core.  Last time I looked at the VM accounting patch it 
>>>seemed rather unpleasing from a maintainability POV. 
>> 
>>hmmm... in which regard? 
>  
>  
> Little changes all over the MM code which might get accidentally broken. 
>  
>  
>>>And, if I understand it correctly, the only response to a job 
>>>going over its VM limits is to kill it, rather than trimming 
>>>it.  Which sounds like a big problem? 
>> 
>>No, UBC virtual memory management refuses occur on mmap()'s. 
>  
>  
> That's worse, isn't it?  Firstly it rules out big sparse mappings and secondly 
1) if mappings are private then yes, you can not mmap too much. This is logical, 
since this whole mappings are potentially allocatable and there is no way to control 
it later except for SIGKILL. 
2) if mappings are shared file mappings (shmem is handled in similar way) then 
  you can mmap as much as you want, since these pages can be reclaimed. 
 
> 	mmap_and_use(80% of container size) 
> 	fork_and_immediately_exec(/bin/true) 
>  
> will fail at the fork? 
yes, it will fail on fork() or exec() in case of much private (1) mappings. 
fail on fork() and exec() is much friendlier then SIGKILL, don't you think so? 
 
private mappings parameter which is limited by UBC is a kind of upper estimation 
of container RSS. From our experience such an estimation is ~ 5-20% higher then 
a real physical memory used (with real-life applications). 
 
>>Andrey Savochkin wrote already a brief summary on vm resource management: 
>> 
>>------------- cut ---------------- 
>>The task of limiting a container to 4.5GB of memory bottles down to the 
>>question: what to do when the container starts to use more than assigned 
>>4.5GB of memory? 
>> 
>>At this moment there are only 3 viable alternatives. 
>> 
>>A) Have separate memory management for each container, 
>>   with separate buddy allocator, lru lists, page replacement mechanism. 
>>   That implies a considerable overhead, and the main challenge there 
>>   is sharing of pages between these separate memory managers. 
>> 
>>B) Return errors on extension of mappings, but not on page faults, where 
>>   memory is actually consumed. 
>>   In this case it makes sense to take into account not only the size of used 
>>   memory, but the size of created mappings as well. 
>>   This is approximately what "privvmpages" accounting/limiting provides in 
>>   UBC. 
>> 
>>C) Rely on OOM killer. 
>>   This is a fall-back method in UBC, for the case "privvmpages" limits 
>>   still leave the possibility to overload the system. 
>> 
>  
>  
> D) Virtual scan of mm's in the over-limit container 
>  
> E) Modify existing physical scanner to be able to skip pages which 
>    belong to not-over-limit containers. 
>  
> F) Something else ;) 
We fully agree that other possible algorithms can and should exist. 
My idea only is that any of them would need accounting anyway 
(which is the most part of beancounters). 
Throtling, modified scanners etc. can be implemented as a separate 
BC parameters. Thus, an administrator will be able to select 
which policy should be applied to the container which is near its limit. 
 
So the patches I'm trying to send are a step-by-step accounting of all 
the resources and their simple limitations. More comprehensive limitation 
policy will be built on top of it later. 
 
BTW, UBC page beancounters allow to distinguish pages used by only one 
container and pages which are shared. So scanner can try to reclaim 
container private pages first, thus not influencing other containers. 
 
Thanks, 
Kirill
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	| 
		
 |  
	| 
		
 |  
	| 
		
 |  
	| 
		
 |  
	
		
		
			| Re:  Re: BC: resource beancounters (v2) [message #5756 is a reply to message #5755] | 
			Tue, 29 August 2006 19:15   | 
		 
		
			
				
				
				
					
						  
						Rohit Seth
						 Messages: 101 Registered: August 2006 
						
					 | 
					Senior Member  | 
					 | 
		 
		 
	 | 
 
	
		On Tue, 2006-08-29 at 20:06 +0100, Alan Cox wrote: 
> Ar Maw, 2006-08-29 am 10:30 -0700, ysgrifennodd Rohit Seth: 
> > On Tue, 2006-08-29 at 11:15 +0100, Alan Cox wrote: 
> > > Ar Llu, 2006-08-28 am 15:28 -0700, ysgrifennodd Rohit Seth: 
> > > > Though if we have file/directory based accounting then shared pages 
> > > > belonging to /usr/lib or /usr/bin can go to a common container. 
> > >  
> > > So that one user can map all the spare libraries and config files and 
> > > DoS the system by preventing people from accessing the libraries they do 
> > > need ? 
> > >  
> >  
> > Well, there is a risk whenever there is sharing across containers. The 
> > point though is, give the choice to sysadmin to configure the platform 
> > the way it is appropriate. 
>  
> In other words your suggestion doesn't actually work for the real world 
> cases like web serving. 
>  
 
Containers are not going to solve all the problems particularly the 
scenarios like when a machine is a web server and an odd user can log on 
to the same machine and (w/o any ulimits) claim all the memory that is 
present in the system. 
 
Though it is quite possible to implement a combination of two (task and 
fs based) policies in containers and sysadmin can set a preference of 
each each container.  [this probably is another reason for having a per 
page container pointer]. 
 
-rohit
		
		
		
 |  
	| 
		
	 | 
 
 
 |  
	
		
		
			| Re: [PATCH] BC: resource beancounters (v2) [message #5763 is a reply to message #5747] | 
			Tue, 29 August 2006 17:08   | 
		 
		
			
				
				
				
					
						  
						Balbir Singh
						 Messages: 491 Registered: August 2006 
						
					 | 
					Senior Member  | 
					 | 
		 
		 
	 | 
 
	
		Kirill Korotaev wrote: 
>>> ------------- cut ---------------- 
>>> The task of limiting a container to 4.5GB of memory bottles down to the 
>>> question: what to do when the container starts to use more than assigned 
>>> 4.5GB of memory? 
>>> 
>>> At this moment there are only 3 viable alternatives. 
>>> 
>>> A) Have separate memory management for each container, 
>>>   with separate buddy allocator, lru lists, page replacement mechanism. 
>>>   That implies a considerable overhead, and the main challenge there 
>>>   is sharing of pages between these separate memory managers. 
>>> 
>>> B) Return errors on extension of mappings, but not on page faults, where 
>>>   memory is actually consumed. 
>>>   In this case it makes sense to take into account not only the size  
>>> of used 
>>>   memory, but the size of created mappings as well. 
>>>   This is approximately what "privvmpages" accounting/limiting  
>>> provides in 
>>>   UBC. 
>>> 
>>> C) Rely on OOM killer. 
>>>   This is a fall-back method in UBC, for the case "privvmpages" limits 
>>>   still leave the possibility to overload the system. 
>>> 
>> 
>> 
>> D) Virtual scan of mm's in the over-limit container 
>> 
>> E) Modify existing physical scanner to be able to skip pages which 
>>    belong to not-over-limit containers. 
>> 
>> F) Something else ;) 
> We fully agree that other possible algorithms can and should exist. 
> My idea only is that any of them would need accounting anyway 
> (which is the most part of beancounters). 
> Throtling, modified scanners etc. can be implemented as a separate 
> BC parameters. Thus, an administrator will be able to select 
> which policy should be applied to the container which is near its limit. 
>  
> So the patches I'm trying to send are a step-by-step accounting of all 
> the resources and their simple limitations. More comprehensive limitation 
> policy will be built on top of it later. 
>  
 
One of the issues I see is that bean counters are not very flexible. Tasks  
cannot change bean counters dynamically after fork()/exec() that is - can they? 
 
 
> BTW, UBC page beancounters allow to distinguish pages used by only one 
> container and pages which are shared. So scanner can try to reclaim 
> container private pages first, thus not influencing other containers. 
>  
 
But can you select the specific container for which we intend to scan pages? 
 
> Thanks, 
> Kirill 
>  
 
--  
	Thanks, 
	Balbir Singh, 
	Linux Technology Center, 
	IBM Software Labs
		
		
		
 |  
	| 
		
	 | 
 
 
 |   
Goto Forum:
 
 Current Time: Tue Nov 04 00:52:34 GMT 2025 
 Total time taken to generate the page: 0.25713 seconds 
 |