| Home » Mailing lists » Devel » [PATCH v3 00/28] kmem limitation for memcg Goto Forum:
	| 
		
			| Re: [PATCH v3 15/28] slub: always get the cache from its page in kfree [message #46576 is a reply to message #46566] | Tue, 29 May 2012 15:59   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/29/2012 06:42 PM, Christoph Lameter wrote: > On Fri, 25 May 2012, Glauber Costa wrote:
 >
 >> struct page already have this information. If we start chaining
 >> caches, this information will always be more trustworthy than
 >> whatever is passed into the function
 >
 > Yes but the lookup of the page struct also costs some cycles. SLAB in
 > !NUMA mode and SLOB avoid these lookups and can improve their freeing
 > speed because of that.
 
 But for our case, I don't really see a way around. What I can do, is
 wrap it further, so when we're not using it, code goes exactly the same
 way as before, instead of always calculating the page. Would it be better?
 
 >> diff --git a/mm/slub.c b/mm/slub.c
 >> index 0eb9e72..640872f 100644
 >> --- a/mm/slub.c
 >> +++ b/mm/slub.c
 >> @@ -2598,10 +2598,14 @@ redo:
 >>   void kmem_cache_free(struct kmem_cache *s, void *x)
 >>   {
 >>   	struct page *page;
 >> +	bool slab_match;
 >>
 >>   	page = virt_to_head_page(x);
 >>
 >> -	slab_free(s, page, x, _RET_IP_);
 >> +	slab_match = (page->slab == s) | slab_is_parent(page->slab, s);
 >> +	VM_BUG_ON(!slab_match);
 >
 > Why add a slab_match bool if you do not really need it?
 
 style. I find aux variables a very human readable way to deal with the
 80-col limitation.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 18/28] slub: charge allocation to a memcg [message #46578 is a reply to message #46568] | Tue, 29 May 2012 16:06   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/29/2012 06:51 PM, Christoph Lameter wrote: > On Fri, 25 May 2012, Glauber Costa wrote:
 >
 >> This patch charges allocation of a slab object to a particular
 >> memcg.
 >
 > I am wondering why you need all the other patches. The simplest approach
 > would just to hook into page allocation and freeing from the slab
 > allocators as done here and charge to the currently active cgroup. This
 > avoids all the duplication of slab caches and per node as well as per cpu
 > structures. A certain degree of fuzziness cannot be avoided given that
 > objects are cached and may be served to multiple cgroups. If that can be
 > tolerated then the rest would be just like this patch which could be made
 > more simple and non intrusive.
 >
 Just hooking into the page allocation only works for caches with very
 big objects. For all the others, we need to relay the process to the
 correct cache.
 
 Some objects may be shared, yes, but in reality most won't.
 
 Let me give you an example:
 
 We track task_struct here. So as a nice side effect of this, a fork bomb
 will be killed because it will not be able to allocate any further.
 
 But if we're accounting only at page allocation time, it is quite
 possible to come up with a pattern while I always let other cgroups
 pay the price for the page, but I will be the one filling it.
 
 Having an eventual dentry, for instance, shared among caches, is okay.
 But the main use case is for process in different cgroups dealing with
 totally different parts of the filesystem.
 
 So we can't really afford to charge to the process touching the nth
 object where n is the number of objects per page. We need to relay it to
 the right one.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 19/28] slab: per-memcg accounting of slab caches [message #46580 is a reply to message #46579] | Tue, 29 May 2012 16:13   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/29/2012 08:07 PM, Glauber Costa wrote: > On 05/29/2012 06:52 PM, Christoph Lameter wrote:
 >> On Fri, 25 May 2012, Glauber Costa wrote:
 >>
 >>> > This patch charges allocation of a slab object to a particular
 >>> > memcg.
 >> Ok so a requirement is to support tracking of individual slab
 >> objects to cgroups? That is going to be quite expensive since it will
 >> touch the hotpaths.
 >>
 >
 > No, we track pages. But all the objects in the page belong to the same
 > cgroup.
 >
 
 Also, please note the following:
 
 The code that relays us to the right cache, is wrapped inside a static
 branch. Whoever is not using more than the root cgroup, will not suffer
 a single bit.
 
 If you are, but your process is in the right cgroup, you will
 unfortunately pay function call penalty(*), but the code will make and
 effort to detect that as early as possible and resume.
 
 
 (*) Not even then if you fall in the following categories, that are
 resolved inline:
 
 +       if (!current->mm)
 +               return cachep;
 +       if (in_interrupt())
 +               return cachep;
 +       if (gfp & __GFP_NOFAIL)
 +               return cachep;
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46581 is a reply to message #46573] | Tue, 29 May 2012 16:33   |  
			| 
				
				
					|  Christoph Lameter Messages: 123
 Registered: September 2006
 | Senior Member |  |  |  
	| On Tue, 29 May 2012, Glauber Costa wrote: 
 > > Ok this only duplicates the kmalloc arrays. Why not the others?
 >
 > It does duplicate the others.
 >
 > First it does a while look on the kmalloc caches, then a list_for_each_entry
 > in the rest. You probably missed it.
 
 There is no need to separately duplicate the kmalloc_caches. Those are
 included on the cache_chain.
 
 > > > @@ -2543,7 +2564,12 @@ kmem_cache_create (const char *name, size_t size,
 > > > size_t align,
 > > >   	cachep->ctor = ctor;
 > > >   	cachep->name = name;
 > > >
 > > > +	if (g_cpucache_up>= FULL)
 > > > +		mem_cgroup_register_cache(memcg, cachep);
 > >
 > > What happens if a cgroup was active during creation of slab xxy but
 > > then a process running in a different cgroup uses that slab to allocate
 > > memory? Is it charged to the first cgroup?
 >
 > I don't see this situation ever happening. kmem_cache_create, when called
 > directly, will always create a global cache. It doesn't matter which cgroups
 > are or aren't active at this time or any other. We create copies per-cgroup,
 > but we create it lazily, when someone will touch it.
 
 How do you detect that someone is touching it?
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46582 is a reply to message #46581] | Tue, 29 May 2012 16:36   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/29/2012 08:33 PM, Christoph Lameter wrote: > On Tue, 29 May 2012, Glauber Costa wrote:
 >
 >>> Ok this only duplicates the kmalloc arrays. Why not the others?
 >>
 >> It does duplicate the others.
 >>
 >> First it does a while look on the kmalloc caches, then a list_for_each_entry
 >> in the rest. You probably missed it.
 >
 > There is no need to separately duplicate the kmalloc_caches. Those are
 > included on the cache_chain.
 >
 >>>> @@ -2543,7 +2564,12 @@ kmem_cache_create (const char *name, size_t size,
 >>>> size_t align,
 >>>>    	cachep->ctor = ctor;
 >>>>    	cachep->name = name;
 >>>>
 >>>> +	if (g_cpucache_up>= FULL)
 >>>> +		mem_cgroup_register_cache(memcg, cachep);
 >>>
 >>> What happens if a cgroup was active during creation of slab xxy but
 >>> then a process running in a different cgroup uses that slab to allocate
 >>> memory? Is it charged to the first cgroup?
 >>
 >> I don't see this situation ever happening. kmem_cache_create, when called
 >> directly, will always create a global cache. It doesn't matter which cgroups
 >> are or aren't active at this time or any other. We create copies per-cgroup,
 >> but we create it lazily, when someone will touch it.
 >
 > How do you detect that someone is touching it?
 
 kmem_alloc_cache will create mem_cgroup_get_kmem_cache.
 (protected by static_branches, so won't happen if you don't have at
 least non-root memcg using it)
 
 * Then it detects which memcg the calling process belongs to,
 * if it is the root memcg, go back to the allocation as quickly as we
 can
 * otherwise, in the creation process, you will notice that each cache
 has an index. memcg will store pointers to the copies and find them by
 the index.
 
 From this point on, all the code of the caches is reused (except for
 accounting the page)
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46583 is a reply to message #46574] | Tue, 29 May 2012 16:05   |  
			| 
				
				
					|  Christoph Lameter Messages: 123
 Registered: September 2006
 | Senior Member |  |  |  
	| On Tue, 29 May 2012, Glauber Costa wrote: 
 > Accounting pages seems just crazy to me. If new allocators come in the future,
 > organizing the pages in a different way, instead of patching it here and
 > there, we need to totally rewrite this.
 
 Quite to the contrary. We could either pass a THIS_IS_A_SLAB page flag to
 the page allocator call or have a special call that does the accounting
 and then calls the page allocator. The code could be completely in
 cgroups. There would be no changes to the allocators aside from setting
 the flag or calling the alternate page allocator functions.
 
 > > Why do you need to increase the refcount? You made a full copy right?
 >
 > Yes, but I don't want this copy to go away while we have other caches around.
 
 You copied all metadata so what is there that you would still need should
 the other copy go away?
 
 > So, in the memcg internals, I used a different reference counter, to avoid
 > messing with this one. I could use that, and leave the original refcnt alone.
 > Would you prefer this?
 
 The refcounter is really not the issue.
 
 I am a bit worried about the various duplicate features here and there.
 The approach is not tightened down yet.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46585 is a reply to message #46584] | Tue, 29 May 2012 16:59   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/29/2012 08:52 PM, Christoph Lameter wrote: > Well kmem_cache_alloc cache is the performance critical hotpath.
 >
 > If you are already there and doing all of that then would it not be better
 > to simply count the objects allocated and freed per cgroup? Directly
 > increment and decrement counters in a cgroup? You do not really need to
 > duplicate the kmem_cache structure and do not need to modify allocators if
 > you are willing to take that kind of a performance hit. Put a wrapper
 > around kmem_cache_alloc/free and count things.
 
 Well, I see it as the difference between being a big slower, and a lot
 slower.
 
 Accounting in memcg is hard, specially because it is potentially
 hierarchical, (meaning you need to nest downwards until your parents).
 
 I never discussed that this is, unfortunately, a hotpath. However, I did
 try to minimize the impact as much as I could.
 
 Not to mention that the current scheme is bound to improvement as
 cgroups improve. One of the things being discussed is to having all
 cgroups always in the same hierarchy. If that ever happens, we can have
 the information about the current cgroup stored in a very accessible
 way, so to make this even faster.
 
 This felt like the best way I could do with the current infrastructure,
 (and again, I did make it free for people not limiting kmem), and is
 way, way cheaper than doing accounting here.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46586 is a reply to message #46583] | Tue, 29 May 2012 17:05   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/29/2012 08:05 PM, Christoph Lameter wrote: > On Tue, 29 May 2012, Glauber Costa wrote:
 >
 >> Accounting pages seems just crazy to me. If new allocators come in the future,
 >> organizing the pages in a different way, instead of patching it here and
 >> there, we need to totally rewrite this.
 >
 > Quite to the contrary. We could either pass a THIS_IS_A_SLAB page flag to
 > the page allocator call or have a special call that does the accounting
 > and then calls the page allocator. The code could be completely in
 > cgroups. There would be no changes to the allocators aside from setting
 > the flag or calling the alternate page allocator functions.
 
 Again: for the page allocation itself, we could do it. (maybe we still
 can, keeping the rest of the approach, so as to simplify that particular
 piece of code, and reduce the churn inside the cache - I am all for it)
 
 But that only solves the page allocation part. I would still need to
 make sure the caller is directed to the right page.
 
 >>> Why do you need to increase the refcount? You made a full copy right?
 >>
 >> Yes, but I don't want this copy to go away while we have other caches around.
 >
 > You copied all metadata so what is there that you would still need should
 > the other copy go away?
 
 Well, consistency being one, because it sounded weird to have the parent
 cache being deleted while the kids are around. You wouldn't be able to
 reach them, for once.
 
 But one can argue that if you deleted the cache, why would you want to
 ever reach it?
 
 I can try to remove the refcount.
 
 >
 >> So, in the memcg internals, I used a different reference counter, to avoid
 >> messing with this one. I could use that, and leave the original refcnt alone.
 >> Would you prefer this?
 >
 > The refcounter is really not the issue.
 >
 > I am a bit worried about the various duplicate features here and there.
 > The approach is not tightened down yet.
 
 Well, I still think that duplication of the structure is better - a lot
 less intrusive - than messing with the cache internals.
 
 I will try to at least have the page accounting done in a consistent
 way. How about that?
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46590 is a reply to message #46589] | Tue, 29 May 2012 19:40   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/29/2012 11:26 PM, Christoph Lameter wrote: > On Tue, 29 May 2012, Glauber Costa wrote:
 >
 >> But we really need a page to be filled with objects from the same cgroup, and
 >> the non-shared objects to be accounted to the right place.
 >
 > No other subsystem has such a requirement. Even the NUMA nodes are mostly
 > suggestions and can be ignored by the allocators to use memory from other
 > pages.
 
 Of course it does. Memcg itself has such a requirement. The collective
 set of processes needs to have the pages it uses accounted to it, and
 never go over limit.
 
 >> Otherwise, I don't think we can meet even the lighter of isolation guarantees.
 >
 > The approach works just fine with NUMA and cpusets. Isolation is mostly
 > done on the per node boundaries and you already have per node statistics.
 
 I don't know about cpusets in details, but at least with NUMA, this is
 not an apple-to-apple comparison. a NUMA node is not meant to contain
 you. A container is, and that is why it is called a container.
 
 NUMA just means what is the *best* node to put my memory.
 Now, if you actually say, through you syscalls "this is the node it
 should live in", then you have a constraint, that to the best of my
 knowledge is respected.
 
 Now isolation here, is done in the container boundary. (cgroups, to be
 generic).
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46591 is a reply to message #46590] | Tue, 29 May 2012 19:55   |  
			| 
				
				
					|  Christoph Lameter Messages: 123
 Registered: September 2006
 | Senior Member |  |  |  
	| On Tue, 29 May 2012, Glauber Costa wrote: 
 > I don't know about cpusets in details, but at least with NUMA, this is not an
 > apple-to-apple comparison. a NUMA node is not meant to contain you. A
 > container is, and that is why it is called a container.
 
 Cpusets contains sets of nodes. A cpusets "contains" nodes. These sets are
 associated with applications.
 
 > NUMA just means what is the *best* node to put my memory.
 > Now, if you actually say, through you syscalls "this is the node it should
 > live in", then you have a constraint, that to the best of my knowledge is
 > respected.
 
 Eith cpusets it means that memory needs to come from an assigned set of
 nodes.
 
 > Now isolation here, is done in the container boundary. (cgroups, to be
 > generic).
 
 Yes and with cpusets it is done at the cpuset boundary. Very much the
 same.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46592 is a reply to message #46591] | Tue, 29 May 2012 20:08   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/29/2012 11:55 PM, Christoph Lameter wrote: >> NUMA just means what is the*best*  node to put my memory.
 >> >  Now, if you actually say, through you syscalls "this is the node it should
 >> >  live in", then you have a constraint, that to the best of my knowledge is
 >> >  respected.
 > Eith cpusets it means that memory needs to come from an assigned set of
 > nodes.
 >
 >> >  Now isolation here, is done in the container boundary. (cgroups, to be
 >> >  generic).
 > Yes and with cpusets it is done at the cpuset boundary. Very much the
 > same.
 
 Well, I'd have to dive in the code a bit more, but that the impression
 that the documentation gives me, by saying:
 
 "Cpusets constrain the CPU and Memory placement of tasks to only
 the resources within a task's current cpuset."
 
 is that you can't allocate from a node outside that set. Is this correct?
 
 So extrapolating this to memcg, the situation is as follows:
 
 * You can't use more memory than what you are assigned to.
 * In order to do that, you need to account the memory you are using
 * and to account the memory you are using, all objects in the page
 must belong to you.
 
 Please note the following:
 
 Having two cgroups touching the same object is something. It tells
 something about the relationship between them. This is shared memory.
 
 Now having two cgroups putting objects in the same page, *does not mean
 _anything_*. It just mean that one had the luck to allocate just after
 the other.
 
 With a predictable enough workload, this is a recipe for working around
 the very protection we need to establish: one can DoS a physical box
 full of containers, by always allocating in someone else's pages, and
 pinning kernel memory down. Never releasing it, so the shrinkers are
 useless.
 
 So I still believe that if a page is allocated to a cgroup, all the
 objects in there belong to it  - unless of course the sharing actually
 means something - and identifying this is just too complicated.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46593 is a reply to message #46592] | Tue, 29 May 2012 20:21   |  
			| 
				
				
					|  Christoph Lameter Messages: 123
 Registered: September 2006
 | Senior Member |  |  |  
	| On Wed, 30 May 2012, Glauber Costa wrote: 
 > Well, I'd have to dive in the code a bit more, but that the impression that
 > the documentation gives me, by saying:
 >
 > "Cpusets constrain the CPU and Memory placement of tasks to only
 > the resources within a task's current cpuset."
 >
 > is that you can't allocate from a node outside that set. Is this correct?
 
 Basically yes but there are exceptions (like slab queues etc). Look at the
 hardwall stuff too that allows more exceptions for kernel allocations to
 use memory from other nodes.
 
 > So extrapolating this to memcg, the situation is as follows:
 >
 > * You can't use more memory than what you are assigned to.
 > * In order to do that, you need to account the memory you are using
 > * and to account the memory you are using, all objects in the page
 >   must belong to you.
 
 Cpusets work at the page boundary and they do not have the requirement you
 are mentioning of all objects in the page having to belong to a certain
 cpusets. Let that go and things become much easier.
 
 > With a predictable enough workload, this is a recipe for working around the
 > very protection we need to establish: one can DoS a physical box full of
 > containers, by always allocating in someone else's pages, and pinning kernel
 > memory down. Never releasing it, so the shrinkers are useless.
 
 Sure you can construct hyperthetical cases like that. But then that is
 true already of other container like logic in the kernel already.
 
 > So I still believe that if a page is allocated to a cgroup, all the objects in
 > there belong to it  - unless of course the sharing actually means something -
 > and identifying this is just too complicated.
 
 We have never worked container like logic like that in the kernel due to
 the complicated logic you would have to put in. The requirement that all
 objects in a page come from the same container is not necessary. If you
 drop this notion then things become very easy and the patches will become
 simple.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46594 is a reply to message #46593] | Tue, 29 May 2012 20:25   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/30/2012 12:21 AM, Christoph Lameter wrote: > On Wed, 30 May 2012, Glauber Costa wrote:
 >
 >> Well, I'd have to dive in the code a bit more, but that the impression that
 >> the documentation gives me, by saying:
 >>
 >> "Cpusets constrain the CPU and Memory placement of tasks to only
 >> the resources within a task's current cpuset."
 >>
 >> is that you can't allocate from a node outside that set. Is this correct?
 >
 > Basically yes but there are exceptions (like slab queues etc). Look at the
 > hardwall stuff too that allows more exceptions for kernel allocations to
 > use memory from other nodes.
 >
 >> So extrapolating this to memcg, the situation is as follows:
 >>
 >> * You can't use more memory than what you are assigned to.
 >> * In order to do that, you need to account the memory you are using
 >> * and to account the memory you are using, all objects in the page
 >>    must belong to you.
 >
 > Cpusets work at the page boundary and they do not have the requirement you
 > are mentioning of all objects in the page having to belong to a certain
 > cpusets. Let that go and things become much easier.
 >
 >> With a predictable enough workload, this is a recipe for working around the
 >> very protection we need to establish: one can DoS a physical box full of
 >> containers, by always allocating in someone else's pages, and pinning kernel
 >> memory down. Never releasing it, so the shrinkers are useless.
 >
 > Sure you can construct hyperthetical cases like that. But then that is
 > true already of other container like logic in the kernel already.
 >
 >> So I still believe that if a page is allocated to a cgroup, all the objects in
 >> there belong to it  - unless of course the sharing actually means something -
 >> and identifying this is just too complicated.
 >
 > We have never worked container like logic like that in the kernel due to
 > the complicated logic you would have to put in. The requirement that all
 > objects in a page come from the same container is not necessary. If you
 > drop this notion then things become very easy and the patches will become
 > simple.
 
 I promise to look at that in more detail and get back to it. In the
 meantime, I think it would be enlightening to hear from other parties as
 well, specially the ones also directly interested in using the technology.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46595 is a reply to message #46593] | Tue, 29 May 2012 20:57   |  
			| 
				
				
					|  Suleiman Souhlal Messages: 64
 Registered: February 2012
 | Member |  |  |  
	| Hi Christoph, 
 On Tue, May 29, 2012 at 1:21 PM, Christoph Lameter <cl@linux.com> wrote:
 > On Wed, 30 May 2012, Glauber Costa wrote:
 >
 >> Well, I'd have to dive in the code a bit more, but that the impression that
 >> the documentation gives me, by saying:
 >>
 >> "Cpusets constrain the CPU and Memory placement of tasks to only
 >> the resources within a task's current cpuset."
 >>
 >> is that you can't allocate from a node outside that set. Is this correct?
 >
 > Basically yes but there are exceptions (like slab queues etc). Look at the
 > hardwall stuff too that allows more exceptions for kernel allocations to
 > use memory from other nodes.
 >
 >> So extrapolating this to memcg, the situation is as follows:
 >>
 >> * You can't use more memory than what you are assigned to.
 >> * In order to do that, you need to account the memory you are using
 >> * and to account the memory you are using, all objects in the page
 >>   must belong to you.
 >
 > Cpusets work at the page boundary and they do not have the requirement you
 > are mentioning of all objects in the page having to belong to a certain
 > cpusets. Let that go and things become much easier.
 >
 >> With a predictable enough workload, this is a recipe for working around the
 >> very protection we need to establish: one can DoS a physical box full of
 >> containers, by always allocating in someone else's pages, and pinning kernel
 >> memory down. Never releasing it, so the shrinkers are useless.
 >
 > Sure you can construct hyperthetical cases like that. But then that is
 > true already of other container like logic in the kernel already.
 >
 >> So I still believe that if a page is allocated to a cgroup, all the objects in
 >> there belong to it  - unless of course the sharing actually means something -
 >> and identifying this is just too complicated.
 >
 > We have never worked container like logic like that in the kernel due to
 > the complicated logic you would have to put in. The requirement that all
 > objects in a page come from the same container is not necessary. If you
 > drop this notion then things become very easy and the patches will become
 > simple.
 
 Back when we (Google) started using cpusets for memory isolation (fake
 NUMA), we found that there was a significant isolation breakage coming
 from slab pages belonging to one cpuset being used by other cpusets,
 which caused us problems: It was very easy for one job to cause slab
 growth in another container, which would cause it to OOM, despite
 being well-behaved.
 
 Because of this, we had to add logic to prevent that from happening
 (by making sure we only allocate objects from pages coming from our
 allowed nodes).
 
 Now that we're switching to doing containers with memcg, I think this
 is a hard requirement, for us. :-(
 
 -- Suleiman
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46596 is a reply to message #46594] | Wed, 30 May 2012 01:29   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, Christoph, Glauber. 
 On Wed, May 30, 2012 at 12:25:36AM +0400, Glauber Costa wrote:
 > >We have never worked container like logic like that in the kernel due to
 > >the complicated logic you would have to put in. The requirement that all
 > >objects in a page come from the same container is not necessary. If you
 > >drop this notion then things become very easy and the patches will become
 > >simple.
 >
 > I promise to look at that in more detail and get back to it. In the
 > meantime, I think it would be enlightening to hear from other
 > parties as well, specially the ones also directly interested in
 > using the technology.
 
 I don't think I'm too interested in using the technology ;) and
 haven't read the code (just glanced through the descriptions and
 discussions), but, in general, I think the approach of duplicating
 memory allocator per-memcg is a sane approach.  It isn't the most
 efficient one with the possibility of wasting considerable amount of
 caching area per memcg but it is something which can mostly stay out
 of the way if done right and that's how I want cgroup implementations
 to be.
 
 The two goals for cgroup controllers that I think are important are
 proper (no, not crazy perfect but good enough) isolation and an
 implementation which doesn't impact !cg path in an intrusive manner -
 if someone who doesn't care about cgroup but knows and wants to work
 on the subsystem should be able to mostly ignore cgroup support.  If
 that means overhead for cgroup users, so be it.
 
 Without looking at the actual code, my rainbow-farting unicorn here
 would be having a common slXb interface layer which handles
 interfacing with memory allocator users and cgroup and let slXb
 implement the same backend interface which doesn't care / know about
 cgroup at all (other than using the correct allocation context, that
 is).  Glauber, would something like that be possible?
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re:  Re: [PATCH v3 13/28] slub: create duplicate cache [message #46597 is a reply to message #46596] | Wed, 30 May 2012 07:28   |  
			| 
				
				
					|  James Bottomley Messages: 17
 Registered: May 2006
 | Junior Member |  |  |  
	| On Wed, 2012-05-30 at 10:29 +0900, Tejun Heo wrote: > Hello, Christoph, Glauber.
 >
 > On Wed, May 30, 2012 at 12:25:36AM +0400, Glauber Costa wrote:
 > > >We have never worked container like logic like that in the kernel due to
 > > >the complicated logic you would have to put in. The requirement that all
 > > >objects in a page come from the same container is not necessary. If you
 > > >drop this notion then things become very easy and the patches will become
 > > >simple.
 > >
 > > I promise to look at that in more detail and get back to it. In the
 > > meantime, I think it would be enlightening to hear from other
 > > parties as well, specially the ones also directly interested in
 > > using the technology.
 >
 > I don't think I'm too interested in using the technology ;) and
 > haven't read the code (just glanced through the descriptions and
 > discussions), but, in general, I think the approach of duplicating
 > memory allocator per-memcg is a sane approach.  It isn't the most
 > efficient one with the possibility of wasting considerable amount of
 > caching area per memcg but it is something which can mostly stay out
 > of the way if done right and that's how I want cgroup implementations
 > to be.
 
 Exactly: we at parallels initially disliked the cgroup multipled by slab
 approach (Our beancounters do count objects) because we feared memory
 wastage and density is very important to us (which tends to mean
 efficient use of memory) however, when we ran through the calculations
 in Prague, you can show that we have ~200 slabs and if each wastes half
 a page, thats ~4MB memory lost per container.  Since most virtual
 environments are of the order nowadays of 0.5GB, we feel it's an
 annoying but acceptable price to pay.
 
 James
 
 > The two goals for cgroup controllers that I think are important are
 > proper (no, not crazy perfect but good enough) isolation and an
 > implementation which doesn't impact !cg path in an intrusive manner -
 > if someone who doesn't care about cgroup but knows and wants to work
 > on the subsystem should be able to mostly ignore cgroup support.  If
 > that means overhead for cgroup users, so be it.
 >
 > Without looking at the actual code, my rainbow-farting unicorn here
 > would be having a common slXb interface layer which handles
 > interfacing with memory allocator users and cgroup and let slXb
 > implement the same backend interface which doesn't care / know about
 > cgroup at all (other than using the correct allocation context, that
 > is).  Glauber, would something like that be possible?
 >
 > Thanks.
 >
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46598 is a reply to message #46596] | Wed, 30 May 2012 07:54   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/30/2012 05:29 AM, Tejun Heo wrote: > The two goals for cgroup controllers that I think are important are
 > proper (no, not crazy perfect but good enough) isolation and an
 > implementation which doesn't impact !cg path in an intrusive manner -
 > if someone who doesn't care about cgroup but knows and wants to work
 > on the subsystem should be able to mostly ignore cgroup support.  If
 > that means overhead for cgroup users, so be it.
 
 Well, my code in the slab is totally wrapped in static branches. They
 only come active when the first group is *limited* (not even created:
 you can have a thousand memcg, if none of them are kmem limited, nothing
 will happen).
 
 After that, the cost paid is to find out at which cgroup the process is
 at. I believe that if we had a faster way for this (like for instance:
 if we had a single hierarchy, the scheduler could put this in a percpu
 variable after context switch - or any other method), then the cost of
 it could be really low, even when this is enabled.
 
 > Without looking at the actual code, my rainbow-farting unicorn here
 > would be having a common slXb interface layer which handles
 > interfacing with memory allocator users and cgroup and let slXb
 > implement the same backend interface which doesn't care / know about
 > cgroup at all (other than using the correct allocation context, that
 > is).  Glauber, would something like that be possible?
 
 It is a matter of degree. There *is* a lot of stuff in common code, and
 I tried to do it as much as I could. Christoph gave me a nice idea about
 hinting to the page allocator that this page is a slab page before the
 allocation, and then we could account from the page allocator directly -
 without touching the cache code at all. This could be done, but some
 stuff would still be there, basically because of differences in how the
 allocator behaves. I think long term Christoph's effort to produce
 common code among them will help a lot, if they stabilize their behavior
 in certain areas.
 
 I will rework this series to try work more towards this goal, but at
 least for now I'll keep duplicating the caches. I still don't believe
 that a loose accounting to the extent Christoph proposed will achieve
 what we need this to achieve.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 13/28] slub: create duplicate cache [message #46599 is a reply to message #46598] | Wed, 30 May 2012 08:02   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, Glauber. 
 On Wed, May 30, 2012 at 4:54 PM, Glauber Costa <glommer@parallels.com> wrote:
 > On 05/30/2012 05:29 AM, Tejun Heo wrote:
 >>
 >> The two goals for cgroup controllers that I think are important are
 >> proper (no, not crazy perfect but good enough) isolation and an
 >> implementation which doesn't impact !cg path in an intrusive manner -
 >> if someone who doesn't care about cgroup but knows and wants to work
 >> on the subsystem should be able to mostly ignore cgroup support.  If
 >> that means overhead for cgroup users, so be it.
 >
 >
 > Well, my code in the slab is totally wrapped in static branches. They only
 > come active when the first group is *limited* (not even created: you can
 > have a thousand memcg, if none of them are kmem limited, nothing will
 > happen).
 
 Great, but I'm not sure why you're trying to emphasize that while my
 point was about memory overhead and that it's OK to have some
 overheads for cg users. :)
 
 > After that, the cost paid is to find out at which cgroup the process is at.
 > I believe that if we had a faster way for this (like for instance: if we had
 > a single hierarchy, the scheduler could put this in a percpu variable after
 > context switch - or any other method), then the cost of it could be really
 > low, even when this is enabled.
 
 Someday, hopefully.
 
 > I will rework this series to try work more towards this goal, but at least
 > for now I'll keep duplicating the caches. I still don't believe that a loose
 > accounting to the extent Christoph proposed will achieve what we need this
 > to achieve.
 
 Yeah, I prefer your per-cg cache approach but do hope that it stays as
 far from actual allocator code as possible. Christoph, would it be
 acceptable if the cg logic is better separated?
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46619 is a reply to message #46584] | Wed, 30 May 2012 11:01   |  
			| 
				
				
					|  Frederic Weisbecker Messages: 25
 Registered: April 2012
 | Junior Member |  |  |  
	| On Tue, May 29, 2012 at 11:52:55AM -0500, Christoph Lameter wrote: > On Tue, 29 May 2012, Glauber Costa wrote:
 >
 > > > How do you detect that someone is touching it?
 > >
 > > kmem_alloc_cache will create mem_cgroup_get_kmem_cache.
 > > (protected by static_branches, so won't happen if you don't have at least
 > > non-root memcg using it)
 > >
 > > * Then it detects which memcg the calling process belongs to,
 > > * if it is the root memcg, go back to the allocation as quickly as we
 > >   can
 > > * otherwise, in the creation process, you will notice that each cache
 > >   has an index. memcg will store pointers to the copies and find them by
 > >   the index.
 > >
 > > From this point on, all the code of the caches is reused (except for
 > > accounting the page)
 >
 > Well kmem_cache_alloc cache is the performance critical hotpath.
 >
 > If you are already there and doing all of that then would it not be better
 > to simply count the objects allocated and freed per cgroup? Directly
 > increment and decrement counters in a cgroup? You do not really need to
 > duplicate the kmem_cache structure and do not need to modify allocators if
 > you are willing to take that kind of a performance hit. Put a wrapper
 > around kmem_cache_alloc/free and count things.
 
 I believe one of the issues is also that a task can migrate to another cgroup
 anytime. But an object that has been charged to a cgroup must be later uncharged
 to the same, unless you move the charge as you move the task. But then it means
 you need to keep track of the allocations per task, and you also need to be able
 to do that reverse mapping (object -> allocating task) because your object can
 be allocated by task A but later freed by task B. Then when you do the uncharge
 it must happen to the cgroup of A, not the one of B.
 
 That all would be much more complicated and performance sensitive than what this
 patchset does. Dealing with duplicate caches for accounting seem to me a good tradeoff
 between allocation performance hot path and maintaining cgroups semantics.
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46634 is a reply to message #46633] | Wed, 30 May 2012 12:38   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/30/2012 04:34 PM, Frederic Weisbecker wrote: > On Fri, May 25, 2012 at 05:03:36PM +0400, Glauber Costa wrote:
 >> +bool __mem_cgroup_new_kmem_page(struct page *page, gfp_t gfp)
 >> +{
 >> +	struct mem_cgroup *memcg;
 >> +	struct page_cgroup *pc;
 >> +	bool ret = true;
 >> +	size_t size;
 >> +	struct task_struct *p;
 >> +
 >> +	if (!current->mm || in_interrupt())
 >> +		return true;
 >> +
 >> +	rcu_read_lock();
 >> +	p = rcu_dereference(current->mm->owner);
 >> +	memcg = mem_cgroup_from_task(p);
 >
 > So this takes the memcg of the group owner rather than the
 > task? I understand why we want this for user memory, but for
 > kernel?
 
 That was already discussed when this first came up in my last submission
 If I recall correctly, Kame pointed out that this would be needed for
 proper OOM-scoring and killing.
 
 Now of course we won't oom kernel threads or anything like that. But
 since this is also accounted towards memcg, it should at least be
 consistent with each memcg it accounts to.
 
 We can't account kmem for the thread's memcg, and mem to the process'.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46640 is a reply to message #46639] | Wed, 30 May 2012 13:06   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/30/2012 05:04 PM, Frederic Weisbecker wrote: > Do you think it's possible that this memcg can be destroyed (like ss->destroy())
 > concurrently?
 >
 > Probably not because there is a synchronize_rcu() in cgroup_diput() so as long
 > as we are in rcu_read_lock() we are fine.
 >
 > OTOH current->mm->owner can exit() right after we fetched its memcg and thus the css_set
 > can be freed concurrently? And then the cgroup itself after we call rcu_read_unlock()
 > due to cgroup_diput().
 > And yet we are doing the mem_cgroup_get() below unconditionally assuming it's
 > always fine to get a reference to it.
 >
 > May be I'm missing something?
 When a cache is created, we grab a reference to the memcg. So after the
 cache is created, no.
 
 When destroy is called, we flush the create queue, so if the cache is
 not created yet, it will just disappear.
 
 I think the only problem that might happen is in the following scenario:
 
 * cache gets created, but ref count is not yet taken
 * memcg disappears
 * we try to inc refcount for a non-existent memcg, and crash.
 
 This would be trivially solvable by grabing the reference earlier.
 But even then, I need to audit this further to make sure it is really an
 issue.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46641 is a reply to message #46634] | Wed, 30 May 2012 13:11   |  
			| 
				
				
					|  Frederic Weisbecker Messages: 25
 Registered: April 2012
 | Junior Member |  |  |  
	| On Wed, May 30, 2012 at 04:38:39PM +0400, Glauber Costa wrote: > On 05/30/2012 04:34 PM, Frederic Weisbecker wrote:
 > >On Fri, May 25, 2012 at 05:03:36PM +0400, Glauber Costa wrote:
 > >>+bool __mem_cgroup_new_kmem_page(struct page *page, gfp_t gfp)
 > >>+{
 > >>+	struct mem_cgroup *memcg;
 > >>+	struct page_cgroup *pc;
 > >>+	bool ret = true;
 > >>+	size_t size;
 > >>+	struct task_struct *p;
 > >>+
 > >>+	if (!current->mm || in_interrupt())
 > >>+		return true;
 > >>+
 > >>+	rcu_read_lock();
 > >>+	p = rcu_dereference(current->mm->owner);
 > >>+	memcg = mem_cgroup_from_task(p);
 > >
 > >So this takes the memcg of the group owner rather than the
 > >task? I understand why we want this for user memory, but for
 > >kernel?
 >
 > That was already discussed when this first came up in my last submission
 > If I recall correctly, Kame pointed out that this would be needed
 > for proper OOM-scoring and killing.
 
 Can we have at least a comment in the code that explain the reasons of taking the
 owner rather than the task? It's not going to be very obvious to future reviewers.
 
 > Now of course we won't oom kernel threads or anything like that.
 
 Seems we are not even accounting them anyway.
 
 > But since this is also accounted towards memcg, it should at least be
 > consistent with each memcg it accounts to.
 >
 > We can't account kmem for the thread's memcg, and mem to the process'.
 
 Don't know. This goes a bit against cgroups semantics which group at the task
 level and not process. But I personally don't mind much, as long as it's
 documented.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46645 is a reply to message #46640] | Wed, 30 May 2012 13:37   |  
			| 
				
				
					|  Frederic Weisbecker Messages: 25
 Registered: April 2012
 | Junior Member |  |  |  
	| On Wed, May 30, 2012 at 05:06:22PM +0400, Glauber Costa wrote: > On 05/30/2012 05:04 PM, Frederic Weisbecker wrote:
 > >Do you think it's possible that this memcg can be destroyed (like ss->destroy())
 > >concurrently?
 > >
 > >Probably not because there is a synchronize_rcu() in cgroup_diput() so as long
 > >as we are in rcu_read_lock() we are fine.
 > >
 > >OTOH current->mm->owner can exit() right after we fetched its memcg and thus the css_set
 > >can be freed concurrently? And then the cgroup itself after we call rcu_read_unlock()
 > >due to cgroup_diput().
 > >And yet we are doing the mem_cgroup_get() below unconditionally assuming it's
 > >always fine to get a reference to it.
 > >
 > >May be I'm missing something?
 > When a cache is created, we grab a reference to the memcg. So after
 > the cache is created, no.
 >
 > When destroy is called, we flush the create queue, so if the cache
 > is not created yet, it will just disappear.
 >
 > I think the only problem that might happen is in the following scenario:
 >
 > * cache gets created, but ref count is not yet taken
 > * memcg disappears
 > * we try to inc refcount for a non-existent memcg, and crash.
 >
 > This would be trivially solvable by grabing the reference earlier.
 > But even then, I need to audit this further to make sure it is
 > really an issue.
 
 Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
 and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
 After this tryget I think you're fine. And in-between you're safe against
 css_set removal due to rcu_read_lock().
 
 I'm less clear with __mem_cgroup_new_kmem_page() though...
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46648 is a reply to message #46647] | Wed, 30 May 2012 13:55   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/30/2012 05:53 PM, Frederic Weisbecker wrote: > On Wed, May 30, 2012 at 05:37:57PM +0400, Glauber Costa wrote:
 >> On 05/30/2012 05:37 PM, Frederic Weisbecker wrote:
 >>> Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
 >>> and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
 >>> After this tryget I think you're fine. And in-between you're safe against
 >>> css_set removal due to rcu_read_lock().
 >>>
 >>> I'm less clear with __mem_cgroup_new_kmem_page() though...
 >>
 >> That one does not get memcg->css but it does call mem_cgroup_get(),
 >> that does prevent against the memcg structure being freed, which I
 >> believe to be good enough.
 >
 > What if the owner calls cgroup_exit() between mem_cgroup_from_task()
 > and mem_cgroup_get()? The css_set which contains the memcg gets freed.
 > Also the reference on the memcg doesn't even prevent the css_set to
 > be removed, does it?
 It doesn't, but we don't really care. The css can go away, if the memcg
 structure stays. The caches will outlive the memcg anyway, since it is
 possible that you delete it, with some caches still holding objects that
 are not freed (they will be marked as dead).
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46649 is a reply to message #46648] | Wed, 30 May 2012 15:33   |  
			| 
				
				
					|  Frederic Weisbecker Messages: 25
 Registered: April 2012
 | Junior Member |  |  |  
	| On Wed, May 30, 2012 at 05:55:38PM +0400, Glauber Costa wrote: > On 05/30/2012 05:53 PM, Frederic Weisbecker wrote:
 > >On Wed, May 30, 2012 at 05:37:57PM +0400, Glauber Costa wrote:
 > >>On 05/30/2012 05:37 PM, Frederic Weisbecker wrote:
 > >>>Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
 > >>>and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
 > >>>After this tryget I think you're fine. And in-between you're safe against
 > >>>css_set removal due to rcu_read_lock().
 > >>>
 > >>>I'm less clear with __mem_cgroup_new_kmem_page() though...
 > >>
 > >>That one does not get memcg->css but it does call mem_cgroup_get(),
 > >>that does prevent against the memcg structure being freed, which I
 > >>believe to be good enough.
 > >
 > >What if the owner calls cgroup_exit() between mem_cgroup_from_task()
 > >and mem_cgroup_get()? The css_set which contains the memcg gets freed.
 > >Also the reference on the memcg doesn't even prevent the css_set to
 > >be removed, does it?
 > It doesn't, but we don't really care. The css can go away, if the
 > memcg structure stays.
 
 Ah right, the memcg itself is only freed at destroy time.
 
 > The caches will outlive the memcg anyway,
 > since it is possible that you delete it, with some caches still
 > holding objects that
 > are not freed (they will be marked as dead).
 
 I guess I need to look at how the destroy path is handled in your patchset
 then. Or how you ensure that __mem_cgroup_new_kmem_page() can't race against
 destroy.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46653 is a reply to message #46649] | Wed, 30 May 2012 16:16   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 05/30/2012 07:33 PM, Frederic Weisbecker wrote: > On Wed, May 30, 2012 at 05:55:38PM +0400, Glauber Costa wrote:
 >> On 05/30/2012 05:53 PM, Frederic Weisbecker wrote:
 >>> On Wed, May 30, 2012 at 05:37:57PM +0400, Glauber Costa wrote:
 >>>> On 05/30/2012 05:37 PM, Frederic Weisbecker wrote:
 >>>>> Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
 >>>>> and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
 >>>>> After this tryget I think you're fine. And in-between you're safe against
 >>>>> css_set removal due to rcu_read_lock().
 >>>>>
 >>>>> I'm less clear with __mem_cgroup_new_kmem_page() though...
 >>>>
 >>>> That one does not get memcg->css but it does call mem_cgroup_get(),
 >>>> that does prevent against the memcg structure being freed, which I
 >>>> believe to be good enough.
 >>>
 >>> What if the owner calls cgroup_exit() between mem_cgroup_from_task()
 >>> and mem_cgroup_get()? The css_set which contains the memcg gets freed.
 >>> Also the reference on the memcg doesn't even prevent the css_set to
 >>> be removed, does it?
 >> It doesn't, but we don't really care. The css can go away, if the
 >> memcg structure stays.
 >
 > Ah right, the memcg itself is only freed at destroy time.
 >
 >> The caches will outlive the memcg anyway,
 >> since it is possible that you delete it, with some caches still
 >> holding objects that
 >> are not freed (they will be marked as dead).
 >
 > I guess I need to look at how the destroy path is handled in your patchset
 > then. Or how you ensure that __mem_cgroup_new_kmem_page() can't race against
 > destroy.
 Appreciate that, thanks.
 |  
	|  |  | 
 
 
 Current Time: Sat Oct 25 15:33:57 GMT 2025 
 Total time taken to generate the page: 0.08401 seconds |