OpenVZ Forum


Home » Mailing lists » Devel » [PATCH v3 00/28] kmem limitation for memcg
Re: [PATCH v3 15/28] slub: always get the cache from its page in kfree [message #46576 is a reply to message #46566] Tue, 29 May 2012 15:59 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 06:42 PM, Christoph Lameter wrote:
> On Fri, 25 May 2012, Glauber Costa wrote:
>
>> struct page already have this information. If we start chaining
>> caches, this information will always be more trustworthy than
>> whatever is passed into the function
>
> Yes but the lookup of the page struct also costs some cycles. SLAB in
> !NUMA mode and SLOB avoid these lookups and can improve their freeing
> speed because of that.

But for our case, I don't really see a way around. What I can do, is
wrap it further, so when we're not using it, code goes exactly the same
way as before, instead of always calculating the page. Would it be better?

>> diff --git a/mm/slub.c b/mm/slub.c
>> index 0eb9e72..640872f 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2598,10 +2598,14 @@ redo:
>> void kmem_cache_free(struct kmem_cache *s, void *x)
>> {
>> struct page *page;
>> + bool slab_match;
>>
>> page = virt_to_head_page(x);
>>
>> - slab_free(s, page, x, _RET_IP_);
>> + slab_match = (page->slab == s) | slab_is_parent(page->slab, s);
>> + VM_BUG_ON(!slab_match);
>
> Why add a slab_match bool if you do not really need it?

style. I find aux variables a very human readable way to deal with the
80-col limitation.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46577 is a reply to message #46567] Tue, 29 May 2012 16:00 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 06:47 PM, Christoph Lameter wrote:
> On Fri, 25 May 2012, Glauber Costa wrote:
>
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -696,7 +696,7 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>> then swapaccount=0 does the trick).
>> config CGROUP_MEM_RES_CTLR_KMEM
>> bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)"
>> - depends on CGROUP_MEM_RES_CTLR&& EXPERIMENTAL
>> + depends on CGROUP_MEM_RES_CTLR&& EXPERIMENTAL&& !SLOB
>> default n
>
> Ok so SLOB is not supported at all.
>
Yes, at least I see no reason to.
Re: [PATCH v3 18/28] slub: charge allocation to a memcg [message #46578 is a reply to message #46568] Tue, 29 May 2012 16:06 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 06:51 PM, Christoph Lameter wrote:
> On Fri, 25 May 2012, Glauber Costa wrote:
>
>> This patch charges allocation of a slab object to a particular
>> memcg.
>
> I am wondering why you need all the other patches. The simplest approach
> would just to hook into page allocation and freeing from the slab
> allocators as done here and charge to the currently active cgroup. This
> avoids all the duplication of slab caches and per node as well as per cpu
> structures. A certain degree of fuzziness cannot be avoided given that
> objects are cached and may be served to multiple cgroups. If that can be
> tolerated then the rest would be just like this patch which could be made
> more simple and non intrusive.
>
Just hooking into the page allocation only works for caches with very
big objects. For all the others, we need to relay the process to the
correct cache.

Some objects may be shared, yes, but in reality most won't.

Let me give you an example:

We track task_struct here. So as a nice side effect of this, a fork bomb
will be killed because it will not be able to allocate any further.

But if we're accounting only at page allocation time, it is quite
possible to come up with a pattern while I always let other cgroups
pay the price for the page, but I will be the one filling it.

Having an eventual dentry, for instance, shared among caches, is okay.
But the main use case is for process in different cgroups dealing with
totally different parts of the filesystem.

So we can't really afford to charge to the process touching the nth
object where n is the number of objects per page. We need to relay it to
the right one.
Re: [PATCH v3 19/28] slab: per-memcg accounting of slab caches [message #46579 is a reply to message #46569] Tue, 29 May 2012 16:07 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 06:52 PM, Christoph Lameter wrote:
> On Fri, 25 May 2012, Glauber Costa wrote:
>
>> > This patch charges allocation of a slab object to a particular
>> > memcg.
> Ok so a requirement is to support tracking of individual slab
> objects to cgroups? That is going to be quite expensive since it will
> touch the hotpaths.
>

No, we track pages. But all the objects in the page belong to the same
cgroup.
Re: [PATCH v3 19/28] slab: per-memcg accounting of slab caches [message #46580 is a reply to message #46579] Tue, 29 May 2012 16:13 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 08:07 PM, Glauber Costa wrote:
> On 05/29/2012 06:52 PM, Christoph Lameter wrote:
>> On Fri, 25 May 2012, Glauber Costa wrote:
>>
>>> > This patch charges allocation of a slab object to a particular
>>> > memcg.
>> Ok so a requirement is to support tracking of individual slab
>> objects to cgroups? That is going to be quite expensive since it will
>> touch the hotpaths.
>>
>
> No, we track pages. But all the objects in the page belong to the same
> cgroup.
>

Also, please note the following:

The code that relays us to the right cache, is wrapped inside a static
branch. Whoever is not using more than the root cgroup, will not suffer
a single bit.

If you are, but your process is in the right cgroup, you will
unfortunately pay function call penalty(*), but the code will make and
effort to detect that as early as possible and resume.


(*) Not even then if you fall in the following categories, that are
resolved inline:

+ if (!current->mm)
+ return cachep;
+ if (in_interrupt())
+ return cachep;
+ if (gfp & __GFP_NOFAIL)
+ return cachep;
Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46581 is a reply to message #46573] Tue, 29 May 2012 16:33 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Tue, 29 May 2012, Glauber Costa wrote:

> > Ok this only duplicates the kmalloc arrays. Why not the others?
>
> It does duplicate the others.
>
> First it does a while look on the kmalloc caches, then a list_for_each_entry
> in the rest. You probably missed it.

There is no need to separately duplicate the kmalloc_caches. Those are
included on the cache_chain.

> > > @@ -2543,7 +2564,12 @@ kmem_cache_create (const char *name, size_t size,
> > > size_t align,
> > > cachep->ctor = ctor;
> > > cachep->name = name;
> > >
> > > + if (g_cpucache_up>= FULL)
> > > + mem_cgroup_register_cache(memcg, cachep);
> >
> > What happens if a cgroup was active during creation of slab xxy but
> > then a process running in a different cgroup uses that slab to allocate
> > memory? Is it charged to the first cgroup?
>
> I don't see this situation ever happening. kmem_cache_create, when called
> directly, will always create a global cache. It doesn't matter which cgroups
> are or aren't active at this time or any other. We create copies per-cgroup,
> but we create it lazily, when someone will touch it.

How do you detect that someone is touching it?
Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46582 is a reply to message #46581] Tue, 29 May 2012 16:36 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 08:33 PM, Christoph Lameter wrote:
> On Tue, 29 May 2012, Glauber Costa wrote:
>
>>> Ok this only duplicates the kmalloc arrays. Why not the others?
>>
>> It does duplicate the others.
>>
>> First it does a while look on the kmalloc caches, then a list_for_each_entry
>> in the rest. You probably missed it.
>
> There is no need to separately duplicate the kmalloc_caches. Those are
> included on the cache_chain.
>
>>>> @@ -2543,7 +2564,12 @@ kmem_cache_create (const char *name, size_t size,
>>>> size_t align,
>>>> cachep->ctor = ctor;
>>>> cachep->name = name;
>>>>
>>>> + if (g_cpucache_up>= FULL)
>>>> + mem_cgroup_register_cache(memcg, cachep);
>>>
>>> What happens if a cgroup was active during creation of slab xxy but
>>> then a process running in a different cgroup uses that slab to allocate
>>> memory? Is it charged to the first cgroup?
>>
>> I don't see this situation ever happening. kmem_cache_create, when called
>> directly, will always create a global cache. It doesn't matter which cgroups
>> are or aren't active at this time or any other. We create copies per-cgroup,
>> but we create it lazily, when someone will touch it.
>
> How do you detect that someone is touching it?

kmem_alloc_cache will create mem_cgroup_get_kmem_cache.
(protected by static_branches, so won't happen if you don't have at
least non-root memcg using it)

* Then it detects which memcg the calling process belongs to,
* if it is the root memcg, go back to the allocation as quickly as we
can
* otherwise, in the creation process, you will notice that each cache
has an index. memcg will store pointers to the copies and find them by
the index.

From this point on, all the code of the caches is reused (except for
accounting the page)
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46583 is a reply to message #46574] Tue, 29 May 2012 16:05 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Tue, 29 May 2012, Glauber Costa wrote:

> Accounting pages seems just crazy to me. If new allocators come in the future,
> organizing the pages in a different way, instead of patching it here and
> there, we need to totally rewrite this.

Quite to the contrary. We could either pass a THIS_IS_A_SLAB page flag to
the page allocator call or have a special call that does the accounting
and then calls the page allocator. The code could be completely in
cgroups. There would be no changes to the allocators aside from setting
the flag or calling the alternate page allocator functions.

> > Why do you need to increase the refcount? You made a full copy right?
>
> Yes, but I don't want this copy to go away while we have other caches around.

You copied all metadata so what is there that you would still need should
the other copy go away?

> So, in the memcg internals, I used a different reference counter, to avoid
> messing with this one. I could use that, and leave the original refcnt alone.
> Would you prefer this?

The refcounter is really not the issue.

I am a bit worried about the various duplicate features here and there.
The approach is not tightened down yet.
Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46584 is a reply to message #46582] Tue, 29 May 2012 16:52 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Tue, 29 May 2012, Glauber Costa wrote:

> > How do you detect that someone is touching it?
>
> kmem_alloc_cache will create mem_cgroup_get_kmem_cache.
> (protected by static_branches, so won't happen if you don't have at least
> non-root memcg using it)
>
> * Then it detects which memcg the calling process belongs to,
> * if it is the root memcg, go back to the allocation as quickly as we
> can
> * otherwise, in the creation process, you will notice that each cache
> has an index. memcg will store pointers to the copies and find them by
> the index.
>
> From this point on, all the code of the caches is reused (except for
> accounting the page)

Well kmem_cache_alloc cache is the performance critical hotpath.

If you are already there and doing all of that then would it not be better
to simply count the objects allocated and freed per cgroup? Directly
increment and decrement counters in a cgroup? You do not really need to
duplicate the kmem_cache structure and do not need to modify allocators if
you are willing to take that kind of a performance hit. Put a wrapper
around kmem_cache_alloc/free and count things.
Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46585 is a reply to message #46584] Tue, 29 May 2012 16:59 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 08:52 PM, Christoph Lameter wrote:
> Well kmem_cache_alloc cache is the performance critical hotpath.
>
> If you are already there and doing all of that then would it not be better
> to simply count the objects allocated and freed per cgroup? Directly
> increment and decrement counters in a cgroup? You do not really need to
> duplicate the kmem_cache structure and do not need to modify allocators if
> you are willing to take that kind of a performance hit. Put a wrapper
> around kmem_cache_alloc/free and count things.

Well, I see it as the difference between being a big slower, and a lot
slower.

Accounting in memcg is hard, specially because it is potentially
hierarchical, (meaning you need to nest downwards until your parents).

I never discussed that this is, unfortunately, a hotpath. However, I did
try to minimize the impact as much as I could.

Not to mention that the current scheme is bound to improvement as
cgroups improve. One of the things being discussed is to having all
cgroups always in the same hierarchy. If that ever happens, we can have
the information about the current cgroup stored in a very accessible
way, so to make this even faster.

This felt like the best way I could do with the current infrastructure,
(and again, I did make it free for people not limiting kmem), and is
way, way cheaper than doing accounting here.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46586 is a reply to message #46583] Tue, 29 May 2012 17:05 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 08:05 PM, Christoph Lameter wrote:
> On Tue, 29 May 2012, Glauber Costa wrote:
>
>> Accounting pages seems just crazy to me. If new allocators come in the future,
>> organizing the pages in a different way, instead of patching it here and
>> there, we need to totally rewrite this.
>
> Quite to the contrary. We could either pass a THIS_IS_A_SLAB page flag to
> the page allocator call or have a special call that does the accounting
> and then calls the page allocator. The code could be completely in
> cgroups. There would be no changes to the allocators aside from setting
> the flag or calling the alternate page allocator functions.

Again: for the page allocation itself, we could do it. (maybe we still
can, keeping the rest of the approach, so as to simplify that particular
piece of code, and reduce the churn inside the cache - I am all for it)

But that only solves the page allocation part. I would still need to
make sure the caller is directed to the right page.

>>> Why do you need to increase the refcount? You made a full copy right?
>>
>> Yes, but I don't want this copy to go away while we have other caches around.
>
> You copied all metadata so what is there that you would still need should
> the other copy go away?

Well, consistency being one, because it sounded weird to have the parent
cache being deleted while the kids are around. You wouldn't be able to
reach them, for once.

But one can argue that if you deleted the cache, why would you want to
ever reach it?

I can try to remove the refcount.

>
>> So, in the memcg internals, I used a different reference counter, to avoid
>> messing with this one. I could use that, and leave the original refcnt alone.
>> Would you prefer this?
>
> The refcounter is really not the issue.
>
> I am a bit worried about the various duplicate features here and there.
> The approach is not tightened down yet.

Well, I still think that duplication of the structure is better - a lot
less intrusive - than messing with the cache internals.

I will try to at least have the page accounting done in a consistent
way. How about that?
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46587 is a reply to message #46586] Tue, 29 May 2012 17:25 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Tue, 29 May 2012, Glauber Costa wrote:

> I will try to at least have the page accounting done in a consistent way. How
> about that?

Ok. What do you mean by "consistent"? Since objects and pages can be used
in a shared way and since accounting in many areas of the kernel is
intentional fuzzy to avoid performance issues there is not that much of a
requirement for accuracy. We have problems even just defining what the
exact set of pages belonging to a task/process is.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46588 is a reply to message #46587] Tue, 29 May 2012 17:27 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 09:25 PM, Christoph Lameter wrote:
>> I will try to at least have the page accounting done in a consistent way. How
>> > about that?
> Ok. What do you mean by "consistent"? Since objects and pages can be used
> in a shared way and since accounting in many areas of the kernel is
> intentional fuzzy to avoid performance issues there is not that much of a
> requirement for accuracy. We have problems even just defining what the
> exact set of pages belonging to a task/process is.
>
I mean consistent between allocators, done in a shared place.
Note that what we do here *is* still fuzzy to some degree. We can have a
level of sharing, and always account to the first one to touch, we don't
go worrying about this because that would be hell. We also don't
carry a process' objects around when we send it to a different cgroup.
Those are all already simplications for performance and simplicity sake.

But we really need a page to be filled with objects from the same
cgroup, and the non-shared objects to be accounted to the right place.

Otherwise, I don't think we can meet even the lighter of isolation
guarantees.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46589 is a reply to message #46588] Tue, 29 May 2012 19:26 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Tue, 29 May 2012, Glauber Costa wrote:

> But we really need a page to be filled with objects from the same cgroup, and
> the non-shared objects to be accounted to the right place.

No other subsystem has such a requirement. Even the NUMA nodes are mostly
suggestions and can be ignored by the allocators to use memory from other
pages.

> Otherwise, I don't think we can meet even the lighter of isolation guarantees.

The approach works just fine with NUMA and cpusets. Isolation is mostly
done on the per node boundaries and you already have per node statistics.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46590 is a reply to message #46589] Tue, 29 May 2012 19:40 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 11:26 PM, Christoph Lameter wrote:
> On Tue, 29 May 2012, Glauber Costa wrote:
>
>> But we really need a page to be filled with objects from the same cgroup, and
>> the non-shared objects to be accounted to the right place.
>
> No other subsystem has such a requirement. Even the NUMA nodes are mostly
> suggestions and can be ignored by the allocators to use memory from other
> pages.

Of course it does. Memcg itself has such a requirement. The collective
set of processes needs to have the pages it uses accounted to it, and
never go over limit.

>> Otherwise, I don't think we can meet even the lighter of isolation guarantees.
>
> The approach works just fine with NUMA and cpusets. Isolation is mostly
> done on the per node boundaries and you already have per node statistics.

I don't know about cpusets in details, but at least with NUMA, this is
not an apple-to-apple comparison. a NUMA node is not meant to contain
you. A container is, and that is why it is called a container.

NUMA just means what is the *best* node to put my memory.
Now, if you actually say, through you syscalls "this is the node it
should live in", then you have a constraint, that to the best of my
knowledge is respected.

Now isolation here, is done in the container boundary. (cgroups, to be
generic).
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46591 is a reply to message #46590] Tue, 29 May 2012 19:55 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Tue, 29 May 2012, Glauber Costa wrote:

> I don't know about cpusets in details, but at least with NUMA, this is not an
> apple-to-apple comparison. a NUMA node is not meant to contain you. A
> container is, and that is why it is called a container.

Cpusets contains sets of nodes. A cpusets "contains" nodes. These sets are
associated with applications.

> NUMA just means what is the *best* node to put my memory.
> Now, if you actually say, through you syscalls "this is the node it should
> live in", then you have a constraint, that to the best of my knowledge is
> respected.

Eith cpusets it means that memory needs to come from an assigned set of
nodes.

> Now isolation here, is done in the container boundary. (cgroups, to be
> generic).

Yes and with cpusets it is done at the cpuset boundary. Very much the
same.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46592 is a reply to message #46591] Tue, 29 May 2012 20:08 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/29/2012 11:55 PM, Christoph Lameter wrote:
>> NUMA just means what is the*best* node to put my memory.
>> > Now, if you actually say, through you syscalls "this is the node it should
>> > live in", then you have a constraint, that to the best of my knowledge is
>> > respected.
> Eith cpusets it means that memory needs to come from an assigned set of
> nodes.
>
>> > Now isolation here, is done in the container boundary. (cgroups, to be
>> > generic).
> Yes and with cpusets it is done at the cpuset boundary. Very much the
> same.

Well, I'd have to dive in the code a bit more, but that the impression
that the documentation gives me, by saying:

"Cpusets constrain the CPU and Memory placement of tasks to only
the resources within a task's current cpuset."

is that you can't allocate from a node outside that set. Is this correct?

So extrapolating this to memcg, the situation is as follows:

* You can't use more memory than what you are assigned to.
* In order to do that, you need to account the memory you are using
* and to account the memory you are using, all objects in the page
must belong to you.

Please note the following:

Having two cgroups touching the same object is something. It tells
something about the relationship between them. This is shared memory.

Now having two cgroups putting objects in the same page, *does not mean
_anything_*. It just mean that one had the luck to allocate just after
the other.

With a predictable enough workload, this is a recipe for working around
the very protection we need to establish: one can DoS a physical box
full of containers, by always allocating in someone else's pages, and
pinning kernel memory down. Never releasing it, so the shrinkers are
useless.

So I still believe that if a page is allocated to a cgroup, all the
objects in there belong to it - unless of course the sharing actually
means something - and identifying this is just too complicated.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46593 is a reply to message #46592] Tue, 29 May 2012 20:21 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 30 May 2012, Glauber Costa wrote:

> Well, I'd have to dive in the code a bit more, but that the impression that
> the documentation gives me, by saying:
>
> "Cpusets constrain the CPU and Memory placement of tasks to only
> the resources within a task's current cpuset."
>
> is that you can't allocate from a node outside that set. Is this correct?

Basically yes but there are exceptions (like slab queues etc). Look at the
hardwall stuff too that allows more exceptions for kernel allocations to
use memory from other nodes.

> So extrapolating this to memcg, the situation is as follows:
>
> * You can't use more memory than what you are assigned to.
> * In order to do that, you need to account the memory you are using
> * and to account the memory you are using, all objects in the page
> must belong to you.

Cpusets work at the page boundary and they do not have the requirement you
are mentioning of all objects in the page having to belong to a certain
cpusets. Let that go and things become much easier.

> With a predictable enough workload, this is a recipe for working around the
> very protection we need to establish: one can DoS a physical box full of
> containers, by always allocating in someone else's pages, and pinning kernel
> memory down. Never releasing it, so the shrinkers are useless.

Sure you can construct hyperthetical cases like that. But then that is
true already of other container like logic in the kernel already.

> So I still believe that if a page is allocated to a cgroup, all the objects in
> there belong to it - unless of course the sharing actually means something -
> and identifying this is just too complicated.

We have never worked container like logic like that in the kernel due to
the complicated logic you would have to put in. The requirement that all
objects in a page come from the same container is not necessary. If you
drop this notion then things become very easy and the patches will become
simple.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46594 is a reply to message #46593] Tue, 29 May 2012 20:25 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 12:21 AM, Christoph Lameter wrote:
> On Wed, 30 May 2012, Glauber Costa wrote:
>
>> Well, I'd have to dive in the code a bit more, but that the impression that
>> the documentation gives me, by saying:
>>
>> "Cpusets constrain the CPU and Memory placement of tasks to only
>> the resources within a task's current cpuset."
>>
>> is that you can't allocate from a node outside that set. Is this correct?
>
> Basically yes but there are exceptions (like slab queues etc). Look at the
> hardwall stuff too that allows more exceptions for kernel allocations to
> use memory from other nodes.
>
>> So extrapolating this to memcg, the situation is as follows:
>>
>> * You can't use more memory than what you are assigned to.
>> * In order to do that, you need to account the memory you are using
>> * and to account the memory you are using, all objects in the page
>> must belong to you.
>
> Cpusets work at the page boundary and they do not have the requirement you
> are mentioning of all objects in the page having to belong to a certain
> cpusets. Let that go and things become much easier.
>
>> With a predictable enough workload, this is a recipe for working around the
>> very protection we need to establish: one can DoS a physical box full of
>> containers, by always allocating in someone else's pages, and pinning kernel
>> memory down. Never releasing it, so the shrinkers are useless.
>
> Sure you can construct hyperthetical cases like that. But then that is
> true already of other container like logic in the kernel already.
>
>> So I still believe that if a page is allocated to a cgroup, all the objects in
>> there belong to it - unless of course the sharing actually means something -
>> and identifying this is just too complicated.
>
> We have never worked container like logic like that in the kernel due to
> the complicated logic you would have to put in. The requirement that all
> objects in a page come from the same container is not necessary. If you
> drop this notion then things become very easy and the patches will become
> simple.

I promise to look at that in more detail and get back to it. In the
meantime, I think it would be enlightening to hear from other parties as
well, specially the ones also directly interested in using the technology.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46595 is a reply to message #46593] Tue, 29 May 2012 20:57 Go to previous messageGo to next message
Suleiman Souhlal is currently offline  Suleiman Souhlal
Messages: 64
Registered: February 2012
Member
Hi Christoph,

On Tue, May 29, 2012 at 1:21 PM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 30 May 2012, Glauber Costa wrote:
>
>> Well, I'd have to dive in the code a bit more, but that the impression that
>> the documentation gives me, by saying:
>>
>> "Cpusets constrain the CPU and Memory placement of tasks to only
>> the resources within a task's current cpuset."
>>
>> is that you can't allocate from a node outside that set. Is this correct?
>
> Basically yes but there are exceptions (like slab queues etc). Look at the
> hardwall stuff too that allows more exceptions for kernel allocations to
> use memory from other nodes.
>
>> So extrapolating this to memcg, the situation is as follows:
>>
>> * You can't use more memory than what you are assigned to.
>> * In order to do that, you need to account the memory you are using
>> * and to account the memory you are using, all objects in the page
>>   must belong to you.
>
> Cpusets work at the page boundary and they do not have the requirement you
> are mentioning of all objects in the page having to belong to a certain
> cpusets. Let that go and things become much easier.
>
>> With a predictable enough workload, this is a recipe for working around the
>> very protection we need to establish: one can DoS a physical box full of
>> containers, by always allocating in someone else's pages, and pinning kernel
>> memory down. Never releasing it, so the shrinkers are useless.
>
> Sure you can construct hyperthetical cases like that. But then that is
> true already of other container like logic in the kernel already.
>
>> So I still believe that if a page is allocated to a cgroup, all the objects in
>> there belong to it  - unless of course the sharing actually means something -
>> and identifying this is just too complicated.
>
> We have never worked container like logic like that in the kernel due to
> the complicated logic you would have to put in. The requirement that all
> objects in a page come from the same container is not necessary. If you
> drop this notion then things become very easy and the patches will become
> simple.

Back when we (Google) started using cpusets for memory isolation (fake
NUMA), we found that there was a significant isolation breakage coming
from slab pages belonging to one cpuset being used by other cpusets,
which caused us problems: It was very easy for one job to cause slab
growth in another container, which would cause it to OOM, despite
being well-behaved.

Because of this, we had to add logic to prevent that from happening
(by making sure we only allocate objects from pages coming from our
allowed nodes).

Now that we're switching to doing containers with memcg, I think this
is a hard requirement, for us. :-(

-- Suleiman
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46596 is a reply to message #46594] Wed, 30 May 2012 01:29 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Christoph, Glauber.

On Wed, May 30, 2012 at 12:25:36AM +0400, Glauber Costa wrote:
> >We have never worked container like logic like that in the kernel due to
> >the complicated logic you would have to put in. The requirement that all
> >objects in a page come from the same container is not necessary. If you
> >drop this notion then things become very easy and the patches will become
> >simple.
>
> I promise to look at that in more detail and get back to it. In the
> meantime, I think it would be enlightening to hear from other
> parties as well, specially the ones also directly interested in
> using the technology.

I don't think I'm too interested in using the technology ;) and
haven't read the code (just glanced through the descriptions and
discussions), but, in general, I think the approach of duplicating
memory allocator per-memcg is a sane approach. It isn't the most
efficient one with the possibility of wasting considerable amount of
caching area per memcg but it is something which can mostly stay out
of the way if done right and that's how I want cgroup implementations
to be.

The two goals for cgroup controllers that I think are important are
proper (no, not crazy perfect but good enough) isolation and an
implementation which doesn't impact !cg path in an intrusive manner -
if someone who doesn't care about cgroup but knows and wants to work
on the subsystem should be able to mostly ignore cgroup support. If
that means overhead for cgroup users, so be it.

Without looking at the actual code, my rainbow-farting unicorn here
would be having a common slXb interface layer which handles
interfacing with memory allocator users and cgroup and let slXb
implement the same backend interface which doesn't care / know about
cgroup at all (other than using the correct allocation context, that
is). Glauber, would something like that be possible?

Thanks.

--
tejun
Re: Re: [PATCH v3 13/28] slub: create duplicate cache [message #46597 is a reply to message #46596] Wed, 30 May 2012 07:28 Go to previous messageGo to next message
James Bottomley is currently offline  James Bottomley
Messages: 17
Registered: May 2006
Junior Member
On Wed, 2012-05-30 at 10:29 +0900, Tejun Heo wrote:
> Hello, Christoph, Glauber.
>
> On Wed, May 30, 2012 at 12:25:36AM +0400, Glauber Costa wrote:
> > >We have never worked container like logic like that in the kernel due to
> > >the complicated logic you would have to put in. The requirement that all
> > >objects in a page come from the same container is not necessary. If you
> > >drop this notion then things become very easy and the patches will become
> > >simple.
> >
> > I promise to look at that in more detail and get back to it. In the
> > meantime, I think it would be enlightening to hear from other
> > parties as well, specially the ones also directly interested in
> > using the technology.
>
> I don't think I'm too interested in using the technology ;) and
> haven't read the code (just glanced through the descriptions and
> discussions), but, in general, I think the approach of duplicating
> memory allocator per-memcg is a sane approach. It isn't the most
> efficient one with the possibility of wasting considerable amount of
> caching area per memcg but it is something which can mostly stay out
> of the way if done right and that's how I want cgroup implementations
> to be.

Exactly: we at parallels initially disliked the cgroup multipled by slab
approach (Our beancounters do count objects) because we feared memory
wastage and density is very important to us (which tends to mean
efficient use of memory) however, when we ran through the calculations
in Prague, you can show that we have ~200 slabs and if each wastes half
a page, thats ~4MB memory lost per container. Since most virtual
environments are of the order nowadays of 0.5GB, we feel it's an
annoying but acceptable price to pay.

James

> The two goals for cgroup controllers that I think are important are
> proper (no, not crazy perfect but good enough) isolation and an
> implementation which doesn't impact !cg path in an intrusive manner -
> if someone who doesn't care about cgroup but knows and wants to work
> on the subsystem should be able to mostly ignore cgroup support. If
> that means overhead for cgroup users, so be it.
>
> Without looking at the actual code, my rainbow-farting unicorn here
> would be having a common slXb interface layer which handles
> interfacing with memory allocator users and cgroup and let slXb
> implement the same backend interface which doesn't care / know about
> cgroup at all (other than using the correct allocation context, that
> is). Glauber, would something like that be possible?
>
> Thanks.
>
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46598 is a reply to message #46596] Wed, 30 May 2012 07:54 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 05:29 AM, Tejun Heo wrote:
> The two goals for cgroup controllers that I think are important are
> proper (no, not crazy perfect but good enough) isolation and an
> implementation which doesn't impact !cg path in an intrusive manner -
> if someone who doesn't care about cgroup but knows and wants to work
> on the subsystem should be able to mostly ignore cgroup support. If
> that means overhead for cgroup users, so be it.

Well, my code in the slab is totally wrapped in static branches. They
only come active when the first group is *limited* (not even created:
you can have a thousand memcg, if none of them are kmem limited, nothing
will happen).

After that, the cost paid is to find out at which cgroup the process is
at. I believe that if we had a faster way for this (like for instance:
if we had a single hierarchy, the scheduler could put this in a percpu
variable after context switch - or any other method), then the cost of
it could be really low, even when this is enabled.

> Without looking at the actual code, my rainbow-farting unicorn here
> would be having a common slXb interface layer which handles
> interfacing with memory allocator users and cgroup and let slXb
> implement the same backend interface which doesn't care / know about
> cgroup at all (other than using the correct allocation context, that
> is). Glauber, would something like that be possible?

It is a matter of degree. There *is* a lot of stuff in common code, and
I tried to do it as much as I could. Christoph gave me a nice idea about
hinting to the page allocator that this page is a slab page before the
allocation, and then we could account from the page allocator directly -
without touching the cache code at all. This could be done, but some
stuff would still be there, basically because of differences in how the
allocator behaves. I think long term Christoph's effort to produce
common code among them will help a lot, if they stabilize their behavior
in certain areas.

I will rework this series to try work more towards this goal, but at
least for now I'll keep duplicating the caches. I still don't believe
that a loose accounting to the extent Christoph proposed will achieve
what we need this to achieve.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46599 is a reply to message #46598] Wed, 30 May 2012 08:02 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Glauber.

On Wed, May 30, 2012 at 4:54 PM, Glauber Costa <glommer@parallels.com> wrote:
> On 05/30/2012 05:29 AM, Tejun Heo wrote:
>>
>> The two goals for cgroup controllers that I think are important are
>> proper (no, not crazy perfect but good enough) isolation and an
>> implementation which doesn't impact !cg path in an intrusive manner -
>> if someone who doesn't care about cgroup but knows and wants to work
>> on the subsystem should be able to mostly ignore cgroup support.  If
>> that means overhead for cgroup users, so be it.
>
>
> Well, my code in the slab is totally wrapped in static branches. They only
> come active when the first group is *limited* (not even created: you can
> have a thousand memcg, if none of them are kmem limited, nothing will
> happen).

Great, but I'm not sure why you're trying to emphasize that while my
point was about memory overhead and that it's OK to have some
overheads for cg users. :)

> After that, the cost paid is to find out at which cgroup the process is at.
> I believe that if we had a faster way for this (like for instance: if we had
> a single hierarchy, the scheduler could put this in a percpu variable after
> context switch - or any other method), then the cost of it could be really
> low, even when this is enabled.

Someday, hopefully.

> I will rework this series to try work more towards this goal, but at least
> for now I'll keep duplicating the caches. I still don't believe that a loose
> accounting to the extent Christoph proposed will achieve what we need this
> to achieve.

Yeah, I prefer your per-cg cache approach but do hope that it stays as
far from actual allocator code as possible. Christoph, would it be
acceptable if the cg logic is better separated?

Thanks.

--
tejun
Re: [PATCH v3 12/28] slab: pass memcg parameter to kmem_cache_create [message #46619 is a reply to message #46584] Wed, 30 May 2012 11:01 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Tue, May 29, 2012 at 11:52:55AM -0500, Christoph Lameter wrote:
> On Tue, 29 May 2012, Glauber Costa wrote:
>
> > > How do you detect that someone is touching it?
> >
> > kmem_alloc_cache will create mem_cgroup_get_kmem_cache.
> > (protected by static_branches, so won't happen if you don't have at least
> > non-root memcg using it)
> >
> > * Then it detects which memcg the calling process belongs to,
> > * if it is the root memcg, go back to the allocation as quickly as we
> > can
> > * otherwise, in the creation process, you will notice that each cache
> > has an index. memcg will store pointers to the copies and find them by
> > the index.
> >
> > From this point on, all the code of the caches is reused (except for
> > accounting the page)
>
> Well kmem_cache_alloc cache is the performance critical hotpath.
>
> If you are already there and doing all of that then would it not be better
> to simply count the objects allocated and freed per cgroup? Directly
> increment and decrement counters in a cgroup? You do not really need to
> duplicate the kmem_cache structure and do not need to modify allocators if
> you are willing to take that kind of a performance hit. Put a wrapper
> around kmem_cache_alloc/free and count things.

I believe one of the issues is also that a task can migrate to another cgroup
anytime. But an object that has been charged to a cgroup must be later uncharged
to the same, unless you move the charge as you move the task. But then it means
you need to keep track of the allocations per task, and you also need to be able
to do that reverse mapping (object -> allocating task) because your object can
be allocated by task A but later freed by task B. Then when you do the uncharge
it must happen to the cgroup of A, not the one of B.

That all would be much more complicated and performance sensitive than what this
patchset does. Dealing with duplicate caches for accounting seem to me a good tradeoff
between allocation performance hot path and maintaining cgroups semantics.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46629 is a reply to message #46529] Wed, 30 May 2012 12:17 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Fri, May 25, 2012 at 05:03:36PM +0400, Glauber Costa wrote:
> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +static __always_inline struct kmem_cache *
> +mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp)
> +{
> + if (!mem_cgroup_kmem_on)
> + return cachep;
> + if (!current->mm)
> + return cachep;
> + if (in_interrupt())
> + return cachep;

Does that mean interrupts are kept out of accounting?
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46632 is a reply to message #46629] Wed, 30 May 2012 12:26 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 04:17 PM, Frederic Weisbecker wrote:
> On Fri, May 25, 2012 at 05:03:36PM +0400, Glauber Costa wrote:
>> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>> +static __always_inline struct kmem_cache *
>> +mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp)
>> +{
>> + if (!mem_cgroup_kmem_on)
>> + return cachep;
>> + if (!current->mm)
>> + return cachep;
>> + if (in_interrupt())
>> + return cachep;
>
> Does that mean interrupts are kept out of accounting?

Well, since interrupts have no process context, if you are in an
interrupt I can't think of any sane thing to do than relay it to the
root memcg. That's what happen when I return cachep: I return the
original parent cache, and we allocate from that.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46633 is a reply to message #46529] Wed, 30 May 2012 12:34 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Fri, May 25, 2012 at 05:03:36PM +0400, Glauber Costa wrote:
> +bool __mem_cgroup_new_kmem_page(struct page *page, gfp_t gfp)
> +{
> + struct mem_cgroup *memcg;
> + struct page_cgroup *pc;
> + bool ret = true;
> + size_t size;
> + struct task_struct *p;
> +
> + if (!current->mm || in_interrupt())
> + return true;
> +
> + rcu_read_lock();
> + p = rcu_dereference(current->mm->owner);
> + memcg = mem_cgroup_from_task(p);

So this takes the memcg of the group owner rather than the
task? I understand why we want this for user memory, but for
kernel?
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46634 is a reply to message #46633] Wed, 30 May 2012 12:38 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 04:34 PM, Frederic Weisbecker wrote:
> On Fri, May 25, 2012 at 05:03:36PM +0400, Glauber Costa wrote:
>> +bool __mem_cgroup_new_kmem_page(struct page *page, gfp_t gfp)
>> +{
>> + struct mem_cgroup *memcg;
>> + struct page_cgroup *pc;
>> + bool ret = true;
>> + size_t size;
>> + struct task_struct *p;
>> +
>> + if (!current->mm || in_interrupt())
>> + return true;
>> +
>> + rcu_read_lock();
>> + p = rcu_dereference(current->mm->owner);
>> + memcg = mem_cgroup_from_task(p);
>
> So this takes the memcg of the group owner rather than the
> task? I understand why we want this for user memory, but for
> kernel?

That was already discussed when this first came up in my last submission
If I recall correctly, Kame pointed out that this would be needed for
proper OOM-scoring and killing.

Now of course we won't oom kernel threads or anything like that. But
since this is also accounted towards memcg, it should at least be
consistent with each memcg it accounts to.

We can't account kmem for the thread's memcg, and mem to the process'.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46639 is a reply to message #46529] Wed, 30 May 2012 13:04 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Fri, May 25, 2012 at 05:03:36PM +0400, Glauber Costa wrote:
> +struct kmem_cache *__mem_cgroup_get_kmem_cache(struct kmem_cache *cachep,
> + gfp_t gfp)
> +{
> + struct mem_cgroup *memcg;
> + int idx;
> + struct task_struct *p;
> +
> + gfp |= cachep->allocflags;
> +
> + if (cachep->memcg_params.memcg)
> + return cachep;
> +
> + idx = cachep->memcg_params.id;
> + VM_BUG_ON(idx == -1);
> +
> + p = rcu_dereference(current->mm->owner);
> + memcg = mem_cgroup_from_task(p);
> +
> + if (!mem_cgroup_kmem_enabled(memcg))
> + return cachep;
> +
> + if (memcg->slabs[idx] == NULL) {
> + memcg_create_cache_enqueue(memcg, cachep);
> + return cachep;
> + }
> +
> + return memcg->slabs[idx];
> +}
> +EXPORT_SYMBOL(__mem_cgroup_get_kmem_cache);
> +
> +bool __mem_cgroup_new_kmem_page(struct page *page, gfp_t gfp)
> +{
> + struct mem_cgroup *memcg;
> + struct page_cgroup *pc;
> + bool ret = true;
> + size_t size;
> + struct task_struct *p;
> +
> + if (!current->mm || in_interrupt())
> + return true;
> +
> + rcu_read_lock();
> + p = rcu_dereference(current->mm->owner);
> + memcg = mem_cgroup_from_task(p);
> +
> + if (!mem_cgroup_kmem_enabled(memcg))
> + goto out;

Do you think it's possible that this memcg can be destroyed (like ss->destroy())
concurrently?

Probably not because there is a synchronize_rcu() in cgroup_diput() so as long
as we are in rcu_read_lock() we are fine.

OTOH current->mm->owner can exit() right after we fetched its memcg and thus the css_set
can be freed concurrently? And then the cgroup itself after we call rcu_read_unlock()
due to cgroup_diput().
And yet we are doing the mem_cgroup_get() below unconditionally assuming it's
always fine to get a reference to it.

May be I'm missing something?


> + mem_cgroup_get(memcg);
> +
> + size = (1 << compound_order(page)) << PAGE_SHIFT;
> +
> + ret = memcg_charge_kmem(memcg, gfp, size) == 0;
> + if (!ret) {
> + mem_cgroup_put(memcg);
> + goto out;
> + }
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + pc->mem_cgroup = memcg;
> + SetPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
> +
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +EXPORT_SYMBOL(__mem_cgroup_new_kmem_page);
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46640 is a reply to message #46639] Wed, 30 May 2012 13:06 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 05:04 PM, Frederic Weisbecker wrote:
> Do you think it's possible that this memcg can be destroyed (like ss->destroy())
> concurrently?
>
> Probably not because there is a synchronize_rcu() in cgroup_diput() so as long
> as we are in rcu_read_lock() we are fine.
>
> OTOH current->mm->owner can exit() right after we fetched its memcg and thus the css_set
> can be freed concurrently? And then the cgroup itself after we call rcu_read_unlock()
> due to cgroup_diput().
> And yet we are doing the mem_cgroup_get() below unconditionally assuming it's
> always fine to get a reference to it.
>
> May be I'm missing something?
When a cache is created, we grab a reference to the memcg. So after the
cache is created, no.

When destroy is called, we flush the create queue, so if the cache is
not created yet, it will just disappear.

I think the only problem that might happen is in the following scenario:

* cache gets created, but ref count is not yet taken
* memcg disappears
* we try to inc refcount for a non-existent memcg, and crash.

This would be trivially solvable by grabing the reference earlier.
But even then, I need to audit this further to make sure it is really an
issue.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46641 is a reply to message #46634] Wed, 30 May 2012 13:11 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Wed, May 30, 2012 at 04:38:39PM +0400, Glauber Costa wrote:
> On 05/30/2012 04:34 PM, Frederic Weisbecker wrote:
> >On Fri, May 25, 2012 at 05:03:36PM +0400, Glauber Costa wrote:
> >>+bool __mem_cgroup_new_kmem_page(struct page *page, gfp_t gfp)
> >>+{
> >>+ struct mem_cgroup *memcg;
> >>+ struct page_cgroup *pc;
> >>+ bool ret = true;
> >>+ size_t size;
> >>+ struct task_struct *p;
> >>+
> >>+ if (!current->mm || in_interrupt())
> >>+ return true;
> >>+
> >>+ rcu_read_lock();
> >>+ p = rcu_dereference(current->mm->owner);
> >>+ memcg = mem_cgroup_from_task(p);
> >
> >So this takes the memcg of the group owner rather than the
> >task? I understand why we want this for user memory, but for
> >kernel?
>
> That was already discussed when this first came up in my last submission
> If I recall correctly, Kame pointed out that this would be needed
> for proper OOM-scoring and killing.

Can we have at least a comment in the code that explain the reasons of taking the
owner rather than the task? It's not going to be very obvious to future reviewers.

> Now of course we won't oom kernel threads or anything like that.

Seems we are not even accounting them anyway.

> But since this is also accounted towards memcg, it should at least be
> consistent with each memcg it accounts to.
>
> We can't account kmem for the thread's memcg, and mem to the process'.

Don't know. This goes a bit against cgroups semantics which group at the task
level and not process. But I personally don't mind much, as long as it's
documented.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46642 is a reply to message #46641] Wed, 30 May 2012 13:09 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 05:11 PM, Frederic Weisbecker wrote:
> Can we have at least a comment in the code that explain the reasons of taking the
> owner rather than the task? It's not going to be very obvious to future reviewers.

Yes.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46645 is a reply to message #46640] Wed, 30 May 2012 13:37 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Wed, May 30, 2012 at 05:06:22PM +0400, Glauber Costa wrote:
> On 05/30/2012 05:04 PM, Frederic Weisbecker wrote:
> >Do you think it's possible that this memcg can be destroyed (like ss->destroy())
> >concurrently?
> >
> >Probably not because there is a synchronize_rcu() in cgroup_diput() so as long
> >as we are in rcu_read_lock() we are fine.
> >
> >OTOH current->mm->owner can exit() right after we fetched its memcg and thus the css_set
> >can be freed concurrently? And then the cgroup itself after we call rcu_read_unlock()
> >due to cgroup_diput().
> >And yet we are doing the mem_cgroup_get() below unconditionally assuming it's
> >always fine to get a reference to it.
> >
> >May be I'm missing something?
> When a cache is created, we grab a reference to the memcg. So after
> the cache is created, no.
>
> When destroy is called, we flush the create queue, so if the cache
> is not created yet, it will just disappear.
>
> I think the only problem that might happen is in the following scenario:
>
> * cache gets created, but ref count is not yet taken
> * memcg disappears
> * we try to inc refcount for a non-existent memcg, and crash.
>
> This would be trivially solvable by grabing the reference earlier.
> But even then, I need to audit this further to make sure it is
> really an issue.

Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
After this tryget I think you're fine. And in-between you're safe against
css_set removal due to rcu_read_lock().

I'm less clear with __mem_cgroup_new_kmem_page() though...
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46646 is a reply to message #46645] Wed, 30 May 2012 13:37 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 05:37 PM, Frederic Weisbecker wrote:
> Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
> and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
> After this tryget I think you're fine. And in-between you're safe against
> css_set removal due to rcu_read_lock().
>
> I'm less clear with __mem_cgroup_new_kmem_page() though...

That one does not get memcg->css but it does call mem_cgroup_get(), that
does prevent against the memcg structure being freed, which I believe to
be good enough.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46647 is a reply to message #46646] Wed, 30 May 2012 13:53 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Wed, May 30, 2012 at 05:37:57PM +0400, Glauber Costa wrote:
> On 05/30/2012 05:37 PM, Frederic Weisbecker wrote:
> >Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
> >and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
> >After this tryget I think you're fine. And in-between you're safe against
> >css_set removal due to rcu_read_lock().
> >
> >I'm less clear with __mem_cgroup_new_kmem_page() though...
>
> That one does not get memcg->css but it does call mem_cgroup_get(),
> that does prevent against the memcg structure being freed, which I
> believe to be good enough.

What if the owner calls cgroup_exit() between mem_cgroup_from_task()
and mem_cgroup_get()? The css_set which contains the memcg gets freed.
Also the reference on the memcg doesn't even prevent the css_set to
be removed, does it?
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46648 is a reply to message #46647] Wed, 30 May 2012 13:55 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 05:53 PM, Frederic Weisbecker wrote:
> On Wed, May 30, 2012 at 05:37:57PM +0400, Glauber Costa wrote:
>> On 05/30/2012 05:37 PM, Frederic Weisbecker wrote:
>>> Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
>>> and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
>>> After this tryget I think you're fine. And in-between you're safe against
>>> css_set removal due to rcu_read_lock().
>>>
>>> I'm less clear with __mem_cgroup_new_kmem_page() though...
>>
>> That one does not get memcg->css but it does call mem_cgroup_get(),
>> that does prevent against the memcg structure being freed, which I
>> believe to be good enough.
>
> What if the owner calls cgroup_exit() between mem_cgroup_from_task()
> and mem_cgroup_get()? The css_set which contains the memcg gets freed.
> Also the reference on the memcg doesn't even prevent the css_set to
> be removed, does it?
It doesn't, but we don't really care. The css can go away, if the memcg
structure stays. The caches will outlive the memcg anyway, since it is
possible that you delete it, with some caches still holding objects that
are not freed (they will be marked as dead).
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46649 is a reply to message #46648] Wed, 30 May 2012 15:33 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Wed, May 30, 2012 at 05:55:38PM +0400, Glauber Costa wrote:
> On 05/30/2012 05:53 PM, Frederic Weisbecker wrote:
> >On Wed, May 30, 2012 at 05:37:57PM +0400, Glauber Costa wrote:
> >>On 05/30/2012 05:37 PM, Frederic Weisbecker wrote:
> >>>Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
> >>>and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
> >>>After this tryget I think you're fine. And in-between you're safe against
> >>>css_set removal due to rcu_read_lock().
> >>>
> >>>I'm less clear with __mem_cgroup_new_kmem_page() though...
> >>
> >>That one does not get memcg->css but it does call mem_cgroup_get(),
> >>that does prevent against the memcg structure being freed, which I
> >>believe to be good enough.
> >
> >What if the owner calls cgroup_exit() between mem_cgroup_from_task()
> >and mem_cgroup_get()? The css_set which contains the memcg gets freed.
> >Also the reference on the memcg doesn't even prevent the css_set to
> >be removed, does it?
> It doesn't, but we don't really care. The css can go away, if the
> memcg structure stays.

Ah right, the memcg itself is only freed at destroy time.

> The caches will outlive the memcg anyway,
> since it is possible that you delete it, with some caches still
> holding objects that
> are not freed (they will be marked as dead).

I guess I need to look at how the destroy path is handled in your patchset
then. Or how you ensure that __mem_cgroup_new_kmem_page() can't race against
destroy.
Re: [PATCH v3 13/28] slub: create duplicate cache [message #46650 is a reply to message #46599] Wed, 30 May 2012 15:37 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 30 May 2012, Tejun Heo wrote:

> Yeah, I prefer your per-cg cache approach but do hope that it stays as
> far from actual allocator code as possible. Christoph, would it be
> acceptable if the cg logic is better separated?

Certainly anything that would allow this to be separated out would be
appreciated. I do not anticipate to ever run cgroup in my environment and
that is due to the additional latency created in key OS paths. Memory we
have enough. The increased cache footprint is killing performance.
Re: [PATCH v3 16/28] memcg: kmem controller charge/uncharge infrastructure [message #46653 is a reply to message #46649] Wed, 30 May 2012 16:16 Go to previous messageGo to previous message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 05/30/2012 07:33 PM, Frederic Weisbecker wrote:
> On Wed, May 30, 2012 at 05:55:38PM +0400, Glauber Costa wrote:
>> On 05/30/2012 05:53 PM, Frederic Weisbecker wrote:
>>> On Wed, May 30, 2012 at 05:37:57PM +0400, Glauber Costa wrote:
>>>> On 05/30/2012 05:37 PM, Frederic Weisbecker wrote:
>>>>> Right. __mem_cgroup_get_kmem_cache() fetches the memcg of the owner
>>>>> and calls memcg_create_cache_enqueue() which does css_tryget(&memcg->css).
>>>>> After this tryget I think you're fine. And in-between you're safe against
>>>>> css_set removal due to rcu_read_lock().
>>>>>
>>>>> I'm less clear with __mem_cgroup_new_kmem_page() though...
>>>>
>>>> That one does not get memcg->css but it does call mem_cgroup_get(),
>>>> that does prevent against the memcg structure being freed, which I
>>>> believe to be good enough.
>>>
>>> What if the owner calls cgroup_exit() between mem_cgroup_from_task()
>>> and mem_cgroup_get()? The css_set which contains the memcg gets freed.
>>> Also the reference on the memcg doesn't even prevent the css_set to
>>> be removed, does it?
>> It doesn't, but we don't really care. The css can go away, if the
>> memcg structure stays.
>
> Ah right, the memcg itself is only freed at destroy time.
>
>> The caches will outlive the memcg anyway,
>> since it is possible that you delete it, with some caches still
>> holding objects that
>> are not freed (they will be marked as dead).
>
> I guess I need to look at how the destroy path is handled in your patchset
> then. Or how you ensure that __mem_cgroup_new_kmem_page() can't race against
> destroy.
Appreciate that, thanks.
Previous Topic: [PATCH] NFS: hard-code init_net for NFS callback transports
Next Topic: [PATCH v4 0/4] per cgroup cpu statistics
Goto Forum:
  


Current Time: Sat Oct 25 20:30:31 GMT 2025

Total time taken to generate the page: 0.09325 seconds