| Home » Mailing lists » Devel » [PATCH v2 00/11] Request for Inclusion: kmem controller for memcg. Goto Forum:
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47498 is a reply to message #47392] | Tue, 14 August 2012 16:21   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Thu 09-08-12 17:01:12, Glauber Costa wrote: > This patch adds the basic infrastructure for the accounting of the slab
 > caches. To control that, the following files are created:
 >
 >  * memory.kmem.usage_in_bytes
 >  * memory.kmem.limit_in_bytes
 >  * memory.kmem.failcnt
 >  * memory.kmem.max_usage_in_bytes
 >
 > They have the same meaning of their user memory counterparts. They
 > reflect the state of the "kmem" res_counter.
 >
 > The code is not enabled until a limit is set. This can be tested by the
 > flag "kmem_accounted". This means that after the patch is applied, no
 > behavioral changes exists for whoever is still using memcg to control
 > their memory usage.
 >
 > We always account to both user and kernel resource_counters. This
 > effectively means that an independent kernel limit is in place when the
 > limit is set to a lower value than the user memory. A equal or higher
 > value means that the user limit will always hit first, meaning that kmem
 > is effectively unlimited.
 
 Well, it contributes to the user limit so it is not unlimited. It just
 falls under a different limit and it tends to contribute less. This can
 be quite confusing.  I am still not sure whether we should mix the two
 things together. If somebody wants to limit the kernel memory he has to
 touch the other limit anyway.  Do you have a strong reason to mix the
 user and kernel counters?
 My impression was that kernel allocation should simply fail while user
 allocations might reclaim as well. Why should we reclaim just because of
 the kernel allocation (which is unreclaimable from hard limit reclaim
 point of view)?
 I also think that the whole thing would get much simpler if those two
 are split. Anyway if this is really a must then this should be
 documented here.
 
 One nit bellow.
 
 > People who want to track kernel memory but not limit it, can set this
 > limit to a very high number (like RESOURCE_MAX - 1page - that no one
 > will ever hit, or equal to the user memory)
 >
 > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > ---
 >  mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 >  1 file changed, 68 insertions(+), 1 deletion(-)
 >
 > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > index b0e29f4..54e93de 100644
 > --- a/mm/memcontrol.c
 > +++ b/mm/memcontrol.c
 [...]
 > @@ -4046,8 +4059,23 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 >  			break;
 >  		if (type == _MEM)
 >  			ret = mem_cgroup_resize_limit(memcg, val);
 > -		else
 > +		else if (type == _MEMSWAP)
 >  			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 > +		else if (type == _KMEM) {
 > +			ret = res_counter_set_limit(&memcg->kmem, val);
 > +			if (ret)
 > +				break;
 > +			/*
 > +			 * Once enabled, can't be disabled. We could in theory
 > +			 * disable it if we haven't yet created any caches, or
 > +			 * if we can shrink them all to death.
 > +			 *
 > +			 * But it is not worth the trouble
 > +			 */
 > +			if (!memcg->kmem_accounted && val != RESOURCE_MAX)
 > +				memcg->kmem_accounted = true;
 > +		} else
 > +			return -EINVAL;
 >  		break;
 
 This doesn't check for the hierachy so kmem_accounted might not be in
 sync with it's parents. mem_cgroup_create (below) needs to copy
 kmem_accounted down from the parent and the above needs to check if this
 is a similar dance like mem_cgroup_oom_control_write.
 
 [...]
 
 > @@ -5033,6 +5098,7 @@ mem_cgroup_create(struct cgroup *cont)
 >  	if (parent && parent->use_hierarchy) {
 >  		res_counter_init(&memcg->res, &parent->res);
 >  		res_counter_init(&memcg->memsw, &parent->memsw);
 > +		res_counter_init(&memcg->kmem, &parent->kmem);
 >  		/*
 >  		 * We increment refcnt of the parent to ensure that we can
 >  		 * safely access it on res_counter_charge/uncharge.
 > @@ -5043,6 +5109,7 @@ mem_cgroup_create(struct cgroup *cont)
 >  	} else {
 >  		res_counter_init(&memcg->res, NULL);
 >  		res_counter_init(&memcg->memsw, NULL);
 > +		res_counter_init(&memcg->kmem, NULL);
 >  	}
 >  	memcg->last_scanned_node = MAX_NUMNODES;
 >  	INIT_LIST_HEAD(&memcg->oom_notify);
 > --
 > 1.7.11.2
 >
 > --
 > To unsubscribe from this list: send the line "unsubscribe cgroups" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47499 is a reply to message #47398] | Tue, 14 August 2012 17:25   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Thu 09-08-12 17:01:14, Glauber Costa wrote: > This patch introduces infrastructure for tracking kernel memory pages to
 > a given memcg. This will happen whenever the caller includes the flag
 > __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
 >
 > In memcontrol.h those functions are wrapped in inline accessors.  The
 > idea is to later on, patch those with static branches, so we don't incur
 > any overhead when no mem cgroups with limited kmem are being used.
 >
 > [ v2: improved comments and standardized function names ]
 >
 > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > CC: Christoph Lameter <cl@linux.com>
 > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > ---
 >  include/linux/memcontrol.h |  79 +++++++++++++++++++
 >  mm/memcontrol.c            | 185 +++++++++++++++++++++++++++++++++++++++++++++
 >  2 files changed, 264 insertions(+)
 >
 > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 > index 8d9489f..75b247e 100644
 > --- a/include/linux/memcontrol.h
 > +++ b/include/linux/memcontrol.h
 [...]
 > +/**
 > + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 > + * @gfp: the gfp allocation flags.
 > + * @handle: a pointer to the memcg this was charged against.
 > + * @order: allocation order.
 > + *
 > + * returns true if the memcg where the current task belongs can hold this
 > + * allocation.
 > + *
 > + * We return true automatically if this allocation is not to be accounted to
 > + * any memcg.
 > + */
 > +static __always_inline bool
 > +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 > +{
 > +	if (!memcg_kmem_on)
 > +		return true;
 > +	if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
 
 OK, I see the point behind __GFP_NOFAIL but it would deserve a comment
 or a mention in the changelog.
 
 [...]
 > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > index 54e93de..e9824c1 100644
 > --- a/mm/memcontrol.c
 > +++ b/mm/memcontrol.c
 [...]
 > +EXPORT_SYMBOL(__memcg_kmem_new_page);
 
 Why is this exported?
 
 > +
 > +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
 > +{
 > +	struct page_cgroup *pc;
 > +	struct mem_cgroup *memcg = handle;
 > +
 > +	if (!memcg)
 > +		return;
 > +
 > +	WARN_ON(mem_cgroup_is_root(memcg));
 > +	/* The page allocation must have failed. Revert */
 > +	if (!page) {
 > +		size_t size = PAGE_SIZE << order;
 > +
 > +		memcg_uncharge_kmem(memcg, size);
 > +		mem_cgroup_put(memcg);
 > +		return;
 > +	}
 > +
 > +	pc = lookup_page_cgroup(page);
 > +	lock_page_cgroup(pc);
 > +	pc->mem_cgroup = memcg;
 > +	SetPageCgroupUsed(pc);
 
 Don't we need a write barrier before assigning memcg? Same as
 __mem_cgroup_commit_charge. This tests the Used bit always from within
 lock_page_cgroup so it should be safe but I am not 100% sure about the
 rest of the code.
 
 [...]
 > +EXPORT_SYMBOL(__memcg_kmem_free_page);
 
 Why is the symbol exported?
 
 >  #endif /* CONFIG_MEMCG_KMEM */
 >
 >  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 > @@ -5759,3 +5878,69 @@ static int __init enable_swap_account(char *s)
 >  __setup("swapaccount=", enable_swap_account);
 >
 >  #endif
 > +
 > +#ifdef CONFIG_MEMCG_KMEM
 > +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 > +{
 > +	struct res_counter *fail_res;
 > +	struct mem_cgroup *_memcg;
 > +	int ret;
 > +	bool may_oom;
 > +	bool nofail = false;
 > +
 > +	may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
 > +	    !(gfp & __GFP_NORETRY);
 
 This deserves a comment.
 
 > +
 > +	ret = 0;
 > +
 > +	if (!memcg)
 > +		return ret;
 > +
 > +	_memcg = memcg;
 > +	ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 > +	    &_memcg, may_oom);
 
 This is really dangerous because atomic allocation which seem to be
 possible could result in deadlocks because of the reclaim. Also, as I
 have mentioned in the other email in this thread. Why should we reclaim
 just because of kernel allocation when we are not reclaiming any of it
 because shrink_slab is ignored in the memcg reclaim.
 
 > +
 > +	if (ret == -EINTR)  {
 > +		nofail = true;
 > +		/*
 > +		 * __mem_cgroup_try_charge() chosed to bypass to root due to
 > +		 * OOM kill or fatal signal.  Since our only options are to
 > +		 * either fail the allocation or charge it to this cgroup, do
 > +		 * it as a temporary condition. But we can't fail. From a
 > +		 * kmem/slab perspective, the cache has already been selected,
 > +		 * by mem_cgroup_get_kmem_cache(), so it is too late to change
 > +		 * our minds
 > +		 */
 > +		res_counter_charge_nofail(&memcg->res, delta, &fail_res);
 > +		if (do_swap_account)
 > +			res_counter_charge_nofail(&memcg->memsw, delta,
 > +						  &fail_res);
 
 Hmmm, this is kind of ugly but I guess unvoidable with the current
 implementation. Oh well...
 
 > +		ret = 0;
 > +	} else if (ret == -ENOMEM)
 > +		return ret;
 > +
 > +	if (nofail)
 > +		res_counter_charge_nofail(&memcg->kmem, delta, &fail_res);
 > +	else
 > +		ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
 > +
 > +	if (ret) {
 > +		res_counter_uncharge(&memcg->res, delta);
 > +		if (do_swap_account)
 > +			res_counter_uncharge(&memcg->memsw, delta);
 > +	}
 > +
 > +	return ret;
 > +}
 > +
 [...]
 
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47500 is a reply to message #47471] | Tue, 14 August 2012 18:58   |  
			| 
				
				
					|  Greg Thelen Messages: 18
 Registered: February 2011
 | Junior Member |  |  |  
	| On Mon, Aug 13 2012, Glauber Costa wrote: 
 >>> > +	WARN_ON(mem_cgroup_is_root(memcg));
 >>> > +	size = (1 << order) << PAGE_SHIFT;
 >>> > +	memcg_uncharge_kmem(memcg, size);
 >>> > +	mem_cgroup_put(memcg);
 >> Why do we need ref-counting here ? kmem res_counter cannot work as
 >> reference ?
 > This is of course the pair of the mem_cgroup_get() you commented on
 > earlier. If we need one, we need the other. If we don't need one, we
 > don't need the other =)
 >
 > The guarantee we're trying to give here is that the memcg structure will
 > stay around while there are dangling charges to kmem, that we decided
 > not to move (remember: moving it for the stack is simple, for the slab
 > is very complicated and ill-defined, and I believe it is better to treat
 > all kmem equally here)
 
 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.
 
 Is there any way to see how much kmem such zombie memcg are consuming?
 I think we could find these with
 for_each_mem_cgroup_tree(root_mem_cgroup).  Basically, I'm wanting to
 know where kernel memory has been allocated.  For live memcg, an admin
 can cat memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure
 how to get this info.  It looks like the root_mem_cgroup
 memory.kmem.usage_in_bytes is not hierarchically charged.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg [message #47501 is a reply to message #47394] | Wed, 15 August 2012 09:08   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/14/2012 07:16 PM, Mel Gorman wrote: > On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
 >> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 >> page allocator will call the corresponding memcg functions to validate
 >> the allocation. Tasks in the root memcg can always proceed.
 >>
 >> To avoid adding markers to the page - and a kmem flag that would
 >> necessarily follow, as much as doing page_cgroup lookups for no reason,
 >
 > As you already guessed, doing a page_cgroup in the page allocator free
 > path would be a no-go.
 
 Specifically yes, but in general, you will be able to observe that I am
 taking all the possible measures to make sure existing paths are
 disturbed as little as possible.
 
 Thanks for your review here
 
 >>
 >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 >> index b956cec..da341dc 100644
 >> --- a/mm/page_alloc.c
 >> +++ b/mm/page_alloc.c
 >> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 >>  	struct page *page = NULL;
 >>  	int migratetype = allocflags_to_migratetype(gfp_mask);
 >>  	unsigned int cpuset_mems_cookie;
 >> +	void *handle = NULL;
 >>
 >>  	gfp_mask &= gfp_allowed_mask;
 >>
 >> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 >>  		return NULL;
 >>
 >>  	/*
 >> +	 * Will only have any effect when __GFP_KMEMCG is set.
 >> +	 * This is verified in the (always inline) callee
 >> +	 */
 >> +	if (!memcg_kmem_new_page(gfp_mask, &handle, order))
 >
 > memcg_kmem_new_page takes a void * parameter already but here you are
 > passing in a void **. This probably happens to work because you do this
 >
 > struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 >
 > but that appears to defeat the purpose of having an opaque type as a
 > "handle". You have to treat it different then passing it into the commit
 > function because it expects a void *. The motivation for an opaque type
 > is completely unclear to me and how it is managed with a mix of void *
 > and void ** is very confusing.
 
 
 okay.
 
 The opaque exists because I am doing speculative charging. I believe it
 to be a better and less complicated approach then letting a page appear
 and then charging it. Besides being consistent with the rest of memcg,
 it won't create unnecessary disturbance in the page allocator
 when the allocation is to fail.
 
 Now, tasks can move between memcgs, so we can't rely on grabbing it from
 current in commit_page, so we pass it around as a handle. Also, even if
 the task could not move, we already got it once from the task, and that
 is not for free. Better save it.
 
 Aside from the handle needed, the cost is more or less the same compared
 to doing it in one pass. All we do by using speculative charging is to
 split the cost in two, and doing it from two places.
 We'd have to charge + update page_cgroup anyway.
 
 As for the type, do you think using struct mem_cgroup would be less
 confusing?
 
 > On a similar note I spotted #define memcg_kmem_on 1 . That is also
 > different just for the sake of it. The convension is to do something
 > like this
 >
 > /* This helps us to avoid #ifdef CONFIG_NUMA */
 > #ifdef CONFIG_NUMA
 > #define NUMA_BUILD 1
 > #else
 > #define NUMA_BUILD 0
 > #endif
 
 
 For simple defines, yes. But a later patch will turn this into a static
 branch test. memcg_kmem_on will be always 0 when compile-disabled, but
 when enable will expand to static_branch(&...).
 
 
 > memcg_kmem_on was difficult to guess based on its name. I thought initially
 > that it would only be active if a memcg existed or at least something like
 > mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.
 
 For now. And I thought that adding the static branch in this patch would
 only confuse matters. The placeholder is there, but it is later patched
 to the final thing.
 
 With that explained, if you want me to change it to something else, I
 can do it. Should I ?
 
 > I also find it *very* strange to have a function named as if it is an
 > allocation-style function when it in fact it's looking up a mem_cgroup
 > and charging it (and uncharging it in the error path if necessary). If
 > it was called memcg_kmem_newpage_charge I might have found it a little
 > better.
 
 I don't feel strongly about names in general. I can change it.
 Will update to memcg_kmem_newpage_charge() and memcg_kmem_page_uncharge().
 
 > This whole operation also looks very expensive (cgroup lookups, RCU locks
 > taken etc) but I guess you're willing to take that cost in the same of
 > isolating containers from each other. However, I strongly suggest that
 > this overhead is measured in advance. It should not stop the series being
 > merged as such but it should be understood because if the cost is high
 > then this feature will be avoided like the plague. I am skeptical that
 > distributions would enable this by default, at least not without support
 > for cgroup_disable=kmem
 
 Enabling this feature will bring you nothing, therefore, no (or little)
 overhead. Nothing of this will be patched in until the first memcg gets
 kmem limited. The mere fact of moving tasks to memcgs won't trigger any
 of this.
 
 I haven't measured this series in particular, but I did measure the slab
 series (which builds ontop of this). I found the per-allocation cost to
 be in the order of 2-3 % for tasks living in limited memcgs, and
 hard to observe when living in the root memcg (compared of course to the
 case of a task running on root memcg without those patches)
 
 I also believe the folks from google also measured this. They may be
 able to spit out numbers grabbed from a system bigger than mine =p
 
 > As this thing is called from within the allocator, it's not clear why
 > __memcg_kmem_new_page is exported. I can't imagine why a module would call
 > it directly although maybe you cover that somewhere else in the series.
 
 Okay, more people commented on this, so let me clarify: They shouldn't
 be. They were initially exported when this was about the slab only,
 because they could be called from inlined functions from the allocators.
 Now that the charge/uncharge was moved to the page allocator - which
 already allowed me the big benefit of separating this in two pieces,
 none of this needs to be exported.
 
 Sorry for not noticing this myself, but thanks for the eyes =)
 
 > From the point of view of a hook, that is acceptable but just barely. I have
 > slammed other hooks because it was possible for a subsystem to override them
 > meaning the runtime cost could be anything. I did not spot a similar issue
 > here but if I missed it, it's still unacceptable. At least here the cost
 > is sortof predictable and only affects memcg because of the __GFP_KMEMCG
 > check in memcg_kmem_new_page.
 
 Yes, that is the idea. And I don't think anyone should override those,
 so I don't see them as hooks in this sense.
 
 >> +		return NULL;
 >> +
 >> +	/*
 >>  	 * Check the zones suitable for the gfp_mask contain at least one
 >>  	 * valid zone. It's possible to have an empty zonelist as a result
 >>  	 * of GFP_THISNODE and a memoryless node
 >> @@ -2583,6 +2591,8 @@ out:
 >>  	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 >>  		goto retry_cpuset;
 >>
 >> +	memcg_kmem_commit_page(page, handle, order);
 >> +
 >
 > As a side note, I'm not keen on how you shortcut these functions. They
 > are all function calls because memcg_kmem_commit_page() will always call
 > __memcg_kmem_commit_page() to check the handle once it's compiled in.
 > The handle==NULL check should have happened in the inline function to save
 > a few cycles.
 >
 
 It is already happening on my updated series after a comment from Kame
 pointed this out.
 
 > This also has the feel that the call of memcg_kmem_commit_page belongs in
 > prep_new_page() but I recognise that requires passing the opaque handler
 > around which would be very ugly.
 
 Indeed, and that is the reason why I kept everything local.
 
 
 >>  	return page;
 
 >
 > memcg_kmem_new_page makes the following check
 >
 > +       if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
 > +               return true;
 >
 > so if the allocation had __GFP_NOFAIL, it does not get charged but can
 > still be freed. I didn't check if this is really the case but it looks
 > very suspicious.
 
 No, it can't be freed (uncharged), because in that case, we won't fill
 in the memcg information in page cgroup.
 
 >
 > Again, this is a fairly heavy operation.
 
 
 Mel, once I address all the issues you pointed out here, do you think
 this would be in an acceptable state for merging? Do you still have any
 fundamental opposition to this?
 
 thanks again
...
 
 
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47502 is a reply to message #47500] | Wed, 15 August 2012 09:18   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/14/2012 10:58 PM, Greg Thelen wrote: > On Mon, Aug 13 2012, Glauber Costa wrote:
 >
 >>>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>>> +	memcg_uncharge_kmem(memcg, size);
 >>>>> +	mem_cgroup_put(memcg);
 >>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>> reference ?
 >> This is of course the pair of the mem_cgroup_get() you commented on
 >> earlier. If we need one, we need the other. If we don't need one, we
 >> don't need the other =)
 >>
 >> The guarantee we're trying to give here is that the memcg structure will
 >> stay around while there are dangling charges to kmem, that we decided
 >> not to move (remember: moving it for the stack is simple, for the slab
 >> is very complicated and ill-defined, and I believe it is better to treat
 >> all kmem equally here)
 >
 > By keeping memcg structures hanging around until the last referring kmem
 > page is uncharged do such zombie memcg each consume a css_id and thus
 > put pressure on the 64k css_id space?  I imagine in pathological cases
 > this would prevent creation of new cgroups until these zombies are
 > dereferenced.
 
 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.
 
 
 > Is there any way to see how much kmem such zombie memcg are consuming?
 > I think we could find these with
 > for_each_mem_cgroup_tree(root_mem_cgroup).
 
 Yes, just need an interface for that. But I think it is something that
 can be addressed orthogonaly to this work, in a separate patch, not as
 some fundamental limitation.
 
 >  Basically, I'm wanting to
 > know where kernel memory has been allocated.  For live memcg, an admin
 > can cat memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure
 > how to get this info.  It looks like the root_mem_cgroup
 > memory.kmem.usage_in_bytes is not hierarchically charged.
 >
 
 Not sure what you mean by not being hierarchically charged. It should
 be, when use_hierarchy = 1. As a matter of fact, I just tested it, and I
 do see kmem being charged all the way to the root cgroup when hierarchy
 is used. (we just can't limit it there)
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg [message #47503 is a reply to message #47394] | Wed, 15 August 2012 09:24   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Thu 09-08-12 17:01:15, Glauber Costa wrote: [...]
 > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 > index b956cec..da341dc 100644
 > --- a/mm/page_alloc.c
 > +++ b/mm/page_alloc.c
 > @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 >  	struct page *page = NULL;
 >  	int migratetype = allocflags_to_migratetype(gfp_mask);
 >  	unsigned int cpuset_mems_cookie;
 > +	void *handle = NULL;
 >
 >  	gfp_mask &= gfp_allowed_mask;
 >
 > @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 >  		return NULL;
 >
 >  	/*
 > +	 * Will only have any effect when __GFP_KMEMCG is set.
 > +	 * This is verified in the (always inline) callee
 > +	 */
 > +	if (!memcg_kmem_new_page(gfp_mask, &handle, order))
 > +		return NULL;
 
 When the previous patch introduced this function I thought the handle
 obfuscantion is to prevent from spreading struct mem_cgroup inside the
 page allocator but memcg_kmem_commit_page uses the type directly. So why
 that obfuscation? Even handle as a name sounds unnecessarily confusing.
 I would go with struct mem_cgroup **memcgp or even return the pointer on
 success or NULL otherwise.
 
 [...]
 > +EXPORT_SYMBOL(__free_accounted_pages);
 
 Why exported?
 
 Btw. this is called from call_rcu context but it itself calls call_rcu
 down the chain in mem_cgroup_put. Is it safe?
 
 [...]
 > +EXPORT_SYMBOL(free_accounted_pages);
 
 here again
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg [message #47504 is a reply to message #47468] | Mon, 13 August 2012 08:57   |  
			| 
				
				
					|  Mel Gorman Messages: 8
 Registered: August 2012
 | Junior Member |  |  |  
	| On Mon, Aug 13, 2012 at 12:03:38PM +0400, Glauber Costa wrote: > On 08/10/2012 09:33 PM, Kamezawa Hiroyuki wrote:
 > > (2012/08/09 22:01), Glauber Costa wrote:
 > >> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 > >> page allocator will call the corresponding memcg functions to validate
 > >> the allocation. Tasks in the root memcg can always proceed.
 > >>
 > >> To avoid adding markers to the page - and a kmem flag that would
 > >> necessarily follow, as much as doing page_cgroup lookups for no reason,
 > >> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 > >> for telling the page allocator that this is such an allocation at
 > >> free_pages() time. This is done by the invocation of
 > >> __free_accounted_pages() and free_accounted_pages().
 > >>
 > >> Signed-off-by: Glauber Costa <glommer@parallels.com>
 > >> CC: Christoph Lameter <cl@linux.com>
 > >> CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > >> CC: Michal Hocko <mhocko@suse.cz>
 > >> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > >> CC: Johannes Weiner <hannes@cmpxchg.org>
 > >> CC: Suleiman Souhlal <suleiman@google.com>
 > >
 > > Ah, ok. free_accounted_page() seems good.
 > >
 > > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > >
 > > I myself is okay with this. But...
 > >
 > > Because you add a new hook to alloc_pages(), please get Ack from Mel
 > > before requesting merge.
 > >
 > > Thanks,
 > > -Kame
 >
 > Absolutely.
 >
 > Mel, would you mind taking a look at this series and commenting on this?
 >
 
 It'll take me a few days but I'll get around to it.
 
 --
 Mel Gorman
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg [message #47506 is a reply to message #47394] | Tue, 14 August 2012 15:16   |  
			| 
				
				
					|  Mel Gorman Messages: 8
 Registered: August 2012
 | Junior Member |  |  |  
	| On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote: > When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 > page allocator will call the corresponding memcg functions to validate
 > the allocation. Tasks in the root memcg can always proceed.
 >
 > To avoid adding markers to the page - and a kmem flag that would
 > necessarily follow, as much as doing page_cgroup lookups for no reason,
 
 As you already guessed, doing a page_cgroup in the page allocator free
 path would be a no-go.
 
 This is my first time glancing at the series and I'm only paying close
 attention to this patch so pardon me if my observations have been made
 already.
 
 > whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 > for telling the page allocator that this is such an allocation at
 > free_pages() time. This is done by the invocation of
 > __free_accounted_pages() and free_accounted_pages().
 >
 > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > CC: Christoph Lameter <cl@linux.com>
 > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > CC: Suleiman Souhlal <suleiman@google.com>
 > ---
 >  include/linux/gfp.h |  3 +++
 >  mm/page_alloc.c     | 38 ++++++++++++++++++++++++++++++++++++++
 >  2 files changed, 41 insertions(+)
 >
 > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 > index d8eae4d..029570f 100644
 > --- a/include/linux/gfp.h
 > +++ b/include/linux/gfp.h
 > @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
 >  extern void free_hot_cold_page(struct page *page, int cold);
 >  extern void free_hot_cold_page_list(struct list_head *list, int cold);
 >
 > +extern void __free_accounted_pages(struct page *page, unsigned int order);
 > +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 > +
 >  #define __free_page(page) __free_pages((page), 0)
 >  #define free_page(addr) free_pages((addr), 0)
 >
 > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 > index b956cec..da341dc 100644
 > --- a/mm/page_alloc.c
 > +++ b/mm/page_alloc.c
 > @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 >  	struct page *page = NULL;
 >  	int migratetype = allocflags_to_migratetype(gfp_mask);
 >  	unsigned int cpuset_mems_cookie;
 > +	void *handle = NULL;
 >
 >  	gfp_mask &= gfp_allowed_mask;
 >
 > @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 >  		return NULL;
 >
 >  	/*
 > +	 * Will only have any effect when __GFP_KMEMCG is set.
 > +	 * This is verified in the (always inline) callee
 > +	 */
 > +	if (!memcg_kmem_new_page(gfp_mask, &handle, order))
 
 memcg_kmem_new_page takes a void * parameter already but here you are
 passing in a void **. This probably happens to work because you do this
 
 struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 
 but that appears to defeat the purpose of having an opaque type as a
 "handle". You have to treat it different then passing it into the commit
 function because it expects a void *. The motivation for an opaque type
 is completely unclear to me and how it is managed with a mix of void *
 and void ** is very confusing.
 
 On a similar note I spotted #define memcg_kmem_on 1 . That is also
 different just for the sake of it. The convension is to do something
 like this
 
 /* This helps us to avoid #ifdef CONFIG_NUMA */
 #ifdef CONFIG_NUMA
 #define NUMA_BUILD 1
 #else
 #define NUMA_BUILD 0
 #endif
 
 memcg_kmem_on was difficult to guess based on its name. I thought initially
 that it would only be active if a memcg existed or at least something like
 mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.
 
 I also find it *very* strange to have a function named as if it is an
 allocation-style function when it in fact it's looking up a mem_cgroup
 and charging it (and uncharging it in the error path if necessary). If
 it was called memcg_kmem_newpage_charge I might have found it a little
 better.  While I believe you have to take care to avoid confusion with
 mem_cgroup_newpage_charge, it would be preferable if the APIs were similar.
 memcg is hard enough as it is to understand without having different APIs.
 
 This whole operation also looks very expensive (cgroup lookups, RCU locks
 taken etc) but I guess you're willing to take that cost in the same of
 isolating containers from each other. However, I strongly suggest that
 this overhead is measured in advance. It should not stop the series being
 merged as such but it should be understood because if the cost is high
 then this feature will be avoided like the plague. I am skeptical that
 distributions would enable this by default, at least not without support
 for cgroup_disable=kmem
 
 As this thing is called from within the allocator, it's not clear why
 __memcg_kmem_new_page is exported. I can't imagine why a module would call
 it directly although maybe you cover that somewhere else in the series.
 
 >From the point of view of a hook, that is acceptable but just barely. I have
 slammed other hooks because it was possible for a subsystem to override them
 meaning the runtime cost could be anything. I did not spot a similar issue
 here but if I missed it, it's still unacceptable. At least here the cost
 is sortof predictable and only affects memcg because of the __GFP_KMEMCG
 check in memcg_kmem_new_page.
 
 > +		return NULL;
 > +
 > +	/*
 >  	 * Check the zones suitable for the gfp_mask contain at least one
 >  	 * valid zone. It's possible to have an empty zonelist as a result
 >  	 * of GFP_THISNODE and a memoryless node
 > @@ -2583,6 +2591,8 @@ out:
 >  	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 >  		goto retry_cpuset;
 >
 > +	memcg_kmem_commit_page(page, handle, order);
 > +
 
 As a side note, I'm not keen on how you shortcut these functions. They
 are all function calls because memcg_kmem_commit_page() will always call
 __memcg_kmem_commit_page() to check the handle once it's compiled in.
 The handle==NULL check should have happened in the inline function to save
 a few cycles.
 
 This also has the feel that the call of memcg_kmem_commit_page belongs in
 prep_new_page() but I recognise that requires passing the opaque handler
 around which would be very ugly.
 
 >  	return page;
 >  }
 >  EXPORT_SYMBOL(__alloc_pages_nodemask);
 > @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
 >
 >  EXPORT_SYMBOL(free_pages);
 >
 > +/*
 > + * __free_accounted_pages and free_accounted_pages will free pages allocated
 > + * with __GFP_KMEMCG.
 > + *
 > + * Those pages are accounted to a particular memcg, embedded in the
 > + * corresponding page_cgroup. To avoid adding a hit in the allocator to search
 > + * for that information only to find out that it is NULL for users who have no
 > + * interest in that whatsoever, we provide these functions.
 > + *
 > + * The caller knows better which flags it relies on.
 > + */
 > +void __free_accounted_pages(struct page *page, unsigned int order)
 > +{
 > +	memcg_kmem_free_page(page, order);
 > +	__free_pages(page, order);
 > +}
 > +EXPORT_SYMBOL(__free_accounted_pages);
 
 memcg_kmem_new_page makes the following check
 
 +       if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
 +               return true;
 
 so if the allocation had __GFP_NOFAIL, it does not get charged but can
 still be freed. I didn't check if this is really the case but it looks
 very suspicious.
 
 Again, this is a fairly heavy operation.
 
 > +
 > +void free_accounted_pages(unsigned long addr, unsigned int order)
 > +{
 > +	if (addr != 0) {
 > +		VM_BUG_ON(!virt_addr_valid((void *)addr));
 > +		memcg_kmem_free_page(virt_to_page((void *)addr), order);
 > +		__free_pages(virt_to_page((void *)addr), order);
 > +	}
 > +}
 > +EXPORT_SYMBOL(free_accounted_pages);
 > +
 >  static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 >  {
 >  	if (addr) {
 
 --
 Mel Gorman
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47507 is a reply to message #47498] | Wed, 15 August 2012 09:33   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| >> We always account to both user and kernel resource_counters. This >> effectively means that an independent kernel limit is in place when the
 >> limit is set to a lower value than the user memory. A equal or higher
 >> value means that the user limit will always hit first, meaning that kmem
 >> is effectively unlimited.
 >
 > Well, it contributes to the user limit so it is not unlimited. It just
 > falls under a different limit and it tends to contribute less.
 
 You are right, but this is just wording. I will update it, but what I
 really mean here is that an independent limit is no imposed on kmem.
 
 > This can
 > be quite confusing.  I am still not sure whether we should mix the two
 > things together. If somebody wants to limit the kernel memory he has to
 > touch the other limit anyway.  Do you have a strong reason to mix the
 > user and kernel counters?
 
 This is funny, because the first opposition I found to this work was
 "Why would anyone want to limit it separately?" =p
 
 It seems that a quite common use case is to have a container with a
 unified view of "memory" that it can use the way he likes, be it with
 kernel memory, or user memory. I believe those people would be happy to
 just silently account kernel memory to user memory, or at the most have
 a switch to enable it.
 
 What gets clear from this back and forth, is that there are people
 interested in both use cases.
 
 > My impression was that kernel allocation should simply fail while user
 > allocations might reclaim as well. Why should we reclaim just because of
 > the kernel allocation (which is unreclaimable from hard limit reclaim
 > point of view)?
 
 That is not what the kernel does, in general. We assume that if he wants
 that memory and we can serve it, we should. Also, not all kernel memory
 is unreclaimable. We can shrink the slabs, for instance. Ying Han
 claims she has patches for that already...
 
 > I also think that the whole thing would get much simpler if those two
 > are split. Anyway if this is really a must then this should be
 > documented here.
 
 Well, documentation can't hurt.
 
 >
 > This doesn't check for the hierachy so kmem_accounted might not be in
 > sync with it's parents. mem_cgroup_create (below) needs to copy
 > kmem_accounted down from the parent and the above needs to check if this
 > is a similar dance like mem_cgroup_oom_control_write.
 >
 
 I don't see why we have to.
 
 I believe in a A/B/C hierarchy, C should be perfectly able to set a
 different limit than its parents. Note that this is not a boolean.
 
 Also, right now, C can become completely unlimited (by not setting a
 limited) and this is, indeed, not the desired behavior.
 
 A later patch will change kmem_accounted to a bitfield, and we'll use
 one of the bits to signal that we should account kmem because our parent
 is limited.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47508 is a reply to message #47499] | Wed, 15 August 2012 09:42   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| >> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed. >> + * @gfp: the gfp allocation flags.
 >> + * @handle: a pointer to the memcg this was charged against.
 >> + * @order: allocation order.
 >> + *
 >> + * returns true if the memcg where the current task belongs can hold this
 >> + * allocation.
 >> + *
 >> + * We return true automatically if this allocation is not to be accounted to
 >> + * any memcg.
 >> + */
 >> +static __always_inline bool
 >> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 >> +{
 >> +	if (!memcg_kmem_on)
 >> +		return true;
 >> +	if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
 >
 > OK, I see the point behind __GFP_NOFAIL but it would deserve a comment
 > or a mention in the changelog.
 
 documentation can't hurt!
 
 Just added.
 
 > [...]
 >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 >> index 54e93de..e9824c1 100644
 >> --- a/mm/memcontrol.c
 >> +++ b/mm/memcontrol.c
 > [...]
 >> +EXPORT_SYMBOL(__memcg_kmem_new_page);
 >
 > Why is this exported?
 >
 
 It shouldn't be. Removed.
 
 >> +
 >> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
 >> +{
 >> +	struct page_cgroup *pc;
 >> +	struct mem_cgroup *memcg = handle;
 >> +
 >> +	if (!memcg)
 >> +		return;
 >> +
 >> +	WARN_ON(mem_cgroup_is_root(memcg));
 >> +	/* The page allocation must have failed. Revert */
 >> +	if (!page) {
 >> +		size_t size = PAGE_SIZE << order;
 >> +
 >> +		memcg_uncharge_kmem(memcg, size);
 >> +		mem_cgroup_put(memcg);
 >> +		return;
 >> +	}
 >> +
 >> +	pc = lookup_page_cgroup(page);
 >> +	lock_page_cgroup(pc);
 >> +	pc->mem_cgroup = memcg;
 >> +	SetPageCgroupUsed(pc);
 >
 > Don't we need a write barrier before assigning memcg? Same as
 > __mem_cgroup_commit_charge. This tests the Used bit always from within
 > lock_page_cgroup so it should be safe but I am not 100% sure about the
 > rest of the code.
 >
 Well, I don't see the reason, precisely because we'll always grab it
 from within the locked region. That should ensure all the necessary
 serialization.
 
 >> +#ifdef CONFIG_MEMCG_KMEM
 >> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 >> +{
 >> +	struct res_counter *fail_res;
 >> +	struct mem_cgroup *_memcg;
 >> +	int ret;
 >> +	bool may_oom;
 >> +	bool nofail = false;
 >> +
 >> +	may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
 >> +	    !(gfp & __GFP_NORETRY);
 >
 > This deserves a comment.
 >
 can't hurt!! =)
 
 >> +
 >> +	ret = 0;
 >> +
 >> +	if (!memcg)
 >> +		return ret;
 >> +
 >> +	_memcg = memcg;
 >> +	ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 >> +	    &_memcg, may_oom);
 >
 > This is really dangerous because atomic allocation which seem to be
 > possible could result in deadlocks because of the reclaim.
 
 Can you elaborate on how this would happen?
 
 > Also, as I
 > have mentioned in the other email in this thread. Why should we reclaim
 > just because of kernel allocation when we are not reclaiming any of it
 > because shrink_slab is ignored in the memcg reclaim.
 
 
 Don't get too distracted by the fact that shrink_slab is ignored. It is
 temporary, and while this being ignored now leads to suboptimal
 behavior, it will 1st, only affect its users, and 2nd, not be disastrous.
 
 I see it this as more or less on pair with the soft limit reclaim
 problem we had. It is not ideal, but it already provided functionality
 
 >> +
 >> +	if (ret == -EINTR)  {
 >> +		nofail = true;
 >> +		/*
 >> +		 * __mem_cgroup_try_charge() chosed to bypass to root due to
 >> +		 * OOM kill or fatal signal.  Since our only options are to
 >> +		 * either fail the allocation or charge it to this cgroup, do
 >> +		 * it as a temporary condition. But we can't fail. From a
 >> +		 * kmem/slab perspective, the cache has already been selected,
 >> +		 * by mem_cgroup_get_kmem_cache(), so it is too late to change
 >> +		 * our minds
 >> +		 */
 >> +		res_counter_charge_nofail(&memcg->res, delta, &fail_res);
 >> +		if (do_swap_account)
 >> +			res_counter_charge_nofail(&memcg->memsw, delta,
 >> +						  &fail_res);
 >
 > Hmmm, this is kind of ugly but I guess unvoidable with the current
 > implementation. Oh well...
 >
 
 Oh well...
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47509 is a reply to message #47508] | Wed, 15 August 2012 10:44   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 01:42 PM, Glauber Costa wrote: >> Also, as I
 >> > have mentioned in the other email in this thread. Why should we reclaim
 >> > just because of kernel allocation when we are not reclaiming any of it
 >> > because shrink_slab is ignored in the memcg reclaim.
 >
 > Don't get too distracted by the fact that shrink_slab is ignored. It is
 > temporary, and while this being ignored now leads to suboptimal
 > behavior, it will 1st, only affect its users, and 2nd, not be disastrous.
 >
 > I see it this as more or less on pair with the soft limit reclaim
 > problem we had. It is not ideal, but it already provided functionality
 >
 
 Okay, I sent the e-mail before finishing it... duh
 
 What I meant in this last sentence, is that the situation while the
 memcg-aware shrinkers doesn't land in the kernel is more or less the
 same (obviously not exactly) as with the soft reclaim work. It is an
 evolutionary approach that provides some functionality that is not yet
 perfect but already solves lots of problems for people willing to live
 with its temporary drawbacks.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47511 is a reply to message #47507] | Wed, 15 August 2012 12:39   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Wed 15-08-12 13:33:55, Glauber Costa wrote: [...]
 > > This can
 > > be quite confusing.  I am still not sure whether we should mix the two
 > > things together. If somebody wants to limit the kernel memory he has to
 > > touch the other limit anyway.  Do you have a strong reason to mix the
 > > user and kernel counters?
 >
 > This is funny, because the first opposition I found to this work was
 > "Why would anyone want to limit it separately?" =p
 >
 > It seems that a quite common use case is to have a container with a
 > unified view of "memory" that it can use the way he likes, be it with
 > kernel memory, or user memory. I believe those people would be happy to
 > just silently account kernel memory to user memory, or at the most have
 > a switch to enable it.
 >
 > What gets clear from this back and forth, is that there are people
 > interested in both use cases.
 
 I am still not 100% sure myself. It is just clear that the reclaim would
 need some work in order to do accounting like this.
 
 > > My impression was that kernel allocation should simply fail while user
 > > allocations might reclaim as well. Why should we reclaim just because of
 > > the kernel allocation (which is unreclaimable from hard limit reclaim
 > > point of view)?
 >
 > That is not what the kernel does, in general. We assume that if he wants
 > that memory and we can serve it, we should. Also, not all kernel memory
 > is unreclaimable. We can shrink the slabs, for instance. Ying Han
 > claims she has patches for that already...
 
 Are those patches somewhere around?
 
 [...]
 > > This doesn't check for the hierachy so kmem_accounted might not be in
 > > sync with it's parents. mem_cgroup_create (below) needs to copy
 > > kmem_accounted down from the parent and the above needs to check if this
 > > is a similar dance like mem_cgroup_oom_control_write.
 > >
 >
 > I don't see why we have to.
 >
 > I believe in a A/B/C hierarchy, C should be perfectly able to set a
 > different limit than its parents. Note that this is not a boolean.
 
 Ohh, I wasn't clear enough. I am not against setting the _limit_ I just
 meant that the kmem_accounted should be consistent within the hierarchy.
 
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47512 is a reply to message #47511] | Wed, 15 August 2012 12:53   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 04:39 PM, Michal Hocko wrote: > On Wed 15-08-12 13:33:55, Glauber Costa wrote:
 > [...]
 >>> This can
 >>> be quite confusing.  I am still not sure whether we should mix the two
 >>> things together. If somebody wants to limit the kernel memory he has to
 >>> touch the other limit anyway.  Do you have a strong reason to mix the
 >>> user and kernel counters?
 >>
 >> This is funny, because the first opposition I found to this work was
 >> "Why would anyone want to limit it separately?" =p
 >>
 >> It seems that a quite common use case is to have a container with a
 >> unified view of "memory" that it can use the way he likes, be it with
 >> kernel memory, or user memory. I believe those people would be happy to
 >> just silently account kernel memory to user memory, or at the most have
 >> a switch to enable it.
 >>
 >> What gets clear from this back and forth, is that there are people
 >> interested in both use cases.
 >
 > I am still not 100% sure myself. It is just clear that the reclaim would
 > need some work in order to do accounting like this.
 >
 
 Note: Besides what I've already said, right *now* in this series we are
 accounting just stack. So reclaimable vs not-reclaimable doesn't even
 get to play. It is used while the tasks are running, it gets freed after
 the tasks exited.
 
 I do agree we need to look to the whole picture, and reclaiming will be
 hard to get right.
 
 This is actually why we're addressing them separately: because they are
 a hard problem on their own, and the current status of accounting
 already solve real life problems for many, though not for all.
 
 >>> My impression was that kernel allocation should simply fail while user
 >>> allocations might reclaim as well. Why should we reclaim just because of
 >>> the kernel allocation (which is unreclaimable from hard limit reclaim
 >>> point of view)?
 >>
 >> That is not what the kernel does, in general. We assume that if he wants
 >> that memory and we can serve it, we should. Also, not all kernel memory
 >> is unreclaimable. We can shrink the slabs, for instance. Ying Han
 >> claims she has patches for that already...
 >
 > Are those patches somewhere around?
 >
 
 Ying Han ?
 
 > [...]
 >>> This doesn't check for the hierachy so kmem_accounted might not be in
 >>> sync with it's parents. mem_cgroup_create (below) needs to copy
 >>> kmem_accounted down from the parent and the above needs to check if this
 >>> is a similar dance like mem_cgroup_oom_control_write.
 >>>
 >>
 >> I don't see why we have to.
 >>
 >> I believe in a A/B/C hierarchy, C should be perfectly able to set a
 >> different limit than its parents. Note that this is not a boolean.
 >
 > Ohh, I wasn't clear enough. I am not against setting the _limit_ I just
 > meant that the kmem_accounted should be consistent within the hierarchy.
 >
 
 If a parent of yours is accounted, you get accounted as well. This is
 not the state in this patch, but gets added later. Isn't this enough ?
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47515 is a reply to message #47514] | Wed, 15 August 2012 13:04   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 05:02 PM, Michal Hocko wrote: > On Wed 15-08-12 16:53:40, Glauber Costa wrote:
 > [...]
 >>>>> This doesn't check for the hierachy so kmem_accounted might not be in
 >>>>> sync with it's parents. mem_cgroup_create (below) needs to copy
 >>>>> kmem_accounted down from the parent and the above needs to check if this
 >>>>> is a similar dance like mem_cgroup_oom_control_write.
 >>>>>
 >>>>
 >>>> I don't see why we have to.
 >>>>
 >>>> I believe in a A/B/C hierarchy, C should be perfectly able to set a
 >>>> different limit than its parents. Note that this is not a boolean.
 >>>
 >>> Ohh, I wasn't clear enough. I am not against setting the _limit_ I just
 >>> meant that the kmem_accounted should be consistent within the hierarchy.
 >>>
 >>
 >> If a parent of yours is accounted, you get accounted as well. This is
 >> not the state in this patch, but gets added later. Isn't this enough ?
 >
 > But if the parent is not accounted, you can set the children to be
 > accounted, right? Or maybe this is changed later in the series? I didn't
 > get to the end yet.
 >
 
 Yes, you can. Do you see any problem with that?
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47516 is a reply to message #47508] | Wed, 15 August 2012 13:09   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Wed 15-08-12 13:42:24, Glauber Costa wrote: [...]
 > >> +
 > >> +	ret = 0;
 > >> +
 > >> +	if (!memcg)
 > >> +		return ret;
 > >> +
 > >> +	_memcg = memcg;
 > >> +	ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 > >> +	    &_memcg, may_oom);
 > >
 > > This is really dangerous because atomic allocation which seem to be
 > > possible could result in deadlocks because of the reclaim.
 >
 > Can you elaborate on how this would happen?
 
 Say you have an atomic allocation and we hit the limit so we get either
 to reclaim which can sleep or to oom which can sleep as well (depending
 on the oom_control).
 
 > > Also, as I have mentioned in the other email in this thread. Why
 > > should we reclaim just because of kernel allocation when we are not
 > > reclaiming any of it because shrink_slab is ignored in the memcg
 > > reclaim.
 >
 > Don't get too distracted by the fact that shrink_slab is ignored. It is
 > temporary, and while this being ignored now leads to suboptimal
 > behavior, it will 1st, only affect its users, and 2nd, not be disastrous.
 
 It's not just about shrink_slab it is also about triggering memcg-oom
 which doesn't consider kmem accounted memory so the wrong tasks could
 be killed. It is true that the impact is packed inside the group
 (hierarchy) so you are right it won't be disastrous.
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg [message #47517 is a reply to message #47501] | Wed, 15 August 2012 13:22   |  
			| 
				
				
					|  Mel Gorman Messages: 8
 Registered: August 2012
 | Junior Member |  |  |  
	| On Wed, Aug 15, 2012 at 01:08:08PM +0400, Glauber Costa wrote: > On 08/14/2012 07:16 PM, Mel Gorman wrote:
 > > On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
 > >> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 > >> page allocator will call the corresponding memcg functions to validate
 > >> the allocation. Tasks in the root memcg can always proceed.
 > >>
 > >> To avoid adding markers to the page - and a kmem flag that would
 > >> necessarily follow, as much as doing page_cgroup lookups for no reason,
 > >
 > > As you already guessed, doing a page_cgroup in the page allocator free
 > > path would be a no-go.
 >
 > Specifically yes, but in general, you will be able to observe that I am
 > taking all the possible measures to make sure existing paths are
 > disturbed as little as possible.
 >
 > Thanks for your review here
 >
 > >>
 > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 > >> index b956cec..da341dc 100644
 > >> --- a/mm/page_alloc.c
 > >> +++ b/mm/page_alloc.c
 > >> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 > >>  	struct page *page = NULL;
 > >>  	int migratetype = allocflags_to_migratetype(gfp_mask);
 > >>  	unsigned int cpuset_mems_cookie;
 > >> +	void *handle = NULL;
 > >>
 > >>  	gfp_mask &= gfp_allowed_mask;
 > >>
 > >> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 > >>  		return NULL;
 > >>
 > >>  	/*
 > >> +	 * Will only have any effect when __GFP_KMEMCG is set.
 > >> +	 * This is verified in the (always inline) callee
 > >> +	 */
 > >> +	if (!memcg_kmem_new_page(gfp_mask, &handle, order))
 > >
 > > memcg_kmem_new_page takes a void * parameter already but here you are
 > > passing in a void **. This probably happens to work because you do this
 > >
 > > struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 > >
 > > but that appears to defeat the purpose of having an opaque type as a
 > > "handle". You have to treat it different then passing it into the commit
 > > function because it expects a void *. The motivation for an opaque type
 > > is completely unclear to me and how it is managed with a mix of void *
 > > and void ** is very confusing.
 >
 > okay.
 >
 > The opaque exists because I am doing speculative charging.
 
 I do not get why speculative charing would mandate an opaque type or
 "handle". It looks like like a fairly standard prepare/commit pattern to me.
 
 > I believe it
 > to be a better and less complicated approach then letting a page appear
 > and then charging it. Besides being consistent with the rest of memcg,
 > it won't create unnecessary disturbance in the page allocator
 > when the allocation is to fail.
 >
 
 I still don't get why you did not just return a mem_cgroup instead of a
 handle.
 
 > Now, tasks can move between memcgs, so we can't rely on grabbing it from
 > current in commit_page, so we pass it around as a handle.
 
 You could just as easily passed around the mem_cgroup and it would have
 been less obscure. Maybe this makes sense from a memcg context and matches
 some coding pattern there that I'm not aware of.
 
 > Also, even if
 > the task could not move, we already got it once from the task, and that
 > is not for free. Better save it.
 >
 > Aside from the handle needed, the cost is more or less the same compared
 > to doing it in one pass. All we do by using speculative charging is to
 > split the cost in two, and doing it from two places.
 > We'd have to charge + update page_cgroup anyway.
 >
 > As for the type, do you think using struct mem_cgroup would be less
 > confusing?
 >
 
 Yes and returning the mem_cgroup or NULL instead of bool.
 
 > > On a similar note I spotted #define memcg_kmem_on 1 . That is also
 > > different just for the sake of it. The convension is to do something
 > > like this
 > >
 > > /* This helps us to avoid #ifdef CONFIG_NUMA */
 > > #ifdef CONFIG_NUMA
 > > #define NUMA_BUILD 1
 > > #else
 > > #define NUMA_BUILD 0
 > > #endif
 >
 > For simple defines, yes. But a later patch will turn this into a static
 > branch test. memcg_kmem_on will be always 0 when compile-disabled, but
 > when enable will expand to static_branch(&...).
 >
 
 I see.
 
 >
 > > memcg_kmem_on was difficult to guess based on its name. I thought initially
 > > that it would only be active if a memcg existed or at least something like
 > > mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.
 >
 > For now. And I thought that adding the static branch in this patch would
 > only confuse matters.
 
 Ah, I see now. I had stopped reading the series once I reached this
 patch. I don't think it would have mattered much to collapse the two
 patches together but ok.
 
 The static key handling does look a little suspicious. You appear to do
 reference counting in memcg_update_kmem_limit for every mem_cgroup_write()
 but decrement it on memcg exit. This does not appear as if it would be
 symmetric if the memcg files were written to multiple times (maybe that's
 not allowed?). Either way, the comment says it can never be disabled but
 as you have static_key_slow_dec calls it would appear that you *do*
 support them being disabled. Confusing.
 
 > The placeholder is there, but it is later patched
 > to the final thing.
 > With that explained, if you want me to change it to something else, I
 > can do it. Should I ?
 >
 
 Not in this patch anyway. I would have preferred a pattern like this but
 that's about it.
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
 static inline int memcg_kmem_enabled(void)
 {
 return static_key_false(&memcg_kmem_enabled_key);
 }
 #else
 
 static inline bool memcg_kmem_enabled(void)
 {
 return false;
 }
 #endif
 
 Two reasons. One, it does not use the terms "on" and "enabled"
 interchangeably.  The other reason is down to taste as I'm copying the
 pattern I used myself for sk_memalloc_socks(). Of course I am biased.
 
 Also, why is the key exported?
 
 > > I also find it *very* strange to have a function named as if it is an
 > > allocation-style function when it in fact it's looking up a mem_cgroup
 > > and charging it (and uncharging it in the error path if necessary). If
 > > it was called memcg_kmem_newpage_charge I might have found it a little
 > > better.
 >
 > I don't feel strongly about names in general. I can change it.
 > Will update to memcg_kmem_newpage_charge() and memcg_kmem_page_uncharge().
 >
 
 I would prefer that anyway. Names have meaning and people make assumptions on
 the implementation depending on the name. We should try to be as consistent
 as possible or maintenance becomes harder. I know there are areas where
 we are not consistent at all but we should not compound the problem.
 
 > > This whole operation also looks very expensive (cgroup lookups, RCU locks
 > > taken etc) but I guess you're willing to take that cost in the same of
 > > isolating containers from each other. However, I strongly suggest that
 > > this overhead is measured in advance. It should not stop the series being
 > > merged as such but it should be understood because if the cost is high
 > > then this feature will be avoided like the plague. I am skeptical that
 > > distributions would enable this by default, at least not without support
 > > for cgroup_disable=kmem
 >
 > Enabling this feature will bring you nothing, therefore, no (or little)
 > overhead. Nothing of this will be patched in until the first memcg gets
 > kmem limited. The mere fact of moving tasks to memcgs won't trigger any
 > of this.
 >
 
 ok.
 
 > I haven't measured this series in particular, but I did measure the slab
 > series (which builds ontop of this). I found the per-allocation cost to
 > be in the order of 2-3 % for tasks living in limited memcgs, and
 > hard to observe when living in the root memcg (compared of course to the
 > case of a task running on root memcg without those patches)
 >
 
 Depending on the workload that 2-3% could be a lot but at least you're
 aware of it.
 
 > I also believe the folks from google also measured this. They may be
 > able to spit out numbers grabbed from a system bigger than mine =p
 >
 > > As this thing is called from within the allocator, it's not clear why
 > > __memcg_kmem_new_page is exported. I can't imagine why a module would call
 > > it directly although maybe you cover that somewhere else in the series.
 >
 > Okay, more people commented on this, so let me clarify: They shouldn't
 > be. They were initially exported when this was about the slab only,
 > because they could be called from inlined functions from the allocators.
 > Now that the charge/uncharge was moved to the page allocator - which
 > already allowed me the big benefit of separating this in two pieces,
 > none of this needs to be exported.
 >
 > Sorry for not noticing this myself, but thanks for the eyes =)
 >
 
 You're welcome. I expect to see all the exports disappear so. If there
 are any exports left I think it would be important to document why they
 have to be exported. This is particularly true because they are
 EXPORT_SYMBOL not EXPORT_SYMBOL_GPL. I think it would be good to know in
 advance why a module (particularly an out-of-tree one) would be
 interested.
 
 > > From the point of view of a hook, that is acceptable but just barely. I have
 > > slammed other hoo
...
 
 
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47518 is a reply to message #47515] | Wed, 15 August 2012 13:26   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Wed 15-08-12 17:04:31, Glauber Costa wrote: > On 08/15/2012 05:02 PM, Michal Hocko wrote:
 > > On Wed 15-08-12 16:53:40, Glauber Costa wrote:
 > > [...]
 > >>>>> This doesn't check for the hierachy so kmem_accounted might not be in
 > >>>>> sync with it's parents. mem_cgroup_create (below) needs to copy
 > >>>>> kmem_accounted down from the parent and the above needs to check if this
 > >>>>> is a similar dance like mem_cgroup_oom_control_write.
 > >>>>>
 > >>>>
 > >>>> I don't see why we have to.
 > >>>>
 > >>>> I believe in a A/B/C hierarchy, C should be perfectly able to set a
 > >>>> different limit than its parents. Note that this is not a boolean.
 > >>>
 > >>> Ohh, I wasn't clear enough. I am not against setting the _limit_ I just
 > >>> meant that the kmem_accounted should be consistent within the hierarchy.
 > >>>
 > >>
 > >> If a parent of yours is accounted, you get accounted as well. This is
 > >> not the state in this patch, but gets added later. Isn't this enough ?
 > >
 > > But if the parent is not accounted, you can set the children to be
 > > accounted, right? Or maybe this is changed later in the series? I didn't
 > > get to the end yet.
 > >
 >
 > Yes, you can. Do you see any problem with that?
 
 Well, if a child contributes with the kmem charges upwards the hierachy
 then a parent can have kmem.usage > 0 with disabled accounting.
 I am not saying this is a no-go but it definitely is confusing and I do
 not see any good reason for it. I've considered it as an overlook rather
 than a deliberate design decision.
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47519 is a reply to message #47513] | Wed, 15 August 2012 13:29   |  
			| 
				
				
					|  James Bottomley Messages: 17
 Registered: May 2006
 | Junior Member |  |  |  
	| On Wed, 2012-08-15 at 14:55 +0200, Michal Hocko wrote: > On Wed 15-08-12 12:12:23, James Bottomley wrote:
 > > On Wed, 2012-08-15 at 13:33 +0400, Glauber Costa wrote:
 > > > > This can
 > > > > be quite confusing.  I am still not sure whether we should mix the two
 > > > > things together. If somebody wants to limit the kernel memory he has to
 > > > > touch the other limit anyway.  Do you have a strong reason to mix the
 > > > > user and kernel counters?
 > > >
 > > > This is funny, because the first opposition I found to this work was
 > > > "Why would anyone want to limit it separately?" =p
 > > >
 > > > It seems that a quite common use case is to have a container with a
 > > > unified view of "memory" that it can use the way he likes, be it with
 > > > kernel memory, or user memory. I believe those people would be happy to
 > > > just silently account kernel memory to user memory, or at the most have
 > > > a switch to enable it.
 > > >
 > > > What gets clear from this back and forth, is that there are people
 > > > interested in both use cases.
 > >
 > > Haven't we already had this discussion during the Prague get together?
 > > We discussed the use cases and finally agreed to separate accounting for
 > > k and then k+u mem because that satisfies both the Google and Parallels
 > > cases.  No-one was overjoyed by k and k+u but no-one had a better
 > > suggestion ... is there a better way of doing this that everyone can
 > > agree to?
 > > We do need to get this nailed down because it's the foundation of the
 > > patch series.
 >
 > There is a slot in MM/memcg minisum at KS so we have a slot to discuss
 > this.
 
 Sure, to get things moving, can you pre-prime us with what you're
 thinking in this area so we can be prepared (and if it doesn't work,
 tell you beforehand)?
 
 Thanks,
 
 James
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47520 is a reply to message #47518] | Wed, 15 August 2012 13:31   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 05:26 PM, Michal Hocko wrote: > On Wed 15-08-12 17:04:31, Glauber Costa wrote:
 >> On 08/15/2012 05:02 PM, Michal Hocko wrote:
 >>> On Wed 15-08-12 16:53:40, Glauber Costa wrote:
 >>> [...]
 >>>>>>> This doesn't check for the hierachy so kmem_accounted might not be in
 >>>>>>> sync with it's parents. mem_cgroup_create (below) needs to copy
 >>>>>>> kmem_accounted down from the parent and the above needs to check if this
 >>>>>>> is a similar dance like mem_cgroup_oom_control_write.
 >>>>>>>
 >>>>>>
 >>>>>> I don't see why we have to.
 >>>>>>
 >>>>>> I believe in a A/B/C hierarchy, C should be perfectly able to set a
 >>>>>> different limit than its parents. Note that this is not a boolean.
 >>>>>
 >>>>> Ohh, I wasn't clear enough. I am not against setting the _limit_ I just
 >>>>> meant that the kmem_accounted should be consistent within the hierarchy.
 >>>>>
 >>>>
 >>>> If a parent of yours is accounted, you get accounted as well. This is
 >>>> not the state in this patch, but gets added later. Isn't this enough ?
 >>>
 >>> But if the parent is not accounted, you can set the children to be
 >>> accounted, right? Or maybe this is changed later in the series? I didn't
 >>> get to the end yet.
 >>>
 >>
 >> Yes, you can. Do you see any problem with that?
 >
 > Well, if a child contributes with the kmem charges upwards the hierachy
 > then a parent can have kmem.usage > 0 with disabled accounting.
 > I am not saying this is a no-go but it definitely is confusing and I do
 > not see any good reason for it. I've considered it as an overlook rather
 > than a deliberate design decision.
 >
 
 No, it is not an overlook.
 It is theoretically possible to skip accounting on non-limited parents,
 but how expensive is that? This is, indeed, confusing.
 
 Of course I can be biased, but the way I see it, once you have
 hierarchy, you account everything your child accounts.
 
 I really don't see what is the concern here.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg [message #47521 is a reply to message #47517] | Wed, 15 August 2012 13:39   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| >> >> As for the type, do you think using struct mem_cgroup would be less
 >> confusing?
 >>
 >
 > Yes and returning the mem_cgroup or NULL instead of bool.
 
 Ok. struct mem_cgroup it is.
 
 >
 >> The placeholder is there, but it is later patched
 >> to the final thing.
 >> With that explained, if you want me to change it to something else, I
 >> can do it. Should I ?
 >>
 >
 > Not in this patch anyway. I would have preferred a pattern like this but
 > that's about it.
 >
 > #ifdef CONFIG_MEMCG_KMEM
 > extern struct static_key memcg_kmem_enabled_key;
 > static inline int memcg_kmem_enabled(void)
 > {
 >        return static_key_false(&memcg_kmem_enabled_key);
 > }
 > #else
 >
 > static inline bool memcg_kmem_enabled(void)
 > {
 >        return false;
 > }
 > #endif
 >
 
 humm, I'll have to think about this name.
 "memcg_kmem_enabled" means it is enabled in this cgroup. It is actually
 used inside memcontrol.c to denote precisely that.
 
 Now the static branch, of course, means it is globally enabled. Or as I
 called here, "on".
 
 
 > Two reasons. One, it does not use the terms "on" and "enabled"
 > interchangeably.  The other reason is down to taste as I'm copying the
 > pattern I used myself for sk_memalloc_socks(). Of course I am biased.
 >
 > Also, why is the key exported?
 >
 
 Same reason. The slab will now have inline functions that will test
 against that. The alloc functions themselves, are inside the page
 allocator, and the exports can go away.
 
 But the static branch will still be tested inside inlined functions in
 the slab.
 
 That said, for the sake of simplicity, I can make it go away here, and
 add that to the right place later.
 
 >>> I also find it *very* strange to have a function named as if it is an
 >>> allocation-style function when it in fact it's looking up a mem_cgroup
 >>> and charging it (and uncharging it in the error path if necessary). If
 >>> it was called memcg_kmem_newpage_charge I might have found it a little
 >>> better.
 >>
 >> I don't feel strongly about names in general. I can change it.
 >> Will update to memcg_kmem_newpage_charge() and memcg_kmem_page_uncharge().
 >>
 >
 > I would prefer that anyway. Names have meaning and people make assumptions on
 > the implementation depending on the name. We should try to be as consistent
 > as possible or maintenance becomes harder. I know there are areas where
 > we are not consistent at all but we should not compound the problem.
 
 memcg_kmem_page_charge() is even better I believe, and that is what I
 changed this to in my tree.
 
 >>> As this thing is called from within the allocator, it's not clear why
 >>> __memcg_kmem_new_page is exported. I can't imagine why a module would call
 >>> it directly although maybe you cover that somewhere else in the series.
 >>
 >> Okay, more people commented on this, so let me clarify: They shouldn't
 >> be. They were initially exported when this was about the slab only,
 >> because they could be called from inlined functions from the allocators.
 >> Now that the charge/uncharge was moved to the page allocator - which
 >> already allowed me the big benefit of separating this in two pieces,
 >> none of this needs to be exported.
 >>
 >> Sorry for not noticing this myself, but thanks for the eyes =)
 >>
 >
 > You're welcome. I expect to see all the exports disappear so. If there
 > are any exports left I think it would be important to document why they
 > have to be exported. This is particularly true because they are
 > EXPORT_SYMBOL not EXPORT_SYMBOL_GPL. I think it would be good to know in
 > advance why a module (particularly an out-of-tree one) would be
 > interested.
 >
 
 I will remove them all for now.
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47525 is a reply to message #47520] | Wed, 15 August 2012 14:10   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Wed 15-08-12 17:31:24, Glauber Costa wrote: > On 08/15/2012 05:26 PM, Michal Hocko wrote:
 > > On Wed 15-08-12 17:04:31, Glauber Costa wrote:
 > >> On 08/15/2012 05:02 PM, Michal Hocko wrote:
 > >>> On Wed 15-08-12 16:53:40, Glauber Costa wrote:
 > >>> [...]
 > >>>>>>> This doesn't check for the hierachy so kmem_accounted might not be in
 > >>>>>>> sync with it's parents. mem_cgroup_create (below) needs to copy
 > >>>>>>> kmem_accounted down from the parent and the above needs to check if this
 > >>>>>>> is a similar dance like mem_cgroup_oom_control_write.
 > >>>>>>>
 > >>>>>>
 > >>>>>> I don't see why we have to.
 > >>>>>>
 > >>>>>> I believe in a A/B/C hierarchy, C should be perfectly able to set a
 > >>>>>> different limit than its parents. Note that this is not a boolean.
 > >>>>>
 > >>>>> Ohh, I wasn't clear enough. I am not against setting the _limit_ I just
 > >>>>> meant that the kmem_accounted should be consistent within the hierarchy.
 > >>>>>
 > >>>>
 > >>>> If a parent of yours is accounted, you get accounted as well. This is
 > >>>> not the state in this patch, but gets added later. Isn't this enough ?
 > >>>
 > >>> But if the parent is not accounted, you can set the children to be
 > >>> accounted, right? Or maybe this is changed later in the series? I didn't
 > >>> get to the end yet.
 > >>>
 > >>
 > >> Yes, you can. Do you see any problem with that?
 > >
 > > Well, if a child contributes with the kmem charges upwards the hierachy
 > > then a parent can have kmem.usage > 0 with disabled accounting.
 > > I am not saying this is a no-go but it definitely is confusing and I do
 > > not see any good reason for it. I've considered it as an overlook rather
 > > than a deliberate design decision.
 > >
 >
 > No, it is not an overlook.
 > It is theoretically possible to skip accounting on non-limited parents,
 > but how expensive is that? This is, indeed, confusing.
 >
 > Of course I can be biased, but the way I see it, once you have
 > hierarchy, you account everything your child accounts.
 >
 > I really don't see what is the concern here.
 
 OK, I missed an important point that kmem_accounted is not exported to
 the userspace (I thought it would be done later in the series) which
 is not the case so actually nobody get's confused by the inconsistency
 because it is about RESOURCE_MAX which they see in both cases.
 Sorry about the confusion!
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47530 is a reply to message #47529] | Wed, 15 August 2012 15:11   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 06:47 PM, Christoph Lameter wrote: > On Wed, 15 Aug 2012, Michal Hocko wrote:
 >
 >>> That is not what the kernel does, in general. We assume that if he wants
 >>> that memory and we can serve it, we should. Also, not all kernel memory
 >>> is unreclaimable. We can shrink the slabs, for instance. Ying Han
 >>> claims she has patches for that already...
 >>
 >> Are those patches somewhere around?
 >
 > You can already shrink the reclaimable slabs (dentries / inodes) via
 > calls to the subsystem specific shrinkers. Did Ying Han do anything to
 > go beyond that?
 >
 That is not enough for us.
 We would like to make sure that the objects being discarded belong to
 the memcg which is under pressure. We don't need to be perfect here, and
 an occasional slip is totally fine. But if in general, shrinking from
 memcg A will mostly wipe out objects from memcg B, we harmed the system
 in return for nothing good.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47532 is a reply to message #47530] | Wed, 15 August 2012 15:34   |  
			| 
				
				
					|  Christoph Lameter Messages: 123
 Registered: September 2006
 | Senior Member |  |  |  
	| On Wed, 15 Aug 2012, Glauber Costa wrote: 
 > On 08/15/2012 06:47 PM, Christoph Lameter wrote:
 > > On Wed, 15 Aug 2012, Michal Hocko wrote:
 > >
 > >>> That is not what the kernel does, in general. We assume that if he wants
 > >>> that memory and we can serve it, we should. Also, not all kernel memory
 > >>> is unreclaimable. We can shrink the slabs, for instance. Ying Han
 > >>> claims she has patches for that already...
 > >>
 > >> Are those patches somewhere around?
 > >
 > > You can already shrink the reclaimable slabs (dentries / inodes) via
 > > calls to the subsystem specific shrinkers. Did Ying Han do anything to
 > > go beyond that?
 > >
 > That is not enough for us.
 > We would like to make sure that the objects being discarded belong to
 > the memcg which is under pressure. We don't need to be perfect here, and
 > an occasional slip is totally fine. But if in general, shrinking from
 > memcg A will mostly wipe out objects from memcg B, we harmed the system
 > in return for nothing good.
 
 How can you figure out which objects belong to which memcg? The ownerships
 of dentries and inodes is a dubious concept already.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47534 is a reply to message #47532] | Wed, 15 August 2012 15:35   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 07:34 PM, Christoph Lameter wrote: > On Wed, 15 Aug 2012, Glauber Costa wrote:
 >
 >> On 08/15/2012 06:47 PM, Christoph Lameter wrote:
 >>> On Wed, 15 Aug 2012, Michal Hocko wrote:
 >>>
 >>>>> That is not what the kernel does, in general. We assume that if he wants
 >>>>> that memory and we can serve it, we should. Also, not all kernel memory
 >>>>> is unreclaimable. We can shrink the slabs, for instance. Ying Han
 >>>>> claims she has patches for that already...
 >>>>
 >>>> Are those patches somewhere around?
 >>>
 >>> You can already shrink the reclaimable slabs (dentries / inodes) via
 >>> calls to the subsystem specific shrinkers. Did Ying Han do anything to
 >>> go beyond that?
 >>>
 >> That is not enough for us.
 >> We would like to make sure that the objects being discarded belong to
 >> the memcg which is under pressure. We don't need to be perfect here, and
 >> an occasional slip is totally fine. But if in general, shrinking from
 >> memcg A will mostly wipe out objects from memcg B, we harmed the system
 >> in return for nothing good.
 >
 > How can you figure out which objects belong to which memcg? The ownerships
 > of dentries and inodes is a dubious concept already.
 >
 
 Remember we copy over the metadata and create copies of the caches
 per-memcg. Therefore, a dentry belongs to a memcg if it was allocated
 from the slab pertaining to that memcg.
 
 It is not 100 % accurate, but it is good enough.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47542 is a reply to message #47502] | Wed, 15 August 2012 16:38   |  
			| 
				
				
					|  Greg Thelen Messages: 18
 Registered: February 2011
 | Junior Member |  |  |  
	| On Wed, Aug 15 2012, Glauber Costa wrote: 
 > On 08/14/2012 10:58 PM, Greg Thelen wrote:
 >> On Mon, Aug 13 2012, Glauber Costa wrote:
 >>
 >>>>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>>>> +	memcg_uncharge_kmem(memcg, size);
 >>>>>> +	mem_cgroup_put(memcg);
 >>>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>>> reference ?
 >>> This is of course the pair of the mem_cgroup_get() you commented on
 >>> earlier. If we need one, we need the other. If we don't need one, we
 >>> don't need the other =)
 >>>
 >>> The guarantee we're trying to give here is that the memcg structure will
 >>> stay around while there are dangling charges to kmem, that we decided
 >>> not to move (remember: moving it for the stack is simple, for the slab
 >>> is very complicated and ill-defined, and I believe it is better to treat
 >>> all kmem equally here)
 >>
 >> By keeping memcg structures hanging around until the last referring kmem
 >> page is uncharged do such zombie memcg each consume a css_id and thus
 >> put pressure on the 64k css_id space?  I imagine in pathological cases
 >> this would prevent creation of new cgroups until these zombies are
 >> dereferenced.
 >
 > Yes, but although this patch makes it more likely, it doesn't introduce
 > that. If the tasks, for instance, grab a reference to the cgroup dentry
 > in the filesystem (like their CWD, etc), they will also keep the cgroup
 > around.
 
 Fair point.  But this doesn't seems like a feature.  It's probably not
 needed initially, but what do you think about creating a
 memcg_kernel_context structure which is allocated when memcg is
 allocated?  Kernel pages charged to a memcg would have
 page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 would allow the mem_cgroup and its css_id to be deleted when the cgroup
 is unlinked from cgroupfs while allowing for the active kernel pages to
 continue pointing to a valid memcg_kernel_context.  This would be a
 reference counted structure much like you are doing with memcg.  When a
 memcg is deleted the memcg_kernel_context would be linked into its
 surviving parent memcg.  This would avoid needing to visit each kernel
 page.
 
 >> Is there any way to see how much kmem such zombie memcg are consuming?
 >> I think we could find these with
 >> for_each_mem_cgroup_tree(root_mem_cgroup).
 >
 > Yes, just need an interface for that. But I think it is something that
 > can be addressed orthogonaly to this work, in a separate patch, not as
 > some fundamental limitation.
 
 Agreed.
 
 >> Basically, I'm wanting to know where kernel memory has been
 >> allocated.  For live memcg, an admin can cat
 >> memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure how
 >> to get this info.  It looks like the root_mem_cgroup
 >> memory.kmem.usage_in_bytes is not hierarchically charged.
 >>
 >
 > Not sure what you mean by not being hierarchically charged. It should
 > be, when use_hierarchy = 1. As a matter of fact, I just tested it, and I
 > do see kmem being charged all the way to the root cgroup when hierarchy
 > is used. (we just can't limit it there)
 
 You're correct, my mistake.
 
 I think the procedure to determine out the amount of zombie kmem is:
 root_mem_cgroup.kmem_usage_in_bytes -
 sum(all top level memcg memory.kmem_usage_in_bytes)
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47545 is a reply to message #47542] | Wed, 15 August 2012 17:00   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 08:38 PM, Greg Thelen wrote: > On Wed, Aug 15 2012, Glauber Costa wrote:
 >
 >> On 08/14/2012 10:58 PM, Greg Thelen wrote:
 >>> On Mon, Aug 13 2012, Glauber Costa wrote:
 >>>
 >>>>>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>>>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>>>>> +	memcg_uncharge_kmem(memcg, size);
 >>>>>>> +	mem_cgroup_put(memcg);
 >>>>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>>>> reference ?
 >>>> This is of course the pair of the mem_cgroup_get() you commented on
 >>>> earlier. If we need one, we need the other. If we don't need one, we
 >>>> don't need the other =)
 >>>>
 >>>> The guarantee we're trying to give here is that the memcg structure will
 >>>> stay around while there are dangling charges to kmem, that we decided
 >>>> not to move (remember: moving it for the stack is simple, for the slab
 >>>> is very complicated and ill-defined, and I believe it is better to treat
 >>>> all kmem equally here)
 >>>
 >>> By keeping memcg structures hanging around until the last referring kmem
 >>> page is uncharged do such zombie memcg each consume a css_id and thus
 >>> put pressure on the 64k css_id space?  I imagine in pathological cases
 >>> this would prevent creation of new cgroups until these zombies are
 >>> dereferenced.
 >>
 >> Yes, but although this patch makes it more likely, it doesn't introduce
 >> that. If the tasks, for instance, grab a reference to the cgroup dentry
 >> in the filesystem (like their CWD, etc), they will also keep the cgroup
 >> around.
 >
 > Fair point.  But this doesn't seems like a feature.  It's probably not
 > needed initially, but what do you think about creating a
 > memcg_kernel_context structure which is allocated when memcg is
 > allocated?  Kernel pages charged to a memcg would have
 > page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 > would allow the mem_cgroup and its css_id to be deleted when the cgroup
 > is unlinked from cgroupfs while allowing for the active kernel pages to
 > continue pointing to a valid memcg_kernel_context.  This would be a
 > reference counted structure much like you are doing with memcg.  When a
 > memcg is deleted the memcg_kernel_context would be linked into its
 > surviving parent memcg.  This would avoid needing to visit each kernel
 > page.
 
 You need more, you need at the res_counters to stay around as well. And
 probably other fields.
 
 So my fear here is that as you add fields to that structure, you can
 defeat a bit the goal of reducing memory consumption. Still leaves the
 css space, yes. But by doing this we can introduce some subtle bugs by
 having a field in the wrong structure.
 
 Did you observe that to be a big problem in your systems?
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47546 is a reply to message #47545] | Wed, 15 August 2012 17:12   |  
			| 
				
				
					|  Greg Thelen Messages: 18
 Registered: February 2011
 | Junior Member |  |  |  
	| On Wed, Aug 15 2012, Glauber Costa wrote: 
 > On 08/15/2012 08:38 PM, Greg Thelen wrote:
 >> On Wed, Aug 15 2012, Glauber Costa wrote:
 >>
 >>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
 >>>> On Mon, Aug 13 2012, Glauber Costa wrote:
 >>>>
 >>>>>>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>>>>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>>>>>> +	memcg_uncharge_kmem(memcg, size);
 >>>>>>>> +	mem_cgroup_put(memcg);
 >>>>>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>>>>> reference ?
 >>>>> This is of course the pair of the mem_cgroup_get() you commented on
 >>>>> earlier. If we need one, we need the other. If we don't need one, we
 >>>>> don't need the other =)
 >>>>>
 >>>>> The guarantee we're trying to give here is that the memcg structure will
 >>>>> stay around while there are dangling charges to kmem, that we decided
 >>>>> not to move (remember: moving it for the stack is simple, for the slab
 >>>>> is very complicated and ill-defined, and I believe it is better to treat
 >>>>> all kmem equally here)
 >>>>
 >>>> By keeping memcg structures hanging around until the last referring kmem
 >>>> page is uncharged do such zombie memcg each consume a css_id and thus
 >>>> put pressure on the 64k css_id space?  I imagine in pathological cases
 >>>> this would prevent creation of new cgroups until these zombies are
 >>>> dereferenced.
 >>>
 >>> Yes, but although this patch makes it more likely, it doesn't introduce
 >>> that. If the tasks, for instance, grab a reference to the cgroup dentry
 >>> in the filesystem (like their CWD, etc), they will also keep the cgroup
 >>> around.
 >>
 >> Fair point.  But this doesn't seems like a feature.  It's probably not
 >> needed initially, but what do you think about creating a
 >> memcg_kernel_context structure which is allocated when memcg is
 >> allocated?  Kernel pages charged to a memcg would have
 >> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 >> would allow the mem_cgroup and its css_id to be deleted when the cgroup
 >> is unlinked from cgroupfs while allowing for the active kernel pages to
 >> continue pointing to a valid memcg_kernel_context.  This would be a
 >> reference counted structure much like you are doing with memcg.  When a
 >> memcg is deleted the memcg_kernel_context would be linked into its
 >> surviving parent memcg.  This would avoid needing to visit each kernel
 >> page.
 >
 > You need more, you need at the res_counters to stay around as well. And
 > probably other fields.
 
 I am not sure the res_counters would need to stay around.  Once a
 memcg_kernel_context has been reparented, then any future kernel page
 uncharge calls will uncharge the parent res_counter.
 
 > So my fear here is that as you add fields to that structure, you can
 > defeat a bit the goal of reducing memory consumption. Still leaves the
 > css space, yes. But by doing this we can introduce some subtle bugs by
 > having a field in the wrong structure.
 >
 > Did you observe that to be a big problem in your systems?
 
 No I have not seen this yet.  But our past solutions have reparented
 kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
 concerns with your approach are just a suspicion because we have been
 experimenting with accounting of even more kernel memory (e.g. vmalloc,
 kernel stacks, page tables).  As the scope of such accounting grows the
 chance of long lived charged pages grows and thus the chance of zombies
 which exhaust the css_id space grows.
 |  
	|  |  | 
 
 
 Current Time: Sun Oct 26 20:51:09 GMT 2025 
 Total time taken to generate the page: 0.25090 seconds |