| Home » Mailing lists » Devel » [PATCH 00/11] kmem controller for memcg: stripped down version Goto Forum:
	| 
		
			| [PATCH 00/11] kmem controller for memcg: stripped down version [message #46916] | Mon, 25 June 2012 14:15  |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| Hi, 
 What I am proposing with this series is a stripped down version of the
 kmem controller for memcg that would allow us to merge significant parts
 of the infrastructure, while leaving out, for now, the polemic bits about
 the slab while it is being reworked by Cristoph.
 
 Me reasoning for that is that after the last change to introduce a gfp
 flag to mark kernel allocations, it became clear to me that tracking other
 resources like the stack would then follow extremely naturaly. I figured
 that at some point we'd have to solve the issue pointed by David, and avoid
 testing the Slab flag in the page allocator, since it would soon be made
 more generic. I do that by having the callers to explicit mark it.
 
 So to demonstrate how it would work, I am introducing a stack tracker here,
 that is already a functionality per-se: it successfully stops fork bombs to
 happen. (Sorry for doing all your work, Frederic =p ). Note that after all
 memcg infrastructure is deployed, it becomes very easy to track anything.
 The last patch of this series is extremely simple.
 
 The infrastructure is exactly the same we had in memcg, but stripped down
 of the slab parts. And because what we have after those patches is a feature
 per-se, I think it could be considered for merging.
 
 Let me know what you think.
 
 Glauber Costa (9):
 memcg: change defines to an enum
 kmem slab accounting basic infrastructure
 Add a __GFP_KMEMCG flag
 memcg: kmem controller infrastructure
 mm: Allocate kernel pages to the right memcg
 memcg: disable kmem code when not in use.
 memcg: propagate kmem limiting information to children
 memcg: allow a memcg with kmem charges to be destructed.
 protect architectures where THREAD_SIZE >= PAGE_SIZE against fork
 bombs
 
 Suleiman Souhlal (2):
 memcg: Make it possible to use the stock for more than one page.
 memcg: Reclaim when more than one page needed.
 
 include/linux/gfp.h         |   11 +-
 include/linux/memcontrol.h  |   46 +++++
 include/linux/thread_info.h |    6 +
 kernel/fork.c               |    4 +-
 mm/memcontrol.c             |  395 ++++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c             |   27 +++
 6 files changed, 464 insertions(+), 25 deletions(-)
 
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46917 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| From: Suleiman Souhlal <ssouhlal@FreeBSD.org> 
 mem_cgroup_do_charge() was written before slab accounting, and expects
 three cases: being called for 1 page, being called for a stock of 32 pages,
 or being called for a hugepage.  If we call for 2 or 3 pages (and several
 slabs used in process creation are such, at least with the debug options I
 had), it assumed it's being called for stock and just retried without reclaiming.
 
 Fix that by passing down a minsize argument in addition to the csize.
 
 And what to do about that (csize == PAGE_SIZE && ret) retry?  If it's
 needed at all (and presumably is since it's there, perhaps to handle
 races), then it should be extended to more than PAGE_SIZE, yet how far?
 And should there be a retry count limit, of what?  For now retry up to
 COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
 and make sure not to do it if __GFP_NORETRY.
 
 [v4: fixed nr pages calculation pointed out by Christoph Lameter ]
 
 Signed-off-by: Suleiman Souhlal <suleiman@google.com>
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 ---
 mm/memcontrol.c |   23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 9304db2..8e601e8 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -2158,8 +2158,16 @@ enum {
 CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
 +/*
 + * We need a number that is small enough to be likely to have been
 + * reclaimed even under pressure, but not too big to trigger unnecessary
 + * retries
 + */
 +#define NR_PAGES_TO_RETRY 2
 +
 static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 -				unsigned int nr_pages, bool oom_check)
 +				unsigned int nr_pages, unsigned int min_pages,
 +				bool oom_check)
 {
 unsigned long csize = nr_pages * PAGE_SIZE;
 struct mem_cgroup *mem_over_limit;
 @@ -2182,18 +2190,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 } else
 mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 /*
 -	 * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
 -	 * of regular pages (CHARGE_BATCH), or a single regular page (1).
 -	 *
 * Never reclaim on behalf of optional batching, retry with a
 * single page instead.
 */
 -	if (nr_pages == CHARGE_BATCH)
 +	if (nr_pages > min_pages)
 return CHARGE_RETRY;
 
 if (!(gfp_mask & __GFP_WAIT))
 return CHARGE_WOULDBLOCK;
 
 +	if (gfp_mask & __GFP_NORETRY)
 +		return CHARGE_NOMEM;
 +
 ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
 if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 return CHARGE_RETRY;
 @@ -2206,7 +2214,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 * unlikely to succeed so close to the limit, and we fall back
 * to regular pages anyway in case of failure.
 */
 -	if (nr_pages == 1 && ret)
 +	if (nr_pages <= NR_PAGES_TO_RETRY && ret)
 return CHARGE_RETRY;
 
 /*
 @@ -2341,7 +2349,8 @@ again:
 nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 }
 
 -		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
 +		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
 +		    oom_check);
 switch (ret) {
 case CHARGE_OK:
 break;
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 03/11] memcg: change defines to an enum [message #46918 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| This is just a cleanup patch for clarity of expression. In earlier submissions, people asked it to be in a separate
 patch, so here it is.
 
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 ---
 mm/memcontrol.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 8e601e8..9352d40 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -387,9 +387,12 @@ enum charge_type {
 };
 
 /* for encoding cft->private value on file */
 -#define _MEM			(0)
 -#define _MEMSWAP		(1)
 -#define _OOM_TYPE		(2)
 +enum res_type {
 +	_MEM,
 +	_MEMSWAP,
 +	_OOM_TYPE,
 +};
 +
 #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
 #define MEMFILE_TYPE(val)	((val) >> 16 & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 01/11] memcg: Make it possible to use the stock for more than one page. [message #46919 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| From: Suleiman Souhlal <ssouhlal@FreeBSD.org> 
 Signed-off-by: Suleiman Souhlal <suleiman@google.com>
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 ---
 mm/memcontrol.c |   21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index f72b5e5..9304db2 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -1967,19 +1967,22 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 /*
 - * Try to consume stocked charge on this cpu. If success, one page is consumed
 - * from local stock and true is returned. If the stock is 0 or charges from a
 - * cgroup which is not current target, returns false. This stock will be
 - * refilled.
 + * Try to consume stocked charge on this cpu. If success, nr_pages pages are
 + * consumed from local stock and true is returned. If the stock is 0 or
 + * charges from a cgroup which is not current target, returns false.
 + * This stock will be refilled.
 */
 -static bool consume_stock(struct mem_cgroup *memcg)
 +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
 {
 struct memcg_stock_pcp *stock;
 bool ret = true;
 
 +	if (nr_pages > CHARGE_BATCH)
 +		return false;
 +
 stock = &get_cpu_var(memcg_stock);
 -	if (memcg == stock->cached && stock->nr_pages)
 -		stock->nr_pages--;
 +	if (memcg == stock->cached && stock->nr_pages >= nr_pages)
 +		stock->nr_pages -= nr_pages;
 else /* need to call res_counter_charge */
 ret = false;
 put_cpu_var(memcg_stock);
 @@ -2278,7 +2281,7 @@ again:
 VM_BUG_ON(css_is_removed(&memcg->css));
 if (mem_cgroup_is_root(memcg))
 goto done;
 -		if (nr_pages == 1 && consume_stock(memcg))
 +		if (consume_stock(memcg, nr_pages))
 goto done;
 css_get(&memcg->css);
 } else {
 @@ -2303,7 +2306,7 @@ again:
 rcu_read_unlock();
 goto done;
 }
 -		if (nr_pages == 1 && consume_stock(memcg)) {
 +		if (consume_stock(memcg, nr_pages)) {
 /*
 * It seems dagerous to access memcg without css_get().
 * But considering how consume_stok works, it's not
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 05/11] Add a __GFP_KMEMCG flag [message #46920 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| This flag is used to indicate to the callees that this allocation will be serviced to the kernel. It is not supposed to be passed by the callers
 of kmem_cache_alloc, but rather by the cache core itself.
 
 CC: Christoph Lameter <cl@linux.com>
 CC: Pekka Enberg <penberg@cs.helsinki.fi>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 CC: Suleiman Souhlal <suleiman@google.com>
 ---
 include/linux/gfp.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index 1e49be4..8f4079f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -37,6 +37,9 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +#define ___GFP_KMEMCG		0x2000000u
 +#endif
 
 /*
 * GFP bitmasks..
 @@ -88,13 +91,16 @@ struct vm_area_struct;
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +#define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG)/* Allocation comes from a memcg-accounted resource */
 +#endif
 /*
 * This may seem redundant, but it's a way of annotating false positives vs.
 * allocations that simply cannot be supported (e.g. page tables).
 */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
 -#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
 +#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 04/11] kmem slab accounting basic infrastructure [message #46921 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| This patch adds the basic infrastructure for the accounting of the slab caches. To control that, the following files are created:
 
 * memory.kmem.usage_in_bytes
 * memory.kmem.limit_in_bytes
 * memory.kmem.failcnt
 * memory.kmem.max_usage_in_bytes
 
 They have the same meaning of their user memory counterparts. They reflect
 the state of the "kmem" res_counter.
 
 The code is not enabled until a limit is set. This can be tested by the flag
 "kmem_accounted". This means that after the patch is applied, no behavioral
 changes exists for whoever is still using memcg to control their memory usage.
 
 We always account to both user and kernel resource_counters. This effectively
 means that an independent kernel limit is in place when the limit is set
 to a lower value than the user memory. A equal or higher value means that the
 user limit will always hit first, meaning that kmem is effectively unlimited.
 
 People who want to track kernel memory but not limit it, can set this limit
 to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit,
 or equal to the user memory)
 
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 ---
 mm/memcontrol.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 77 insertions(+), 1 deletion(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 9352d40..6f34b77 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -265,6 +265,10 @@ struct mem_cgroup {
 };
 
 /*
 +	 * the counter to account for kernel memory usage.
 +	 */
 +	struct res_counter kmem;
 +	/*
 * Per cgroup active and inactive list, similar to the
 * per zone LRU lists.
 */
 @@ -279,6 +283,7 @@ struct mem_cgroup {
 * Should the accounting and control be hierarchical, per subtree?
 */
 bool use_hierarchy;
 +	bool kmem_accounted;
 
 bool		oom_lock;
 atomic_t	under_oom;
 @@ -391,6 +396,7 @@ enum res_type {
 _MEM,
 _MEMSWAP,
 _OOM_TYPE,
 +	_KMEM,
 };
 
 #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
 @@ -1438,6 +1444,10 @@ done:
 res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
 res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
 res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
 +	printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
 +		res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
 +		res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
 +		res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
 }
 
 /*
 @@ -3879,6 +3889,11 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 else
 val = res_counter_read_u64(&memcg->memsw, name);
 break;
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +	case _KMEM:
 +		val = res_counter_read_u64(&memcg->kmem, name);
 +		break;
 +#endif
 default:
 BUG();
 }
 @@ -3916,8 +3931,26 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 break;
 if (type == _MEM)
 ret = mem_cgroup_resize_limit(memcg, val);
 -		else
 +		else if (type == _MEMSWAP)
 ret = mem_cgroup_resize_memsw_limit(memcg, val);
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +		else if (type == _KMEM) {
 +			ret = res_counter_set_limit(&memcg->kmem, val);
 +			if (ret)
 +				break;
 +			/*
 +			 * Once enabled, can't be disabled. We could in theory
 +			 * disable it if we haven't yet created any caches, or
 +			 * if we can shrink them all to death.
 +			 *
 +			 * But it is not worth the trouble
 +			 */
 +			if (!memcg->kmem_accounted && val != RESOURCE_MAX)
 +				memcg->kmem_accounted = true;
 +		}
 +#endif
 +		else
 +			return -EINVAL;
 break;
 case RES_SOFT_LIMIT:
 ret = res_counter_memparse_write_strategy(buffer, &val);
 @@ -3982,12 +4015,20 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 case RES_MAX_USAGE:
 if (type == _MEM)
 res_counter_reset_max(&memcg->res);
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +		else if (type == _KMEM)
 +			res_counter_reset_max(&memcg->kmem);
 +#endif
 else
 res_counter_reset_max(&memcg->memsw);
 break;
 case RES_FAILCNT:
 if (type == _MEM)
 res_counter_reset_failcnt(&memcg->res);
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +		else if (type == _KMEM)
 +			res_counter_reset_failcnt(&memcg->kmem);
 +#endif
 else
 res_counter_reset_failcnt(&memcg->memsw);
 break;
 @@ -4549,6 +4590,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +static struct cftype kmem_cgroup_files[] = {
 +	{
 +		.name = "kmem.limit_in_bytes",
 +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
 +		.write_string = mem_cgroup_write,
 +		.read = mem_cgroup_read,
 +	},
 +	{
 +		.name = "kmem.usage_in_bytes",
 +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
 +		.read = mem_cgroup_read,
 +	},
 +	{
 +		.name = "kmem.failcnt",
 +		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
 +		.trigger = mem_cgroup_reset,
 +		.read = mem_cgroup_read,
 +	},
 +	{
 +		.name = "kmem.max_usage_in_bytes",
 +		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
 +		.trigger = mem_cgroup_reset,
 +		.read = mem_cgroup_read,
 +	},
 +	{},
 +};
 +
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 return mem_cgroup_sockets_init(memcg, ss);
 @@ -4892,6 +4960,12 @@ mem_cgroup_create(struct cgroup *cont)
 int cpu;
 enable_swap_cgroup();
 parent = NULL;
 +
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +		WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
 +					   kmem_cgroup_files));
 +#endif
 +
 if (mem_cgroup_soft_limit_tree_init())
 goto free_out;
 root_mem_cgroup = memcg;
 @@ -4910,6 +4984,7 @@ mem_cgroup_create(struct cgroup *cont)
 if (parent && parent->use_hierarchy) {
 res_counter_init(&memcg->res, &parent->res);
 res_counter_init(&memcg->memsw, &parent->memsw);
 +		res_counter_init(&memcg->kmem, &parent->kmem);
 /*
 * We increment refcnt of the parent to ensure that we can
 * safely access it on res_counter_charge/uncharge.
 @@ -4920,6 +4995,7 @@ mem_cgroup_create(struct cgroup *cont)
 } else {
 res_counter_init(&memcg->res, NULL);
 res_counter_init(&memcg->memsw, NULL);
 +		res_counter_init(&memcg->kmem, NULL);
 }
 memcg->last_scanned_node = MAX_NUMNODES;
 INIT_LIST_HEAD(&memcg->oom_notify);
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #46922 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| Because the ultimate goal of the kmem tracking in memcg is to track slab pages as well, we can't guarantee that we'll always
 be able to point a page to a particular process, and migrate
 the charges along with it - since in the common case, a page
 will contain data belonging to multiple processes.
 
 Because of that, when we destroy a memcg, we only make sure
 the destruction will succeed by discounting the kmem charges
 from the user charges when we try to empty the cgroup.
 
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 CC: Christoph Lameter <cl@linux.com>
 CC: Pekka Enberg <penberg@cs.helsinki.fi>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 CC: Suleiman Souhlal <suleiman@google.com>
 ---
 mm/memcontrol.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index a6a440b..bb9b6fe 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 {
 if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
 static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
 +	/*
 +	 * This check can't live in kmem destruction function,
 +	 * since the charges will outlive the cgroup
 +	 */
 +	BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
 @@ -3838,6 +3843,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
 int node, zid, shrink;
 int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 struct cgroup *cgrp = memcg->css.cgroup;
 +	u64 usage;
 
 css_get(&memcg->css);
 
 @@ -3877,8 +3883,10 @@ move_account:
 if (ret == -ENOMEM)
 goto try_to_free;
 cond_resched();
 +		usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
 +			res_counter_read_u64(&memcg->kmem, RES_USAGE);
 /* "ret" should also be checked to ensure all lists are empty. */
 -	} while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
 +	} while (usage > 0 || ret);
 out:
 css_put(&memcg->css);
 return ret;
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 07/11] mm: Allocate kernel pages to the right memcg [message #46923 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| When a process tries to allocate a page with the __GFP_KMEMCG flag, the page allocator will call the corresponding memcg functions to
 validate the allocation. Tasks in the root memcg can always proceed.
 
 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no
 reason, whoever is marking its allocations with __GFP_KMEMCG flag
 is responsible for telling the page allocator that this is such an
 allocation at free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().
 
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 CC: Christoph Lameter <cl@linux.com>
 CC: Pekka Enberg <penberg@cs.helsinki.fi>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 CC: Suleiman Souhlal <suleiman@google.com>
 ---
 include/linux/gfp.h |    3 +++
 mm/page_alloc.c     |   27 +++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)
 
 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index 8f4079f..4e27cd8 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -368,6 +368,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_cold_page(struct page *page, int cold);
 extern void free_hot_cold_page_list(struct list_head *list, int cold);
 
 +extern void __free_accounted_pages(struct page *page, unsigned int order);
 +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 +
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index 4403009..cb8867e 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2479,6 +2479,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 struct page *page = NULL;
 int migratetype = allocflags_to_migratetype(gfp_mask);
 unsigned int cpuset_mems_cookie;
 +	void *handle = NULL;
 
 gfp_mask &= gfp_allowed_mask;
 
 @@ -2490,6 +2491,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 return NULL;
 
 /*
 +	 * Will only have any effect when __GFP_KMEMCG is set.
 +	 * This is verified in the (always inline) callee
 +	 */
 +	if (!mem_cgroup_new_kmem_page(gfp_mask, &handle, order))
 +		return NULL;
 +
 +	/*
 * Check the zones suitable for the gfp_mask contain at least one
 * valid zone. It's possible to have an empty zonelist as a result
 * of GFP_THISNODE and a memoryless node
 @@ -2528,6 +2536,8 @@ out:
 if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 goto retry_cpuset;
 
 +	mem_cgroup_commit_kmem_page(page, handle, order);
 +
 return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 @@ -2580,6 +2590,23 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
 +void __free_accounted_pages(struct page *page, unsigned int order)
 +{
 +	mem_cgroup_free_kmem_page(page, order);
 +	__free_pages(page, order);
 +}
 +EXPORT_SYMBOL(__free_accounted_pages);
 +
 +void free_accounted_pages(unsigned long addr, unsigned int order)
 +{
 +	if (addr != 0) {
 +		VM_BUG_ON(!virt_addr_valid((void *)addr));
 +		mem_cgroup_free_kmem_page(virt_to_page((void *)addr), order);
 +		__free_pages(virt_to_page((void *)addr), order);
 +	}
 +}
 +EXPORT_SYMBOL(free_accounted_pages);
 +
 static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 {
 if (addr) {
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46924 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| Because those architectures will draw their stacks directly from the page allocator, rather than the slab cache, we can directly
 pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
 
 This code path is taken when the architecture doesn't define
 CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
 THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
 remaining architectures fall in this category.
 
 This will guarantee that every stack page is accounted to the memcg
 the process currently lives on, and will have the allocations to fail
 if they go over limit.
 
 For the time being, I am defining a new variant of THREADINFO_GFP, not
 to mess with the other path. Once the slab is also tracked by memcg,
 we can get rid of that flag.
 
 Tested to successfully protect against :(){ :|:& };:
 
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 CC: Christoph Lameter <cl@linux.com>
 CC: Pekka Enberg <penberg@cs.helsinki.fi>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 CC: Suleiman Souhlal <suleiman@google.com>
 CC: Frederic Weisbecker <fweisbec@redhat.com>
 ---
 include/linux/thread_info.h |    6 ++++++
 kernel/fork.c               |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)
 
 diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
 index ccc1899..914ec07 100644
 --- a/include/linux/thread_info.h
 +++ b/include/linux/thread_info.h
 @@ -61,6 +61,12 @@ extern long do_no_restart_syscall(struct restart_block *parm);
 # define THREADINFO_GFP		(GFP_KERNEL | __GFP_NOTRACK)
 #endif
 
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
 +#else
 +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP)
 +#endif
 +
 /*
 * flag set/clear/test wrappers
 * - pass TIF_xxxx constants to these functions
 diff --git a/kernel/fork.c b/kernel/fork.c
 index ab5211b..06b414f 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -142,7 +142,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) { }
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 int node)
 {
 -	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 +	struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
 THREAD_SIZE_ORDER);
 
 return page ? page_address(page) : NULL;
 @@ -151,7 +151,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 static inline void free_thread_info(struct thread_info *ti)
 {
 arch_release_thread_info(ti);
 -	free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 +	free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 06/11] memcg: kmem controller infrastructure [message #46925 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| This patch introduces infrastructure for tracking kernel memory pages to a given memcg. This will happen whenever the caller includes the
 flag __GFP_KMEMCG flag, and the task belong to a memcg other than
 the root.
 
 In memcontrol.h those functions are wrapped in inline acessors.
 The idea is to later on, patch those with jump labels, so we don't
 incur any overhead when no mem cgroups are being used.
 
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 CC: Christoph Lameter <cl@linux.com>
 CC: Pekka Enberg <penberg@cs.helsinki.fi>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 ---
 include/linux/memcontrol.h |   44 ++++++++++++
 mm/memcontrol.c            |  172 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 216 insertions(+)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 83e7ba9..22479eb 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
 #define _LINUX_MEMCONTROL_H
 #include <linux/cgroup.h>
 #include <linux/vm_event_item.h>
 +#include <linux/hardirq.h>
 
 struct mem_cgroup;
 struct page_cgroup;
 @@ -409,6 +410,12 @@ struct sock;
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
 +
 +#define mem_cgroup_kmem_on 1
 +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
 +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
 +void __mem_cgroup_free_kmem_page(struct page *page, int order);
 +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)
 #else
 static inline void sock_update_memcg(struct sock *sk)
 {
 @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
 +
 +#define mem_cgroup_kmem_on 0
 +#define __mem_cgroup_new_kmem_page(a, b, c) false
 +#define __mem_cgroup_free_kmem_page(a,b )
 +#define __mem_cgroup_commit_kmem_page(a, b, c)
 +#define is_kmem_tracked_alloc (false)
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 +
 +static __always_inline
 +bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order)
 +{
 +	if (!mem_cgroup_kmem_on)
 +		return true;
 +	if (!is_kmem_tracked_alloc)
 +		return true;
 +	if (!current->mm)
 +		return true;
 +	if (in_interrupt())
 +		return true;
 +	if (gfp & __GFP_NOFAIL)
 +		return true;
 +	return __mem_cgroup_new_kmem_page(gfp, handle, order);
 +}
 +
 +static __always_inline
 +void mem_cgroup_free_kmem_page(struct page *page, int order)
 +{
 +	if (mem_cgroup_kmem_on)
 +		__mem_cgroup_free_kmem_page(page, order);
 +}
 +
 +static __always_inline
 +void mem_cgroup_commit_kmem_page(struct page *page, struct mem_cgroup *handle,
 +				 int order)
 +{
 +	if (mem_cgroup_kmem_on)
 +		__mem_cgroup_commit_kmem_page(page, handle, order);
 +}
 #endif /* _LINUX_MEMCONTROL_H */
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 6f34b77..27b2b6f 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -10,6 +10,10 @@
 * Copyright (C) 2009 Nokia Corporation
 * Author: Kirill A. Shutemov
 *
 + * Kernel Memory Controller
 + * Copyright (C) 2012 Parallels Inc. and Google Inc.
 + * Authors: Glauber Costa and Suleiman Souhlal
 + *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 @@ -422,6 +426,9 @@ static void mem_cgroup_put(struct mem_cgroup *memcg);
 #include <net/ip.h>
 
 static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
 +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
 +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
 +
 void sock_update_memcg(struct sock *sk)
 {
 if (mem_cgroup_sockets_enabled) {
 @@ -476,6 +483,105 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(tcp_proto_cgroup);
 #endif /* CONFIG_INET */
 +
 +static inline bool mem_cgroup_kmem_enabled(struct mem_cgroup *memcg)
 +{
 +	return !mem_cgroup_disabled() && memcg &&
 +	       !mem_cgroup_is_root(memcg) && memcg->kmem_accounted;
 +}
 +
 +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *_handle, int order)
 +{
 +	struct mem_cgroup *memcg;
 +	struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 +	bool ret = true;
 +	size_t size;
 +	struct task_struct *p;
 +
 +	*handle = NULL;
 +	rcu_read_lock();
 +	p = rcu_dereference(current->mm->owner);
 +	memcg = mem_cgroup_from_task(p);
 +	if (!mem_cgroup_kmem_enabled(memcg))
 +		goto out;
 +
 +	mem_cgroup_get(memcg);
 +
 +	size = (1 << order) << PAGE_SHIFT;
 +	ret = memcg_charge_kmem(memcg, gfp, size) == 0;
 +	if (!ret) {
 +		mem_cgroup_put(memcg);
 +		goto out;
 +	}
 +
 +	*handle = memcg;
 +out:
 +	rcu_read_unlock();
 +	return ret;
 +}
 +EXPORT_SYMBOL(__mem_cgroup_new_kmem_page);
 +
 +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order)
 +{
 +	struct page_cgroup *pc;
 +	struct mem_cgroup *memcg = handle;
 +	size_t size;
 +
 +	if (!memcg)
 +		return;
 +
 +	WARN_ON(mem_cgroup_is_root(memcg));
 +	/* The page allocation must have failed. Revert */
 +	if (!page) {
 +		size = (1 << order) << PAGE_SHIFT;
 +		memcg_uncharge_kmem(memcg, size);
 +		mem_cgroup_put(memcg);
 +		return;
 +	}
 +
 +	pc = lookup_page_cgroup(page);
 +	lock_page_cgroup(pc);
 +	pc->mem_cgroup = memcg;
 +	SetPageCgroupUsed(pc);
 +	unlock_page_cgroup(pc);
 +}
 +void __mem_cgroup_free_kmem_page(struct page *page, int order)
 +{
 +	struct mem_cgroup *memcg;
 +	size_t size;
 +	struct page_cgroup *pc;
 +
 +	if (mem_cgroup_disabled())
 +		return;
 +
 +	pc = lookup_page_cgroup(page);
 +	lock_page_cgroup(pc);
 +	memcg = pc->mem_cgroup;
 +	pc->mem_cgroup = NULL;
 +	if (!PageCgroupUsed(pc)) {
 +		unlock_page_cgroup(pc);
 +		return;
 +	}
 +	ClearPageCgroupUsed(pc);
 +	unlock_page_cgroup(pc);
 +
 +	/*
 +	 * The classical disabled check won't work
 +	 * for uncharge, since it is possible that the user enabled
 +	 * kmem tracking, allocated, and then disabled.
 +	 *
 +	 * We trust if there is a memcg associated with the page,
 +	 * it is a valid allocation
 +	 */
 +	if (!memcg)
 +		return;
 +
 +	WARN_ON(mem_cgroup_is_root(memcg));
 +	size = (1 << order) << PAGE_SHIFT;
 +	memcg_uncharge_kmem(memcg, size);
 +	mem_cgroup_put(memcg);
 +}
 +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
 #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
 @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
 __setup("swapaccount=", enable_swap_account);
 
 #endif
 +
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 +{
 +	struct res_counter *fail_res;
 +	struct mem_cgroup *_memcg;
 +	int may_oom, ret;
 +	bool nofail = false;
 +
 +	may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
 +	    !(gfp & __GFP_NORETRY);
 +
 +	ret = 0;
 +
 +	if (!memcg)
 +		return ret;
 +
 +	_memcg = memcg;
 +	ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 +	    &_memcg, may_oom);
 +
 +	if (ret == -EINTR)  {
 +		nofail = true;
 +		/*
 +		 * __mem_cgroup_try_charge() chose to bypass to root due
 +		 * to OOM kill or fatal signal.
 +		 * Since our only options are to either fail the
 +		 * allocation or charge it to this cgroup, do it as
 +		 * a temporary condition. But we can't fail. From a kmem/slab
 +		 * perspective, the cache has already been selected, by
 +		 * mem_cgroup_get_kmem_cache(), so it is too late to change our
 +		 * minds
 +		 */
 +		res_counter_charge_nofail(&memcg->res, delta, &fail_res);
 +		if (do_swap_account)
 +			res_counter_charge_nofail(&memcg->memsw, delta,
 +						  &fail_res);
 +		ret = 0;
 +	} else if (ret == -ENOMEM)
 +		return ret;
 +
 +	if (nofail)
 +		res_counter_charge_nofail(&memcg->kmem, delta, &fail_res);
 +	else
 +		ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
 +
 +	if (ret) {
 +		res_counter_uncharge(&memcg->res, delta);
 +		if (do_swap_account)
 +			res_counter_uncharge(&memcg->memsw, delta);
 +	}
 +
 +	return ret;
 +}
 +
 +void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta)
 +{
 +	if (!memcg)
 +		return;
 +
 +	res_counter_uncharge(&memcg->kmem, delta);
 +	res_counter_uncharge(&memcg->res, delta);
 +	if (do_swap_account)
 +		res_counter_uncharge(&memcg->memsw, delta);
 +}
 +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46926 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| The current memcg slab cache management fails to present satisfatory hierarchical behavior in the following scenario:
 
 -> /cgroups/memory/A/B/C
 
 * kmem limit set at A
 * A and B empty taskwise
 * bash in C does find /
 
 Because kmem_accounted is a boolean that was not set for C, no accounting
 would be done. This is, however, not what we expect.
 
 The basic idea, is that when a cgroup is limited, we walk the tree
 upwards (something Kame and I already thought about doing for other purposes),
 and make sure that we store the information about the parent being limited in
 kmem_accounted (that is turned into a bitmap: two booleans would not be space
 efficient). The code for that is taken from sched/core.c. My reasons for not
 putting it into a common place is to dodge the type issues that would arise
 from a common implementation between memcg and the scheduler - but I think
 that it should ultimately happen, so if you want me to do it now, let me
 know.
 
 We do the reverse operation when a formerly limited cgroup becomes unlimited.
 
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 CC: Christoph Lameter <cl@linux.com>
 CC: Pekka Enberg <penberg@cs.helsinki.fi>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 CC: Suleiman Souhlal <suleiman@google.com>
 ---
 mm/memcontrol.c |   86 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 71 insertions(+), 15 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index fe5388e..a6a440b 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -287,7 +287,11 @@ struct mem_cgroup {
 * Should the accounting and control be hierarchical, per subtree?
 */
 bool use_hierarchy;
 -	bool kmem_accounted;
 +	/*
 +	 * bit0: accounted by this cgroup
 +	 * bit1: accounted by a parent.
 +	 */
 +	volatile unsigned long kmem_accounted;
 
 bool		oom_lock;
 atomic_t	under_oom;
 @@ -340,6 +344,9 @@ struct mem_cgroup {
 #endif
 };
 
 +#define KMEM_ACCOUNTED_THIS	0
 +#define KMEM_ACCOUNTED_PARENT	1
 +
 /* Stuffs for move charges at task migration. */
 /*
 * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
 @@ -589,7 +596,7 @@ EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
 
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
 {
 -	if (memcg->kmem_accounted)
 +	if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
 static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
 }
 #else
 @@ -4027,6 +4034,66 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
 return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 }
 +
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +static void mem_cgroup_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
 +{
 +	struct mem_cgroup *iter;
 +
 +	mutex_lock(&set_limit_mutex);
 +	if (!test_and_set_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted) &&
 +		val != RESOURCE_MAX) {
 +
 +		/*
 +		 * Once enabled, can't be disabled. We could in theory
 +		 * disable it if we haven't yet created any caches, or
 +		 * if we can shrink them all to death.
 +		 *
 +		 * But it is not worth the trouble
 +		 */
 +		static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
 +
 +		if (!memcg->use_hierarchy)
 +			goto out;
 +
 +		for_each_mem_cgroup_tree(iter, memcg) {
 +			if (iter == memcg)
 +				continue;
 +			set_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
 +		}
 +
 +	} else if (test_and_clear_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted)
 +		&& val == RESOURCE_MAX) {
 +
 +		if (!memcg->use_hierarchy)
 +			goto out;
 +
 +		for_each_mem_cgroup_tree(iter, memcg) {
 +			struct mem_cgroup *parent;
 +			if (iter == memcg)
 +				continue;
 +			/*
 +			 * We should only have our parent bit cleared if none of
 +			 * ouri parents are accounted. The transversal order of
 +			 * our iter function forces us to always look at the
 +			 * parents.
 +			 */
 +			parent = parent_mem_cgroup(iter);
 +			while (parent && (parent != memcg)) {
 +				if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
 +					goto noclear;
 +
 +				parent = parent_mem_cgroup(parent);
 +			}
 +			clear_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
 +noclear:
 +			continue;
 +		}
 +	}
 +out:
 +	mutex_unlock(&set_limit_mutex);
 +}
 +#endif
 /*
 * The user of this function is...
 * RES_LIMIT.
 @@ -4064,19 +4131,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 ret = res_counter_set_limit(&memcg->kmem, val);
 if (ret)
 break;
 -			/*
 -			 * Once enabled, can't be disabled. We could in theory
 -			 * disable it if we haven't yet created any caches, or
 -			 * if we can shrink them all to death.
 -			 *
 -			 * But it is not worth the trouble
 -			 */
 -			mutex_lock(&set_limit_mutex);
 -			if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
 -				static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
 -				memcg->kmem_accounted = true;
 -			}
 -			mutex_unlock(&set_limit_mutex);
 +			mem_cgroup_update_kmem_limit(memcg, val);
 +			break;
 }
 #endif
 else
 --
 1.7.10.2
 |  
	|  |  |  
	| 
		
			| [PATCH 08/11] memcg: disable kmem code when not in use. [message #46927 is a reply to message #46916] | Mon, 25 June 2012 14:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| We can use jump labels to patch the code in or out when not used.
 
 Because the assignment: memcg->kmem_accounted = true
 is done after the jump labels increment, we guarantee
 that the root memcg will always be selected until
 all call sites are patched (see mem_cgroup_kmem_enabled).
 This guarantees that no mischarges are applied.
 
 Jump label decrement happens when the last reference
 count from the memcg dies. This will only happen when
 the caches are all dead.
 
 Signed-off-by: Glauber Costa <glommer@parallels.com>
 CC: Christoph Lameter <cl@linux.com>
 CC: Pekka Enberg <penberg@cs.helsinki.fi>
 CC: Michal Hocko <mhocko@suse.cz>
 CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 CC: Johannes Weiner <hannes@cmpxchg.org>
 CC: Suleiman Souhlal <suleiman@google.com>
 ---
 include/linux/memcontrol.h |    4 +++-
 mm/memcontrol.c            |   28 ++++++++++++++++++++++++++--
 2 files changed, 29 insertions(+), 3 deletions(-)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 22479eb..4d69ff8 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -22,6 +22,7 @@
 #include <linux/cgroup.h>
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
 +#include <linux/jump_label.h>
 
 struct mem_cgroup;
 struct page_cgroup;
 @@ -411,7 +412,8 @@ struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
 
 -#define mem_cgroup_kmem_on 1
 +extern struct static_key mem_cgroup_kmem_enabled_key;
 +#define mem_cgroup_kmem_on static_key_false(&mem_cgroup_kmem_enabled_key)
 bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
 void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
 void __mem_cgroup_free_kmem_page(struct page *page, int order);
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 27b2b6f..fe5388e 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -425,6 +425,10 @@ static void mem_cgroup_put(struct mem_cgroup *memcg);
 #include <net/sock.h>
 #include <net/ip.h>
 
 +struct static_key mem_cgroup_kmem_enabled_key;
 +/* so modules can inline the checks */
 +EXPORT_SYMBOL(mem_cgroup_kmem_enabled_key);
 +
 static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
 static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
 @@ -582,6 +586,16 @@ void __mem_cgroup_free_kmem_page(struct page *page, int order)
 mem_cgroup_put(memcg);
 }
 EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
 +
 +static void disarm_kmem_keys(struct mem_cgroup *memcg)
 +{
 +	if (memcg->kmem_accounted)
 +		static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
 +}
 +#else
 +static void disarm_kmem_keys(struct mem_cgroup *memcg)
 +{
 +}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
 #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
 @@ -597,6 +611,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
 }
 #endif
 
 +static void disarm_static_keys(struct mem_cgroup *memcg)
 +{
 +	disarm_sock_keys(memcg);
 +	disarm_kmem_keys(memcg);
 +}
 +
 static void drain_all_stock_async(struct mem_cgroup *memcg);
 
 static struct mem_cgroup_per_zone *
 @@ -4051,8 +4071,12 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 *
 * But it is not worth the trouble
 */
 -			if (!memcg->kmem_accounted && val != RESOURCE_MAX)
 +			mutex_lock(&set_limit_mutex);
 +			if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
 +				static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
 memcg->kmem_accounted = true;
 +			}
 +			mutex_unlock(&set_limit_mutex);
 }
 #endif
 else
 @@ -4927,7 +4951,7 @@ static void free_work(struct work_struct *work)
 * to move this code around, and make sure it is outside
 * the cgroup_lock.
 */
 -	disarm_sock_keys(memcg);
 +	disarm_static_keys(memcg);
 if (size < PAGE_SIZE)
 kfree(memcg);
 else
 --
 1.7.10.2
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46929 is a reply to message #46925] | Mon, 25 June 2012 18:06   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Again, nits. 
 On Mon, Jun 25, 2012 at 06:15:23PM +0400, Glauber Costa wrote:
 > +#define mem_cgroup_kmem_on 1
 > +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
 > +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
 > +void __mem_cgroup_free_kmem_page(struct page *page, int order);
 > +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)
 
 Ugh... please do the following instead.
 
 static inline bool is_kmem_tracked_alloc(gfp_t gfp)
 {
 return gfp & __GFP_KMEMCG;
 }
 
 >  #else
 >  static inline void sock_update_memcg(struct sock *sk)
 >  {
 > @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
 >  static inline void sock_release_memcg(struct sock *sk)
 >  {
 >  }
 > +
 > +#define mem_cgroup_kmem_on 0
 > +#define __mem_cgroup_new_kmem_page(a, b, c) false
 > +#define __mem_cgroup_free_kmem_page(a,b )
 > +#define __mem_cgroup_commit_kmem_page(a, b, c)
 > +#define is_kmem_tracked_alloc (false)
 
 I would prefer static inlines here too.  It's a bit more code in the
 header but leads to less surprises (e.g. arg evals w/ side effects or
 compiler warning about unused vars) and makes it easier to avoid
 cosmetic errors.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46931 is a reply to message #46926] | Mon, 25 June 2012 18:29   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Feeling like a nit pervert but.. 
 On Mon, Jun 25, 2012 at 06:15:26PM +0400, Glauber Costa wrote:
 > @@ -287,7 +287,11 @@ struct mem_cgroup {
 >  	 * Should the accounting and control be hierarchical, per subtree?
 >  	 */
 >  	bool use_hierarchy;
 > -	bool kmem_accounted;
 > +	/*
 > +	 * bit0: accounted by this cgroup
 > +	 * bit1: accounted by a parent.
 > +	 */
 > +	volatile unsigned long kmem_accounted;
 
 Is the volatile declaration really necessary?  Why is it necessary?
 Why no comment explaining it?
 
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +static void mem_cgroup_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
 > +{
 > +	struct mem_cgroup *iter;
 > +
 > +	mutex_lock(&set_limit_mutex);
 > +	if (!test_and_set_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted) &&
 > +		val != RESOURCE_MAX) {
 > +
 > +		/*
 > +		 * Once enabled, can't be disabled. We could in theory
 > +		 * disable it if we haven't yet created any caches, or
 > +		 * if we can shrink them all to death.
 > +		 *
 > +		 * But it is not worth the trouble
 > +		 */
 > +		static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
 > +
 > +		if (!memcg->use_hierarchy)
 > +			goto out;
 > +
 > +		for_each_mem_cgroup_tree(iter, memcg) {
 > +			if (iter == memcg)
 > +				continue;
 > +			set_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
 > +		}
 > +
 > +	} else if (test_and_clear_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted)
 > +		&& val == RESOURCE_MAX) {
 > +
 > +		if (!memcg->use_hierarchy)
 > +			goto out;
 > +
 > +		for_each_mem_cgroup_tree(iter, memcg) {
 > +			struct mem_cgroup *parent;
 
 Blank line between decl and body please.
 
 > +			if (iter == memcg)
 > +				continue;
 > +			/*
 > +			 * We should only have our parent bit cleared if none of
 > +			 * ouri parents are accounted. The transversal order of
 
 ^ type
 
 > +			 * our iter function forces us to always look at the
 > +			 * parents.
 
 Also, it's okay here but the text filling in comments and patch
 descriptions tend to be quite inconsistent.  If you're on emacs, alt-q
 is your friend and I'm sure vim can do text filling pretty nicely too.
 
 > +			 */
 > +			parent = parent_mem_cgroup(iter);
 > +			while (parent && (parent != memcg)) {
 > +				if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
 > +					goto noclear;
 > +
 > +				parent = parent_mem_cgroup(parent);
 > +			}
 
 Better written in for (;;)?  Also, if we're breaking on parent ==
 memcg, can we ever hit NULL parent in the above loop?
 
 > +			clear_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
 > +noclear:
 > +			continue;
 > +		}
 > +	}
 > +out:
 > +	mutex_unlock(&set_limit_mutex);
 
 Can we please branch on val != RECOURSE_MAX first?  I'm not even sure
 whether the above conditionals are correct.  If the user updates an
 existing kmem limit, the first test_and_set_bit() returns non-zero, so
 the code proceeds onto clearing KMEM_ACCOUNTED_THIS, which succeeds
 but val == RESOURCE_MAX fails so it doesn't do anything.  If the user
 changes it again, it will set ACCOUNTED_THIS again.  So, changing an
 existing kmem limit toggles KMEM_ACCOUNTED_THIS, which just seems
 wacky to me.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #46932 is a reply to message #46922] | Mon, 25 June 2012 18:34   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| On Mon, Jun 25, 2012 at 06:15:27PM +0400, Glauber Costa wrote: > Because the ultimate goal of the kmem tracking in memcg is to
 > track slab pages as well, we can't guarantee that we'll always
 > be able to point a page to a particular process, and migrate
 > the charges along with it - since in the common case, a page
 > will contain data belonging to multiple processes.
 >
 > Because of that, when we destroy a memcg, we only make sure
 > the destruction will succeed by discounting the kmem charges
 > from the user charges when we try to empty the cgroup.
 >
 > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > CC: Christoph Lameter <cl@linux.com>
 > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > CC: Suleiman Souhlal <suleiman@google.com>
 > ---
 >  mm/memcontrol.c |   10 +++++++++-
 >  1 file changed, 9 insertions(+), 1 deletion(-)
 >
 > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > index a6a440b..bb9b6fe 100644
 > --- a/mm/memcontrol.c
 > +++ b/mm/memcontrol.c
 > @@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 >  {
 >  	if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
 >  		static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
 > +	/*
 > +	 * This check can't live in kmem destruction function,
 > +	 * since the charges will outlive the cgroup
 > +	 */
 > +	BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 
 WARN_ON() please.  Misaccounted kernel usually is better than dead
 kernel.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46933 is a reply to message #46924] | Mon, 25 June 2012 18:38   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| On Mon, Jun 25, 2012 at 06:55:35PM +0200, Frederic Weisbecker wrote: > On 06/25/2012 04:15 PM, Glauber Costa wrote:
 >
 > > Because those architectures will draw their stacks directly from
 > > the page allocator, rather than the slab cache, we can directly
 > > pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
 > >
 > > This code path is taken when the architecture doesn't define
 > > CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
 > > THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
 > > remaining architectures fall in this category.
 > >
 > > This will guarantee that every stack page is accounted to the memcg
 > > the process currently lives on, and will have the allocations to fail
 > > if they go over limit.
 > >
 > > For the time being, I am defining a new variant of THREADINFO_GFP, not
 > > to mess with the other path. Once the slab is also tracked by memcg,
 > > we can get rid of that flag.
 > >
 > > Tested to successfully protect against :(){ :|:& };:
 > >
 > > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > > CC: Christoph Lameter <cl@linux.com>
 > > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > > CC: Michal Hocko <mhocko@suse.cz>
 > > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > > CC: Johannes Weiner <hannes@cmpxchg.org>
 > > CC: Suleiman Souhlal <suleiman@google.com>
 >
 >
 > Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
 
 Frederic, does this (with proper slab accounting added later) achieve
 what you wanted with the task counter?
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46935 is a reply to message #46933] | Mon, 25 June 2012 20:57   |  
			| 
				
				
					|  Frederic Weisbecker Messages: 25
 Registered: April 2012
 | Junior Member |  |  |  
	| On Mon, Jun 25, 2012 at 11:38:18AM -0700, Tejun Heo wrote: > On Mon, Jun 25, 2012 at 06:55:35PM +0200, Frederic Weisbecker wrote:
 > > On 06/25/2012 04:15 PM, Glauber Costa wrote:
 > >
 > > > Because those architectures will draw their stacks directly from
 > > > the page allocator, rather than the slab cache, we can directly
 > > > pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
 > > >
 > > > This code path is taken when the architecture doesn't define
 > > > CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
 > > > THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
 > > > remaining architectures fall in this category.
 > > >
 > > > This will guarantee that every stack page is accounted to the memcg
 > > > the process currently lives on, and will have the allocations to fail
 > > > if they go over limit.
 > > >
 > > > For the time being, I am defining a new variant of THREADINFO_GFP, not
 > > > to mess with the other path. Once the slab is also tracked by memcg,
 > > > we can get rid of that flag.
 > > >
 > > > Tested to successfully protect against :(){ :|:& };:
 > > >
 > > > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > > > CC: Christoph Lameter <cl@linux.com>
 > > > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > > > CC: Michal Hocko <mhocko@suse.cz>
 > > > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > > > CC: Johannes Weiner <hannes@cmpxchg.org>
 > > > CC: Suleiman Souhlal <suleiman@google.com>
 > >
 > >
 > > Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
 >
 > Frederic, does this (with proper slab accounting added later) achieve
 > what you wanted with the task counter?
 
 I think so yeah. Relying on general kernel memory accounting should do
 the trick for us. And if we need more finegrained limitation on kernel
 stack accounting we can still add it incrementally. But I believe global
 limitation can be enough.
 
 Thanks.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #46936 is a reply to message #46932] | Mon, 25 June 2012 22:25   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 06/25/2012 10:34 PM, Tejun Heo wrote: > On Mon, Jun 25, 2012 at 06:15:27PM +0400, Glauber Costa wrote:
 >> Because the ultimate goal of the kmem tracking in memcg is to
 >> track slab pages as well, we can't guarantee that we'll always
 >> be able to point a page to a particular process, and migrate
 >> the charges along with it - since in the common case, a page
 >> will contain data belonging to multiple processes.
 >>
 >> Because of that, when we destroy a memcg, we only make sure
 >> the destruction will succeed by discounting the kmem charges
 >> from the user charges when we try to empty the cgroup.
 >>
 >> Signed-off-by: Glauber Costa <glommer@parallels.com>
 >> CC: Christoph Lameter <cl@linux.com>
 >> CC: Pekka Enberg <penberg@cs.helsinki.fi>
 >> CC: Michal Hocko <mhocko@suse.cz>
 >> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 >> CC: Johannes Weiner <hannes@cmpxchg.org>
 >> CC: Suleiman Souhlal <suleiman@google.com>
 >> ---
 >>   mm/memcontrol.c |   10 +++++++++-
 >>   1 file changed, 9 insertions(+), 1 deletion(-)
 >>
 >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 >> index a6a440b..bb9b6fe 100644
 >> --- a/mm/memcontrol.c
 >> +++ b/mm/memcontrol.c
 >> @@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 >>   {
 >>   	if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
 >>   		static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
 >> +	/*
 >> +	 * This check can't live in kmem destruction function,
 >> +	 * since the charges will outlive the cgroup
 >> +	 */
 >> +	BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 >
 > WARN_ON() please.  Misaccounted kernel usually is better than dead
 > kernel.
 >
 
 You're absolutely right, will change.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46939 is a reply to message #46929] | Mon, 25 June 2012 22:28   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 06/25/2012 10:06 PM, Tejun Heo wrote: > Again, nits.
 >
 > On Mon, Jun 25, 2012 at 06:15:23PM +0400, Glauber Costa wrote:
 >> +#define mem_cgroup_kmem_on 1
 >> +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
 >> +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
 >> +void __mem_cgroup_free_kmem_page(struct page *page, int order);
 >> +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)
 >
 > Ugh... please do the following instead.
 >
 > static inline bool is_kmem_tracked_alloc(gfp_t gfp)
 > {
 > 	return gfp & __GFP_KMEMCG;
 > }
 >
 >>   #else
 >>   static inline void sock_update_memcg(struct sock *sk)
 >>   {
 >> @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
 >>   static inline void sock_release_memcg(struct sock *sk)
 >>   {
 >>   }
 >> +
 >> +#define mem_cgroup_kmem_on 0
 >> +#define __mem_cgroup_new_kmem_page(a, b, c) false
 >> +#define __mem_cgroup_free_kmem_page(a,b )
 >> +#define __mem_cgroup_commit_kmem_page(a, b, c)
 >> +#define is_kmem_tracked_alloc (false)
 >
 > I would prefer static inlines here too.  It's a bit more code in the
 > header but leads to less surprises (e.g. arg evals w/ side effects or
 > compiler warning about unused vars) and makes it easier to avoid
 > cosmetic errors.
 >
 > Thanks.
 >
 
 Sure thing.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 01/11] memcg: Make it possible to use the stock for more than one page. [message #46941 is a reply to message #46940] | Mon, 25 June 2012 22:33   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, 
 On Mon, Jun 25, 2012 at 3:29 PM, Glauber Costa <glommer@parallels.com> wrote:
 >> It would be nice to explain why this is being done.  Just a simple
 >> statement like - "prepare for XXX" or "will be needed by XXX".
 >
 >
 > I picked this patch from Suleiman Souhlal, and tried to keep it as close as
 > possible to his version.
 >
 > But for the sake of documentation, I can do that, yes.
 
 Yeah, that would be great. Also, the patch does change behavior,
 right? Explaining / justifying that a bit would be nice too.
 
 Thanks!
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46942 is a reply to message #46931] | Mon, 25 June 2012 22:36   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 06/25/2012 10:29 PM, Tejun Heo wrote: > Feeling like a nit pervert but..
 >
 > On Mon, Jun 25, 2012 at 06:15:26PM +0400, Glauber Costa wrote:
 >> @@ -287,7 +287,11 @@ struct mem_cgroup {
 >>   	 * Should the accounting and control be hierarchical, per subtree?
 >>   	 */
 >>   	bool use_hierarchy;
 >> -	bool kmem_accounted;
 >> +	/*
 >> +	 * bit0: accounted by this cgroup
 >> +	 * bit1: accounted by a parent.
 >> +	 */
 >> +	volatile unsigned long kmem_accounted;
 >
 > Is the volatile declaration really necessary?  Why is it necessary?
 > Why no comment explaining it?
 
 Seems to be required by set_bit and friends. gcc will complain if it is
 not volatile (take a look at the bit function headers)
 
 >> +
 >> +		for_each_mem_cgroup_tree(iter, memcg) {
 >> +			struct mem_cgroup *parent;
 >
 > Blank line between decl and body please.
 ok.
 
 >
 >> +			if (iter == memcg)
 >> +				continue;
 >> +			/*
 >> +			 * We should only have our parent bit cleared if none of
 >> +			 * ouri parents are accounted. The transversal order of
 >
 >                                ^ type
 >
 >> +			 * our iter function forces us to always look at the
 >> +			 * parents.
 >
 > Also, it's okay here but the text filling in comments and patch
 > descriptions tend to be quite inconsistent.  If you're on emacs, alt-q
 > is your friend and I'm sure vim can do text filling pretty nicely too.
 >
 >> +			 */
 >> +			parent = parent_mem_cgroup(iter);
 >> +			while (parent && (parent != memcg)) {
 >> +				if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
 >> +					goto noclear;
 >> +
 >> +				parent = parent_mem_cgroup(parent);
 >> +			}
 >
 > Better written in for (;;)?  Also, if we're breaking on parent ==
 > memcg, can we ever hit NULL parent in the above loop?
 
 I can simplify to test parent != memcg only, indeed it is not expected
 to be NULL (but if it happens to be due to any kind of bug, we protect
 against NULL-dereference, that is why I like to write this way)
 
 >> +			continue;
 >> +		}
 >> +	}
 >> +out:
 >> +	mutex_unlock(&set_limit_mutex);
 >
 > Can we please branch on val != RECOURSE_MAX first?  I'm not even sure
 > whether the above conditionals are correct.  If the user updates an
 > existing kmem limit, the first test_and_set_bit() returns non-zero, so
 > the code proceeds onto clearing KMEM_ACCOUNTED_THIS, which succeeds
 > but val == RESOURCE_MAX fails so it doesn't do anything.  If the user
 > changes it again, it will set ACCOUNTED_THIS again.  So, changing an
 > existing kmem limit toggles KMEM_ACCOUNTED_THIS, which just seems
 > wacky to me.
 >
 
 I will take a look at that tomorrow as well.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46943 is a reply to message #46942] | Mon, 25 June 2012 22:49   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, Glauber. 
 On Tue, Jun 26, 2012 at 02:36:27AM +0400, Glauber Costa wrote:
 > >Is the volatile declaration really necessary?  Why is it necessary?
 > >Why no comment explaining it?
 >
 > Seems to be required by set_bit and friends. gcc will complain if it
 > is not volatile (take a look at the bit function headers)
 
 Hmmm?  Are you sure gcc includes volatile in type check?  There are a
 lot of bitops users in the kernel but most of them don't use volatile
 decl on the variable.
 
 > >>+			 */
 > >>+			parent = parent_mem_cgroup(iter);
 > >>+			while (parent && (parent != memcg)) {
 > >>+				if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
 > >>+					goto noclear;
 > >>+
 > >>+				parent = parent_mem_cgroup(parent);
 > >>+			}
 > >
 > >Better written in for (;;)?  Also, if we're breaking on parent ==
 > >memcg, can we ever hit NULL parent in the above loop?
 >
 > I can simplify to test parent != memcg only, indeed it is not
 > expected to be NULL (but if it happens to be due to any kind of bug,
 > we protect against NULL-dereference, that is why I like to write
 > this way)
 
 I personally don't really like that.  It doesn't really add meaningful
 protection (if that happens the tree walking is already severely
 broken) while causes confusion to future readers of the code (when can
 parent be NULL?).
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46944 is a reply to message #46925] | Mon, 25 June 2012 23:17   |  
			| 
				
				
					|  akpm Messages: 224
 Registered: March 2007
 | Senior Member |  |  |  
	| On Mon, 25 Jun 2012 18:15:23 +0400 Glauber Costa <glommer@parallels.com> wrote:
 
 > This patch introduces infrastructure for tracking kernel memory pages
 > to a given memcg. This will happen whenever the caller includes the
 > flag __GFP_KMEMCG flag, and the task belong to a memcg other than
 > the root.
 >
 > In memcontrol.h those functions are wrapped in inline acessors.
 > The idea is to later on, patch those with jump labels, so we don't
 > incur any overhead when no mem cgroups are being used.
 >
 >
 > ...
 >
 > @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
 >  static inline void sock_release_memcg(struct sock *sk)
 >  {
 >  }
 > +
 > +#define mem_cgroup_kmem_on 0
 > +#define __mem_cgroup_new_kmem_page(a, b, c) false
 > +#define __mem_cgroup_free_kmem_page(a,b )
 > +#define __mem_cgroup_commit_kmem_page(a, b, c)
 
 I suggest that the naming consistently follow the model
 "mem_cgroup_kmem_foo".  So "mem_cgroup_kmem_" becomes the well-known
 identifier for this subsystem.
 
 Then, s/mem_cgroup/memcg/g/ - show us some mercy here!
 
 > +#define is_kmem_tracked_alloc (false)
 
 memcg_kmem_tracked_alloc, perhaps.  But what does this actually do?
 
 <looks>
 
 eww, ick.
 
 > +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)
 
 What Tejun said.  This:
 
 /*
 * Nice comment goes here
 */
 static inline bool memcg_kmem_tracked_alloc(gfp_t gfp)
 {
 return gfp & __GFP_KMEMCG;
 }
 
 >  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 > +
 > +static __always_inline
 > +bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order)
 
 memcg_kmem_new_page().
 
 > +{
 > +	if (!mem_cgroup_kmem_on)
 > +		return true;
 > +	if (!is_kmem_tracked_alloc)
 > +		return true;
 > +	if (!current->mm)
 > +		return true;
 > +	if (in_interrupt())
 > +		return true;
 > +	if (gfp & __GFP_NOFAIL)
 > +		return true;
 > +	return __mem_cgroup_new_kmem_page(gfp, handle, order);
 > +}
 
 Add documentation, please.  The semantics of the return value are
 inscrutable.
 
 > +static __always_inline
 > +void mem_cgroup_free_kmem_page(struct page *page, int order)
 > +{
 > +	if (mem_cgroup_kmem_on)
 > +		__mem_cgroup_free_kmem_page(page, order);
 > +}
 
 memcg_kmem_free_page().
 
 > +static __always_inline
 > +void mem_cgroup_commit_kmem_page(struct page *page, struct mem_cgroup *handle,
 > +				 int order)
 > +{
 > +	if (mem_cgroup_kmem_on)
 > +		__mem_cgroup_commit_kmem_page(page, handle, order);
 > +}
 
 memcg_kmem_commit_page().
 
 >  #endif /* _LINUX_MEMCONTROL_H */
 >
 >
 > ...
 >
 > +static inline bool mem_cgroup_kmem_enabled(struct mem_cgroup *memcg)
 > +{
 > +	return !mem_cgroup_disabled() && memcg &&
 > +	       !mem_cgroup_is_root(memcg) && memcg->kmem_accounted;
 > +}
 
 Does this really need to handle a memcg==NULL?
 
 > +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *_handle, int order)
 > +{
 > +	struct mem_cgroup *memcg;
 > +	struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 > +	bool ret = true;
 > +	size_t size;
 > +	struct task_struct *p;
 > +
 > +	*handle = NULL;
 > +	rcu_read_lock();
 > +	p = rcu_dereference(current->mm->owner);
 > +	memcg = mem_cgroup_from_task(p);
 > +	if (!mem_cgroup_kmem_enabled(memcg))
 > +		goto out;
 > +
 > +	mem_cgroup_get(memcg);
 > +
 > +	size = (1 << order) << PAGE_SHIFT;
 
 size = PAGE_SIZE << order;
 
 is simpler and more typical.
 
 > +	ret = memcg_charge_kmem(memcg, gfp, size) == 0;
 
 Odd.  memcg_charge_kmem() returns a nice errno, buit this conversion
 just drops that information on the floor.  If the
 mem_cgroup_new_kmem_page() return value had been documented, I might
 have been able to understand the thinking here.  But it wasn't, so I
 couldn't.
 
 > +	if (!ret) {
 > +		mem_cgroup_put(memcg);
 > +		goto out;
 > +	}
 > +
 > +	*handle = memcg;
 > +out:
 > +	rcu_read_unlock();
 > +	return ret;
 > +}
 > +EXPORT_SYMBOL(__mem_cgroup_new_kmem_page);
 > +
 > +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order)
 > +{
 > +	struct page_cgroup *pc;
 > +	struct mem_cgroup *memcg = handle;
 > +	size_t size;
 > +
 > +	if (!memcg)
 > +		return;
 > +
 > +	WARN_ON(mem_cgroup_is_root(memcg));
 > +	/* The page allocation must have failed. Revert */
 > +	if (!page) {
 > +		size = (1 << order) << PAGE_SHIFT;
 
 PAGE_SIZE << order
 
 > +		memcg_uncharge_kmem(memcg, size);
 > +		mem_cgroup_put(memcg);
 > +		return;
 > +	}
 
 This all looks a bit odd.  After a failure you run a commit which
 undoes the speculative charging.  I guess it makes sense.  It's the
 sort of thing which can be expounded upon in the documentation whcih
 isn't there, sigh.
 
 > +	pc = lookup_page_cgroup(page);
 > +	lock_page_cgroup(pc);
 > +	pc->mem_cgroup = memcg;
 > +	SetPageCgroupUsed(pc);
 > +	unlock_page_cgroup(pc);
 > +}
 
 missed a newline there.
 
 > +void __mem_cgroup_free_kmem_page(struct page *page, int order)
 > +{
 > +	struct mem_cgroup *memcg;
 > +	size_t size;
 > +	struct page_cgroup *pc;
 > +
 > +	if (mem_cgroup_disabled())
 > +		return;
 > +
 > +	pc = lookup_page_cgroup(page);
 > +	lock_page_cgroup(pc);
 > +	memcg = pc->mem_cgroup;
 > +	pc->mem_cgroup = NULL;
 > +	if (!PageCgroupUsed(pc)) {
 > +		unlock_page_cgroup(pc);
 > +		return;
 > +	}
 > +	ClearPageCgroupUsed(pc);
 > +	unlock_page_cgroup(pc);
 > +
 > +	/*
 > +	 * The classical disabled check won't work
 
 What is "The classical disabled check"?  Be specific.  Testing
 mem_cgroup_kmem_on?
 
 > +	 * for uncharge, since it is possible that the user enabled
 > +	 * kmem tracking, allocated, and then disabled.
 > +	 *
 > +	 * We trust if there is a memcg associated with the page,
 > +	 * it is a valid allocation
 > +	 */
 > +	if (!memcg)
 > +		return;
 > +
 > +	WARN_ON(mem_cgroup_is_root(memcg));
 > +	size = (1 << order) << PAGE_SHIFT;
 
 PAGE_SIZE << order
 
 > +	memcg_uncharge_kmem(memcg, size);
 > +	mem_cgroup_put(memcg);
 > +}
 > +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
 >  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 >
 >  #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
 > @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
 >  __setup("swapaccount=", enable_swap_account);
 >
 >  #endif
 > +
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 
 gargh.  CONFIG_MEMCG_KMEM, please!
 
 > +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 > +{
 > +	struct res_counter *fail_res;
 > +	struct mem_cgroup *_memcg;
 > +	int may_oom, ret;
 > +	bool nofail = false;
 > +
 > +	may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
 > +	    !(gfp & __GFP_NORETRY);
 
 may_oom should have bool type.
 
 > +	ret = 0;
 > +
 > +	if (!memcg)
 > +		return ret;
 > +
 > +	_memcg = memcg;
 > +	ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 > +	    &_memcg, may_oom);
 > +
 > +	if (ret == -EINTR)  {
 > +		nofail = true;
 > +		/*
 > +		 * __mem_cgroup_try_charge() chose to bypass to root due
 > +		 * to OOM kill or fatal signal.
 
 Is "bypass" correct?  Maybe "fall back"?
 
 > +		 * Since our only options are to either fail the
 > +		 * allocation or charge it to this cgroup, do it as
 > +		 * a temporary condition. But we can't fail. From a kmem/slab
 > +		 * perspective, the cache has already been selected, by
 > +		 * mem_cgroup_get_kmem_cache(), so it is too late to change our
 > +		 * minds
 > +		 */
 > +		res_counter_charge_nofail(&memcg->res, delta, &fail_res);
 > +		if (do_swap_account)
 > +			res_counter_charge_nofail(&memcg->memsw, delta,
 > +						  &fail_res);
 > +		ret = 0;
 > +	} else if (ret == -ENOMEM)
 > +		return ret;
 > +
 > +	if (nofail)
 > +		res_counter_charge_nofail(&memcg->kmem, delta, &fail_res);
 > +	else
 > +		ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
 > +
 > +	if (ret) {
 > +		res_counter_uncharge(&memcg->res, delta);
 > +		if (do_swap_account)
 > +			res_counter_uncharge(&memcg->memsw, delta);
 > +	}
 > +
 > +	return ret;
 > +}
 >
 > ...
 >
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH 00/11] kmem controller for memcg: stripped down version [message #46947 is a reply to message #46916] | Mon, 25 June 2012 23:27   |  
			| 
				
				
					|  akpm Messages: 224
 Registered: March 2007
 | Senior Member |  |  |  
	| On Mon, 25 Jun 2012 18:15:17 +0400 Glauber Costa <glommer@parallels.com> wrote:
 
 > What I am proposing with this series is a stripped down version of the
 > kmem controller for memcg that would allow us to merge significant parts
 > of the infrastructure, while leaving out, for now, the polemic bits about
 > the slab while it is being reworked by Cristoph.
 >
 > Me reasoning for that is that after the last change to introduce a gfp
 > flag to mark kernel allocations, it became clear to me that tracking other
 > resources like the stack would then follow extremely naturaly. I figured
 > that at some point we'd have to solve the issue pointed by David, and avoid
 > testing the Slab flag in the page allocator, since it would soon be made
 > more generic. I do that by having the callers to explicit mark it.
 >
 > So to demonstrate how it would work, I am introducing a stack tracker here,
 > that is already a functionality per-se: it successfully stops fork bombs to
 > happen. (Sorry for doing all your work, Frederic =p ). Note that after all
 > memcg infrastructure is deployed, it becomes very easy to track anything.
 > The last patch of this series is extremely simple.
 >
 > The infrastructure is exactly the same we had in memcg, but stripped down
 > of the slab parts. And because what we have after those patches is a feature
 > per-se, I think it could be considered for merging.
 
 hm.  None of this new code makes the kernel smaller, faster, easier to
 understand or more fun to read!
 
 Presumably we're getting some benefit for all the downside.  When the
 time is appropriate, please do put some time into explaining that
 benefit, so that others can agree that it is a worthwhile tradeoff.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46948 is a reply to message #46917] | Mon, 25 June 2012 23:33   |  
			| 
				
				
					|  Suleiman Souhlal Messages: 64
 Registered: February 2012
 | Member |  |  |  
	| On Mon, Jun 25, 2012 at 7:15 AM, Glauber Costa <glommer@parallels.com> wrote: > From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
 >
 > mem_cgroup_do_charge() was written before slab accounting, and expects
 > three cases: being called for 1 page, being called for a stock of 32 pages,
 > or being called for a hugepage.  If we call for 2 or 3 pages (and several
 > slabs used in process creation are such, at least with the debug options I
 > had), it assumed it's being called for stock and just retried without reclaiming.
 >
 > Fix that by passing down a minsize argument in addition to the csize.
 >
 > And what to do about that (csize == PAGE_SIZE && ret) retry?  If it's
 > needed at all (and presumably is since it's there, perhaps to handle
 > races), then it should be extended to more than PAGE_SIZE, yet how far?
 > And should there be a retry count limit, of what?  For now retry up to
 > COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
 > and make sure not to do it if __GFP_NORETRY.
 
 The commit description mentions COSTLY_ORDER, but it's not actually
 used in the patch.
 
 -- Suleiman
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46951 is a reply to message #46917] | Tue, 26 June 2012 04:09   |  
			| 
				
				
					|  David Rientjes Messages: 59
 Registered: November 2006
 | Member |  |  |  
	| On Mon, 25 Jun 2012, Glauber Costa wrote: 
 > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > index 9304db2..8e601e8 100644
 > --- a/mm/memcontrol.c
 > +++ b/mm/memcontrol.c
 > @@ -2158,8 +2158,16 @@ enum {
 >  	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 >  };
 >
 > +/*
 > + * We need a number that is small enough to be likely to have been
 > + * reclaimed even under pressure, but not too big to trigger unnecessary
 
 Whitespace.
 
 > + * retries
 > + */
 > +#define NR_PAGES_TO_RETRY 2
 > +
 
 Should be 1 << PAGE_ALLOC_COSTLY_ORDER?  Where does this number come from?
 The changelog doesn't specify.
 
 >  static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 > -				unsigned int nr_pages, bool oom_check)
 > +				unsigned int nr_pages, unsigned int min_pages,
 > +				bool oom_check)
 >  {
 >  	unsigned long csize = nr_pages * PAGE_SIZE;
 >  	struct mem_cgroup *mem_over_limit;
 > @@ -2182,18 +2190,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 >  	} else
 >  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 >  	/*
 > -	 * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
 > -	 * of regular pages (CHARGE_BATCH), or a single regular page (1).
 > -	 *
 >  	 * Never reclaim on behalf of optional batching, retry with a
 >  	 * single page instead.
 >  	 */
 > -	if (nr_pages == CHARGE_BATCH)
 > +	if (nr_pages > min_pages)
 >  		return CHARGE_RETRY;
 >
 >  	if (!(gfp_mask & __GFP_WAIT))
 >  		return CHARGE_WOULDBLOCK;
 >
 > +	if (gfp_mask & __GFP_NORETRY)
 > +		return CHARGE_NOMEM;
 > +
 >  	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
 >  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 >  		return CHARGE_RETRY;
 > @@ -2206,7 +2214,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 >  	 * unlikely to succeed so close to the limit, and we fall back
 >  	 * to regular pages anyway in case of failure.
 >  	 */
 > -	if (nr_pages == 1 && ret)
 > +	if (nr_pages <= NR_PAGES_TO_RETRY && ret)
 >  		return CHARGE_RETRY;
 >
 >  	/*
 > @@ -2341,7 +2349,8 @@ again:
 >  			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 >  		}
 >
 > -		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
 > +		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
 > +		    oom_check);
 >  		switch (ret) {
 >  		case CHARGE_OK:
 >  			break;
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 04/11] kmem slab accounting basic infrastructure [message #46953 is a reply to message #46921] | Tue, 26 June 2012 04:22   |  
			| 
				
				
					|  David Rientjes Messages: 59
 Registered: November 2006
 | Member |  |  |  
	| On Mon, 25 Jun 2012, Glauber Costa wrote: 
 > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > index 9352d40..6f34b77 100644
 > --- a/mm/memcontrol.c
 > +++ b/mm/memcontrol.c
 > @@ -265,6 +265,10 @@ struct mem_cgroup {
 >  	};
 >
 >  	/*
 > +	 * the counter to account for kernel memory usage.
 > +	 */
 > +	struct res_counter kmem;
 > +	/*
 >  	 * Per cgroup active and inactive list, similar to the
 >  	 * per zone LRU lists.
 >  	 */
 > @@ -279,6 +283,7 @@ struct mem_cgroup {
 >  	 * Should the accounting and control be hierarchical, per subtree?
 >  	 */
 >  	bool use_hierarchy;
 > +	bool kmem_accounted;
 >
 >  	bool		oom_lock;
 >  	atomic_t	under_oom;
 > @@ -391,6 +396,7 @@ enum res_type {
 >  	_MEM,
 >  	_MEMSWAP,
 >  	_OOM_TYPE,
 > +	_KMEM,
 >  };
 >
 >  #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
 > @@ -1438,6 +1444,10 @@ done:
 >  		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
 >  		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
 >  		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
 > +	printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
 > +		res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
 > +		res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
 > +		res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
 >  }
 >
 >  /*
 > @@ -3879,6 +3889,11 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 >  		else
 >  			val = res_counter_read_u64(&memcg->memsw, name);
 >  		break;
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +	case _KMEM:
 > +		val = res_counter_read_u64(&memcg->kmem, name);
 > +		break;
 > +#endif
 
 This shouldn't need an #ifdef, ->kmem is available on all
 CONFIG_CGROUP_MEM_RES_CTLR kernels.  Same with several of the other
 instances in this patch.
 
 Can't these instances be addressed by not adding kmem_cgroup_files without
 CONFIG_CGROUP_MEM_RES_CTLR_KMEM?
 
 >  	default:
 >  		BUG();
 >  	}
 > @@ -3916,8 +3931,26 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 >  			break;
 >  		if (type == _MEM)
 >  			ret = mem_cgroup_resize_limit(memcg, val);
 > -		else
 > +		else if (type == _MEMSWAP)
 >  			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +		else if (type == _KMEM) {
 > +			ret = res_counter_set_limit(&memcg->kmem, val);
 > +			if (ret)
 > +				break;
 > +			/*
 > +			 * Once enabled, can't be disabled. We could in theory
 > +			 * disable it if we haven't yet created any caches, or
 > +			 * if we can shrink them all to death.
 > +			 *
 > +			 * But it is not worth the trouble
 > +			 */
 > +			if (!memcg->kmem_accounted && val != RESOURCE_MAX)
 > +				memcg->kmem_accounted = true;
 > +		}
 > +#endif
 > +		else
 > +			return -EINVAL;
 >  		break;
 >  	case RES_SOFT_LIMIT:
 >  		ret = res_counter_memparse_write_strategy(buffer, &val);
 > @@ -3982,12 +4015,20 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 >  	case RES_MAX_USAGE:
 >  		if (type == _MEM)
 >  			res_counter_reset_max(&memcg->res);
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +		else if (type == _KMEM)
 > +			res_counter_reset_max(&memcg->kmem);
 > +#endif
 >  		else
 >  			res_counter_reset_max(&memcg->memsw);
 >  		break;
 >  	case RES_FAILCNT:
 >  		if (type == _MEM)
 >  			res_counter_reset_failcnt(&memcg->res);
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +		else if (type == _KMEM)
 > +			res_counter_reset_failcnt(&memcg->kmem);
 > +#endif
 >  		else
 >  			res_counter_reset_failcnt(&memcg->memsw);
 >  		break;
 > @@ -4549,6 +4590,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 >  }
 >
 >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +static struct cftype kmem_cgroup_files[] = {
 > +	{
 > +		.name = "kmem.limit_in_bytes",
 > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
 > +		.write_string = mem_cgroup_write,
 > +		.read = mem_cgroup_read,
 > +	},
 > +	{
 > +		.name = "kmem.usage_in_bytes",
 > +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
 > +		.read = mem_cgroup_read,
 > +	},
 > +	{
 > +		.name = "kmem.failcnt",
 > +		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
 > +		.trigger = mem_cgroup_reset,
 > +		.read = mem_cgroup_read,
 > +	},
 > +	{
 > +		.name = "kmem.max_usage_in_bytes",
 > +		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
 > +		.trigger = mem_cgroup_reset,
 > +		.read = mem_cgroup_read,
 > +	},
 > +	{},
 > +};
 > +
 >  static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 >  {
 >  	return mem_cgroup_sockets_init(memcg, ss);
 > @@ -4892,6 +4960,12 @@ mem_cgroup_create(struct cgroup *cont)
 >  		int cpu;
 >  		enable_swap_cgroup();
 >  		parent = NULL;
 > +
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +		WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
 > +					   kmem_cgroup_files));
 > +#endif
 > +
 >  		if (mem_cgroup_soft_limit_tree_init())
 >  			goto free_out;
 >  		root_mem_cgroup = memcg;
 > @@ -4910,6 +4984,7 @@ mem_cgroup_create(struct cgroup *cont)
 >  	if (parent && parent->use_hierarchy) {
 >  		res_counter_init(&memcg->res, &parent->res);
 >  		res_counter_init(&memcg->memsw, &parent->memsw);
 > +		res_counter_init(&memcg->kmem, &parent->kmem);
 >  		/*
 >  		 * We increment refcnt of the parent to ensure that we can
 >  		 * safely access it on res_counter_charge/uncharge.
 > @@ -4920,6 +4995,7 @@ mem_cgroup_create(struct cgroup *cont)
 >  	} else {
 >  		res_counter_init(&memcg->res, NULL);
 >  		res_counter_init(&memcg->memsw, NULL);
 > +		res_counter_init(&memcg->kmem, NULL);
 >  	}
 >  	memcg->last_scanned_node = MAX_NUMNODES;
 >  	INIT_LIST_HEAD(&memcg->oom_notify);
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 05/11] Add a __GFP_KMEMCG flag [message #46954 is a reply to message #46920] | Tue, 26 June 2012 04:25   |  
			| 
				
				
					|  David Rientjes Messages: 59
 Registered: November 2006
 | Member |  |  |  
	| On Mon, 25 Jun 2012, Glauber Costa wrote: 
 > This flag is used to indicate to the callees that this allocation will be
 > serviced to the kernel. It is not supposed to be passed by the callers
 > of kmem_cache_alloc, but rather by the cache core itself.
 >
 
 Not sure what "serviced to the kernel" means, does this mean that the
 memory will not be accounted for to the root memcg?
 
 > CC: Christoph Lameter <cl@linux.com>
 > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > CC: Suleiman Souhlal <suleiman@google.com>
 > ---
 >  include/linux/gfp.h |    8 +++++++-
 >  1 file changed, 7 insertions(+), 1 deletion(-)
 >
 > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 > index 1e49be4..8f4079f 100644
 > --- a/include/linux/gfp.h
 > +++ b/include/linux/gfp.h
 > @@ -37,6 +37,9 @@ struct vm_area_struct;
 >  #define ___GFP_NO_KSWAPD	0x400000u
 >  #define ___GFP_OTHER_NODE	0x800000u
 >  #define ___GFP_WRITE		0x1000000u
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +#define ___GFP_KMEMCG		0x2000000u
 > +#endif
 >
 >  /*
 >   * GFP bitmasks..
 > @@ -88,13 +91,16 @@ struct vm_area_struct;
 >  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 >  #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 >
 > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 > +#define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG)/* Allocation comes from a memcg-accounted resource */
 > +#endif
 
 Needs a space.
 
 >  /*
 >   * This may seem redundant, but it's a way of annotating false positives vs.
 >   * allocations that simply cannot be supported (e.g. page tables).
 >   */
 >  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 >
 > -#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
 > +#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 >  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 >
 >  /* This equals 0, but use constants in case they ever change */
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46957 is a reply to message #46946] | Tue, 26 June 2012 05:24   |  
			| 
				
				
					|  David Rientjes Messages: 59
 Registered: November 2006
 | Member |  |  |  
	| On Mon, 25 Jun 2012, Andrew Morton wrote: 
 > > --- a/mm/memcontrol.c
 > > +++ b/mm/memcontrol.c
 > > @@ -287,7 +287,11 @@ struct mem_cgroup {
 > >  	 * Should the accounting and control be hierarchical, per subtree?
 > >  	 */
 > >  	bool use_hierarchy;
 > > -	bool kmem_accounted;
 > > +	/*
 > > +	 * bit0: accounted by this cgroup
 > > +	 * bit1: accounted by a parent.
 > > +	 */
 > > +	volatile unsigned long kmem_accounted;
 >
 > I suggest
 >
 > 	unsigned long kmem_accounted;	/* See KMEM_ACCOUNTED_*, below */
 >
 > >  	bool		oom_lock;
 > >  	atomic_t	under_oom;
 > > @@ -340,6 +344,9 @@ struct mem_cgroup {
 > >  #endif
 > >  };
 > >
 > > +#define KMEM_ACCOUNTED_THIS	0
 > > +#define KMEM_ACCOUNTED_PARENT	1
 >
 > And then document the fields here.
 >
 
 In hex, please?
 |  
	|  |  |  
	|  | 
 
 
 Current Time: Sat Oct 25 13:10:17 GMT 2025 
 Total time taken to generate the page: 0.09021 seconds |