OpenVZ Forum


Home » Mailing lists » Devel » [PATCH 00/11] kmem controller for memcg: stripped down version
[PATCH 00/11] kmem controller for memcg: stripped down version [message #46916] Mon, 25 June 2012 14:15 Go to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
Hi,

What I am proposing with this series is a stripped down version of the
kmem controller for memcg that would allow us to merge significant parts
of the infrastructure, while leaving out, for now, the polemic bits about
the slab while it is being reworked by Cristoph.

Me reasoning for that is that after the last change to introduce a gfp
flag to mark kernel allocations, it became clear to me that tracking other
resources like the stack would then follow extremely naturaly. I figured
that at some point we'd have to solve the issue pointed by David, and avoid
testing the Slab flag in the page allocator, since it would soon be made
more generic. I do that by having the callers to explicit mark it.

So to demonstrate how it would work, I am introducing a stack tracker here,
that is already a functionality per-se: it successfully stops fork bombs to
happen. (Sorry for doing all your work, Frederic =p ). Note that after all
memcg infrastructure is deployed, it becomes very easy to track anything.
The last patch of this series is extremely simple.

The infrastructure is exactly the same we had in memcg, but stripped down
of the slab parts. And because what we have after those patches is a feature
per-se, I think it could be considered for merging.

Let me know what you think.

Glauber Costa (9):
memcg: change defines to an enum
kmem slab accounting basic infrastructure
Add a __GFP_KMEMCG flag
memcg: kmem controller infrastructure
mm: Allocate kernel pages to the right memcg
memcg: disable kmem code when not in use.
memcg: propagate kmem limiting information to children
memcg: allow a memcg with kmem charges to be destructed.
protect architectures where THREAD_SIZE >= PAGE_SIZE against fork
bombs

Suleiman Souhlal (2):
memcg: Make it possible to use the stock for more than one page.
memcg: Reclaim when more than one page needed.

include/linux/gfp.h | 11 +-
include/linux/memcontrol.h | 46 +++++
include/linux/thread_info.h | 6 +
kernel/fork.c | 4 +-
mm/memcontrol.c | 395 ++++++++++++++++++++++++++++++++++++++++---
mm/page_alloc.c | 27 +++
6 files changed, 464 insertions(+), 25 deletions(-)

--
1.7.10.2
[PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46917 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
From: Suleiman Souhlal <ssouhlal@FreeBSD.org>

mem_cgroup_do_charge() was written before slab accounting, and expects
three cases: being called for 1 page, being called for a stock of 32 pages,
or being called for a hugepage. If we call for 2 or 3 pages (and several
slabs used in process creation are such, at least with the debug options I
had), it assumed it's being called for stock and just retried without reclaiming.

Fix that by passing down a minsize argument in addition to the csize.

And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
needed at all (and presumably is since it's there, perhaps to handle
races), then it should be extended to more than PAGE_SIZE, yet how far?
And should there be a retry count limit, of what? For now retry up to
COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
and make sure not to do it if __GFP_NORETRY.

[v4: fixed nr pages calculation pointed out by Christoph Lameter ]

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9304db2..8e601e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2158,8 +2158,16 @@ enum {
CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

+/*
+ * We need a number that is small enough to be likely to have been
+ * reclaimed even under pressure, but not too big to trigger unnecessary
+ * retries
+ */
+#define NR_PAGES_TO_RETRY 2
+
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, unsigned int min_pages,
+ bool oom_check)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2182,18 +2190,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
} else
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
/*
- * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
- * of regular pages (CHARGE_BATCH), or a single regular page (1).
- *
* Never reclaim on behalf of optional batching, retry with a
* single page instead.
*/
- if (nr_pages == CHARGE_BATCH)
+ if (nr_pages > min_pages)
return CHARGE_RETRY;

if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;

+ if (gfp_mask & __GFP_NORETRY)
+ return CHARGE_NOMEM;
+
ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
return CHARGE_RETRY;
@@ -2206,7 +2214,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* unlikely to succeed so close to the limit, and we fall back
* to regular pages anyway in case of failure.
*/
- if (nr_pages == 1 && ret)
+ if (nr_pages <= NR_PAGES_TO_RETRY && ret)
return CHARGE_RETRY;

/*
@@ -2341,7 +2349,8 @@ again:
nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
}

- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
+ oom_check);
switch (ret) {
case CHARGE_OK:
break;
--
1.7.10.2
[PATCH 03/11] memcg: change defines to an enum [message #46918 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
This is just a cleanup patch for clarity of expression.
In earlier submissions, people asked it to be in a separate
patch, so here it is.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8e601e8..9352d40 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -387,9 +387,12 @@ enum charge_type {
};

/* for encoding cft->private value on file */
-#define _MEM (0)
-#define _MEMSWAP (1)
-#define _OOM_TYPE (2)
+enum res_type {
+ _MEM,
+ _MEMSWAP,
+ _OOM_TYPE,
+};
+
#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
#define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff)
#define MEMFILE_ATTR(val) ((val) & 0xffff)
--
1.7.10.2
[PATCH 01/11] memcg: Make it possible to use the stock for more than one page. [message #46919 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
From: Suleiman Souhlal <ssouhlal@FreeBSD.org>

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f72b5e5..9304db2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1967,19 +1967,22 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
static DEFINE_MUTEX(percpu_charge_mutex);

/*
- * Try to consume stocked charge on this cpu. If success, one page is consumed
- * from local stock and true is returned. If the stock is 0 or charges from a
- * cgroup which is not current target, returns false. This stock will be
- * refilled.
+ * Try to consume stocked charge on this cpu. If success, nr_pages pages are
+ * consumed from local stock and true is returned. If the stock is 0 or
+ * charges from a cgroup which is not current target, returns false.
+ * This stock will be refilled.
*/
-static bool consume_stock(struct mem_cgroup *memcg)
+static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
{
struct memcg_stock_pcp *stock;
bool ret = true;

+ if (nr_pages > CHARGE_BATCH)
+ return false;
+
stock = &get_cpu_var(memcg_stock);
- if (memcg == stock->cached && stock->nr_pages)
- stock->nr_pages--;
+ if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+ stock->nr_pages -= nr_pages;
else /* need to call res_counter_charge */
ret = false;
put_cpu_var(memcg_stock);
@@ -2278,7 +2281,7 @@ again:
VM_BUG_ON(css_is_removed(&memcg->css));
if (mem_cgroup_is_root(memcg))
goto done;
- if (nr_pages == 1 && consume_stock(memcg))
+ if (consume_stock(memcg, nr_pages))
goto done;
css_get(&memcg->css);
} else {
@@ -2303,7 +2306,7 @@ again:
rcu_read_unlock();
goto done;
}
- if (nr_pages == 1 && consume_stock(memcg)) {
+ if (consume_stock(memcg, nr_pages)) {
/*
* It seems dagerous to access memcg without css_get().
* But considering how consume_stok works, it's not
--
1.7.10.2
[PATCH 05/11] Add a __GFP_KMEMCG flag [message #46920 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
This flag is used to indicate to the callees that this allocation will be
serviced to the kernel. It is not supposed to be passed by the callers
of kmem_cache_alloc, but rather by the cache core itself.

CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
---
include/linux/gfp.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1e49be4..8f4079f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -37,6 +37,9 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+#define ___GFP_KMEMCG 0x2000000u
+#endif

/*
* GFP bitmasks..
@@ -88,13 +91,16 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG)/* Allocation comes from a memcg-accounted resource */
+#endif
/*
* This may seem redundant, but it's a way of annotating false positives vs.
* allocations that simply cannot be supported (e.g. page tables).
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/* This equals 0, but use constants in case they ever change */
--
1.7.10.2
[PATCH 04/11] kmem slab accounting basic infrastructure [message #46921 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
This patch adds the basic infrastructure for the accounting of the slab
caches. To control that, the following files are created:

* memory.kmem.usage_in_bytes
* memory.kmem.limit_in_bytes
* memory.kmem.failcnt
* memory.kmem.max_usage_in_bytes

They have the same meaning of their user memory counterparts. They reflect
the state of the "kmem" res_counter.

The code is not enabled until a limit is set. This can be tested by the flag
"kmem_accounted". This means that after the patch is applied, no behavioral
changes exists for whoever is still using memcg to control their memory usage.

We always account to both user and kernel resource_counters. This effectively
means that an independent kernel limit is in place when the limit is set
to a lower value than the user memory. A equal or higher value means that the
user limit will always hit first, meaning that kmem is effectively unlimited.

People who want to track kernel memory but not limit it, can set this limit
to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit,
or equal to the user memory)

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 77 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9352d40..6f34b77 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -265,6 +265,10 @@ struct mem_cgroup {
};

/*
+ * the counter to account for kernel memory usage.
+ */
+ struct res_counter kmem;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -279,6 +283,7 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
+ bool kmem_accounted;

bool oom_lock;
atomic_t under_oom;
@@ -391,6 +396,7 @@ enum res_type {
_MEM,
_MEMSWAP,
_OOM_TYPE,
+ _KMEM,
};

#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
@@ -1438,6 +1444,10 @@ done:
res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+ printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
+ res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
+ res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
+ res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
}

/*
@@ -3879,6 +3889,11 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
else
val = res_counter_read_u64(&memcg->memsw, name);
break;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+ case _KMEM:
+ val = res_counter_read_u64(&memcg->kmem, name);
+ break;
+#endif
default:
BUG();
}
@@ -3916,8 +3931,26 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
break;
if (type == _MEM)
ret = mem_cgroup_resize_limit(memcg, val);
- else
+ else if (type == _MEMSWAP)
ret = mem_cgroup_resize_memsw_limit(memcg, val);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+ else if (type == _KMEM) {
+ ret = res_counter_set_limit(&memcg->kmem, val);
+ if (ret)
+ break;
+ /*
+ * Once enabled, can't be disabled. We could in theory
+ * disable it if we haven't yet created any caches, or
+ * if we can shrink them all to death.
+ *
+ * But it is not worth the trouble
+ */
+ if (!memcg->kmem_accounted && val != RESOURCE_MAX)
+ memcg->kmem_accounted = true;
+ }
+#endif
+ else
+ return -EINVAL;
break;
case RES_SOFT_LIMIT:
ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -3982,12 +4015,20 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
case RES_MAX_USAGE:
if (type == _MEM)
res_counter_reset_max(&memcg->res);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+ else if (type == _KMEM)
+ res_counter_reset_max(&memcg->kmem);
+#endif
else
res_counter_reset_max(&memcg->memsw);
break;
case RES_FAILCNT:
if (type == _MEM)
res_counter_reset_failcnt(&memcg->res);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+ else if (type == _KMEM)
+ res_counter_reset_failcnt(&memcg->kmem);
+#endif
else
res_counter_reset_failcnt(&memcg->memsw);
break;
@@ -4549,6 +4590,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static struct cftype kmem_cgroup_files[] = {
+ {
+ .name = "kmem.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.failcnt",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
+ .trigger = mem_cgroup_reset,
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.max_usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
+ .trigger = mem_cgroup_reset,
+ .read = mem_cgroup_read,
+ },
+ {},
+};
+
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
return mem_cgroup_sockets_init(memcg, ss);
@@ -4892,6 +4960,12 @@ mem_cgroup_create(struct cgroup *cont)
int cpu;
enable_swap_cgroup();
parent = NULL;
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+ WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
+ kmem_cgroup_files));
+#endif
+
if (mem_cgroup_soft_limit_tree_init())
goto free_out;
root_mem_cgroup = memcg;
@@ -4910,6 +4984,7 @@ mem_cgroup_create(struct cgroup *cont)
if (parent && parent->use_hierarchy) {
res_counter_init(&memcg->res, &parent->res);
res_counter_init(&memcg->memsw, &parent->memsw);
+ res_counter_init(&memcg->kmem, &parent->kmem);
/*
* We increment refcnt of the parent to ensure that we can
* safely access it on res_counter_charge/uncharge.
@@ -4920,6 +4995,7 @@ mem_cgroup_create(struct cgroup *cont)
} else {
res_counter_init(&memcg->res, NULL);
res_counter_init(&memcg->memsw, NULL);
+ res_counter_init(&memcg->kmem, NULL);
}
memcg->last_scanned_node = MAX_NUMNODES;
INIT_LIST_HEAD(&memcg->oom_notify);
--
1.7.10.2
[PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #46922 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
Because the ultimate goal of the kmem tracking in memcg is to
track slab pages as well, we can't guarantee that we'll always
be able to point a page to a particular process, and migrate
the charges along with it - since in the common case, a page
will contain data belonging to multiple processes.

Because of that, when we destroy a memcg, we only make sure
the destruction will succeed by discounting the kmem charges
from the user charges when we try to empty the cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
---
mm/memcontrol.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6a440b..bb9b6fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
+ /*
+ * This check can't live in kmem destruction function,
+ * since the charges will outlive the cgroup
+ */
+ BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
}
#else
static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -3838,6 +3843,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
int node, zid, shrink;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct cgroup *cgrp = memcg->css.cgroup;
+ u64 usage;

css_get(&memcg->css);

@@ -3877,8 +3883,10 @@ move_account:
if (ret == -ENOMEM)
goto try_to_free;
cond_resched();
+ usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
+ res_counter_read_u64(&memcg->kmem, RES_USAGE);
/* "ret" should also be checked to ensure all lists are empty. */
- } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
+ } while (usage > 0 || ret);
out:
css_put(&memcg->css);
return ret;
--
1.7.10.2
[PATCH 07/11] mm: Allocate kernel pages to the right memcg [message #46923 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
When a process tries to allocate a page with the __GFP_KMEMCG flag,
the page allocator will call the corresponding memcg functions to
validate the allocation. Tasks in the root memcg can always proceed.

To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no
reason, whoever is marking its allocations with __GFP_KMEMCG flag
is responsible for telling the page allocator that this is such an
allocation at free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
---
include/linux/gfp.h | 3 +++
mm/page_alloc.c | 27 +++++++++++++++++++++++++++
2 files changed, 30 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8f4079f..4e27cd8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -368,6 +368,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
extern void free_hot_cold_page(struct page *page, int cold);
extern void free_hot_cold_page_list(struct list_head *list, int cold);

+extern void __free_accounted_pages(struct page *page, unsigned int order);
+extern void free_accounted_pages(unsigned long addr, unsigned int order);
+
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr), 0)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..cb8867e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2479,6 +2479,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
int migratetype = allocflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
+ void *handle = NULL;

gfp_mask &= gfp_allowed_mask;

@@ -2490,6 +2491,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
return NULL;

/*
+ * Will only have any effect when __GFP_KMEMCG is set.
+ * This is verified in the (always inline) callee
+ */
+ if (!mem_cgroup_new_kmem_page(gfp_mask, &handle, order))
+ return NULL;
+
+ /*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
@@ -2528,6 +2536,8 @@ out:
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;

+ mem_cgroup_commit_kmem_page(page, handle, order);
+
return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2580,6 +2590,23 @@ void free_pages(unsigned long addr, unsigned int order)

EXPORT_SYMBOL(free_pages);

+void __free_accounted_pages(struct page *page, unsigned int order)
+{
+ mem_cgroup_free_kmem_page(page, order);
+ __free_pages(page, order);
+}
+EXPORT_SYMBOL(__free_accounted_pages);
+
+void free_accounted_pages(unsigned long addr, unsigned int order)
+{
+ if (addr != 0) {
+ VM_BUG_ON(!virt_addr_valid((void *)addr));
+ mem_cgroup_free_kmem_page(virt_to_page((void *)addr), order);
+ __free_pages(virt_to_page((void *)addr), order);
+ }
+}
+EXPORT_SYMBOL(free_accounted_pages);
+
static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
{
if (addr) {
--
1.7.10.2
[PATCH 11/11] protect architectures where THREAD_SIZE &gt;= PAGE_SIZE against fork bombs [message #46924 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
Because those architectures will draw their stacks directly from
the page allocator, rather than the slab cache, we can directly
pass __GFP_KMEMCG flag, and issue the corresponding free_pages.

This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
remaining architectures fall in this category.

This will guarantee that every stack page is accounted to the memcg
the process currently lives on, and will have the allocations to fail
if they go over limit.

For the time being, I am defining a new variant of THREADINFO_GFP, not
to mess with the other path. Once the slab is also tracked by memcg,
we can get rid of that flag.

Tested to successfully protect against :(){ :|:& };:

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Frederic Weisbecker <fweisbec@redhat.com>
---
include/linux/thread_info.h | 6 ++++++
kernel/fork.c | 4 ++--
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ccc1899..914ec07 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,6 +61,12 @@ extern long do_no_restart_syscall(struct restart_block *parm);
# define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
#endif

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
+#else
+# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP)
+#endif
+
/*
* flag set/clear/test wrappers
* - pass TIF_xxxx constants to these functions
diff --git a/kernel/fork.c b/kernel/fork.c
index ab5211b..06b414f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -142,7 +142,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) { }
static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
int node)
{
- struct page *page = alloc_pages_node(node, THREADINFO_GFP,
+ struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
THREAD_SIZE_ORDER);

return page ? page_address(page) : NULL;
@@ -151,7 +151,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
static inline void free_thread_info(struct thread_info *ti)
{
arch_release_thread_info(ti);
- free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+ free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
}
# else
static struct kmem_cache *thread_info_cache;
--
1.7.10.2
[PATCH 06/11] memcg: kmem controller infrastructure [message #46925 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
This patch introduces infrastructure for tracking kernel memory pages
to a given memcg. This will happen whenever the caller includes the
flag __GFP_KMEMCG flag, and the task belong to a memcg other than
the root.

In memcontrol.h those functions are wrapped in inline acessors.
The idea is to later on, patch those with jump labels, so we don't
incur any overhead when no mem cgroups are being used.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/memcontrol.h | 44 ++++++++++++
mm/memcontrol.c | 172 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 216 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83e7ba9..22479eb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@
#define _LINUX_MEMCONTROL_H
#include <linux/cgroup.h>
#include <linux/vm_event_item.h>
+#include <linux/hardirq.h>

struct mem_cgroup;
struct page_cgroup;
@@ -409,6 +410,12 @@ struct sock;
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
void sock_update_memcg(struct sock *sk);
void sock_release_memcg(struct sock *sk);
+
+#define mem_cgroup_kmem_on 1
+bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
+void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
+void __mem_cgroup_free_kmem_page(struct page *page, int order);
+#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)
#else
static inline void sock_update_memcg(struct sock *sk)
{
@@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
static inline void sock_release_memcg(struct sock *sk)
{
}
+
+#define mem_cgroup_kmem_on 0
+#define __mem_cgroup_new_kmem_page(a, b, c) false
+#define __mem_cgroup_free_kmem_page(a,b )
+#define __mem_cgroup_commit_kmem_page(a, b, c)
+#define is_kmem_tracked_alloc (false)
#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
+static __always_inline
+bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order)
+{
+ if (!mem_cgroup_kmem_on)
+ return true;
+ if (!is_kmem_tracked_alloc)
+ return true;
+ if (!current->mm)
+ return true;
+ if (in_interrupt())
+ return true;
+ if (gfp & __GFP_NOFAIL)
+ return true;
+ return __mem_cgroup_new_kmem_page(gfp, handle, order);
+}
+
+static __always_inline
+void mem_cgroup_free_kmem_page(struct page *page, int order)
+{
+ if (mem_cgroup_kmem_on)
+ __mem_cgroup_free_kmem_page(page, order);
+}
+
+static __always_inline
+void mem_cgroup_commit_kmem_page(struct page *page, struct mem_cgroup *handle,
+ int order)
+{
+ if (mem_cgroup_kmem_on)
+ __mem_cgroup_commit_kmem_page(page, handle, order);
+}
#endif /* _LINUX_MEMCONTROL_H */

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f34b77..27b2b6f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -10,6 +10,10 @@
* Copyright (C) 2009 Nokia Corporation
* Author: Kirill A. Shutemov
*
+ * Kernel Memory Controller
+ * Copyright (C) 2012 Parallels Inc. and Google Inc.
+ * Authors: Glauber Costa and Suleiman Souhlal
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -422,6 +426,9 @@ static void mem_cgroup_put(struct mem_cgroup *memcg);
#include <net/ip.h>

static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
+
void sock_update_memcg(struct sock *sk)
{
if (mem_cgroup_sockets_enabled) {
@@ -476,6 +483,105 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
}
EXPORT_SYMBOL(tcp_proto_cgroup);
#endif /* CONFIG_INET */
+
+static inline bool mem_cgroup_kmem_enabled(struct mem_cgroup *memcg)
+{
+ return !mem_cgroup_disabled() && memcg &&
+ !mem_cgroup_is_root(memcg) && memcg->kmem_accounted;
+}
+
+bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *_handle, int order)
+{
+ struct mem_cgroup *memcg;
+ struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
+ bool ret = true;
+ size_t size;
+ struct task_struct *p;
+
+ *handle = NULL;
+ rcu_read_lock();
+ p = rcu_dereference(current->mm->owner);
+ memcg = mem_cgroup_from_task(p);
+ if (!mem_cgroup_kmem_enabled(memcg))
+ goto out;
+
+ mem_cgroup_get(memcg);
+
+ size = (1 << order) << PAGE_SHIFT;
+ ret = memcg_charge_kmem(memcg, gfp, size) == 0;
+ if (!ret) {
+ mem_cgroup_put(memcg);
+ goto out;
+ }
+
+ *handle = memcg;
+out:
+ rcu_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL(__mem_cgroup_new_kmem_page);
+
+void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *memcg = handle;
+ size_t size;
+
+ if (!memcg)
+ return;
+
+ WARN_ON(mem_cgroup_is_root(memcg));
+ /* The page allocation must have failed. Revert */
+ if (!page) {
+ size = (1 << order) << PAGE_SHIFT;
+ memcg_uncharge_kmem(memcg, size);
+ mem_cgroup_put(memcg);
+ return;
+ }
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ pc->mem_cgroup = memcg;
+ SetPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+}
+void __mem_cgroup_free_kmem_page(struct page *page, int order)
+{
+ struct mem_cgroup *memcg;
+ size_t size;
+ struct page_cgroup *pc;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ memcg = pc->mem_cgroup;
+ pc->mem_cgroup = NULL;
+ if (!PageCgroupUsed(pc)) {
+ unlock_page_cgroup(pc);
+ return;
+ }
+ ClearPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+
+ /*
+ * The classical disabled check won't work
+ * for uncharge, since it is possible that the user enabled
+ * kmem tracking, allocated, and then disabled.
+ *
+ * We trust if there is a memcg associated with the page,
+ * it is a valid allocation
+ */
+ if (!memcg)
+ return;
+
+ WARN_ON(mem_cgroup_is_root(memcg));
+ size = (1 << order) << PAGE_SHIFT;
+ memcg_uncharge_kmem(memcg, size);
+ mem_cgroup_put(memcg);
+}
+EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */

#if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
@@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
__setup("swapaccount=", enable_swap_account);

#endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
+{
+ struct res_counter *fail_res;
+ struct mem_cgroup *_memcg;
+ int may_oom, ret;
+ bool nofail = false;
+
+ may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
+ !(gfp & __GFP_NORETRY);
+
+ ret = 0;
+
+ if (!memcg)
+ return ret;
+
+ _memcg = memcg;
+ ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
+ &_memcg, may_oom);
+
+ if (ret == -EINTR) {
+ nofail = true;
+ /*
+ * __mem_cgroup_try_charge() chose to bypass to root due
+ * to OOM kill or fatal signal.
+ * Since our only options are to either fail the
+ * allocation or charge it to this cgroup, do it as
+ * a temporary condition. But we can't fail. From a kmem/slab
+ * perspective, the cache has already been selected, by
+ * mem_cgroup_get_kmem_cache(), so it is too late to change our
+ * minds
+ */
+ res_counter_charge_nofail(&memcg->res, delta, &fail_res);
+ if (do_swap_account)
+ res_counter_charge_nofail(&memcg->memsw, delta,
+ &fail_res);
+ ret = 0;
+ } else if (ret == -ENOMEM)
+ return ret;
+
+ if (nofail)
+ res_counter_charge_nofail(&memcg->kmem, delta, &fail_res);
+ else
+ ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
+
+ if (ret) {
+ res_counter_uncharge(&memcg->res, delta);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, delta);
+ }
+
+ return ret;
+}
+
+void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta)
+{
+ if (!memcg)
+ return;
+
+ res_counter_uncharge(&memcg->kmem, delta);
+ res_counter_uncharge(&memcg->res, delta);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, delta);
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
--
1.7.10.2
[PATCH 09/11] memcg: propagate kmem limiting information to children [message #46926 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
The current memcg slab cache management fails to present satisfatory hierarchical
behavior in the following scenario:

-> /cgroups/memory/A/B/C

* kmem limit set at A
* A and B empty taskwise
* bash in C does find /

Because kmem_accounted is a boolean that was not set for C, no accounting
would be done. This is, however, not what we expect.

The basic idea, is that when a cgroup is limited, we walk the tree
upwards (something Kame and I already thought about doing for other purposes),
and make sure that we store the information about the parent being limited in
kmem_accounted (that is turned into a bitmap: two booleans would not be space
efficient). The code for that is taken from sched/core.c. My reasons for not
putting it into a common place is to dodge the type issues that would arise
from a common implementation between memcg and the scheduler - but I think
that it should ultimately happen, so if you want me to do it now, let me
know.

We do the reverse operation when a formerly limited cgroup becomes unlimited.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
---
mm/memcontrol.c | 86 +++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 71 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe5388e..a6a440b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,7 +287,11 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
- bool kmem_accounted;
+ /*
+ * bit0: accounted by this cgroup
+ * bit1: accounted by a parent.
+ */
+ volatile unsigned long kmem_accounted;

bool oom_lock;
atomic_t under_oom;
@@ -340,6 +344,9 @@ struct mem_cgroup {
#endif
};

+#define KMEM_ACCOUNTED_THIS 0
+#define KMEM_ACCOUNTED_PARENT 1
+
/* Stuffs for move charges at task migration. */
/*
* Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
@@ -589,7 +596,7 @@ EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);

static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
- if (memcg->kmem_accounted)
+ if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
}
#else
@@ -4027,6 +4034,66 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
return simple_read_from_buffer(buf, nbytes, ppos, str, len);
}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static void mem_cgroup_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
+{
+ struct mem_cgroup *iter;
+
+ mutex_lock(&set_limit_mutex);
+ if (!test_and_set_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted) &&
+ val != RESOURCE_MAX) {
+
+ /*
+ * Once enabled, can't be disabled. We could in theory
+ * disable it if we haven't yet created any caches, or
+ * if we can shrink them all to death.
+ *
+ * But it is not worth the trouble
+ */
+ static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
+
+ if (!memcg->use_hierarchy)
+ goto out;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ if (iter == memcg)
+ continue;
+ set_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
+ }
+
+ } else if (test_and_clear_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted)
+ && val == RESOURCE_MAX) {
+
+ if (!memcg->use_hierarchy)
+ goto out;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ struct mem_cgroup *parent;
+ if (iter == memcg)
+ continue;
+ /*
+ * We should only have our parent bit cleared if none of
+ * ouri parents are accounted. The transversal order of
+ * our iter function forces us to always look at the
+ * parents.
+ */
+ parent = parent_mem_cgroup(iter);
+ while (parent && (parent != memcg)) {
+ if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
+ goto noclear;
+
+ parent = parent_mem_cgroup(parent);
+ }
+ clear_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
+noclear:
+ continue;
+ }
+ }
+out:
+ mutex_unlock(&set_limit_mutex);
+}
+#endif
/*
* The user of this function is...
* RES_LIMIT.
@@ -4064,19 +4131,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
ret = res_counter_set_limit(&memcg->kmem, val);
if (ret)
break;
- /*
- * Once enabled, can't be disabled. We could in theory
- * disable it if we haven't yet created any caches, or
- * if we can shrink them all to death.
- *
- * But it is not worth the trouble
- */
- mutex_lock(&set_limit_mutex);
- if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
- static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
- memcg->kmem_accounted = true;
- }
- mutex_unlock(&set_limit_mutex);
+ mem_cgroup_update_kmem_limit(memcg, val);
+ break;
}
#endif
else
--
1.7.10.2
[PATCH 08/11] memcg: disable kmem code when not in use. [message #46927 is a reply to message #46916] Mon, 25 June 2012 14:15 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
We can use jump labels to patch the code in or out
when not used.

Because the assignment: memcg->kmem_accounted = true
is done after the jump labels increment, we guarantee
that the root memcg will always be selected until
all call sites are patched (see mem_cgroup_kmem_enabled).
This guarantees that no mischarges are applied.

Jump label decrement happens when the last reference
count from the memcg dies. This will only happen when
the caches are all dead.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
---
include/linux/memcontrol.h | 4 +++-
mm/memcontrol.c | 28 ++++++++++++++++++++++++++--
2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 22479eb..4d69ff8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -22,6 +22,7 @@
#include <linux/cgroup.h>
#include <linux/vm_event_item.h>
#include <linux/hardirq.h>
+#include <linux/jump_label.h>

struct mem_cgroup;
struct page_cgroup;
@@ -411,7 +412,8 @@ struct sock;
void sock_update_memcg(struct sock *sk);
void sock_release_memcg(struct sock *sk);

-#define mem_cgroup_kmem_on 1
+extern struct static_key mem_cgroup_kmem_enabled_key;
+#define mem_cgroup_kmem_on static_key_false(&mem_cgroup_kmem_enabled_key)
bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
void __mem_cgroup_free_kmem_page(struct page *page, int order);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27b2b6f..fe5388e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -425,6 +425,10 @@ static void mem_cgroup_put(struct mem_cgroup *memcg);
#include <net/sock.h>
#include <net/ip.h>

+struct static_key mem_cgroup_kmem_enabled_key;
+/* so modules can inline the checks */
+EXPORT_SYMBOL(mem_cgroup_kmem_enabled_key);
+
static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
@@ -582,6 +586,16 @@ void __mem_cgroup_free_kmem_page(struct page *page, int order)
mem_cgroup_put(memcg);
}
EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
+
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+ if (memcg->kmem_accounted)
+ static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
+}
+#else
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+}
#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */

#if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
@@ -597,6 +611,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
}
#endif

+static void disarm_static_keys(struct mem_cgroup *memcg)
+{
+ disarm_sock_keys(memcg);
+ disarm_kmem_keys(memcg);
+}
+
static void drain_all_stock_async(struct mem_cgroup *memcg);

static struct mem_cgroup_per_zone *
@@ -4051,8 +4071,12 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
*
* But it is not worth the trouble
*/
- if (!memcg->kmem_accounted && val != RESOURCE_MAX)
+ mutex_lock(&set_limit_mutex);
+ if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
+ static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
memcg->kmem_accounted = true;
+ }
+ mutex_unlock(&set_limit_mutex);
}
#endif
else
@@ -4927,7 +4951,7 @@ static void free_work(struct work_struct *work)
* to move this code around, and make sure it is outside
* the cgroup_lock.
*/
- disarm_sock_keys(memcg);
+ disarm_static_keys(memcg);
if (size < PAGE_SIZE)
kfree(memcg);
else
--
1.7.10.2
Re: [PATCH 01/11] memcg: Make it possible to use the stock for more than one page. [message #46928 is a reply to message #46919] Mon, 25 June 2012 17:44 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hey, Glauber.

Just a couple nits.

On Mon, Jun 25, 2012 at 06:15:18PM +0400, Glauber Costa wrote:
> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>

It would be nice to explain why this is being done. Just a simple
statement like - "prepare for XXX" or "will be needed by XXX".

> /*
> - * Try to consume stocked charge on this cpu. If success, one page is consumed
> - * from local stock and true is returned. If the stock is 0 or charges from a
> - * cgroup which is not current target, returns false. This stock will be
> - * refilled.
> + * Try to consume stocked charge on this cpu. If success, nr_pages pages are
> + * consumed from local stock and true is returned. If the stock is 0 or
> + * charges from a cgroup which is not current target, returns false.
> + * This stock will be refilled.

I hope this were converted to proper /** function comment with
arguments and return value description.

Thanks.

--
tejun
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46929 is a reply to message #46925] Mon, 25 June 2012 18:06 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Again, nits.

On Mon, Jun 25, 2012 at 06:15:23PM +0400, Glauber Costa wrote:
> +#define mem_cgroup_kmem_on 1
> +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
> +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
> +void __mem_cgroup_free_kmem_page(struct page *page, int order);
> +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)

Ugh... please do the following instead.

static inline bool is_kmem_tracked_alloc(gfp_t gfp)
{
return gfp & __GFP_KMEMCG;
}

> #else
> static inline void sock_update_memcg(struct sock *sk)
> {
> @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
> static inline void sock_release_memcg(struct sock *sk)
> {
> }
> +
> +#define mem_cgroup_kmem_on 0
> +#define __mem_cgroup_new_kmem_page(a, b, c) false
> +#define __mem_cgroup_free_kmem_page(a,b )
> +#define __mem_cgroup_commit_kmem_page(a, b, c)
> +#define is_kmem_tracked_alloc (false)

I would prefer static inlines here too. It's a bit more code in the
header but leads to less surprises (e.g. arg evals w/ side effects or
compiler warning about unused vars) and makes it easier to avoid
cosmetic errors.

Thanks.

--
tejun
Re: [PATCH 07/11] mm: Allocate kernel pages to the right memcg [message #46930 is a reply to message #46923] Mon, 25 June 2012 18:07 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
On Mon, Jun 25, 2012 at 06:15:24PM +0400, Glauber Costa wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag,
> the page allocator will call the corresponding memcg functions to
> validate the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no
> reason, whoever is marking its allocations with __GFP_KMEMCG flag
> is responsible for telling the page allocator that this is such an
> allocation at free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().

Shouldn't we be documenting that in the code somewhere, preferably in
the function comments?

--
tejun
Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46931 is a reply to message #46926] Mon, 25 June 2012 18:29 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Feeling like a nit pervert but..

On Mon, Jun 25, 2012 at 06:15:26PM +0400, Glauber Costa wrote:
> @@ -287,7 +287,11 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> - bool kmem_accounted;
> + /*
> + * bit0: accounted by this cgroup
> + * bit1: accounted by a parent.
> + */
> + volatile unsigned long kmem_accounted;

Is the volatile declaration really necessary? Why is it necessary?
Why no comment explaining it?

> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static void mem_cgroup_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
> +{
> + struct mem_cgroup *iter;
> +
> + mutex_lock(&set_limit_mutex);
> + if (!test_and_set_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted) &&
> + val != RESOURCE_MAX) {
> +
> + /*
> + * Once enabled, can't be disabled. We could in theory
> + * disable it if we haven't yet created any caches, or
> + * if we can shrink them all to death.
> + *
> + * But it is not worth the trouble
> + */
> + static_key_slow_inc(&mem_cgroup_kmem_enabled_key);
> +
> + if (!memcg->use_hierarchy)
> + goto out;
> +
> + for_each_mem_cgroup_tree(iter, memcg) {
> + if (iter == memcg)
> + continue;
> + set_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
> + }
> +
> + } else if (test_and_clear_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted)
> + && val == RESOURCE_MAX) {
> +
> + if (!memcg->use_hierarchy)
> + goto out;
> +
> + for_each_mem_cgroup_tree(iter, memcg) {
> + struct mem_cgroup *parent;

Blank line between decl and body please.

> + if (iter == memcg)
> + continue;
> + /*
> + * We should only have our parent bit cleared if none of
> + * ouri parents are accounted. The transversal order of

^ type

> + * our iter function forces us to always look at the
> + * parents.

Also, it's okay here but the text filling in comments and patch
descriptions tend to be quite inconsistent. If you're on emacs, alt-q
is your friend and I'm sure vim can do text filling pretty nicely too.

> + */
> + parent = parent_mem_cgroup(iter);
> + while (parent && (parent != memcg)) {
> + if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
> + goto noclear;
> +
> + parent = parent_mem_cgroup(parent);
> + }

Better written in for (;;)? Also, if we're breaking on parent ==
memcg, can we ever hit NULL parent in the above loop?

> + clear_bit(KMEM_ACCOUNTED_PARENT, &iter->kmem_accounted);
> +noclear:
> + continue;
> + }
> + }
> +out:
> + mutex_unlock(&set_limit_mutex);

Can we please branch on val != RECOURSE_MAX first? I'm not even sure
whether the above conditionals are correct. If the user updates an
existing kmem limit, the first test_and_set_bit() returns non-zero, so
the code proceeds onto clearing KMEM_ACCOUNTED_THIS, which succeeds
but val == RESOURCE_MAX fails so it doesn't do anything. If the user
changes it again, it will set ACCOUNTED_THIS again. So, changing an
existing kmem limit toggles KMEM_ACCOUNTED_THIS, which just seems
wacky to me.

Thanks.

--
tejun
Re: [PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #46932 is a reply to message #46922] Mon, 25 June 2012 18:34 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
On Mon, Jun 25, 2012 at 06:15:27PM +0400, Glauber Costa wrote:
> Because the ultimate goal of the kmem tracking in memcg is to
> track slab pages as well, we can't guarantee that we'll always
> be able to point a page to a particular process, and migrate
> the charges along with it - since in the common case, a page
> will contain data belonging to multiple processes.
>
> Because of that, when we destroy a memcg, we only make sure
> the destruction will succeed by discounting the kmem charges
> from the user charges when we try to empty the cgroup.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>
> ---
> mm/memcontrol.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a6a440b..bb9b6fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
> {
> if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
> static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
> + /*
> + * This check can't live in kmem destruction function,
> + * since the charges will outlive the cgroup
> + */
> + BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);

WARN_ON() please. Misaccounted kernel usually is better than dead
kernel.

Thanks.

--
tejun
Re: [PATCH 11/11] protect architectures where THREAD_SIZE &gt;= PAGE_SIZE against fork bombs [message #46933 is a reply to message #46924] Mon, 25 June 2012 18:38 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
On Mon, Jun 25, 2012 at 06:55:35PM +0200, Frederic Weisbecker wrote:
> On 06/25/2012 04:15 PM, Glauber Costa wrote:
>
> > Because those architectures will draw their stacks directly from
> > the page allocator, rather than the slab cache, we can directly
> > pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
> >
> > This code path is taken when the architecture doesn't define
> > CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
> > THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
> > remaining architectures fall in this category.
> >
> > This will guarantee that every stack page is accounted to the memcg
> > the process currently lives on, and will have the allocations to fail
> > if they go over limit.
> >
> > For the time being, I am defining a new variant of THREADINFO_GFP, not
> > to mess with the other path. Once the slab is also tracked by memcg,
> > we can get rid of that flag.
> >
> > Tested to successfully protect against :(){ :|:& };:
> >
> > Signed-off-by: Glauber Costa <glommer@parallels.com>
> > CC: Christoph Lameter <cl@linux.com>
> > CC: Pekka Enberg <penberg@cs.helsinki.fi>
> > CC: Michal Hocko <mhocko@suse.cz>
> > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > CC: Johannes Weiner <hannes@cmpxchg.org>
> > CC: Suleiman Souhlal <suleiman@google.com>
>
>
> Acked-by: Frederic Weisbecker <fweisbec@redhat.com>

Frederic, does this (with proper slab accounting added later) achieve
what you wanted with the task counter?

Thanks.

--
tejun
Re: [PATCH 11/11] protect architectures where THREAD_SIZE &gt;= PAGE_SIZE against fork bombs [message #46935 is a reply to message #46933] Mon, 25 June 2012 20:57 Go to previous messageGo to next message
Frederic Weisbecker is currently offline  Frederic Weisbecker
Messages: 25
Registered: April 2012
Junior Member
On Mon, Jun 25, 2012 at 11:38:18AM -0700, Tejun Heo wrote:
> On Mon, Jun 25, 2012 at 06:55:35PM +0200, Frederic Weisbecker wrote:
> > On 06/25/2012 04:15 PM, Glauber Costa wrote:
> >
> > > Because those architectures will draw their stacks directly from
> > > the page allocator, rather than the slab cache, we can directly
> > > pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
> > >
> > > This code path is taken when the architecture doesn't define
> > > CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
> > > THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
> > > remaining architectures fall in this category.
> > >
> > > This will guarantee that every stack page is accounted to the memcg
> > > the process currently lives on, and will have the allocations to fail
> > > if they go over limit.
> > >
> > > For the time being, I am defining a new variant of THREADINFO_GFP, not
> > > to mess with the other path. Once the slab is also tracked by memcg,
> > > we can get rid of that flag.
> > >
> > > Tested to successfully protect against :(){ :|:& };:
> > >
> > > Signed-off-by: Glauber Costa <glommer@parallels.com>
> > > CC: Christoph Lameter <cl@linux.com>
> > > CC: Pekka Enberg <penberg@cs.helsinki.fi>
> > > CC: Michal Hocko <mhocko@suse.cz>
> > > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > CC: Johannes Weiner <hannes@cmpxchg.org>
> > > CC: Suleiman Souhlal <suleiman@google.com>
> >
> >
> > Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
>
> Frederic, does this (with proper slab accounting added later) achieve
> what you wanted with the task counter?

I think so yeah. Relying on general kernel memory accounting should do
the trick for us. And if we need more finegrained limitation on kernel
stack accounting we can still add it incrementally. But I believe global
limitation can be enough.

Thanks.
Re: [PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #46936 is a reply to message #46932] Mon, 25 June 2012 22:25 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 06/25/2012 10:34 PM, Tejun Heo wrote:
> On Mon, Jun 25, 2012 at 06:15:27PM +0400, Glauber Costa wrote:
>> Because the ultimate goal of the kmem tracking in memcg is to
>> track slab pages as well, we can't guarantee that we'll always
>> be able to point a page to a particular process, and migrate
>> the charges along with it - since in the common case, a page
>> will contain data belonging to multiple processes.
>>
>> Because of that, when we destroy a memcg, we only make sure
>> the destruction will succeed by discounting the kmem charges
>> from the user charges when we try to empty the cgroup.
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Christoph Lameter <cl@linux.com>
>> CC: Pekka Enberg <penberg@cs.helsinki.fi>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> CC: Suleiman Souhlal <suleiman@google.com>
>> ---
>> mm/memcontrol.c | 10 +++++++++-
>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a6a440b..bb9b6fe 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
>> {
>> if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
>> static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
>> + /*
>> + * This check can't live in kmem destruction function,
>> + * since the charges will outlive the cgroup
>> + */
>> + BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
>
> WARN_ON() please. Misaccounted kernel usually is better than dead
> kernel.
>

You're absolutely right, will change.
Re: [PATCH 07/11] mm: Allocate kernel pages to the right memcg [message #46938 is a reply to message #46930] Mon, 25 June 2012 22:27 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 06/25/2012 10:07 PM, Tejun Heo wrote:
> On Mon, Jun 25, 2012 at 06:15:24PM +0400, Glauber Costa wrote:
>> When a process tries to allocate a page with the __GFP_KMEMCG flag,
>> the page allocator will call the corresponding memcg functions to
>> validate the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no
>> reason, whoever is marking its allocations with __GFP_KMEMCG flag
>> is responsible for telling the page allocator that this is such an
>> allocation at free_pages() time. This is done by the invocation of
>> __free_accounted_pages() and free_accounted_pages().
>
> Shouldn't we be documenting that in the code somewhere, preferably in
> the function comments?
>
I can certainly do that.
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46939 is a reply to message #46929] Mon, 25 June 2012 22:28 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 06/25/2012 10:06 PM, Tejun Heo wrote:
> Again, nits.
>
> On Mon, Jun 25, 2012 at 06:15:23PM +0400, Glauber Costa wrote:
>> +#define mem_cgroup_kmem_on 1
>> +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
>> +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
>> +void __mem_cgroup_free_kmem_page(struct page *page, int order);
>> +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)
>
> Ugh... please do the following instead.
>
> static inline bool is_kmem_tracked_alloc(gfp_t gfp)
> {
> return gfp & __GFP_KMEMCG;
> }
>
>> #else
>> static inline void sock_update_memcg(struct sock *sk)
>> {
>> @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
>> static inline void sock_release_memcg(struct sock *sk)
>> {
>> }
>> +
>> +#define mem_cgroup_kmem_on 0
>> +#define __mem_cgroup_new_kmem_page(a, b, c) false
>> +#define __mem_cgroup_free_kmem_page(a,b )
>> +#define __mem_cgroup_commit_kmem_page(a, b, c)
>> +#define is_kmem_tracked_alloc (false)
>
> I would prefer static inlines here too. It's a bit more code in the
> header but leads to less surprises (e.g. arg evals w/ side effects or
> compiler warning about unused vars) and makes it easier to avoid
> cosmetic errors.
>
> Thanks.
>

Sure thing.
Re: [PATCH 01/11] memcg: Make it possible to use the stock for more than one page. [message #46940 is a reply to message #46928] Mon, 25 June 2012 22:29 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 06/25/2012 09:44 PM, Tejun Heo wrote:
> Hey, Glauber.
>
> Just a couple nits.
>
> On Mon, Jun 25, 2012 at 06:15:18PM +0400, Glauber Costa wrote:
>> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>
> It would be nice to explain why this is being done. Just a simple
> statement like - "prepare for XXX" or "will be needed by XXX".

I picked this patch from Suleiman Souhlal, and tried to keep it as close
as possible to his version.

But for the sake of documentation, I can do that, yes.
Re: [PATCH 01/11] memcg: Make it possible to use the stock for more than one page. [message #46941 is a reply to message #46940] Mon, 25 June 2012 22:33 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello,

On Mon, Jun 25, 2012 at 3:29 PM, Glauber Costa <glommer@parallels.com> wrote:
>> It would be nice to explain why this is being done.  Just a simple
>> statement like - "prepare for XXX" or "will be needed by XXX".
>
>
> I picked this patch from Suleiman Souhlal, and tried to keep it as close as
> possible to his version.
>
> But for the sake of documentation, I can do that, yes.

Yeah, that would be great. Also, the patch does change behavior,
right? Explaining / justifying that a bit would be nice too.

Thanks!

--
tejun
Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46942 is a reply to message #46931] Mon, 25 June 2012 22:36 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 06/25/2012 10:29 PM, Tejun Heo wrote:
> Feeling like a nit pervert but..
>
> On Mon, Jun 25, 2012 at 06:15:26PM +0400, Glauber Costa wrote:
>> @@ -287,7 +287,11 @@ struct mem_cgroup {
>> * Should the accounting and control be hierarchical, per subtree?
>> */
>> bool use_hierarchy;
>> - bool kmem_accounted;
>> + /*
>> + * bit0: accounted by this cgroup
>> + * bit1: accounted by a parent.
>> + */
>> + volatile unsigned long kmem_accounted;
>
> Is the volatile declaration really necessary? Why is it necessary?
> Why no comment explaining it?

Seems to be required by set_bit and friends. gcc will complain if it is
not volatile (take a look at the bit function headers)

>> +
>> + for_each_mem_cgroup_tree(iter, memcg) {
>> + struct mem_cgroup *parent;
>
> Blank line between decl and body please.
ok.

>
>> + if (iter == memcg)
>> + continue;
>> + /*
>> + * We should only have our parent bit cleared if none of
>> + * ouri parents are accounted. The transversal order of
>
> ^ type
>
>> + * our iter function forces us to always look at the
>> + * parents.
>
> Also, it's okay here but the text filling in comments and patch
> descriptions tend to be quite inconsistent. If you're on emacs, alt-q
> is your friend and I'm sure vim can do text filling pretty nicely too.
>
>> + */
>> + parent = parent_mem_cgroup(iter);
>> + while (parent && (parent != memcg)) {
>> + if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
>> + goto noclear;
>> +
>> + parent = parent_mem_cgroup(parent);
>> + }
>
> Better written in for (;;)? Also, if we're breaking on parent ==
> memcg, can we ever hit NULL parent in the above loop?

I can simplify to test parent != memcg only, indeed it is not expected
to be NULL (but if it happens to be due to any kind of bug, we protect
against NULL-dereference, that is why I like to write this way)

>> + continue;
>> + }
>> + }
>> +out:
>> + mutex_unlock(&set_limit_mutex);
>
> Can we please branch on val != RECOURSE_MAX first? I'm not even sure
> whether the above conditionals are correct. If the user updates an
> existing kmem limit, the first test_and_set_bit() returns non-zero, so
> the code proceeds onto clearing KMEM_ACCOUNTED_THIS, which succeeds
> but val == RESOURCE_MAX fails so it doesn't do anything. If the user
> changes it again, it will set ACCOUNTED_THIS again. So, changing an
> existing kmem limit toggles KMEM_ACCOUNTED_THIS, which just seems
> wacky to me.
>

I will take a look at that tomorrow as well.
Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46943 is a reply to message #46942] Mon, 25 June 2012 22:49 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Glauber.

On Tue, Jun 26, 2012 at 02:36:27AM +0400, Glauber Costa wrote:
> >Is the volatile declaration really necessary? Why is it necessary?
> >Why no comment explaining it?
>
> Seems to be required by set_bit and friends. gcc will complain if it
> is not volatile (take a look at the bit function headers)

Hmmm? Are you sure gcc includes volatile in type check? There are a
lot of bitops users in the kernel but most of them don't use volatile
decl on the variable.

> >>+ */
> >>+ parent = parent_mem_cgroup(iter);
> >>+ while (parent && (parent != memcg)) {
> >>+ if (test_bit(KMEM_ACCOUNTED_THIS, &parent->kmem_accounted))
> >>+ goto noclear;
> >>+
> >>+ parent = parent_mem_cgroup(parent);
> >>+ }
> >
> >Better written in for (;;)? Also, if we're breaking on parent ==
> >memcg, can we ever hit NULL parent in the above loop?
>
> I can simplify to test parent != memcg only, indeed it is not
> expected to be NULL (but if it happens to be due to any kind of bug,
> we protect against NULL-dereference, that is why I like to write
> this way)

I personally don't really like that. It doesn't really add meaningful
protection (if that happens the tree walking is already severely
broken) while causes confusion to future readers of the code (when can
parent be NULL?).

Thanks.

--
tejun
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46944 is a reply to message #46925] Mon, 25 June 2012 23:17 Go to previous messageGo to next message
akpm is currently offline  akpm
Messages: 224
Registered: March 2007
Senior Member
On Mon, 25 Jun 2012 18:15:23 +0400
Glauber Costa <glommer@parallels.com> wrote:

> This patch introduces infrastructure for tracking kernel memory pages
> to a given memcg. This will happen whenever the caller includes the
> flag __GFP_KMEMCG flag, and the task belong to a memcg other than
> the root.
>
> In memcontrol.h those functions are wrapped in inline acessors.
> The idea is to later on, patch those with jump labels, so we don't
> incur any overhead when no mem cgroups are being used.
>
>
> ...
>
> @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
> static inline void sock_release_memcg(struct sock *sk)
> {
> }
> +
> +#define mem_cgroup_kmem_on 0
> +#define __mem_cgroup_new_kmem_page(a, b, c) false
> +#define __mem_cgroup_free_kmem_page(a,b )
> +#define __mem_cgroup_commit_kmem_page(a, b, c)

I suggest that the naming consistently follow the model
"mem_cgroup_kmem_foo". So "mem_cgroup_kmem_" becomes the well-known
identifier for this subsystem.

Then, s/mem_cgroup/memcg/g/ - show us some mercy here!

> +#define is_kmem_tracked_alloc (false)

memcg_kmem_tracked_alloc, perhaps. But what does this actually do?

<looks>

eww, ick.

> +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)

What Tejun said. This:

/*
* Nice comment goes here
*/
static inline bool memcg_kmem_tracked_alloc(gfp_t gfp)
{
return gfp & __GFP_KMEMCG;
}

> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
> +static __always_inline
> +bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order)

memcg_kmem_new_page().

> +{
> + if (!mem_cgroup_kmem_on)
> + return true;
> + if (!is_kmem_tracked_alloc)
> + return true;
> + if (!current->mm)
> + return true;
> + if (in_interrupt())
> + return true;
> + if (gfp & __GFP_NOFAIL)
> + return true;
> + return __mem_cgroup_new_kmem_page(gfp, handle, order);
> +}

Add documentation, please. The semantics of the return value are
inscrutable.

> +static __always_inline
> +void mem_cgroup_free_kmem_page(struct page *page, int order)
> +{
> + if (mem_cgroup_kmem_on)
> + __mem_cgroup_free_kmem_page(page, order);
> +}

memcg_kmem_free_page().

> +static __always_inline
> +void mem_cgroup_commit_kmem_page(struct page *page, struct mem_cgroup *handle,
> + int order)
> +{
> + if (mem_cgroup_kmem_on)
> + __mem_cgroup_commit_kmem_page(page, handle, order);
> +}

memcg_kmem_commit_page().

> #endif /* _LINUX_MEMCONTROL_H */
>
>
> ...
>
> +static inline bool mem_cgroup_kmem_enabled(struct mem_cgroup *memcg)
> +{
> + return !mem_cgroup_disabled() && memcg &&
> + !mem_cgroup_is_root(memcg) && memcg->kmem_accounted;
> +}

Does this really need to handle a memcg==NULL?

> +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *_handle, int order)
> +{
> + struct mem_cgroup *memcg;
> + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
> + bool ret = true;
> + size_t size;
> + struct task_struct *p;
> +
> + *handle = NULL;
> + rcu_read_lock();
> + p = rcu_dereference(current->mm->owner);
> + memcg = mem_cgroup_from_task(p);
> + if (!mem_cgroup_kmem_enabled(memcg))
> + goto out;
> +
> + mem_cgroup_get(memcg);
> +
> + size = (1 << order) << PAGE_SHIFT;

size = PAGE_SIZE << order;

is simpler and more typical.

> + ret = memcg_charge_kmem(memcg, gfp, size) == 0;

Odd. memcg_charge_kmem() returns a nice errno, buit this conversion
just drops that information on the floor. If the
mem_cgroup_new_kmem_page() return value had been documented, I might
have been able to understand the thinking here. But it wasn't, so I
couldn't.

> + if (!ret) {
> + mem_cgroup_put(memcg);
> + goto out;
> + }
> +
> + *handle = memcg;
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +EXPORT_SYMBOL(__mem_cgroup_new_kmem_page);
> +
> +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order)
> +{
> + struct page_cgroup *pc;
> + struct mem_cgroup *memcg = handle;
> + size_t size;
> +
> + if (!memcg)
> + return;
> +
> + WARN_ON(mem_cgroup_is_root(memcg));
> + /* The page allocation must have failed. Revert */
> + if (!page) {
> + size = (1 << order) << PAGE_SHIFT;

PAGE_SIZE << order

> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
> + return;
> + }

This all looks a bit odd. After a failure you run a commit which
undoes the speculative charging. I guess it makes sense. It's the
sort of thing which can be expounded upon in the documentation whcih
isn't there, sigh.

> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + pc->mem_cgroup = memcg;
> + SetPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
> +}

missed a newline there.

> +void __mem_cgroup_free_kmem_page(struct page *page, int order)
> +{
> + struct mem_cgroup *memcg;
> + size_t size;
> + struct page_cgroup *pc;
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + memcg = pc->mem_cgroup;
> + pc->mem_cgroup = NULL;
> + if (!PageCgroupUsed(pc)) {
> + unlock_page_cgroup(pc);
> + return;
> + }
> + ClearPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
> +
> + /*
> + * The classical disabled check won't work

What is "The classical disabled check"? Be specific. Testing
mem_cgroup_kmem_on?

> + * for uncharge, since it is possible that the user enabled
> + * kmem tracking, allocated, and then disabled.
> + *
> + * We trust if there is a memcg associated with the page,
> + * it is a valid allocation
> + */
> + if (!memcg)
> + return;
> +
> + WARN_ON(mem_cgroup_is_root(memcg));
> + size = (1 << order) << PAGE_SHIFT;

PAGE_SIZE << order

> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
> +}
> +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>
> #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
> __setup("swapaccount=", enable_swap_account);
>
> #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM

gargh. CONFIG_MEMCG_KMEM, please!

> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
> +{
> + struct res_counter *fail_res;
> + struct mem_cgroup *_memcg;
> + int may_oom, ret;
> + bool nofail = false;
> +
> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
> + !(gfp & __GFP_NORETRY);

may_oom should have bool type.

> + ret = 0;
> +
> + if (!memcg)
> + return ret;
> +
> + _memcg = memcg;
> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> + &_memcg, may_oom);
> +
> + if (ret == -EINTR) {
> + nofail = true;
> + /*
> + * __mem_cgroup_try_charge() chose to bypass to root due
> + * to OOM kill or fatal signal.

Is "bypass" correct? Maybe "fall back"?

> + * Since our only options are to either fail the
> + * allocation or charge it to this cgroup, do it as
> + * a temporary condition. But we can't fail. From a kmem/slab
> + * perspective, the cache has already been selected, by
> + * mem_cgroup_get_kmem_cache(), so it is too late to change our
> + * minds
> + */
> + res_counter_charge_nofail(&memcg->res, delta, &fail_res);
> + if (do_swap_account)
> + res_counter_charge_nofail(&memcg->memsw, delta,
> + &fail_res);
> + ret = 0;
> + } else if (ret == -ENOMEM)
> + return ret;
> +
> + if (nofail)
> + res_counter_charge_nofail(&memcg->kmem, delta, &fail_res);
> + else
> + ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
> +
> + if (ret) {
> + res_counter_uncharge(&memcg->res, delta);
> + if (do_swap_account)
> + res_counter_uncharge(&memcg->memsw, delta);
> + }
> +
> + return ret;
> +}
>
> ...
>
...

Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46945 is a reply to message #46942] Mon, 25 June 2012 23:21 Go to previous messageGo to next message
akpm is currently offline  akpm
Messages: 224
Registered: March 2007
Senior Member
On Tue, 26 Jun 2012 02:36:27 +0400
Glauber Costa <glommer@parallels.com> wrote:

> On 06/25/2012 10:29 PM, Tejun Heo wrote:
> > Feeling like a nit pervert but..
> >
> > On Mon, Jun 25, 2012 at 06:15:26PM +0400, Glauber Costa wrote:
> >> @@ -287,7 +287,11 @@ struct mem_cgroup {
> >> * Should the accounting and control be hierarchical, per subtree?
> >> */
> >> bool use_hierarchy;
> >> - bool kmem_accounted;
> >> + /*
> >> + * bit0: accounted by this cgroup
> >> + * bit1: accounted by a parent.
> >> + */
> >> + volatile unsigned long kmem_accounted;
> >
> > Is the volatile declaration really necessary? Why is it necessary?
> > Why no comment explaining it?
>
> Seems to be required by set_bit and friends. gcc will complain if it is
> not volatile (take a look at the bit function headers)

That would be a broken gcc. We run test_bit()/set_bit() and friends
against plain old `unsigned long' in thousands of places. There's
nothing special about this one!
Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46946 is a reply to message #46926] Mon, 25 June 2012 23:23 Go to previous messageGo to next message
akpm is currently offline  akpm
Messages: 224
Registered: March 2007
Senior Member
On Mon, 25 Jun 2012 18:15:26 +0400
Glauber Costa <glommer@parallels.com> wrote:

> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -287,7 +287,11 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> - bool kmem_accounted;
> + /*
> + * bit0: accounted by this cgroup
> + * bit1: accounted by a parent.
> + */
> + volatile unsigned long kmem_accounted;

I suggest

unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */

> bool oom_lock;
> atomic_t under_oom;
> @@ -340,6 +344,9 @@ struct mem_cgroup {
> #endif
> };
>
> +#define KMEM_ACCOUNTED_THIS 0
> +#define KMEM_ACCOUNTED_PARENT 1

And then document the fields here.
Re: [PATCH 00/11] kmem controller for memcg: stripped down version [message #46947 is a reply to message #46916] Mon, 25 June 2012 23:27 Go to previous messageGo to next message
akpm is currently offline  akpm
Messages: 224
Registered: March 2007
Senior Member
On Mon, 25 Jun 2012 18:15:17 +0400
Glauber Costa <glommer@parallels.com> wrote:

> What I am proposing with this series is a stripped down version of the
> kmem controller for memcg that would allow us to merge significant parts
> of the infrastructure, while leaving out, for now, the polemic bits about
> the slab while it is being reworked by Cristoph.
>
> Me reasoning for that is that after the last change to introduce a gfp
> flag to mark kernel allocations, it became clear to me that tracking other
> resources like the stack would then follow extremely naturaly. I figured
> that at some point we'd have to solve the issue pointed by David, and avoid
> testing the Slab flag in the page allocator, since it would soon be made
> more generic. I do that by having the callers to explicit mark it.
>
> So to demonstrate how it would work, I am introducing a stack tracker here,
> that is already a functionality per-se: it successfully stops fork bombs to
> happen. (Sorry for doing all your work, Frederic =p ). Note that after all
> memcg infrastructure is deployed, it becomes very easy to track anything.
> The last patch of this series is extremely simple.
>
> The infrastructure is exactly the same we had in memcg, but stripped down
> of the slab parts. And because what we have after those patches is a feature
> per-se, I think it could be considered for merging.

hm. None of this new code makes the kernel smaller, faster, easier to
understand or more fun to read!

Presumably we're getting some benefit for all the downside. When the
time is appropriate, please do put some time into explaining that
benefit, so that others can agree that it is a worthwhile tradeoff.
Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46948 is a reply to message #46917] Mon, 25 June 2012 23:33 Go to previous messageGo to next message
Suleiman Souhlal is currently offline  Suleiman Souhlal
Messages: 64
Registered: February 2012
Member
On Mon, Jun 25, 2012 at 7:15 AM, Glauber Costa <glommer@parallels.com> wrote:
> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>
> mem_cgroup_do_charge() was written before slab accounting, and expects
> three cases: being called for 1 page, being called for a stock of 32 pages,
> or being called for a hugepage. If we call for 2 or 3 pages (and several
> slabs used in process creation are such, at least with the debug options I
> had), it assumed it's being called for stock and just retried without reclaiming.
>
> Fix that by passing down a minsize argument in addition to the csize.
>
> And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
> needed at all (and presumably is since it's there, perhaps to handle
> races), then it should be extended to more than PAGE_SIZE, yet how far?
> And should there be a retry count limit, of what? For now retry up to
> COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
> and make sure not to do it if __GFP_NORETRY.

The commit description mentions COSTLY_ORDER, but it's not actually
used in the patch.

-- Suleiman
Re: [PATCH 01/11] memcg: Make it possible to use the stock for more than one page. [message #46950 is a reply to message #46919] Tue, 26 June 2012 04:01 Go to previous messageGo to next message
David Rientjes is currently offline  David Rientjes
Messages: 59
Registered: November 2006
Member
On Mon, 25 Jun 2012, Glauber Costa wrote:

> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: David Rientjes <rientjes@google.com>
Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46951 is a reply to message #46917] Tue, 26 June 2012 04:09 Go to previous messageGo to next message
David Rientjes is currently offline  David Rientjes
Messages: 59
Registered: November 2006
Member
On Mon, 25 Jun 2012, Glauber Costa wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9304db2..8e601e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2158,8 +2158,16 @@ enum {
> CHARGE_OOM_DIE, /* the current is killed because of OOM */
> };
>
> +/*
> + * We need a number that is small enough to be likely to have been
> + * reclaimed even under pressure, but not too big to trigger unnecessary

Whitespace.

> + * retries
> + */
> +#define NR_PAGES_TO_RETRY 2
> +

Should be 1 << PAGE_ALLOC_COSTLY_ORDER? Where does this number come from?
The changelog doesn't specify.

> static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> - unsigned int nr_pages, bool oom_check)
> + unsigned int nr_pages, unsigned int min_pages,
> + bool oom_check)
> {
> unsigned long csize = nr_pages * PAGE_SIZE;
> struct mem_cgroup *mem_over_limit;
> @@ -2182,18 +2190,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> } else
> mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> /*
> - * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
> - * of regular pages (CHARGE_BATCH), or a single regular page (1).
> - *
> * Never reclaim on behalf of optional batching, retry with a
> * single page instead.
> */
> - if (nr_pages == CHARGE_BATCH)
> + if (nr_pages > min_pages)
> return CHARGE_RETRY;
>
> if (!(gfp_mask & __GFP_WAIT))
> return CHARGE_WOULDBLOCK;
>
> + if (gfp_mask & __GFP_NORETRY)
> + return CHARGE_NOMEM;
> +
> ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> return CHARGE_RETRY;
> @@ -2206,7 +2214,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * unlikely to succeed so close to the limit, and we fall back
> * to regular pages anyway in case of failure.
> */
> - if (nr_pages == 1 && ret)
> + if (nr_pages <= NR_PAGES_TO_RETRY && ret)
> return CHARGE_RETRY;
>
> /*
> @@ -2341,7 +2349,8 @@ again:
> nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> }
>
> - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
> + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
> + oom_check);
> switch (ret) {
> case CHARGE_OK:
> break;
Re: [PATCH 03/11] memcg: change defines to an enum [message #46952 is a reply to message #46918] Tue, 26 June 2012 04:11 Go to previous messageGo to next message
David Rientjes is currently offline  David Rientjes
Messages: 59
Registered: November 2006
Member
On Mon, 25 Jun 2012, Glauber Costa wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8e601e8..9352d40 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -387,9 +387,12 @@ enum charge_type {
> };
>
> /* for encoding cft->private value on file */
> -#define _MEM (0)
> -#define _MEMSWAP (1)
> -#define _OOM_TYPE (2)
> +enum res_type {
> + _MEM,
> + _MEMSWAP,
> + _OOM_TYPE,
> +};
> +
> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
> #define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff)
> #define MEMFILE_ATTR(val) ((val) & 0xffff)

Shouldn't everything that does MEMFILE_TYPE() now be using type
enum res_type rather than int?
Re: [PATCH 04/11] kmem slab accounting basic infrastructure [message #46953 is a reply to message #46921] Tue, 26 June 2012 04:22 Go to previous messageGo to next message
David Rientjes is currently offline  David Rientjes
Messages: 59
Registered: November 2006
Member
On Mon, 25 Jun 2012, Glauber Costa wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9352d40..6f34b77 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -265,6 +265,10 @@ struct mem_cgroup {
> };
>
> /*
> + * the counter to account for kernel memory usage.
> + */
> + struct res_counter kmem;
> + /*
> * Per cgroup active and inactive list, similar to the
> * per zone LRU lists.
> */
> @@ -279,6 +283,7 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> + bool kmem_accounted;
>
> bool oom_lock;
> atomic_t under_oom;
> @@ -391,6 +396,7 @@ enum res_type {
> _MEM,
> _MEMSWAP,
> _OOM_TYPE,
> + _KMEM,
> };
>
> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
> @@ -1438,6 +1444,10 @@ done:
> res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> + printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
> + res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
> + res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
> + res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
> }
>
> /*
> @@ -3879,6 +3889,11 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> else
> val = res_counter_read_u64(&memcg->memsw, name);
> break;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> + case _KMEM:
> + val = res_counter_read_u64(&memcg->kmem, name);
> + break;
> +#endif

This shouldn't need an #ifdef, ->kmem is available on all
CONFIG_CGROUP_MEM_RES_CTLR kernels. Same with several of the other
instances in this patch.

Can't these instances be addressed by not adding kmem_cgroup_files without
CONFIG_CGROUP_MEM_RES_CTLR_KMEM?

> default:
> BUG();
> }
> @@ -3916,8 +3931,26 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
> break;
> if (type == _MEM)
> ret = mem_cgroup_resize_limit(memcg, val);
> - else
> + else if (type == _MEMSWAP)
> ret = mem_cgroup_resize_memsw_limit(memcg, val);
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> + else if (type == _KMEM) {
> + ret = res_counter_set_limit(&memcg->kmem, val);
> + if (ret)
> + break;
> + /*
> + * Once enabled, can't be disabled. We could in theory
> + * disable it if we haven't yet created any caches, or
> + * if we can shrink them all to death.
> + *
> + * But it is not worth the trouble
> + */
> + if (!memcg->kmem_accounted && val != RESOURCE_MAX)
> + memcg->kmem_accounted = true;
> + }
> +#endif
> + else
> + return -EINVAL;
> break;
> case RES_SOFT_LIMIT:
> ret = res_counter_memparse_write_strategy(buffer, &val);
> @@ -3982,12 +4015,20 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> case RES_MAX_USAGE:
> if (type == _MEM)
> res_counter_reset_max(&memcg->res);
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> + else if (type == _KMEM)
> + res_counter_reset_max(&memcg->kmem);
> +#endif
> else
> res_counter_reset_max(&memcg->memsw);
> break;
> case RES_FAILCNT:
> if (type == _MEM)
> res_counter_reset_failcnt(&memcg->res);
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> + else if (type == _KMEM)
> + res_counter_reset_failcnt(&memcg->kmem);
> +#endif
> else
> res_counter_reset_failcnt(&memcg->memsw);
> break;
> @@ -4549,6 +4590,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> }
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static struct cftype kmem_cgroup_files[] = {
> + {
> + .name = "kmem.limit_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> + .write_string = mem_cgroup_write,
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.failcnt",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
> + .trigger = mem_cgroup_reset,
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.max_usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
> + .trigger = mem_cgroup_reset,
> + .read = mem_cgroup_read,
> + },
> + {},
> +};
> +
> static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> {
> return mem_cgroup_sockets_init(memcg, ss);
> @@ -4892,6 +4960,12 @@ mem_cgroup_create(struct cgroup *cont)
> int cpu;
> enable_swap_cgroup();
> parent = NULL;
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
> + kmem_cgroup_files));
> +#endif
> +
> if (mem_cgroup_soft_limit_tree_init())
> goto free_out;
> root_mem_cgroup = memcg;
> @@ -4910,6 +4984,7 @@ mem_cgroup_create(struct cgroup *cont)
> if (parent && parent->use_hierarchy) {
> res_counter_init(&memcg->res, &parent->res);
> res_counter_init(&memcg->memsw, &parent->memsw);
> + res_counter_init(&memcg->kmem, &parent->kmem);
> /*
> * We increment refcnt of the parent to ensure that we can
> * safely access it on res_counter_charge/uncharge.
> @@ -4920,6 +4995,7 @@ mem_cgroup_create(struct cgroup *cont)
> } else {
> res_counter_init(&memcg->res, NULL);
> res_counter_init(&memcg->memsw, NULL);
> + res_counter_init(&memcg->kmem, NULL);
> }
> memcg->last_scanned_node = MAX_NUMNODES;
> INIT_LIST_HEAD(&memcg->oom_notify);
Re: [PATCH 05/11] Add a __GFP_KMEMCG flag [message #46954 is a reply to message #46920] Tue, 26 June 2012 04:25 Go to previous messageGo to next message
David Rientjes is currently offline  David Rientjes
Messages: 59
Registered: November 2006
Member
On Mon, 25 Jun 2012, Glauber Costa wrote:

> This flag is used to indicate to the callees that this allocation will be
> serviced to the kernel. It is not supposed to be passed by the callers
> of kmem_cache_alloc, but rather by the cache core itself.
>

Not sure what "serviced to the kernel" means, does this mean that the
memory will not be accounted for to the root memcg?

> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>
> ---
> include/linux/gfp.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 1e49be4..8f4079f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -37,6 +37,9 @@ struct vm_area_struct;
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> #define ___GFP_WRITE 0x1000000u
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +#define ___GFP_KMEMCG 0x2000000u
> +#endif
>
> /*
> * GFP bitmasks..
> @@ -88,13 +91,16 @@ struct vm_area_struct;
> #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG)/* Allocation comes from a memcg-accounted resource */
> +#endif

Needs a space.

> /*
> * This may seem redundant, but it's a way of annotating false positives vs.
> * allocations that simply cannot be supported (e.g. page tables).
> */
> #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>
> -#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
Re: [PATCH 11/11] protect architectures where THREAD_SIZE &gt;= PAGE_SIZE against fork bombs [message #46955 is a reply to message #46924] Tue, 26 June 2012 04:57 Go to previous messageGo to next message
David Rientjes is currently offline  David Rientjes
Messages: 59
Registered: November 2006
Member
On Mon, 25 Jun 2012, Glauber Costa wrote:

> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index ccc1899..914ec07 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -61,6 +61,12 @@ extern long do_no_restart_syscall(struct restart_block *parm);
> # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
> #endif
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
> +#else
> +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP)
> +#endif
> +

This type of requirement is going to become nasty very quickly if nobody
can use __GFP_KMEMCG without testing for CONFIG_CGROUP_MEM_RES_CTLR_KMEM.
Perhaps define __GFP_KMEMCG to be 0x0 if it's not enabled, similar to how
kmemcheck does?
Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46956 is a reply to message #46945] Tue, 26 June 2012 05:23 Go to previous messageGo to next message
David Rientjes is currently offline  David Rientjes
Messages: 59
Registered: November 2006
Member
On Mon, 25 Jun 2012, Andrew Morton wrote:

> > >> */
> > >> bool use_hierarchy;
> > >> - bool kmem_accounted;
> > >> + /*
> > >> + * bit0: accounted by this cgroup
> > >> + * bit1: accounted by a parent.
> > >> + */
> > >> + volatile unsigned long kmem_accounted;
> > >
> > > Is the volatile declaration really necessary? Why is it necessary?
> > > Why no comment explaining it?
> >
> > Seems to be required by set_bit and friends. gcc will complain if it is
> > not volatile (take a look at the bit function headers)
>
> That would be a broken gcc. We run test_bit()/set_bit() and friends
> against plain old `unsigned long' in thousands of places. There's
> nothing special about this one!
>

No version of gcc would complain about this, even with 4.6 and later with
-fstrict-volatile-bitfields, it's a qualifier that determines whether or
not the access to memory is the exact size of the bitfield and aligned to
its natural boundary. If the type isn't qualified as such then it's
simply going to compile to access the native word size of the
architecture. No special consideration is needed for a member of
struct mem_cgroup.
Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46957 is a reply to message #46946] Tue, 26 June 2012 05:24 Go to previous messageGo to next message
David Rientjes is currently offline  David Rientjes
Messages: 59
Registered: November 2006
Member
On Mon, 25 Jun 2012, Andrew Morton wrote:

> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -287,7 +287,11 @@ struct mem_cgroup {
> > * Should the accounting and control be hierarchical, per subtree?
> > */
> > bool use_hierarchy;
> > - bool kmem_accounted;
> > + /*
> > + * bit0: accounted by this cgroup
> > + * bit1: accounted by a parent.
> > + */
> > + volatile unsigned long kmem_accounted;
>
> I suggest
>
> unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
>
> > bool oom_lock;
> > atomic_t under_oom;
> > @@ -340,6 +344,9 @@ struct mem_cgroup {
> > #endif
> > };
> >
> > +#define KMEM_ACCOUNTED_THIS 0
> > +#define KMEM_ACCOUNTED_PARENT 1
>
> And then document the fields here.
>

In hex, please?
Re: [PATCH 09/11] memcg: propagate kmem limiting information to children [message #46958 is a reply to message #46957] Tue, 26 June 2012 05:31 Go to previous messageGo to previous message
akpm is currently offline  akpm
Messages: 224
Registered: March 2007
Senior Member
On Mon, 25 Jun 2012 22:24:44 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> > > +#define KMEM_ACCOUNTED_THIS 0
> > > +#define KMEM_ACCOUNTED_PARENT 1
> >
> > And then document the fields here.
> >
>
> In hex, please?

Well, they're bit numbers, not masks. Decimal 0-31 is OK, or an enum.
Previous Topic: containers and cgroups mini-summit @ Linux Plumbers
Next Topic: [PATCH 0/4] fuse: optimize scatter-gather direct IO
Goto Forum:
  


Current Time: Sat Oct 25 22:32:38 GMT 2025

Total time taken to generate the page: 0.09889 seconds