OpenVZ Forum: Devel » [PATCH v4 00/25] kmem limitation for memcg

Home » Mailing lists » Devel » [PATCH v4 00/25] kmem limitation for memcg

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Switch to threaded view of this topic

Create a new topic

Submit Reply

Re: [PATCH v4 05/25] memcg: Always free struct memcg through schedule_work() [message #46865 is a reply to message #46864]

Wed, 20 June 2012 08:40

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

On 06/20/2012 11:32 AM, Pekka Enberg wrote:
>> >Maybe Pekka can merge the current -mm with his tree?
> I first want to have a stable base from Christoph's "common slab" series
> before I am comfortable with going forward with the memcg parts.
>
> Feel free to push forward any preparational patches to the slab
> allocators, though.
>
> Pekka

Kame and others:

If you are already comfortable with the general shape of the series, it
would do me good to do the same with the memcg preparation patches, so
we have less code to review and merge in the next window.

They are:

memcg: Make it possible to use the stock for more than one page.
memcg: Reclaim when more than one page needed.
memcg: change defines to an enum

Do you see any value in merging them now ?

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 23/25] memcg: propagate kmem limiting information to children [message #46866 is a reply to message #46862]

Wed, 20 June 2012 08:59

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

On 06/19/2012 12:54 PM, Glauber Costa wrote:
> On 06/19/2012 12:35 PM, Glauber Costa wrote:
>> On 06/19/2012 04:16 AM, Kamezawa Hiroyuki wrote:
>>> (2012/06/18 21:43), Glauber Costa wrote:
>>>> On 06/18/2012 04:37 PM, Kamezawa Hiroyuki wrote:
>>>>> (2012/06/18 19:28), Glauber Costa wrote:
>>>>>> The current memcg slab cache management fails to present satisfatory hierarchical
>>>>>> behavior in the following scenario:
>>>>>>
>>>>>> -> /cgroups/memory/A/B/C
>>>>>>
>>>>>> * kmem limit set at A
>>>>>> * A and B empty taskwise
>>>>>> * bash in C does find /
>>>>>>
>>>>>> Because kmem_accounted is a boolean that was not set for C, no accounting
>>>>>> would be done. This is, however, not what we expect.
>>>>>>
>>>>>
>>>>> Hmm....do we need this new routines even while we have mem_cgroup_iter() ?
>>>>>
>>>>> Doesn't this work ?
>>>>>
>>>>> struct mem_cgroup {
>>>>> .....
>>>>> bool kmem_accounted_this;
>>>>> atomic_t kmem_accounted;
>>>>> ....
>>>>> }
>>>>>
>>>>> at set limit
>>>>>
>>>>> ....set_limit(memcg) {
>>>>>
>>>>> if (newly accounted) {
>>>>> mem_cgroup_iter() {
>>>>> atomic_inc(&iter->kmem_accounted)
>>>>> }
>>>>> } else {
>>>>> mem_cgroup_iter() {
>>>>> atomic_dec(&iter->kmem_accounted);
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> hm ? Then, you can see kmem is accounted or not by atomic_read(&memcg->kmem_accounted);
>>>>>
>>>>
>>>> Accounted by itself / parent is still useful, and I see no reason to use
>>>> an atomic + bool if we can use a pair of bits.
>>>>
>>>> As for the routine, I guess mem_cgroup_iter will work... It does a lot
>>>> more than I need, but for the sake of using what's already in there, I
>>>> can switch to it with no problems.
>>>>
>>>
>>> Hmm. please start from reusing existing routines.
>>> If it's not enough, some enhancement for generic cgroup will be welcomed
>>> rather than completely new one only for memcg.
>>>
>>
>> And now that I am trying to adapt the code to the new function, I
>> remember clearly why I done this way. Sorry for my failed memory.
>>
>> That has to do with the order of the walk. I need to enforce hierarchy,
>> which means whenever a cgroup has !use_hierarchy, I need to cut out that
>> branch, but continue scanning the tree for other branches.
>>
>> That is a lot easier to do with depth-search tree walks like the one
>> proposed in this patch. for_each_mem_cgroup() seems to walk the tree in
>> css-creation order. Which means we need to keep track of parents that
>> has hierarchy disabled at all times ( can be many ), and always test for
>> ancestorship - which is expensive, but I don't particularly care.
>>
>> But I'll give another shot with this one.
>>
>
> Humm, silly me. I was believing the hierarchical settings to be more
> flexible than they really are.
>
> I thought that it could be possible for a children of a parent with
> use_hierarchy = 1 to have use_hierarchy = 0.
>
> It seems not to be the case. This makes my life a lot easier.
>

How about the following patch?

It is still expensive in the clear_bit case, because I can't just walk
the whole tree flipping the bit down: I need to stop whenever I see a
branch whose root is itself accounted - and the ordering of iter forces
me to always check the tree up (So we got O(n*h) h being height instead
of O(n)).

for flipping the bit up, it is easy enough.

Attachment: 0001-memcg-propagate-kmem-limiting-information-to-childre.patch
(Size: 5.13KB, Downloaded 728 times)

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 08/25] memcg: change defines to an enum [message #46872 is a reply to message #46819]

Wed, 20 June 2012 13:13

Michal Hocko is currently offline

Michal Hocko
Messages: 109
Registered: December 2011

Senior Member

[Sorry for late reply. I am aware of the series, I am just too busy to
give it serious time needed for review. It doesn't make much sense to
delay these preparational pieces so...]

On Mon 18-06-12 14:28:01, Glauber Costa wrote:
> This is just a cleanup patch for clarity of expression.
> In earlier submissions, people asked it to be in a separate
> patch, so here it is.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
> mm/memcontrol.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b6cb075..cc1fdb4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -374,9 +374,12 @@ enum charge_type {
> };
>
> /* for encoding cft->private value on file */
> -#define _MEM (0)
> -#define _MEMSWAP (1)
> -#define _OOM_TYPE (2)
> +enum res_type {
> + _MEM,
> + _MEMSWAP,
> + _OOM_TYPE,
> +};
> +
> #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
> #define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff)
> #define MEMFILE_ATTR(val) ((val) & 0xffff)
> --
> 1.7.10.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 05/25] memcg: Always free struct memcg through schedule_work() [message #46873 is a reply to message #46828]

Wed, 20 June 2012 13:20

Michal Hocko is currently offline

Michal Hocko
Messages: 109
Registered: December 2011

Senior Member

On Mon 18-06-12 14:27:58, Glauber Costa wrote:
> Right now we free struct memcg with kfree right after a
> rcu grace period, but defer it if we need to use vfree() to get
> rid of that memory area. We do that by need, because we need vfree
> to be called in a process context.
>
> This patch unifies this behavior, by ensuring that even kfree will
> happen in a separate thread. The goal is to have a stable place to
> call the upcoming jump label destruction function outside the realm
> of the complicated and quite far-reaching cgroup lock (that can't be
> held when calling neither the cpu_hotplug.lock nor the jump_label_mutex)

This one is in memcg-devel (mmotm) tree for quite some time with acks
from me and Kamezawa.

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Tejun Heo <tj@kernel.org>
> CC: Li Zefan <lizefan@huawei.com>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Michal Hocko <mhocko@suse.cz>
> ---
> mm/memcontrol.c | 24 +++++++++++++-----------
> 1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e3b528e..ce15be4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -245,8 +245,8 @@ struct mem_cgroup {
> */
> struct rcu_head rcu_freeing;
> /*
> - * But when using vfree(), that cannot be done at
> - * interrupt time, so we must then queue the work.
> + * We also need some space for a worker in deferred freeing.
> + * By the time we call it, rcu_freeing is not longer in use.
> */
> struct work_struct work_freeing;
> };
> @@ -4826,23 +4826,28 @@ out_free:
> }
>
> /*
> - * Helpers for freeing a vzalloc()ed mem_cgroup by RCU,
> + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
> * but in process context. The work_freeing structure is overlaid
> * on the rcu_freeing structure, which itself is overlaid on memsw.
> */
> -static void vfree_work(struct work_struct *work)
> +static void free_work(struct work_struct *work)
> {
> struct mem_cgroup *memcg;
> + int size = sizeof(struct mem_cgroup);
>
> memcg = container_of(work, struct mem_cgroup, work_freeing);
> - vfree(memcg);
> + if (size < PAGE_SIZE)
> + kfree(memcg);
> + else
> + vfree(memcg);
> }
> -static void vfree_rcu(struct rcu_head *rcu_head)
> +
> +static void free_rcu(struct rcu_head *rcu_head)
> {
> struct mem_cgroup *memcg;
>
> memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
> - INIT_WORK(&memcg->work_freeing, vfree_work);
> + INIT_WORK(&memcg->work_freeing, free_work);
> schedule_work(&memcg->work_freeing);
> }
>
> @@ -4868,10 +4873,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
> free_mem_cgroup_per_zone_info(memcg, node);
>
> free_percpu(memcg->stat);
> - if (sizeof(struct mem_cgroup) < PAGE_SIZE)
> - kfree_rcu(memcg, rcu_freeing);
> - else
> - call_rcu(&memcg->rcu_freeing, vfree_rcu);
> + call_rcu(&memcg->rcu_freeing, free_rcu);
> }
>
> static void mem_cgroup_get(struct mem_cgroup *memcg)
> --
> 1.7.10.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 06/25] memcg: Make it possible to use the stock for more than one page. [message #46874 is a reply to message #46820]

Wed, 20 June 2012 13:28

Michal Hocko is currently offline

Michal Hocko
Messages: 109
Registered: December 2011

Senior Member

On Mon 18-06-12 14:27:59, Glauber Costa wrote:
> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I am not sure the patch is good to merge on its own without the rest.
One comment bellow.

> ---
> mm/memcontrol.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ce15be4..00b9f1e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1998,19 +1998,19 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> static DEFINE_MUTEX(percpu_charge_mutex);
>
> /*
> - * Try to consume stocked charge on this cpu. If success, one page is consumed
> - * from local stock and true is returned. If the stock is 0 or charges from a
> - * cgroup which is not current target, returns false. This stock will be
> - * refilled.
> + * Try to consume stocked charge on this cpu. If success, nr_pages pages are
> + * consumed from local stock and true is returned. If the stock is 0 or
> + * charges from a cgroup which is not current target, returns false.
> + * This stock will be refilled.
> */
> -static bool consume_stock(struct mem_cgroup *memcg)
> +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
> {
> struct memcg_stock_pcp *stock;
> bool ret = true;

I guess you want:
if (nr_pages > CHARGE_BATCH)
return false;

because you don't want to try to use stock for THP pages.

>
> stock = &get_cpu_var(memcg_stock);
> - if (memcg == stock->cached && stock->nr_pages)
> - stock->nr_pages--;
> + if (memcg == stock->cached && stock->nr_pages >= nr_pages)
> + stock->nr_pages -= nr_pages;
> else /* need to call res_counter_charge */
> ret = false;
> put_cpu_var(memcg_stock);
> @@ -2309,7 +2309,7 @@ again:
> VM_BUG_ON(css_is_removed(&memcg->css));
> if (mem_cgroup_is_root(memcg))
> goto done;
> - if (nr_pages == 1 && consume_stock(memcg))
> + if (consume_stock(memcg, nr_pages))
> goto done;
> css_get(&memcg->css);
> } else {
> @@ -2334,7 +2334,7 @@ again:
> rcu_read_unlock();
> goto done;
> }
> - if (nr_pages == 1 && consume_stock(memcg)) {
> + if (consume_stock(memcg, nr_pages)) {
> /*
> * It seems dagerous to access memcg without css_get().
> * But considering how consume_stok works, it's not
> --
> 1.7.10.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 07/25] memcg: Reclaim when more than one page needed. [message #46875 is a reply to message #46822]

Wed, 20 June 2012 13:47

Michal Hocko is currently offline

Michal Hocko
Messages: 109
Registered: December 2011

Senior Member

On Mon 18-06-12 14:28:00, Glauber Costa wrote:
> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>
> mem_cgroup_do_charge() was written before slab accounting, and expects
> three cases: being called for 1 page, being called for a stock of 32 pages,
> or being called for a hugepage. If we call for 2 or 3 pages (and several
> slabs used in process creation are such, at least with the debug options I
> had), it assumed it's being called for stock and just retried without reclaiming.
>
> Fix that by passing down a minsize argument in addition to the csize.
>
> And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
> needed at all (and presumably is since it's there, perhaps to handle
> races), then it should be extended to more than PAGE_SIZE, yet how far?
> And should there be a retry count limit, of what? For now retry up to
> COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
> and make sure not to do it if __GFP_NORETRY.
>
> [v4: fixed nr pages calculation pointed out by Christoph Lameter ]
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I think this is not ready to be merged yet.
Two comments below.

[...]
> @@ -2210,18 +2211,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> } else
> mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> /*
> - * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
> - * of regular pages (CHARGE_BATCH), or a single regular page (1).
> - *
> * Never reclaim on behalf of optional batching, retry with a
> * single page instead.
> */
> - if (nr_pages == CHARGE_BATCH)
> + if (nr_pages > min_pages)
> return CHARGE_RETRY;
>
> if (!(gfp_mask & __GFP_WAIT))
> return CHARGE_WOULDBLOCK;
>
> + if (gfp_mask & __GFP_NORETRY)
> + return CHARGE_NOMEM;

This is kmem specific and should be preparated out in case this should
be merged before the rest.
Btw. I assume that oom==false when called from kmem...

> +
> ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> return CHARGE_RETRY;
> @@ -2234,8 +2235,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * unlikely to succeed so close to the limit, and we fall back
> * to regular pages anyway in case of failure.
> */
> - if (nr_pages == 1 && ret)
> + if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret) {
> + cond_resched();
> return CHARGE_RETRY;
> + }

What prevents us from looping for unbounded amount of time here?
Maybe you need to consider the number of reclaimed pages here.

>
> /*
> * At task move, charge accounts can be doubly counted. So, it's
> @@ -2369,7 +2372,8 @@ again:
> nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> }
>
> - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
> + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
> + oom_check);
> switch (ret) {
> case CHARGE_OK:
> break;
> --
> 1.7.10.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 06/25] memcg: Make it possible to use the stock for more than one page. [message #46878 is a reply to message #46874]

Wed, 20 June 2012 19:36

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

On 06/20/2012 05:28 PM, Michal Hocko wrote:
> On Mon 18-06-12 14:27:59, Glauber Costa wrote:
>> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>>
>> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> I am not sure the patch is good to merge on its own without the rest.
> One comment bellow.
>
>> ---
>> mm/memcontrol.c | 18 +++++++++---------
>> 1 file changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ce15be4..00b9f1e 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1998,19 +1998,19 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>> static DEFINE_MUTEX(percpu_charge_mutex);
>>
>> /*
>> - * Try to consume stocked charge on this cpu. If success, one page is consumed
>> - * from local stock and true is returned. If the stock is 0 or charges from a
>> - * cgroup which is not current target, returns false. This stock will be
>> - * refilled.
>> + * Try to consume stocked charge on this cpu. If success, nr_pages pages are
>> + * consumed from local stock and true is returned. If the stock is 0 or
>> + * charges from a cgroup which is not current target, returns false.
>> + * This stock will be refilled.
>> */
>> -static bool consume_stock(struct mem_cgroup *memcg)
>> +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
>> {
>> struct memcg_stock_pcp *stock;
>> bool ret = true;
>
> I guess you want:
> if (nr_pages > CHARGE_BATCH)
> return false;
>
> because you don't want to try to use stock for THP pages.

The code reads:

+ if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+ stock->nr_pages -= nr_pages;

Isn't stock->nr_pages always <= CHARGE_BATCH by definition?

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 07/25] memcg: Reclaim when more than one page needed. [message #46879 is a reply to message #46875]

Wed, 20 June 2012 19:43

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

On 06/20/2012 05:47 PM, Michal Hocko wrote:
> On Mon 18-06-12 14:28:00, Glauber Costa wrote:
>> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>>
>> mem_cgroup_do_charge() was written before slab accounting, and expects
>> three cases: being called for 1 page, being called for a stock of 32 pages,
>> or being called for a hugepage. If we call for 2 or 3 pages (and several
>> slabs used in process creation are such, at least with the debug options I
>> had), it assumed it's being called for stock and just retried without reclaiming.
>>
>> Fix that by passing down a minsize argument in addition to the csize.
>>
>> And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
>> needed at all (and presumably is since it's there, perhaps to handle
>> races), then it should be extended to more than PAGE_SIZE, yet how far?
>> And should there be a retry count limit, of what? For now retry up to
>> COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
>> and make sure not to do it if __GFP_NORETRY.
>>
>> [v4: fixed nr pages calculation pointed out by Christoph Lameter ]
>>
>> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> I think this is not ready to be merged yet.
Fair Enough

> Two comments below.
>
> [...]
>> @@ -2210,18 +2211,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> } else
>> mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>> /*
>> - * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
>> - * of regular pages (CHARGE_BATCH), or a single regular page (1).
>> - *
>> * Never reclaim on behalf of optional batching, retry with a
>> * single page instead.
>> */
>> - if (nr_pages == CHARGE_BATCH)
>> + if (nr_pages > min_pages)
>> return CHARGE_RETRY;
>>
>> if (!(gfp_mask & __GFP_WAIT))
>> return CHARGE_WOULDBLOCK;
>>
>> + if (gfp_mask & __GFP_NORETRY)
>> + return CHARGE_NOMEM;
>
> This is kmem specific and should be preparated out in case this should
> be merged before the rest.
ok.

> Btw. I assume that oom==false when called from kmem...

What prevents the oom killer to be called for a reclaimable kmem
allocation that can be satisfied ?

>> +
>> ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
>> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>> return CHARGE_RETRY;
>> @@ -2234,8 +2235,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> * unlikely to succeed so close to the limit, and we fall back
>> * to regular pages anyway in case of failure.
>> */
>> - if (nr_pages == 1 && ret)
>> + if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret) {
>> + cond_resched();
>> return CHARGE_RETRY;
>> + }
>
> What prevents us from looping for unbounded amount of time here?
> Maybe you need to consider the number of reclaimed pages here.

Why would we even loop here? It will just return CHARGE_RETRY, it is up
to the caller to decide whether or not it will retry.

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 05/25] memcg: Always free struct memcg through schedule_work() [message #46885 is a reply to message #46865]

Thu, 21 June 2012 11:39

KAMEZAWA Hiroyuki is currently offline

KAMEZAWA Hiroyuki
Messages: 463
Registered: September 2006

Senior Member

(2012/06/20 17:40), Glauber Costa wrote:
> On 06/20/2012 11:32 AM, Pekka Enberg wrote:
>>> >Maybe Pekka can merge the current -mm with his tree?
>> I first want to have a stable base from Christoph's "common slab" series
>> before I am comfortable with going forward with the memcg parts.
>>
>> Feel free to push forward any preparational patches to the slab
>> allocators, though.
>>
>> Pekka
>
> Kame and others:
>
> If you are already comfortable with the general shape of the series, it would do me good to do the same with the memcg preparation patches, so we have less code to review and merge in the next window.
>
> They are:
>
> memcg: Make it possible to use the stock for more than one page.
> memcg: Reclaim when more than one page needed.
> memcg: change defines to an enum
>
> Do you see any value in merging them now ?
>

I'll be okay with the 3 patches for memcg.

Thanks,
-Kame

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 06/25] memcg: Make it possible to use the stock for more than one page. [message #46890 is a reply to message #46878]

Thu, 21 June 2012 21:14

Michal Hocko is currently offline

Michal Hocko
Messages: 109
Registered: December 2011

Senior Member

On Wed 20-06-12 23:36:47, Glauber Costa wrote:
> On 06/20/2012 05:28 PM, Michal Hocko wrote:
> >On Mon 18-06-12 14:27:59, Glauber Costa wrote:
> >>From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
> >>
> >>Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> >>Signed-off-by: Glauber Costa <glommer@parallels.com>
> >>Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> >I am not sure the patch is good to merge on its own without the rest.
> >One comment bellow.
> >
> >>---
> >> mm/memcontrol.c | 18 +++++++++---------
> >> 1 file changed, 9 insertions(+), 9 deletions(-)
> >>
> >>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>index ce15be4..00b9f1e 100644
> >>--- a/mm/memcontrol.c
> >>+++ b/mm/memcontrol.c
> >>@@ -1998,19 +1998,19 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> >> static DEFINE_MUTEX(percpu_charge_mutex);
> >>
> >> /*
> >>- * Try to consume stocked charge on this cpu. If success, one page is consumed
> >>- * from local stock and true is returned. If the stock is 0 or charges from a
> >>- * cgroup which is not current target, returns false. This stock will be
> >>- * refilled.
> >>+ * Try to consume stocked charge on this cpu. If success, nr_pages pages are
> >>+ * consumed from local stock and true is returned. If the stock is 0 or
> >>+ * charges from a cgroup which is not current target, returns false.
> >>+ * This stock will be refilled.
> >> */
> >>-static bool consume_stock(struct mem_cgroup *memcg)
> >>+static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
> >> {
> >> struct memcg_stock_pcp *stock;
> >> bool ret = true;
> >
> >I guess you want:
> > if (nr_pages > CHARGE_BATCH)
> > return false;
> >
> >because you don't want to try to use stock for THP pages.
>
>
> The code reads:
>
> + if (memcg == stock->cached && stock->nr_pages >= nr_pages)
> + stock->nr_pages -= nr_pages;
>
> Isn't stock->nr_pages always <= CHARGE_BATCH by definition?

Yes it is, but why to disable preemption if we know this has no chance
to succeed at all?

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 07/25] memcg: Reclaim when more than one page needed. [message #46891 is a reply to message #46879]

Thu, 21 June 2012 21:19

Michal Hocko is currently offline

Michal Hocko
Messages: 109
Registered: December 2011

Senior Member

On Wed 20-06-12 23:43:52, Glauber Costa wrote:
> On 06/20/2012 05:47 PM, Michal Hocko wrote:
> >On Mon 18-06-12 14:28:00, Glauber Costa wrote:
> >>From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
> >>
> >>mem_cgroup_do_charge() was written before slab accounting, and expects
> >>three cases: being called for 1 page, being called for a stock of 32 pages,
> >>or being called for a hugepage. If we call for 2 or 3 pages (and several
> >>slabs used in process creation are such, at least with the debug options I
> >>had), it assumed it's being called for stock and just retried without reclaiming.
> >>
> >>Fix that by passing down a minsize argument in addition to the csize.
> >>
> >>And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
> >>needed at all (and presumably is since it's there, perhaps to handle
> >>races), then it should be extended to more than PAGE_SIZE, yet how far?
> >>And should there be a retry count limit, of what? For now retry up to
> >>COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
> >>and make sure not to do it if __GFP_NORETRY.
> >>
> >>[v4: fixed nr pages calculation pointed out by Christoph Lameter ]
> >>
> >>Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> >>Signed-off-by: Glauber Costa <glommer@parallels.com>
> >>Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> >I think this is not ready to be merged yet.
> Fair Enough
>
> >Two comments below.
> >
> >[...]
> >>@@ -2210,18 +2211,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >> } else
> >> mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> >> /*
> >>- * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
> >>- * of regular pages (CHARGE_BATCH), or a single regular page (1).
> >>- *
> >> * Never reclaim on behalf of optional batching, retry with a
> >> * single page instead.
> >> */
> >>- if (nr_pages == CHARGE_BATCH)
> >>+ if (nr_pages > min_pages)
> >> return CHARGE_RETRY;
> >>
> >> if (!(gfp_mask & __GFP_WAIT))
> >> return CHARGE_WOULDBLOCK;
> >>
> >>+ if (gfp_mask & __GFP_NORETRY)
> >>+ return CHARGE_NOMEM;
> >
> >This is kmem specific and should be preparated out in case this should
> >be merged before the rest.
> ok.
>
> >Btw. I assume that oom==false when called from kmem...
>
> What prevents the oom killer to be called for a reclaimable kmem
> allocation that can be satisfied ?

Well, I am not familiar with the rest of the patch series yet (sorry
about that) but playing with oom can be really nasty if oom score
doesn't consider also kmem allocations. You can end up killing
unexpected processes just because of kmem hungry (and nasty) process.
Dunno, have to thing about that.

> >>+
> >> ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> >> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> >> return CHARGE_RETRY;
> >>@@ -2234,8 +2235,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >> * unlikely to succeed so close to the limit, and we fall back
> >> * to regular pages anyway in case of failure.
> >> */
> >>- if (nr_pages == 1 && ret)
> >>+ if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret) {
> >>+ cond_resched();
> >> return CHARGE_RETRY;
> >>+ }
> >
> >What prevents us from looping for unbounded amount of time here?
> >Maybe you need to consider the number of reclaimed pages here.
>
> Why would we even loop here? It will just return CHARGE_RETRY, it is
> up to the caller to decide whether or not it will retry.

Yes, but the test was original to prevent oom when we managed to reclaim
something. And something might be enough for a single page but now you
have high order allocations so we can retry without any success.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 23/25] memcg: propagate kmem limiting information to children [message #46900 is a reply to message #46866]

Sat, 23 June 2012 04:19

KAMEZAWA Hiroyuki is currently offline

KAMEZAWA Hiroyuki
Messages: 463
Registered: September 2006

Senior Member

(2012/06/20 17:59), Glauber Costa wrote:
> On 06/19/2012 12:54 PM, Glauber Costa wrote:
>> On 06/19/2012 12:35 PM, Glauber Costa wrote:
>>> On 06/19/2012 04:16 AM, Kamezawa Hiroyuki wrote:
>>>> (2012/06/18 21:43), Glauber Costa wrote:
>>>>> On 06/18/2012 04:37 PM, Kamezawa Hiroyuki wrote:
>>>>>> (2012/06/18 19:28), Glauber Costa wrote:
>>>>>>> The current memcg slab cache management fails to present satisfatory hierarchical
>>>>>>> behavior in the following scenario:
>>>>>>>
>>>>>>> -> /cgroups/memory/A/B/C
>>>>>>>
>>>>>>> * kmem limit set at A
>>>>>>> * A and B empty taskwise
>>>>>>> * bash in C does find /
>>>>>>>
>>>>>>> Because kmem_accounted is a boolean that was not set for C, no accounting
>>>>>>> would be done. This is, however, not what we expect.
>>>>>>>
>>>>>>
>>>>>> Hmm....do we need this new routines even while we have mem_cgroup_iter() ?
>>>>>>
>>>>>> Doesn't this work ?
>>>>>>
>>>>>> struct mem_cgroup {
>>>>>> .....
>>>>>> bool kmem_accounted_this;
>>>>>> atomic_t kmem_accounted;
>>>>>> ....
>>>>>> }
>>>>>>
>>>>>> at set limit
>>>>>>
>>>>>> ....set_limit(memcg) {
>>>>>>
>>>>>> if (newly accounted) {
>>>>>> mem_cgroup_iter() {
>>>>>> atomic_inc(&iter->kmem_accounted)
>>>>>> }
>>>>>> } else {
>>>>>> mem_cgroup_iter() {
>>>>>> atomic_dec(&iter->kmem_accounted);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> hm ? Then, you can see kmem is accounted or not by atomic_read(&memcg->kmem_accounted);
>>>>>>
>>>>>
>>>>> Accounted by itself / parent is still useful, and I see no reason to use
>>>>> an atomic + bool if we can use a pair of bits.
>>>>>
>>>>> As for the routine, I guess mem_cgroup_iter will work... It does a lot
>>>>> more than I need, but for the sake of using what's already in there, I
>>>>> can switch to it with no problems.
>>>>>
>>>>
>>>> Hmm. please start from reusing existing routines.
>>>> If it's not enough, some enhancement for generic cgroup will be welcomed
>>>> rather than completely new one only for memcg.
>>>>
>>>
>>> And now that I am trying to adapt the code to the new function, I
>>> remember clearly why I done this way. Sorry for my failed memory.
>>>
>>> That has to do with the order of the walk. I need to enforce hierarchy,
>>> which means whenever a cgroup has !use_hierarchy, I need to cut out that
>>> branch, but continue scanning the tree for other branches.
>>>
>>> That is a lot easier to do with depth-search tree walks like the one
>>> proposed in this patch. for_each_mem_cgroup() seems to walk the tree in
>>> css-creation order. Which means we need to keep track of parents that
>>> has hierarchy disabled at all times ( can be many ), and always test for
>>> ancestorship - which is expensive, but I don't particularly care.
>>>
>>> But I'll give another shot with this one.
>>>
>>
>> Humm, silly me. I was believing the hierarchical settings to be more
>> flexible than they really are.
>>
>> I thought that it could be possible for a children of a parent with
>> use_hierarchy = 1 to have use_hierarchy = 0.
>>
>> It seems not to be the case. This makes my life a lot easier.
>>
>
> How about the following patch?
>
> It is still expensive in the clear_bit case, because I can't just walk
> the whole tree flipping the bit down: I need to stop whenever I see a
> branch whose root is itself accounted - and the ordering of iter forces
> me to always check the tree up (So we got O(n*h) h being height instead
> of O(n)).
>
> for flipping the bit up, it is easy enough.
>
>
Yes. It seems much nicer.

Thanks,
-Kame

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 06/25] memcg: Make it possible to use the stock for more than one page. [message #46912 is a reply to message #46874]

Mon, 25 June 2012 13:03

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

On 06/20/2012 05:28 PM, Michal Hocko wrote:
> I guess you want:
> if (nr_pages > CHARGE_BATCH)
> return false;
>
> because you don't want to try to use stock for THP pages.

Done, thanks.

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 07/25] memcg: Reclaim when more than one page needed. [message #46913 is a reply to message #46891]

Mon, 25 June 2012 13:13

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

>>>> +
>>>> ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
>>>> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>>>> return CHARGE_RETRY;
>>>> @@ -2234,8 +2235,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>> * unlikely to succeed so close to the limit, and we fall back
>>>> * to regular pages anyway in case of failure.
>>>> */
>>>> - if (nr_pages == 1 && ret)
>>>> + if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret) {
>>>> + cond_resched();
>>>> return CHARGE_RETRY;
>>>> + }
>>>
>>> What prevents us from looping for unbounded amount of time here?
>>> Maybe you need to consider the number of reclaimed pages here.
>>
>> Why would we even loop here? It will just return CHARGE_RETRY, it is
>> up to the caller to decide whether or not it will retry.
>
> Yes, but the test was original to prevent oom when we managed to reclaim
> something. And something might be enough for a single page but now you
> have high order allocations so we can retry without any success.
>

So,

Most of the kmem allocations are likely to be quite small as well. For
the slab, we're dealing with the order of 2-3 pages, and for other
allocations that may happen, like stack, they will be in the order of 2
pages as well.

So one thing I could do here, is define a threshold, say, 3, and only
retry for that very low threshold, instead of following COSTLY_ORDER.
I don't expect two or three pages to be much less likely to be freed
than a single page.

I am fine with ripping of the cond_resched as well.

Let me know if you would be okay with that.

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 07/25] memcg: Reclaim when more than one page needed. [message #46915 is a reply to message #46913]

Mon, 25 June 2012 14:04

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

On 06/25/2012 05:13 PM, Glauber Costa wrote:
>
>>>>> +
>>>>> ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
>>>>> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>>>>> return CHARGE_RETRY;
>>>>> @@ -2234,8 +2235,10 @@ static int mem_cgroup_do_charge(struct
>>>>> mem_cgroup *memcg, gfp_t gfp_mask,
>>>>> * unlikely to succeed so close to the limit, and we fall back
>>>>> * to regular pages anyway in case of failure.
>>>>> */
>>>>> - if (nr_pages == 1 && ret)
>>>>> + if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret) {
>>>>> + cond_resched();
>>>>> return CHARGE_RETRY;
>>>>> + }
>>>>
>>>> What prevents us from looping for unbounded amount of time here?
>>>> Maybe you need to consider the number of reclaimed pages here.
>>>
>>> Why would we even loop here? It will just return CHARGE_RETRY, it is
>>> up to the caller to decide whether or not it will retry.
>>
>> Yes, but the test was original to prevent oom when we managed to reclaim
>> something. And something might be enough for a single page but now you
>> have high order allocations so we can retry without any success.
>>
>
> So,
>
> Most of the kmem allocations are likely to be quite small as well. For
> the slab, we're dealing with the order of 2-3 pages, and for other
> allocations that may happen, like stack, they will be in the order of 2
> pages as well.
>
> So one thing I could do here, is define a threshold, say, 3, and only
> retry for that very low threshold, instead of following COSTLY_ORDER.
> I don't expect two or three pages to be much less likely to be freed
> than a single page.
>
> I am fine with ripping of the cond_resched as well.
>
> Let me know if you would be okay with that.
>
>

For the record, here's the patch I would propose.

At this point, I think it would be nice to Suleiman to say if he is
still okay with the changes.

Attachment: 0001-memcg-Reclaim-when-more-than-one-page-needed.patch
(Size: 3.39KB, Downloaded 648 times)

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 24/25] memcg/slub: shrink dead caches [message #47102 is a reply to message #46842]

Fri, 06 July 2012 15:16

Christoph Lameter is currently offline

Christoph Lameter
Messages: 123
Registered: September 2006

Senior Member

On Mon, 18 Jun 2012, Glauber Costa wrote:

> In the slub allocator, when the last object of a page goes away, we
> don't necessarily free it - there is not necessarily a test for empty
> page in any slab_free path.

This is the same btw in SLAB which keeps objects in per cpu caches and
keeps empty slab pages on special queues.

> This patch marks all memcg caches as dead. kmem_cache_shrink is called
> for the ones who are not yet dead - this will force internal cache
> reorganization, and then all references to empty pages will be removed.

You need to call this also for slab to drain the caches and free the pages
on the empty list.

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 24/25] memcg/slub: shrink dead caches [message #47206 is a reply to message #47102]

Fri, 20 July 2012 22:16

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

On 07/06/2012 12:16 PM, Christoph Lameter wrote:
> On Mon, 18 Jun 2012, Glauber Costa wrote:
>
>> In the slub allocator, when the last object of a page goes away, we
>> don't necessarily free it - there is not necessarily a test for empty
>> page in any slab_free path.
>
> This is the same btw in SLAB which keeps objects in per cpu caches and
> keeps empty slab pages on special queues.
>
>> This patch marks all memcg caches as dead. kmem_cache_shrink is called
>> for the ones who are not yet dead - this will force internal cache
>> reorganization, and then all references to empty pages will be removed.
>
> You need to call this also for slab to drain the caches and free the pages
> on the empty list.
>
Doesn't the SLAB have a time-based reaper for that?

That's why I was less concerned with the SLAB, but I can certainly call
it for both.

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 24/25] memcg/slub: shrink dead caches [message #47268 is a reply to message #47206]

Wed, 25 July 2012 15:23

Christoph Lameter is currently offline

Christoph Lameter
Messages: 123
Registered: September 2006

Senior Member

On Fri, 20 Jul 2012, Glauber Costa wrote:

> > This is the same btw in SLAB which keeps objects in per cpu caches and
> > keeps empty slab pages on special queues.
> >
> >> This patch marks all memcg caches as dead. kmem_cache_shrink is called
> >> for the ones who are not yet dead - this will force internal cache
> >> reorganization, and then all references to empty pages will be removed.
> >
> > You need to call this also for slab to drain the caches and free the pages
> > on the empty list.
> >
> Doesn't the SLAB have a time-based reaper for that?

Yes but it will take a couple of minutes to drain the caches.

Report message to a moderator

Send a private message to this user

Re: [PATCH v4 24/25] memcg/slub: shrink dead caches [message #47273 is a reply to message #47268]

Wed, 25 July 2012 18:15

Glauber Costa is currently offline

Glauber Costa
Messages: 916
Registered: October 2011

Senior Member

On 07/25/2012 07:23 PM, Christoph Lameter wrote:
> On Fri, 20 Jul 2012, Glauber Costa wrote:
>
>>> This is the same btw in SLAB which keeps objects in per cpu caches and
>>> keeps empty slab pages on special queues.
>>>
>>>> This patch marks all memcg caches as dead. kmem_cache_shrink is called
>>>> for the ones who are not yet dead - this will force internal cache
>>>> reorganization, and then all references to empty pages will be removed.
>>>
>>> You need to call this also for slab to drain the caches and free the pages
>>> on the empty list.
>>>
>> Doesn't the SLAB have a time-based reaper for that?
>
> Yes but it will take a couple of minutes to drain the caches.
>
You might have seen in my last submission that included this in the slab
as well.

Report message to a moderator

Send a private message to this user

Pages (2): [ « ‹ 1 2]

Switch to threaded view of this topic

Create a new topic

Submit Reply

Previous Topic:	[PATCH] provide a common place for initcall processing in kmem_cache
Next Topic:	[PATCH] SUNRPC: return negative value in case rpcbind client creation error

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

PDF

]

Current Time: Tue Jun 16 04:43:25 GMT 2026

Total time taken to generate the page: 0.44231 seconds