OpenVZ Forum


Home » Mailing lists » Devel » [PATCH v3 00/13] kmem controller for memcg.
Re: [PATCH v3 10/13] memcg: use static branches when code not in use [message #48168 is a reply to message #47891] Mon, 01 October 2012 12:25 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
On Tue 18-09-12 18:04:07, Glauber Costa wrote:
[...]
> include/linux/memcontrol.h | 4 +++-
> mm/memcontrol.c | 26 ++++++++++++++++++++++++--
> 2 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 82ede9a..4ec9fd5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -22,6 +22,7 @@
> #include <linux/cgroup.h>
> #include <linux/vm_event_item.h>
> #include <linux/hardirq.h>
> +#include <linux/jump_label.h>
>
> struct mem_cgroup;
> struct page_cgroup;
> @@ -401,9 +402,10 @@ struct sock;
> void sock_update_memcg(struct sock *sk);
> void sock_release_memcg(struct sock *sk);
>
> +extern struct static_key memcg_kmem_enabled_key;
> static inline bool memcg_kmem_enabled(void)
> {
> - return true;
> + return static_key_false(&memcg_kmem_enabled_key);
> }
>
> extern bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 720e4bb..aada601 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -467,6 +467,8 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
> #include <net/sock.h>
> #include <net/ip.h>
>
> +struct static_key memcg_kmem_enabled_key;
> +
> static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
> static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size);
> static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size);
> @@ -624,6 +626,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> WARN_ON(mem_cgroup_is_root(memcg));
> memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
> }
> +
> +static void disarm_kmem_keys(struct mem_cgroup *memcg)
> +{
> + if (memcg_kmem_is_accounted(memcg))
> + static_key_slow_dec(&memcg_kmem_enabled_key);
> +}
> +#else
> +static void disarm_kmem_keys(struct mem_cgroup *memcg)
> +{
> +}
> #endif /* CONFIG_MEMCG_KMEM */
>
> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> @@ -639,6 +651,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
> }
> #endif
>
> +static void disarm_static_keys(struct mem_cgroup *memcg)
> +{
> + disarm_sock_keys(memcg);
> + disarm_kmem_keys(memcg);
> +}
> +
> static void drain_all_stock_async(struct mem_cgroup *memcg);
>
> static struct mem_cgroup_per_zone *
> @@ -4131,7 +4149,11 @@ static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
> */
> mutex_lock(&set_limit_mutex);
> if ((val != RESOURCE_MAX) && memcg_kmem_set_active(memcg)) {
> -
> + /*
> + * Once the static branch is enabled it will only be
> + * disabled when the last reference to memcg is gone.
> + */
> + static_key_slow_inc(&memcg_kmem_enabled_key);

I guess the reason why we do not need to inc also for children is that
we do not inherit kmem_accounted, right?

> mem_cgroup_get(memcg);
>
> for_each_mem_cgroup_tree(iter, memcg) {
> @@ -5066,7 +5088,7 @@ static void free_work(struct work_struct *work)
> * to move this code around, and make sure it is outside
> * the cgroup_lock.
> */
> - disarm_sock_keys(memcg);
> + disarm_static_keys(memcg);
> if (size < PAGE_SIZE)
> kfree(memcg);
> else
> --
> 1.7.11.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs
Re: [PATCH v3 11/13] memcg: allow a memcg with kmem charges to be destructed. [message #48169 is a reply to message #47893] Mon, 01 October 2012 12:30 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
On Tue 18-09-12 18:04:08, Glauber Costa wrote:
> Because the ultimate goal of the kmem tracking in memcg is to track slab
> pages as well, we can't guarantee that we'll always be able to point a
> page to a particular process, and migrate the charges along with it -
> since in the common case, a page will contain data belonging to multiple
> processes.
>
> Because of that, when we destroy a memcg, we only make sure the
> destruction will succeed by discounting the kmem charges from the user
> charges when we try to empty the cgroup.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>

Looks good.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> mm/memcontrol.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index aada601..b05ecac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -631,6 +631,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
> {
> if (memcg_kmem_is_accounted(memcg))
> static_key_slow_dec(&memcg_kmem_enabled_key);
> + /*
> + * This check can't live in kmem destruction function,
> + * since the charges will outlive the cgroup
> + */
> + WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
> }
> #else
> static void disarm_kmem_keys(struct mem_cgroup *memcg)
> @@ -3933,6 +3938,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
> int node, zid, shrink;
> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> struct cgroup *cgrp = memcg->css.cgroup;
> + u64 usage;
>
> css_get(&memcg->css);
>
> @@ -3966,8 +3972,17 @@ move_account:
> mem_cgroup_end_move(memcg);
> memcg_oom_recover(memcg);
> cond_resched();
> + /*
> + * Kernel memory may not necessarily be trackable to a specific
> + * process. So they are not migrated, and therefore we can't
> + * expect their value to drop to 0 here.
> + *
> + * having res filled up with kmem only is enough
> + */
> + usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
> + res_counter_read_u64(&memcg->kmem, RES_USAGE);
> /* "ret" should also be checked to ensure all lists are empty. */
> - } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
> + } while (usage > 0 || ret);
> out:
> css_put(&memcg->css);
> return ret;
> --
> 1.7.11.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs
Re: [PATCH v3 10/13] memcg: use static branches when code not in use [message #48170 is a reply to message #48168] Mon, 01 October 2012 12:27 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/01/2012 04:25 PM, Michal Hocko wrote:
> On Tue 18-09-12 18:04:07, Glauber Costa wrote:
> [...]
>> include/linux/memcontrol.h | 4 +++-
>> mm/memcontrol.c | 26 ++++++++++++++++++++++++--
>> 2 files changed, 27 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 82ede9a..4ec9fd5 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -22,6 +22,7 @@
>> #include <linux/cgroup.h>
>> #include <linux/vm_event_item.h>
>> #include <linux/hardirq.h>
>> +#include <linux/jump_label.h>
>>
>> struct mem_cgroup;
>> struct page_cgroup;
>> @@ -401,9 +402,10 @@ struct sock;
>> void sock_update_memcg(struct sock *sk);
>> void sock_release_memcg(struct sock *sk);
>>
>> +extern struct static_key memcg_kmem_enabled_key;
>> static inline bool memcg_kmem_enabled(void)
>> {
>> - return true;
>> + return static_key_false(&memcg_kmem_enabled_key);
>> }
>>
>> extern bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 720e4bb..aada601 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -467,6 +467,8 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
>> #include <net/sock.h>
>> #include <net/ip.h>
>>
>> +struct static_key memcg_kmem_enabled_key;
>> +
>> static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
>> static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size);
>> static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size);
>> @@ -624,6 +626,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>> WARN_ON(mem_cgroup_is_root(memcg));
>> memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
>> }
>> +
>> +static void disarm_kmem_keys(struct mem_cgroup *memcg)
>> +{
>> + if (memcg_kmem_is_accounted(memcg))
>> + static_key_slow_dec(&memcg_kmem_enabled_key);
>> +}
>> +#else
>> +static void disarm_kmem_keys(struct mem_cgroup *memcg)
>> +{
>> +}
>> #endif /* CONFIG_MEMCG_KMEM */
>>
>> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>> @@ -639,6 +651,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
>> }
>> #endif
>>
>> +static void disarm_static_keys(struct mem_cgroup *memcg)
>> +{
>> + disarm_sock_keys(memcg);
>> + disarm_kmem_keys(memcg);
>> +}
>> +
>> static void drain_all_stock_async(struct mem_cgroup *memcg);
>>
>> static struct mem_cgroup_per_zone *
>> @@ -4131,7 +4149,11 @@ static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
>> */
>> mutex_lock(&set_limit_mutex);
>> if ((val != RESOURCE_MAX) && memcg_kmem_set_active(memcg)) {
>> -
>> + /*
>> + * Once the static branch is enabled it will only be
>> + * disabled when the last reference to memcg is gone.
>> + */
>> + static_key_slow_inc(&memcg_kmem_enabled_key);
>
> I guess the reason why we do not need to inc also for children is that
> we do not inherit kmem_accounted, right?
>

Yes, but I of course changed that in the upcoming version of the patch.
We now inherit the value everytime, and the static branches are updated
accordingly.
Re: [PATCH v3 09/13] memcg: kmem accounting lifecycle management [message #48171 is a reply to message #48167] Mon, 01 October 2012 12:29 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/01/2012 04:15 PM, Michal Hocko wrote:
> Based on the previous discussions I guess this one will get reworked,
> right?
>

Yes, but most of it stayed. The hierarchy part is gone, but because we
will still have kmem pages floating around (potentially), I am still
using the mark_dead() trick with the corresponding get when kmem_accounted.
Re: [PATCH v3 09/13] memcg: kmem accounting lifecycle management [message #48172 is a reply to message #48171] Mon, 01 October 2012 12:36 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
On Mon 01-10-12 16:29:11, Glauber Costa wrote:
> On 10/01/2012 04:15 PM, Michal Hocko wrote:
> > Based on the previous discussions I guess this one will get reworked,
> > right?
> >
>
> Yes, but most of it stayed. The hierarchy part is gone, but because we
> will still have kmem pages floating around (potentially), I am still
> using the mark_dead() trick with the corresponding get when kmem_accounted.

Is it OK if I hold on with the review of this one until the next
version?
--
Michal Hocko
SUSE Labs
Re: [PATCH v3 09/13] memcg: kmem accounting lifecycle management [message #48173 is a reply to message #48172] Mon, 01 October 2012 12:43 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/01/2012 04:36 PM, Michal Hocko wrote:
> On Mon 01-10-12 16:29:11, Glauber Costa wrote:
>> On 10/01/2012 04:15 PM, Michal Hocko wrote:
>>> Based on the previous discussions I guess this one will get reworked,
>>> right?
>>>
>>
>> Yes, but most of it stayed. The hierarchy part is gone, but because we
>> will still have kmem pages floating around (potentially), I am still
>> using the mark_dead() trick with the corresponding get when kmem_accounted.
>
> Is it OK if I hold on with the review of this one until the next
> version?
>
Of course.

I haven't sent it yet because I also received a lot more feedback for
the slab part (which is expected), and I want to get a least part of
that going before I send it again.
Re: [PATCH v3 13/13] protect architectures where THREAD_SIZE &gt;= PAGE_SIZE against fork bombs [message #48174 is a reply to message #47896] Mon, 01 October 2012 13:17 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
On Tue 18-09-12 18:04:10, Glauber Costa wrote:
> Because those architectures will draw their stacks directly from the
> page allocator, rather than the slab cache, we can directly pass
> __GFP_KMEMCG flag, and issue the corresponding free_pages.
>
> This code path is taken when the architecture doesn't define
> CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
> THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
> architectures fall in this category.
>
> This will guarantee that every stack page is accounted to the memcg the
> process currently lives on, and will have the allocations to fail if
> they go over limit.
>
> For the time being, I am defining a new variant of THREADINFO_GFP, not
> to mess with the other path. Once the slab is also tracked by memcg, we
> can get rid of that flag.
>
> Tested to successfully protect against :(){ :|:& };:

OK. Although I was complaining that this is not the full truth the last
time, I do not insist on gravy details about the slaughter this will
cause to the rest of the group and that who-ever could fork in the group
can easily DOS the whole hierarchy. It has some interesting side effects
as well but let's keep this to a careful reader ;)

The patch, as is, is still useful and an improvement because it reduces
the impact.

>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> include/linux/thread_info.h | 2 ++
> kernel/fork.c | 4 ++--
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index ccc1899..e7e0473 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm);
> # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
> #endif
>
> +#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
> +
> /*
> * flag set/clear/test wrappers
> * - pass TIF_xxxx constants to these functions
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0ff2bf7..897e89c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
> static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
> int node)
> {
> - struct page *page = alloc_pages_node(node, THREADINFO_GFP,
> + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
> THREAD_SIZE_ORDER);
>
> return page ? page_address(page) : NULL;
> @@ -154,7 +154,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>
> static inline void free_thread_info(struct thread_info *ti)
> {
> - free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> + free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> }
> # else
> static struct kmem_cache *thread_info_cache;
> --
> 1.7.11.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs
Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [message #48175 is a reply to message #47888] Mon, 01 October 2012 13:27 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
On Tue 18-09-12 18:04:09, Glauber Costa wrote:
> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
> means that the freeing of memcg structure must happen in a compatible context,
> otherwise we'll get a deadlock.

Maybe I am missing something obvious but why cannot we simply disble
(soft)irqs in mem_cgroup_create rather than make the free path much more
complicated. It really feels strange to defer everything (e.g. soft
reclaim tree cleanup which should be a no-op at the time because there
shouldn't be any user pages in the group).

> The reference counting mechanism we use allows the memcg structure to be freed
> later and outlive the actual memcg destruction from the filesystem. However, we
> have little, if any, means to guarantee in which context the last memcg_put
> will happen. The best we can do is test it and try to make sure no invalid
> context releases are happening. But as we add more code to memcg, the possible
> interactions grow in number and expose more ways to get context conflicts.
>
> We already moved a part of the freeing to a worker thread to be context-safe
> for the static branches disabling. I see no reason not to do it for the whole
> freeing action. I consider this to be the safe choice.


>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Tested-by: Greg Thelen <gthelen@google.com>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/memcontrol.c | 66 +++++++++++++++++++++++++++++----------------------------
> 1 file changed, 34 insertions(+), 32 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b05ecac..74654f0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5082,16 +5082,29 @@ out_free:
> }
>
> /*
> - * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
> - * but in process context. The work_freeing structure is overlaid
> - * on the rcu_freeing structure, which itself is overlaid on memsw.
> + * At destroying mem_cgroup, references from swap_cgroup can remain.
> + * (scanning all at force_empty is too costly...)
> + *
> + * Instead of clearing all references at force_empty, we remember
> + * the number of reference from swap_cgroup and free mem_cgroup when
> + * it goes down to 0.
> + *
> + * Removal of cgroup itself succeeds regardless of refs from swap.
> */
> -static void free_work(struct work_struct *work)
> +
> +static void __mem_cgroup_free(struct mem_cgroup *memcg)
> {
> - struct mem_cgroup *memcg;
> + int node;
> int size = sizeof(struct mem_cgroup);
>
> - memcg = container_of(work, struct mem_cgroup, work_freeing);
> + mem_cgroup_remove_from_trees(memcg);
> + free_css_id(&mem_cgroup_subsys, &memcg->css);
> +
> + for_each_node(node)
> + free_mem_cgroup_per_zone_info(memcg, node);
> +
> + free_percpu(memcg->stat);
> +
> /*
> * We need to make sure that (at least for now), the jump label
> * destruction code runs outside of the cgroup lock. This is because
> @@ -5110,38 +5123,27 @@ static void free_work(struct work_struct *work)
> vfree(memcg);
> }
>
> -static void free_rcu(struct rcu_head *rcu_head)
> -{
> - struct mem_cgroup *memcg;
> -
> - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
> - INIT_WORK(&memcg->work_freeing, free_work);
> - schedule_work(&memcg->work_freeing);
> -}
>
> /*
> - * At destroying mem_cgroup, references from swap_cgroup can remain.
> - * (scanning all at force_empty is too costly...)
> - *
> - * Instead of clearing all references at force_empty, we remember
> - * the number of reference from swap_cgroup and free mem_cgroup when
> - * it goes down to 0.
> - *
> - * Removal of cgroup itself succeeds regardless of refs from swap.
> + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
> + * but in process context. The work_freeing structure is overlaid
> + * on the rcu_freeing structure, which itself is overlaid on memsw.
> */
> -
> -static void __mem_cgroup_free(struct mem_cgroup *memcg)
> +static void free_work(struct work_struct *work)
> {
> - int node;
> + struct mem_cgroup *memcg;
>
> - mem_cgroup_remove_from_trees(memcg);
> - free_css_id(&mem_cgroup_subsys, &memcg->css);
> + memcg = container_of(work, struct mem_cgroup, work_freeing);
> + __mem_cgroup_free(memcg);
> +}
>
> - for_each_node(node)
> - free_mem_cgroup_per_zone_info(memcg, node);
> +static void free_rcu(struct rcu_head *rcu_head)
> +{
> + struct mem_cgroup *memcg;
>
> - free_percpu(memcg->stat);
> - call_rcu(&memcg->rcu_freeing, free_rcu);
> + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
> + INIT_WORK(&memcg->work_freeing, free_work);
> + schedule_work(&memcg->work_freeing);
> }
>
> static void mem_cgroup_get(struct mem_cgroup *memcg)
> @@ -5153,7 +5155,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
> {
> if (atomic_sub_and_test(count, &memcg->refcnt)) {
> struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> - __mem_cgroup_free(memcg);
> + call_rcu(&memcg->rcu_freeing, free_rcu);
> if (parent)
> mem_cgroup_put(parent);
> }
> --
> 1.7.11.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs
Re: [PATCH v3 01/13] memcg: Make it possible to use the stock for more than one page. [message #48178 is a reply to message #47889] Mon, 01 October 2012 18:48 Go to previous messageGo to next message
Johannes Weiner is currently offline  Johannes Weiner
Messages: 9
Registered: November 2010
Junior Member
On Tue, Sep 18, 2012 at 06:03:58PM +0400, Glauber Costa wrote:
> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>
> We currently have a percpu stock cache scheme that charges one page at a
> time from memcg->res, the user counter. When the kernel memory
> controller comes into play, we'll need to charge more than that.
>
> This is because kernel memory allocations will also draw from the user
> counter, and can be bigger than a single page, as it is the case with
> the stack (usually 2 pages) or some higher order slabs.
>
> [ glommer@parallels.com: added a changelog ]
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Independent of how the per-subtree enable-through-setting-limit
discussion pans out, we're going to need the charge cache, so:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Re: [PATCH v3 03/13] memcg: change defines to an enum [message #48180 is a reply to message #47887] Mon, 01 October 2012 19:06 Go to previous messageGo to next message
Johannes Weiner is currently offline  Johannes Weiner
Messages: 9
Registered: November 2010
Junior Member
On Tue, Sep 18, 2012 at 06:04:00PM +0400, Glauber Costa wrote:
> This is just a cleanup patch for clarity of expression. In earlier
> submissions, people asked it to be in a separate patch, so here it is.
>
> [ v2: use named enum as type throughout the file as well ]
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Should probably be the first in the series to get the cleanups out of
the way :-)

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Re: [PATCH v3 05/13] Add a __GFP_KMEMCG flag [message #48181 is a reply to message #47895] Mon, 01 October 2012 19:09 Go to previous messageGo to next message
Johannes Weiner is currently offline  Johannes Weiner
Messages: 9
Registered: November 2010
Junior Member
On Tue, Sep 18, 2012 at 06:04:02PM +0400, Glauber Costa wrote:
> This flag is used to indicate to the callees that this allocation is a
> kernel allocation in process context, and should be accounted to
> current's memcg. It takes numerical place of the of the recently removed
> __GFP_NO_KSWAPD.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

With the feedback from Christoph and Mel incorporated:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Re: [PATCH v3 03/13] memcg: change defines to an enum [message #48182 is a reply to message #48180] Tue, 02 October 2012 09:10 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/01/2012 11:06 PM, Johannes Weiner wrote:
> On Tue, Sep 18, 2012 at 06:04:00PM +0400, Glauber Costa wrote:
>> This is just a cleanup patch for clarity of expression. In earlier
>> submissions, people asked it to be in a separate patch, so here it is.
>>
>> [ v2: use named enum as type throughout the file as well ]
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Acked-by: Michal Hocko <mhocko@suse.cz>
>
> Should probably be the first in the series to get the cleanups out of
> the way :-)
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
If you guys want to merge this separately, be my guest =)
Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [message #48199 is a reply to message #48151] Wed, 03 October 2012 22:11 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Glauber.

Sorry about late replies. I'be been traveling for the Korean
thanksgiving holidays.

On Mon, Oct 01, 2012 at 12:28:28PM +0400, Glauber Costa wrote:
> > That synchronous ref draining is going away. Maybe we can do that
> > before kmemcg? Michal, do you have some timeframe on mind?
>
> Since you said yourself in other points in this thread that you are fine
> with some page references outliving the cgroup in the case of slab, this
> is a situation that comes with the code, not a situation that was
> incidentally there, and we're making use of.

Hmmm? Not sure what you're trying to say but I wanted to say that
this should be okay once the scheduled memcg pre_destroy change
happens and nudge Michal once more.

Thanks.

--
tejun
Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [message #48205 is a reply to message #48175] Thu, 04 October 2012 10:53 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/01/2012 05:27 PM, Michal Hocko wrote:
> On Tue 18-09-12 18:04:09, Glauber Costa wrote:
>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
>> means that the freeing of memcg structure must happen in a compatible context,
>> otherwise we'll get a deadlock.
>
> Maybe I am missing something obvious but why cannot we simply disble
> (soft)irqs in mem_cgroup_create rather than make the free path much more
> complicated. It really feels strange to defer everything (e.g. soft
> reclaim tree cleanup which should be a no-op at the time because there
> shouldn't be any user pages in the group).
>

Ok.

I was just able to come back to this today - I was mostly working on the
slab feedback over the past few days. I will answer yours and Tejun's
concerns at once:

Here is the situation: the backtrace I get is this one:

[ 124.956725] =================================
[ 124.957217] [ INFO: inconsistent lock state ]
[ 124.957217] 3.5.0+ #99 Not tainted
[ 124.957217] ---------------------------------
[ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
[ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at:
[<ffffffff810aa7b2>] spin_lock+0x9/0xb
[ 124.957217] {SOFTIRQ-ON-W} state was registered at:
[ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
[ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c
[ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
[ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb
[ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
[ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
[ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100
[ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
[ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
[ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b


So what we learn from it, is: we are acquiring a specific lock (the css
id one) from softirq context. It was previously taken in a
softirq-enabled context, that seems to be coming directly from
get_new_cssid.

Tejun correctly pointed out that we should never acquire that lock from
a softirq context, in which he is right.

But the situation changes slightly with kmem. Now, the following excerpt
of a backtrace is possible:

[ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
[ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
[ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
[ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
[ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
[ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
[ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
[ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277

So as you can see, free_accounted_pages (that will trigger a memcg_put()
-> mem_cgroup_free()) can now be called from softirq context, which is,
an rcu callback (and I just realized I wrote the exact opposite in the
subj line: man, I really suck at that!!)
As a matter of fact, we could not move to our rcu callback as well:

we need to move it to a worker thread with the rest.

We already have a worker thread: he reason we have it is not
static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
and could not be called from rcu callback as well. We moved static
branches in there as well for a similar problem, but haven't introduced it.

Could we move just part of it to the worker thread? Absolutely yes.
Moving just free_css_id() is enough to make it work. But since it is not
the first context related problem we had, I thought: "to hell with that,
let's move everything and be safe".

I am fine moving free_css_id() only if you would prefer.

Can we disable softirqs when we initialize css_id? Maybe. My machine
seems to boot fine and survive the simple workload that would trigger
that bug if I use irqsave spinlocks instead of normal spinlocks. But
this has to be done from cgroup core: We have no control over css
creation in memcg.

How would you guys like me to handle this ?
Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [message #48209 is a reply to message #48205] Thu, 04 October 2012 14:20 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/04/2012 02:53 PM, Glauber Costa wrote:
> On 10/01/2012 05:27 PM, Michal Hocko wrote:
>> On Tue 18-09-12 18:04:09, Glauber Costa wrote:
>>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
>>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
>>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
>>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
>>> means that the freeing of memcg structure must happen in a compatible context,
>>> otherwise we'll get a deadlock.
>>
>> Maybe I am missing something obvious but why cannot we simply disble
>> (soft)irqs in mem_cgroup_create rather than make the free path much more
>> complicated. It really feels strange to defer everything (e.g. soft
>> reclaim tree cleanup which should be a no-op at the time because there
>> shouldn't be any user pages in the group).
>>
>
> Ok.
>
> I was just able to come back to this today - I was mostly working on the
> slab feedback over the past few days. I will answer yours and Tejun's
> concerns at once:
>
> Here is the situation: the backtrace I get is this one:
>
> [ 124.956725] =================================
> [ 124.957217] [ INFO: inconsistent lock state ]
> [ 124.957217] 3.5.0+ #99 Not tainted
> [ 124.957217] ---------------------------------
> [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
> [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at:
> [<ffffffff810aa7b2>] spin_lock+0x9/0xb
> [ 124.957217] {SOFTIRQ-ON-W} state was registered at:
> [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
> [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c
> [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
> [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb
> [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
> [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
> [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100
> [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
> [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
> [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b
>
>
> So what we learn from it, is: we are acquiring a specific lock (the css
> id one) from softirq context. It was previously taken in a
> softirq-enabled context, that seems to be coming directly from
> get_new_cssid.
>
> Tejun correctly pointed out that we should never acquire that lock from
> a softirq context, in which he is right.
>
> But the situation changes slightly with kmem. Now, the following excerpt
> of a backtrace is possible:
>
> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277
>
> So as you can see, free_accounted_pages (that will trigger a memcg_put()
> -> mem_cgroup_free()) can now be called from softirq context, which is,
> an rcu callback (and I just realized I wrote the exact opposite in the
> subj line: man, I really suck at that!!)
> As a matter of fact, we could not move to our rcu callback as well:
>
> we need to move it to a worker thread with the rest.
>
> We already have a worker thread: he reason we have it is not
> static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
> and could not be called from rcu callback as well. We moved static
> branches in there as well for a similar problem, but haven't introduced it.
>
> Could we move just part of it to the worker thread? Absolutely yes.
> Moving just free_css_id() is enough to make it work. But since it is not
> the first context related problem we had, I thought: "to hell with that,
> let's move everything and be safe".
>
> I am fine moving free_css_id() only if you would prefer.
>
> Can we disable softirqs when we initialize css_id? Maybe. My machine
> seems to boot fine and survive the simple workload that would trigger
> that bug if I use irqsave spinlocks instead of normal spinlocks. But
> this has to be done from cgroup core: We have no control over css
> creation in memcg.
>
> How would you guys like me to handle this ?

One more thing: As I mentioned in the Changelog,
mem_cgroup_remove_exceeded(), called from mem_cgroup_remove_from_trees()
will lead to the same usage pattern.
Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [message #48213 is a reply to message #48205] Fri, 05 October 2012 15:31 Go to previous messageGo to next message
Johannes Weiner is currently offline  Johannes Weiner
Messages: 9
Registered: November 2010
Junior Member
On Thu, Oct 04, 2012 at 02:53:13PM +0400, Glauber Costa wrote:
> On 10/01/2012 05:27 PM, Michal Hocko wrote:
> > On Tue 18-09-12 18:04:09, Glauber Costa wrote:
> >> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
> >> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
> >> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
> >> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
> >> means that the freeing of memcg structure must happen in a compatible context,
> >> otherwise we'll get a deadlock.
> >
> > Maybe I am missing something obvious but why cannot we simply disble
> > (soft)irqs in mem_cgroup_create rather than make the free path much more
> > complicated. It really feels strange to defer everything (e.g. soft
> > reclaim tree cleanup which should be a no-op at the time because there
> > shouldn't be any user pages in the group).
> >
>
> Ok.
>
> I was just able to come back to this today - I was mostly working on the
> slab feedback over the past few days. I will answer yours and Tejun's
> concerns at once:
>
> Here is the situation: the backtrace I get is this one:
>
> [ 124.956725] =================================
> [ 124.957217] [ INFO: inconsistent lock state ]
> [ 124.957217] 3.5.0+ #99 Not tainted
> [ 124.957217] ---------------------------------
> [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
> [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at:
> [<ffffffff810aa7b2>] spin_lock+0x9/0xb
> [ 124.957217] {SOFTIRQ-ON-W} state was registered at:
> [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
> [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c
> [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
> [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb
> [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
> [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
> [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100
> [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
> [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
> [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b
>
>
> So what we learn from it, is: we are acquiring a specific lock (the css
> id one) from softirq context. It was previously taken in a
> softirq-enabled context, that seems to be coming directly from
> get_new_cssid.
>
> Tejun correctly pointed out that we should never acquire that lock from
> a softirq context, in which he is right.
>
> But the situation changes slightly with kmem. Now, the following excerpt
> of a backtrace is possible:
>
> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277
>
> So as you can see, free_accounted_pages (that will trigger a memcg_put()
> -> mem_cgroup_free()) can now be called from softirq context, which is,
> an rcu callback (and I just realized I wrote the exact opposite in the
> subj line: man, I really suck at that!!)
> As a matter of fact, we could not move to our rcu callback as well:
>
> we need to move it to a worker thread with the rest.
>
> We already have a worker thread: he reason we have it is not
> static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
> and could not be called from rcu callback as well. We moved static
> branches in there as well for a similar problem, but haven't introduced it.
>
> Could we move just part of it to the worker thread? Absolutely yes.
> Moving just free_css_id() is enough to make it work. But since it is not
> the first context related problem we had, I thought: "to hell with that,
> let's move everything and be safe".
>
> I am fine moving free_css_id() only if you would prefer.
>
> Can we disable softirqs when we initialize css_id? Maybe. My machine
> seems to boot fine and survive the simple workload that would trigger
> that bug if I use irqsave spinlocks instead of normal spinlocks. But
> this has to be done from cgroup core: We have no control over css
> creation in memcg.
>
> How would you guys like me to handle this ?

Without the vfree callback, I would have preferred just making the
id_lock softirq safe. But since we have to defer (parts of) freeing
anyway, I like your approach of just deferring the rest as well
better.

But please add comments why the stuff in there is actually deferred.
Just simple notes like:

"this can be called from atomic contexts, <examples>",

"vfree must run from process context" and "css_id locking is not soft
irq safe",

"to hell with that, let's just do everything from the workqueue and be
safe and simple".

(And this may be personal preference, but why have free_work call
__mem_cgroup_free()? Does anyone else need to call that code? There
are too many layers already, why not just keep it all in free_work()
and have one less stack frame on your mind? :))

As for the changelog, here is my attempt:

---

mm: memcg: defer whole memcg tear-down to workqueue

The final memcg put can already happen in atomic context and so the
freeing is deferred to a workqueue because it needs to use vfree().

Kmem tracking will add freeing from softirq context, but the id_lock
acquired when destroying the cgroup object is not softirq safe, e.g.:

> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277

To avoid making tear-down too complicated - making locks soft irq
safe, having half the cleanup in one function on the other half
somewhere else - just defer everything to the workqueue.
Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [message #48224 is a reply to message #48213] Mon, 08 October 2012 09:45 Go to previous message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/05/2012 07:31 PM, Johannes Weiner wrote:
> On Thu, Oct 04, 2012 at 02:53:13PM +0400, Glauber Costa wrote:
>> On 10/01/2012 05:27 PM, Michal Hocko wrote:
>>> On Tue 18-09-12 18:04:09, Glauber Costa wrote:
>>>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
>>>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
>>>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
>>>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
>>>> means that the freeing of memcg structure must happen in a compatible context,
>>>> otherwise we'll get a deadlock.
>>>
>>> Maybe I am missing something obvious but why cannot we simply disble
>>> (soft)irqs in mem_cgroup_create rather than make the free path much more
>>> complicated. It really feels strange to defer everything (e.g. soft
>>> reclaim tree cleanup which should be a no-op at the time because there
>>> shouldn't be any user pages in the group).
>>>
>>
>> Ok.
>>
>> I was just able to come back to this today - I was mostly working on the
>> slab feedback over the past few days. I will answer yours and Tejun's
>> concerns at once:
>>
>> Here is the situation: the backtrace I get is this one:
>>
>> [ 124.956725] =================================
>> [ 124.957217] [ INFO: inconsistent lock state ]
>> [ 124.957217] 3.5.0+ #99 Not tainted
>> [ 124.957217] ---------------------------------
>> [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
>> [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
>> [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at:
>> [<ffffffff810aa7b2>] spin_lock+0x9/0xb
>> [ 124.957217] {SOFTIRQ-ON-W} state was registered at:
>> [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
>> [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c
>> [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
>> [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb
>> [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
>> [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
>> [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100
>> [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
>> [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
>> [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b
>>
>>
>> So what we learn from it, is: we are acquiring a specific lock (the css
>> id one) from softirq context. It was previously taken in a
>> softirq-enabled context, that seems to be coming directly from
>> get_new_cssid.
>>
>> Tejun correctly pointed out that we should never acquire that lock from
>> a softirq context, in which he is right.
>>
>> But the situation changes slightly with kmem. Now, the following excerpt
>> of a backtrace is possible:
>>
>> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
>> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
>> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
>> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
>> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
>> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
>> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
>> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277
>>
>> So as you can see, free_accounted_pages (that will trigger a memcg_put()
>> -> mem_cgroup_free()) can now be called from softirq context, which is,
>> an rcu callback (and I just realized I wrote the exact opposite in the
>> subj line: man, I really suck at that!!)
>> As a matter of fact, we could not move to our rcu callback as well:
>>
>> we need to move it to a worker thread with the rest.
>>
>> We already have a worker thread: he reason we have it is not
>> static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
>> and could not be called from rcu callback as well. We moved static
>> branches in there as well for a similar problem, but haven't introduced it.
>>
>> Could we move just part of it to the worker thread? Absolutely yes.
>> Moving just free_css_id() is enough to make it work. But since it is not
>> the first context related problem we had, I thought: "to hell with that,
>> let's move everything and be safe".
>>
>> I am fine moving free_css_id() only if you would prefer.
>>
>> Can we disable softirqs when we initialize css_id? Maybe. My machine
>> seems to boot fine and survive the simple workload that would trigger
>> that bug if I use irqsave spinlocks instead of normal spinlocks. But
>> this has to be done from cgroup core: We have no control over css
>> creation in memcg.
>>
>> How would you guys like me to handle this ?
>
> Without the vfree callback, I would have preferred just making the
> id_lock softirq safe. But since we have to defer (parts of) freeing
> anyway, I like your approach of just deferring the rest as well
> better.
>
> But please add comments why the stuff in there is actually deferred.
> Just simple notes like:
>
> "this can be called from atomic contexts, <examples>",
>
> "vfree must run from process context" and "css_id locking is not soft
> irq safe",
>
> "to hell with that, let's just do everything from the workqueue and be
> safe and simple".
>
> (And this may be personal preference, but why have free_work call
> __mem_cgroup_free()? Does anyone else need to call that code? There
> are too many layers already, why not just keep it all in free_work()
> and have one less stack frame on your mind? :))
>
It is used when create fails.
Previous Topic: [RFC PATCH 0/5] net: socket bind to file descriptor introduced
Next Topic: [RFC] Posix timers improvements, requied for CRIU project
Goto Forum:
  


Current Time: Fri Oct 24 19:21:30 GMT 2025

Total time taken to generate the page: 0.18252 seconds