Home » Mailing lists » Devel » [PATCH v5 00/14] kmem controller for memcg.
Re: [PATCH v5 06/14] memcg: kmem controller infrastructure [message #48461 is a reply to message #48440] |
Thu, 18 October 2012 09:16 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:43 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> This patch introduces infrastructure for tracking kernel memory pages to
>> a given memcg. This will happen whenever the caller includes the flag
>> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>>
>> In memcontrol.h those functions are wrapped in inline acessors. The
>> idea is to later on, patch those with static branches, so we don't incur
>> any overhead when no mem cgroups with limited kmem are being used.
>>
>> Users of this functionality shall interact with the memcg core code
>> through the following functions:
>>
>> memcg_kmem_newpage_charge: will return true if the group can handle the
>> allocation. At this point, struct page is not
>> yet allocated.
>>
>> memcg_kmem_commit_charge: will either revert the charge, if struct page
>> allocation failed, or embed memcg information
>> into page_cgroup.
>>
>> memcg_kmem_uncharge_page: called at free time, will revert the charge.
>>
>> ...
>>
>> +static __always_inline bool
>> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
>> +{
>> + if (!memcg_kmem_enabled())
>> + return true;
>> +
>> + /*
>> + * __GFP_NOFAIL allocations will move on even if charging is not
>> + * possible. Therefore we don't even try, and have this allocation
>> + * unaccounted. We could in theory charge it with
>> + * res_counter_charge_nofail, but we hope those allocations are rare,
>> + * and won't be worth the trouble.
>> + */
>> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
>> + return true;
>> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>> + return true;
>> +
>> + /* If the test is dying, just let it go. */
>> + if (unlikely(test_thread_flag(TIF_MEMDIE)
>> + || fatal_signal_pending(current)))
>> + return true;
>> +
>> + return __memcg_kmem_newpage_charge(gfp, memcg, order);
>> +}
>
> That's a big function! Why was it __always_inline? I'd have thought
> it would be better to move the code after memcg_kmem_enabled() out of
> line.
>
it is big, but it is mostly bit testing. So the goal here is to avoid a
function call at all costs, this being a fast path.
> Do we actually need to test PF_KTHREAD when current->mm == NULL?
> Perhaps because of aio threads whcih temporarily adopt a userspace mm?
I believe so. I remember I discussed this in the past with David
Rientjes and he advised me to test for both.
>
>> +/**
>> + * memcg_kmem_uncharge_page: uncharge pages from memcg
>> + * @page: pointer to struct page being freed
>> + * @order: allocation order.
>> + *
>> + * there is no need to specify memcg here, since it is embedded in page_cgroup
>> + */
>> +static __always_inline void
>> +memcg_kmem_uncharge_page(struct page *page, int order)
>> +{
>> + if (memcg_kmem_enabled())
>> + __memcg_kmem_uncharge_page(page, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_commit_charge: embeds correct memcg in a page
>> + * @page: pointer to struct page recently allocated
>> + * @memcg: the memcg structure we charged against
>> + * @order: allocation order.
>> + *
>> + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
>> + * failure of the allocation. if @page is NULL, this function will revert the
>> + * charges. Otherwise, it will commit the memcg given by @memcg to the
>> + * corresponding page_cgroup.
>> + */
>> +static __always_inline void
>> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
>> +{
>> + if (memcg_kmem_enabled() && memcg)
>> + __memcg_kmem_commit_charge(page, memcg, order);
>> +}
>
> I suspect the __always_inline's here are to do with static branch
> trickery. A code comment is warranted if so?
>
Not necessarily. Same thing as above. We want to avoid function calls in
those sites.
|
|
|
Re: [PATCH v5 06/14] memcg: kmem controller infrastructure [message #48462 is a reply to message #48447] |
Thu, 18 October 2012 09:23 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/18/2012 02:37 AM, David Rientjes wrote:
> On Tue, 16 Oct 2012, Glauber Costa wrote:
>
>> + /* If the test is dying, just let it go. */
>> + if (unlikely(test_thread_flag(TIF_MEMDIE)
>> + || fatal_signal_pending(current)))
>> + return true;
>
> This can be simplified to just check fatal_signal_pending(), all threads
> with TIF_MEMDIE also have a pending SIGKILL.
Yes, I believe it is better. I will change.
>> +
>> + return __memcg_kmem_newpage_charge(gfp, memcg, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_uncharge_page: uncharge pages from memcg
>
> Should be memcg_kmem_uncharge_pages() since it takes an order argument?
>
I tried to use naming as close as possible to user-memcg. But to be
fair, their are always calling it page-by-page, so pages() won't be a
problem here.
>> + * @page: pointer to struct page being freed
>> + * @order: allocation order.
>> + *
>> + * there is no need to specify memcg here, since it is embedded in page_cgroup
>> + */
>> +static __always_inline void
>> +memcg_kmem_uncharge_page(struct page *page, int order)
>> +{
>> + if (memcg_kmem_enabled())
>> + __memcg_kmem_uncharge_page(page, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_commit_charge: embeds correct memcg in a page
>> + * @page: pointer to struct page recently allocated
>> + * @memcg: the memcg structure we charged against
>> + * @order: allocation order.
>> + *
>> + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
>> + * failure of the allocation. if @page is NULL, this function will revert the
>> + * charges. Otherwise, it will commit the memcg given by @memcg to the
>> + * corresponding page_cgroup.
>> + */
>> +static __always_inline void
>> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
>> +{
>> + if (memcg_kmem_enabled() && memcg)
>> + __memcg_kmem_commit_charge(page, memcg, order);
>> +}
>> +
>> #else
>> static inline void sock_update_memcg(struct sock *sk)
>> {
>> @@ -406,6 +489,21 @@ static inline void sock_update_memcg(struct sock *sk)
>> static inline void sock_release_memcg(struct sock *sk)
>> {
>> }
>> +
>> +static inline bool
>> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
>> +{
>> + return true;
>> +}
>> +
>> +static inline void memcg_kmem_uncharge_page(struct page *page, int order)
>
> Two spaces.
>
Thanks.
>> +{
>> +}
>> +
>> +static inline void
>> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
>> +{
>> +}
>> #endif /* CONFIG_MEMCG_KMEM */
>> #endif /* _LINUX_MEMCONTROL_H */
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 30eafeb..1182188 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -10,6 +10,10 @@
>> * Copyright (C) 2009 Nokia Corporation
>> * Author: Kirill A. Shutemov
>> *
>> + * Kernel Memory Controller
>> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
>> + * Authors: Glauber Costa and Suleiman Souhlal
>> + *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License as published by
>> * the Free Software Foundation; either version 2 of the License, or
>> @@ -2630,6 +2634,171 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>> memcg_check_events(memcg, page);
>> }
>>
>> +#ifdef CONFIG_MEMCG_KMEM
>> +static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>> +{
>> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
>> + (memcg->kmem_accounted & KMEM_ACCOUNTED_MASK);
>> +}
>> +
>> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>> +{
>> + struct res_counter *fail_res;
>> + struct mem_cgroup *_memcg;
>> + int ret = 0;
>> + bool may_oom;
>> +
>> + ret = res_counter_charge(&memcg->kmem, size, &fail_res);
>> + if (ret)
>> + return ret;
>> +
>> + /*
>> + * Conditions under which we can wait for the oom_killer.
>> + * We have to be able to wait, but also, if we can't retry,
>> + * we obviously shouldn't go mess with oom.
>> + */
>> + may_oom = (gfp & __GFP_WAIT) && !(gfp & __GFP_NORETRY);
>
> What about gfp & __GFP_FS?
>
Do you intend to prevent or allow OOM under that flag? I personally
think that anything that accepts to be OOM-killed should have GFP_WAIT
set, so that ought to be enough.
>> +
>> + _memcg = memcg;
>> + ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
>> + &_memcg, may_oom);
>> +
>> + if (ret == -EINTR) {
>> + /*
>> + * __mem_cgroup_try_charge() chosed to bypass to root due to
>> + * OOM kill or fatal signal. Since our only options are to
>> + * either fail the allocation or charge it to this cgroup, do
>> + * it as a temporary condition. But we can't fail. From a
>> + * kmem/slab perspective, the cache has already been selected,
>> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
>> + * our minds. This condition will only trigger if the task
>> + * entered memcg_charge_kmem in a sane state, but was
>> + * OOM-killed. during __mem_cgroup_try_charge. Tasks that are
>
> Looks like some copy-and-paste damage.
>
thanks.
>> +void __memcg_kmem_uncharge_page(struct page *page, int order)
>> +{
>> + struct mem_cgroup *memcg = NULL;
>> + struct page_cgroup *pc;
>> +
>> +
>> + pc = lookup_page_cgroup(page);
>> + /*
>> + * Fast unlocked return. Theoretically might have changed, have to
>> + * check again after locking.
>> + */
>> + if (!PageCgroupUsed(pc))
>> + return;
>> +
>> + lock_page_cgroup(pc);
>> + if (PageCgroupUsed(pc)) {
>> + memcg = pc->mem_cgroup;
>> + ClearPageCgroupUsed(pc);
>> + }
>> + unlock_page_cgroup(pc);
>> +
>> + /*
>> + * We trust that only if there is a memcg associated with the page, it
>> + * is a valid allocation
>> + */
>> + if (!memcg)
>> + return;
>> +
>> + VM_BUG_ON(mem_cgroup_is_root(memcg));
>> + memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
>> + mem_cgroup_put(memcg);
>
> Should this mem_cgroup_put() be done conditionally on
> memcg->kmem_accounted & KMEM_ACCOUNTED_MASK?
>
> The next patch in the series does memcg_kmem_newpage_charge() in the page
> allocator which will return true for memcg_can_account_kmem() without
> doing mem_cgroup_get().
>
And then this put will go away as well.
I am not testing for memcg_can_account_kmem in here, because having or
not having the PageCgroupUsed bit set (and therefore, a valid memcg) in
page_cgroup should be the most robust test here.
|
|
|
Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg [message #48463 is a reply to message #48442] |
Thu, 18 October 2012 09:24 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:44 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
>> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
>> for telling the page allocator that this is such an allocation at
>> free_pages() time.
>
> Well, why? Was that the correct decision?
>
I don't fully understand your question. Is this the same question you
posed in patch 0, about marking some versus marking all? If so, I
believe I should have answered it there.
If not, please explain.
>> This is done by the invocation of
>> __free_accounted_pages() and free_accounted_pages().
>
> These are very general-sounding names. I'd expect the identifiers to
> contain "memcg" and/or "kmem", to identify what's going on.
>
Changed.
|
|
|
Re: [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed. [message #48464 is a reply to message #48443] |
Thu, 18 October 2012 09:33 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:48 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> Because the ultimate goal of the kmem tracking in memcg is to track slab
>> pages as well,
>
> It is? For a major patchset such as this, it's pretty important to
> discuss such long-term plans in the top-level discussion. Covering
> things such as expected complexity, expected performance hit, how these
> plans affected the current implementation, etc.
>
> The main reason for this is that if the future plans appear to be of
> doubtful feasibility and the current implementation isn't sufficiently
> useful without the future stuff, we shouldn't merge the current
> implementation. It's a big issue!
>
Not really. I am not talking about plans when it comes to slab. The code
is there, and usually always posted to linux-mm a few days after I post
this series. It also lives in the kmemcg-slab branch in my git tree.
I am trying to logically split it in two to aid reviewers work. I may
have made a mistake by splitting it this way, but so far I think it was
the right decision: it allowed people to focus on a part of the work
first, instead of going all the way in a 30-patch patch series that
would be merged atomically.
I believe they should be merged separately, to allow us to find any
issues easier. But I also believe that this "separate" should ultimately
live in the same merge window.
Pekka, from the slab side, already stated that 3.8 would not be
unreasonable.
As for the perfomance hit, my latest benchmark, quoted in the opening
mail of this series already include results for both patchsets.
|
|
|
Re: [PATCH v5 13/14] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #48465 is a reply to message #48444] |
Thu, 18 October 2012 09:37 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:50 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> @@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
>> static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
>> int node)
>> {
>> - struct page *page = alloc_pages_node(node, THREADINFO_GFP,
>> + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
>> THREAD_SIZE_ORDER);
>
> yay, we actually used all this code for something ;)
>
Happy to be of use, sir!
> I don't think we really saw a comprehensive list of what else the kmem
> controller will be used for, but I believe that all other envisaged
> applications will require slab accounting, yes?
>
>
> So it appears that all we have at present is a
> yet-another-fork-bomb-preventer, but one which requires that the
> culprit be in a container? That's reasonable, given your
> hosted-environment scenario. It's unclear (to me) that we should merge
> all this code for only this feature. Again, it would be good to have a
> clear listing of and plan for other applications of this code.
>
I agree. This doesn't buy me much without slab accounting. But
reiterating what I've just said in another e-mail, slab accounting is
not really in plan stage, but had also been through extensive development.
As a matter of fact, it used to be only "slab accounting" in the
beginning, without this. I've split it more recently because I believe
it would allow people to do a more focused review, leading to better code.
|
|
|
|
Re: [PATCH v5 09/14] memcg: kmem accounting lifecycle management [message #48467 is a reply to message #48450] |
Thu, 18 October 2012 09:42 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/18/2012 03:28 AM, David Rientjes wrote:
> On Tue, 16 Oct 2012, Glauber Costa wrote:
>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 1182188..e24b388 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -344,6 +344,7 @@ struct mem_cgroup {
>> /* internal only representation about the status of kmem accounting. */
>> enum {
>> KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
>> + KMEM_ACCOUNTED_DEAD, /* dead memcg, pending kmem charges */
>
> "dead memcg with pending kmem charges" seems better.
>
ok.
>> };
>>
>> #define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
>> @@ -353,6 +354,22 @@ static void memcg_kmem_set_active(struct mem_cgroup *memcg)
>> {
>> set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
>> }
>> +
>> +static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>> +{
>> + return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
>> +}
>
> I think all of these should be inline.
>
They end up being, to be best of my knowledge the compiler can and will
inline such simple functions regardless of their marking, unless you
explicitly mark them noinline.
>> +
>> +static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
>> +{
>> + if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted))
>> + set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_accounted);
>> +}
>
> The set_bit() doesn't happen atomically with the test_bit(), what
> synchronization is required for this?
>
I believe the explanation Michal gave in answer to this is comprehensive.
|
|
|
|
Re: [PATCH v5 00/14] kmem controller for memcg. [message #48497 is a reply to message #48457] |
Thu, 18 October 2012 19:21 |
akpm
Messages: 224 Registered: March 2007
|
Senior Member |
|
|
On Thu, 18 Oct 2012 20:51:05 +0400
Glauber Costa <glommer@parallels.com> wrote:
> On 10/18/2012 02:11 AM, Andrew Morton wrote:
> > On Tue, 16 Oct 2012 14:16:37 +0400
> > Glauber Costa <glommer@parallels.com> wrote:
> >
> >> ...
> >>
> >> A general explanation of what this is all about follows:
> >>
> >> The kernel memory limitation mechanism for memcg concerns itself with
> >> disallowing potentially non-reclaimable allocations to happen in exaggerate
> >> quantities by a particular set of processes (cgroup). Those allocations could
> >> create pressure that affects the behavior of a different and unrelated set of
> >> processes.
> >>
> >> Its basic working mechanism is to annotate some allocations with the
> >> _GFP_KMEMCG flag. When this flag is set, the current process allocating will
> >> have its memcg identified and charged against. When reaching a specific limit,
> >> further allocations will be denied.
> >
> > The need to set _GFP_KMEMCG is rather unpleasing, and makes one wonder
> > "why didn't it just track all allocations".
> >
> This was raised as well by Peter Zijlstra during the memcg summit.
Firstly: please treat any question from a reviewer as an indication
that information was missing from the changelog or from code comments.
Ideally all such queries are addressed in later version of the patch
and changelog.
> The
> answer I gave to him still stands: There is a cost associated with it.
> We believe it comes down to a trade off situation. How much tracking a
> particular kind of allocation help vs how much does it cost.
>
> The free path is specially more expensive, since it will always incur in
> a page_cgroup lookup.
OK. But that is a quantitative argument, without any quantities! Do
we have even an estimate of what this cost will be? Perhaps it's the
case that, if well implemented, that cost will be acceptable. How do
we tell?
> > Does this mean that over time we can expect more sites to get the
> > _GFP_KMEMCG tagging?
>
> We have being doing kernel memory limitation for OpenVZ for a lot of
> times, using a quite different mechanism. What we do in this work (with
> slab included), allows us to achieve feature parity with that. It means
> it is good enough for production environments.
That's really good info.
> Whether or not more people will want other allocations to be tracked, I
> can't predict. What I do can say is that stack + slab is a very
> significant part of the memory one potentially cares about, and if
> anyone else ever have the need for more, it will come down to a
> trade-off calculation.
OK.
> > If so, are there any special implications, or do
> > we just go in, do the one-line patch and expect everything to work?
>
> With the infrastructure in place, it shouldn't be hard. But it's not
> necessarily a one-liner either. It depends on what are the pratical
> considerations for having that specific kind of allocation tied to a
> memcg. The slab, for instance, that follows this series, is far away
> from a one-liner: it is in fact, a 19-patch patch series.
>
>
>
> >
> > And how *accurate* is the proposed code? What percentage of kernel
> > memory allocations are unaccounted, typical case and worst case?
>
> With both patchsets applied, all memory used for the stack and most of
> the memory used for slab objects allocated in userspace process contexts
> are accounted.
>
> I honestly don't know which percentage of the total kernel memory this
> represents.
It sounds like the coverage will be good. What's left over? Random
get_free_pages() calls and interrupt-time slab allocations?
I suppose that there are situations in which network rx could consume
significant amounts of unaccounted memory?
> The accuracy for stack pages is very high: In this series, we don't move
> stack pages around when moving a task to other cgroups (for stack, it
> could be done), but other than that, all processes that pops up in a
> cgroup and stay there will have its memory accurately accounted.
>
> The slab is more complicated, and depends on the workload. It will be
> more accurate in workloads in which the level of object-sharing among
> cgroups is low. A container, for instance, is the perfect example of
> where this happen.
>
> >
> > All sorts of questions come to mind over this decision, but it was
> > unexplained. It should be, please. A lot!
> >
> >>
> >> ...
> >>
> >> Limits lower than
> >> the user limit effectively means there is a separate kernel memory limit that
> >> may be reached independently than the user limit. Values equal or greater than
> >> the user limit implies only that kernel memory is tracked. This provides a
> >> unified vision of "maximum memory", be it kernel or user memory.
> >>
> >
> > I'm struggling to understand that text much at all. Reading the
> > Documentation/cgroups/memory.txt patch helped.
> >
>
> Great. If you have any specific suggestions I can change that. Maybe I
> should just paste the documentation bit in here...
That's not a bad idea.
|
|
|
Re: [PATCH v5 04/14] kmem accounting basic infrastructure [message #48498 is a reply to message #48435] |
Thu, 18 October 2012 19:37 |
Tejun Heo
Messages: 184 Registered: November 2006
|
Senior Member |
|
|
Hello, David.
On Wed, Oct 17, 2012 at 03:08:04PM -0700, David Rientjes wrote:
> > +static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
> > +{
> > + int ret = -EINVAL;
> > +#ifdef CONFIG_MEMCG_KMEM
> > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> > + /*
> > + * For simplicity, we won't allow this to be disabled. It also can't
> > + * be changed if the cgroup has children already, or if tasks had
> > + * already joined.
> > + *
> > + * If tasks join before we set the limit, a person looking at
> > + * kmem.usage_in_bytes will have no way to determine when it took
> > + * place, which makes the value quite meaningless.
> > + *
> > + * After it first became limited, changes in the value of the limit are
> > + * of course permitted.
> > + *
> > + * Taking the cgroup_lock is really offensive, but it is so far the only
> > + * way to guarantee that no children will appear. There are plenty of
> > + * other offenders, and they should all go away. Fine grained locking
> > + * is probably the way to go here. When we are fully hierarchical, we
> > + * can also get rid of the use_hierarchy check.
>
> Not sure it's so offensive, it's a pretty standard way of ensuring that
> cont->children doesn't get manipulated in a race.
cgroup_lock is inherently one of the outermost locks as it protects
cgroup hierarchy and modifying cgroup hierarchy involves invoking
subsystem callbacks which may grab subsystem locks. Grabbing it
directly from subsystems thus creates high likelihood of creating a
dependency loop and it's nasty to break.
And I'm unsure whether making cgroup locks finer grained would help as
cpuset grabs cgroup_lock to perform actual task migration which would
require the outermost cgroup locking anyway. This one already has
showed up in a couple lockdep warnings involving static_key usages.
A couple days ago, I posted a patchset to remove cgroup_lock usage
from cgroup_freezer and at least cgroup_freezer seems like it was
aiming for the wrong behavior which led to the wrong locking behavior
requiring grabbing cgroup_lock. I can't say whether others will be
that easy tho.
Anyways, so, cgroup_lock is in the process of being unexported and I'd
really like another usage isn't added but maybe that requires larger
changes to memcg and not something which can be achieved here. Dunno.
Will think more about it.
Thanks.
--
tejun
|
|
|
|
Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg [message #48501 is a reply to message #48463] |
Thu, 18 October 2012 20:44 |
akpm
Messages: 224 Registered: March 2007
|
Senior Member |
|
|
On Thu, 18 Oct 2012 13:24:47 +0400
Glauber Costa <glommer@parallels.com> wrote:
> On 10/18/2012 02:12 AM, Andrew Morton wrote:
> > On Tue, 16 Oct 2012 14:16:44 +0400
> > Glauber Costa <glommer@parallels.com> wrote:
> >
> >> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> >> page allocator will call the corresponding memcg functions to validate
> >> the allocation. Tasks in the root memcg can always proceed.
> >>
> >> To avoid adding markers to the page - and a kmem flag that would
> >> necessarily follow, as much as doing page_cgroup lookups for no reason,
> >> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> >> for telling the page allocator that this is such an allocation at
> >> free_pages() time.
> >
> > Well, why? Was that the correct decision?
> >
>
> I don't fully understand your question. Is this the same question you
> posed in patch 0, about marking some versus marking all? If so, I
> believe I should have answered it there.
Yes, it's the same question. The one which has not yet been fully answered ;)
|
|
|
Re: [PATCH v5 06/14] memcg: kmem controller infrastructure [message #48502 is a reply to message #48462] |
Thu, 18 October 2012 21:59 |
David Rientjes
Messages: 59 Registered: November 2006
|
Member |
|
|
On Thu, 18 Oct 2012, Glauber Costa wrote:
> >> @@ -2630,6 +2634,171 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
> >> memcg_check_events(memcg, page);
> >> }
> >>
> >> +#ifdef CONFIG_MEMCG_KMEM
> >> +static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
> >> +{
> >> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
> >> + (memcg->kmem_accounted & KMEM_ACCOUNTED_MASK);
> >> +}
> >> +
> >> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
> >> +{
> >> + struct res_counter *fail_res;
> >> + struct mem_cgroup *_memcg;
> >> + int ret = 0;
> >> + bool may_oom;
> >> +
> >> + ret = res_counter_charge(&memcg->kmem, size, &fail_res);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + /*
> >> + * Conditions under which we can wait for the oom_killer.
> >> + * We have to be able to wait, but also, if we can't retry,
> >> + * we obviously shouldn't go mess with oom.
> >> + */
> >> + may_oom = (gfp & __GFP_WAIT) && !(gfp & __GFP_NORETRY);
> >
> > What about gfp & __GFP_FS?
> >
>
> Do you intend to prevent or allow OOM under that flag? I personally
> think that anything that accepts to be OOM-killed should have GFP_WAIT
> set, so that ought to be enough.
>
The oom killer in the page allocator cannot trigger without __GFP_FS
because direct reclaim has little chance of being very successful and
thus we end up needlessly killing processes, and that tends to happen
quite a bit if we dont check for it. Seems like this would also happen
with memcg if mem_cgroup_reclaim() has a large probability of failing?
|
|
|
|
|
|
Re: [PATCH v5 00/14] kmem controller for memcg. [message #48513 is a reply to message #48497] |
Fri, 19 October 2012 09:55 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/18/2012 11:21 PM, Andrew Morton wrote:
> On Thu, 18 Oct 2012 20:51:05 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> On 10/18/2012 02:11 AM, Andrew Morton wrote:
>>> On Tue, 16 Oct 2012 14:16:37 +0400
>>> Glauber Costa <glommer@parallels.com> wrote:
>>>
>>>> ...
>>>>
>>>> A general explanation of what this is all about follows:
>>>>
>>>> The kernel memory limitation mechanism for memcg concerns itself with
>>>> disallowing potentially non-reclaimable allocations to happen in exaggerate
>>>> quantities by a particular set of processes (cgroup). Those allocations could
>>>> create pressure that affects the behavior of a different and unrelated set of
>>>> processes.
>>>>
>>>> Its basic working mechanism is to annotate some allocations with the
>>>> _GFP_KMEMCG flag. When this flag is set, the current process allocating will
>>>> have its memcg identified and charged against. When reaching a specific limit,
>>>> further allocations will be denied.
>>>
>>> The need to set _GFP_KMEMCG is rather unpleasing, and makes one wonder
>>> "why didn't it just track all allocations".
>>>
>> This was raised as well by Peter Zijlstra during the memcg summit.
>
> Firstly: please treat any question from a reviewer as an indication
> that information was missing from the changelog or from code comments.
> Ideally all such queries are addressed in later version of the patch
> and changelog.
>
This is in no opposition with me telling a bit that this has been raised
before! =)
>> The
>> answer I gave to him still stands: There is a cost associated with it.
>> We believe it comes down to a trade off situation. How much tracking a
>> particular kind of allocation help vs how much does it cost.
>>
>> The free path is specially more expensive, since it will always incur in
>> a page_cgroup lookup.
>
> OK. But that is a quantitative argument, without any quantities! Do
> we have even an estimate of what this cost will be? Perhaps it's the
> case that, if well implemented, that cost will be acceptable. How do
> we tell?
>
There are two ways:
1) Measuring on various workloads. The workload I measured particularly
in here (link in the beginning of this e-mail), showed a 2 - 3 % penalty
with the whole thing applied. Truth be told, this was mostly pin-pointed
to the slab part, which gets most of its cost from a relay function, and
not from the page allocation per-se. But for me, this is enough to tell
that there is a cost high enough to bother some.
2) We can infer from past behavior of memcg. It always shown itself as
quite an expensive beast. Making it suck faster is a completely separate
endeavor. It seems only natural to me to reduce its reach even without
specific number for each of the to-be-tracked candidates.
Moreover, there is the cost question, but cost is not *the only*
question, as I underlined a few paragraphs below. It is not always
obvious how to pinpoint a kernel page to a specific process, so this
need to be analyzed on a case-by-case basis. The slab is the hardest
one, and it is done. But even then...
If this is still not good enough, and you would like me to measure
something else, just let me know.
>>> Does this mean that over time we can expect more sites to get the
>>> _GFP_KMEMCG tagging?
>>
>> We have being doing kernel memory limitation for OpenVZ for a lot of
>> times, using a quite different mechanism. What we do in this work (with
>> slab included), allows us to achieve feature parity with that. It means
>> it is good enough for production environments.
>
> That's really good info.
>
>> Whether or not more people will want other allocations to be tracked, I
>> can't predict. What I do can say is that stack + slab is a very
>> significant part of the memory one potentially cares about, and if
>> anyone else ever have the need for more, it will come down to a
>> trade-off calculation.
>
> OK.
>
>>> If so, are there any special implications, or do
>>> we just go in, do the one-line patch and expect everything to work?
>>
>> With the infrastructure in place, it shouldn't be hard. But it's not
>> necessarily a one-liner either. It depends on what are the pratical
>> considerations for having that specific kind of allocation tied to a
>> memcg. The slab, for instance, that follows this series, is far away
>> from a one-liner: it is in fact, a 19-patch patch series.
>>
>>
>>
>>>
>>> And how *accurate* is the proposed code? What percentage of kernel
>>> memory allocations are unaccounted, typical case and worst case?
>>
>> With both patchsets applied, all memory used for the stack and most of
>> the memory used for slab objects allocated in userspace process contexts
>> are accounted.
>>
>> I honestly don't know which percentage of the total kernel memory this
>> represents.
>
> It sounds like the coverage will be good. What's left over? Random
> get_free_pages() calls and interrupt-time slab allocations?
>
random get_free_pages, vmalloc, ptes. interrupt is left out on purpose,
because we can't cgroup-track something that doesn't have a process context.
> I suppose that there are situations in which network rx could consume
> significant amounts of unaccounted memory?
>
Not unaccounted. This is merged already =)
|
|
|
Re: [PATCH v5 06/14] memcg: kmem controller infrastructure [message #48514 is a reply to message #48510] |
Fri, 19 October 2012 10:00 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/19/2012 01:31 PM, David Rientjes wrote:
> On Fri, 19 Oct 2012, Glauber Costa wrote:
>
>>>>> Do we actually need to test PF_KTHREAD when current->mm == NULL?
>>>>> Perhaps because of aio threads whcih temporarily adopt a userspace mm?
>>>>
>>>> I believe so. I remember I discussed this in the past with David
>>>> Rientjes and he advised me to test for both.
>>>>
>>>
>>> PF_KTHREAD can do use_mm() to assume an ->mm but hopefully they aren't
>>> allocating slab while doing so. Have you considered actually charging
>>> current->mm->owner for that memory, though, since the kthread will have
>>> freed the memory before unuse_mm() or otherwise have charged it on behalf
>>> of a user process, i.e. only exempting PF_KTHREAD?
>>>
>> I always charge current->mm->owner.
>>
>
> Yeah, I'm asking have you considered charging current->mm->owner for the
> memory when a kthread (current) assumes the mm of a user process via
> use_mm()? It may free the memory before calling unuse_mm(), but it's also
> allocating the memory on behalf of a user so this exemption might be
> dangerous if use_mm() becomes more popular. I don't think there's
> anything that prevents that charge, I'm just wondering if you considered
> doing it even for kthreads with an mm.
>
Well, I thought about it.
And I personally don't like it. I think all kthreads should be treated
the same. We have control over it, unlike any userspace application. We
never expect its memory consumption to explode.
Specially considering that those allocations are supposed to be
short-lived, we are only paying the res_counters count for no reason.
|
|
|
Re: [PATCH v5 06/14] memcg: kmem controller infrastructure [message #48515 is a reply to message #48502] |
Fri, 19 October 2012 10:08 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/19/2012 01:59 AM, David Rientjes wrote:
> On Thu, 18 Oct 2012, Glauber Costa wrote:
>
>>>> @@ -2630,6 +2634,171 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>>>> memcg_check_events(memcg, page);
>>>> }
>>>>
>>>> +#ifdef CONFIG_MEMCG_KMEM
>>>> +static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>>>> +{
>>>> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
>>>> + (memcg->kmem_accounted & KMEM_ACCOUNTED_MASK);
>>>> +}
>>>> +
>>>> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>>>> +{
>>>> + struct res_counter *fail_res;
>>>> + struct mem_cgroup *_memcg;
>>>> + int ret = 0;
>>>> + bool may_oom;
>>>> +
>>>> + ret = res_counter_charge(&memcg->kmem, size, &fail_res);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> + /*
>>>> + * Conditions under which we can wait for the oom_killer.
>>>> + * We have to be able to wait, but also, if we can't retry,
>>>> + * we obviously shouldn't go mess with oom.
>>>> + */
>>>> + may_oom = (gfp & __GFP_WAIT) && !(gfp & __GFP_NORETRY);
>>>
>>> What about gfp & __GFP_FS?
>>>
>>
>> Do you intend to prevent or allow OOM under that flag? I personally
>> think that anything that accepts to be OOM-killed should have GFP_WAIT
>> set, so that ought to be enough.
>>
>
> The oom killer in the page allocator cannot trigger without __GFP_FS
> because direct reclaim has little chance of being very successful and
> thus we end up needlessly killing processes, and that tends to happen
> quite a bit if we dont check for it. Seems like this would also happen
> with memcg if mem_cgroup_reclaim() has a large probability of failing?
>
I can indeed see tests for GFP_FS in some key locations in mm/ before
calling the OOM Killer.
Should I test for GFP_IO as well? If the idea is preventing OOM to
trigger for allocations that can write their pages back, how would you
feel about the following test:
may_oom = (gfp & GFP_KERNEL) && !(gfp & __GFP_NORETRY) ?
Michal, what is your take in here?
|
|
|
|
|
Re: [PATCH v5 06/14] memcg: kmem controller infrastructure [message #48556 is a reply to message #48554] |
Mon, 22 October 2012 12:51 |
Michal Hocko
Messages: 109 Registered: December 2011
|
Senior Member |
|
|
[Sorry for the late reply]
On Mon 22-10-12 16:34:15, Glauber Costa wrote:
> On 10/20/2012 12:34 AM, David Rientjes wrote:
> > On Fri, 19 Oct 2012, Glauber Costa wrote:
> >
> >>>>> What about gfp & __GFP_FS?
> >>>>>
> >>>>
> >>>> Do you intend to prevent or allow OOM under that flag? I personally
> >>>> think that anything that accepts to be OOM-killed should have GFP_WAIT
> >>>> set, so that ought to be enough.
> >>>>
> >>>
> >>> The oom killer in the page allocator cannot trigger without __GFP_FS
> >>> because direct reclaim has little chance of being very successful and
> >>> thus we end up needlessly killing processes, and that tends to happen
> >>> quite a bit if we dont check for it. Seems like this would also happen
> >>> with memcg if mem_cgroup_reclaim() has a large probability of failing?
> >>>
> >>
> >> I can indeed see tests for GFP_FS in some key locations in mm/ before
> >> calling the OOM Killer.
> >>
> >> Should I test for GFP_IO as well?
> >
> > It's not really necessary, if __GFP_IO isn't set then it wouldn't make
> > sense for __GFP_FS to be set.
> >
> >> If the idea is preventing OOM to
> >> trigger for allocations that can write their pages back, how would you
> >> feel about the following test:
> >>
> >> may_oom = (gfp & GFP_KERNEL) && !(gfp & __GFP_NORETRY) ?
> >>
> >
> > I would simply copy the logic from the page allocator and only trigger oom
> > for __GFP_FS and !__GFP_NORETRY.
> >
>
> That seems reasonable to me. Michal ?
Yes it makes sense to be consistent with the global case. While we are
at it, do we need to consider PF_DUMPCORE resp. !__GFP_NOFAIL?
--
Michal Hocko
SUSE Labs
|
|
|
Re: [PATCH v5 06/14] memcg: kmem controller infrastructure [message #48557 is a reply to message #48556] |
Mon, 22 October 2012 12:52 |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 10/22/2012 04:51 PM, Michal Hocko wrote:
> [Sorry for the late reply]
>
> On Mon 22-10-12 16:34:15, Glauber Costa wrote:
>> On 10/20/2012 12:34 AM, David Rientjes wrote:
>>> On Fri, 19 Oct 2012, Glauber Costa wrote:
>>>
>>>>>>> What about gfp & __GFP_FS?
>>>>>>>
>>>>>>
>>>>>> Do you intend to prevent or allow OOM under that flag? I personally
>>>>>> think that anything that accepts to be OOM-killed should have GFP_WAIT
>>>>>> set, so that ought to be enough.
>>>>>>
>>>>>
>>>>> The oom killer in the page allocator cannot trigger without __GFP_FS
>>>>> because direct reclaim has little chance of being very successful and
>>>>> thus we end up needlessly killing processes, and that tends to happen
>>>>> quite a bit if we dont check for it. Seems like this would also happen
>>>>> with memcg if mem_cgroup_reclaim() has a large probability of failing?
>>>>>
>>>>
>>>> I can indeed see tests for GFP_FS in some key locations in mm/ before
>>>> calling the OOM Killer.
>>>>
>>>> Should I test for GFP_IO as well?
>>>
>>> It's not really necessary, if __GFP_IO isn't set then it wouldn't make
>>> sense for __GFP_FS to be set.
>>>
>>>> If the idea is preventing OOM to
>>>> trigger for allocations that can write their pages back, how would you
>>>> feel about the following test:
>>>>
>>>> may_oom = (gfp & GFP_KERNEL) && !(gfp & __GFP_NORETRY) ?
>>>>
>>>
>>> I would simply copy the logic from the page allocator and only trigger oom
>>> for __GFP_FS and !__GFP_NORETRY.
>>>
>>
>> That seems reasonable to me. Michal ?
>
> Yes it makes sense to be consistent with the global case. While we are
> at it, do we need to consider PF_DUMPCORE resp. !__GFP_NOFAIL?
>
at least from kmem, GFP_NOFAIL will not reach this codepath. We will
ditch it to the root in memcontrol.h
|
|
|
Goto Forum:
Current Time: Fri Nov 15 02:02:11 GMT 2024
Total time taken to generate the page: 0.03571 seconds
|