Home » Mailing lists » Devel » [PATCH 00/11] kmem controller for memcg: stripped down version
|
|
Re: [PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #46961 is a reply to message #46922] |
Tue, 26 June 2012 05:59   |
KAMEZAWA Hiroyuki
Messages: 463 Registered: September 2006
|
Senior Member |
|
|
(2012/06/25 23:15), Glauber Costa wrote:
> Because the ultimate goal of the kmem tracking in memcg is to
> track slab pages as well, we can't guarantee that we'll always
> be able to point a page to a particular process, and migrate
> the charges along with it - since in the common case, a page
> will contain data belonging to multiple processes.
>
> Because of that, when we destroy a memcg, we only make sure
> the destruction will succeed by discounting the kmem charges
> from the user charges when we try to empty the cgroup.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>
> ---
> mm/memcontrol.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a6a440b..bb9b6fe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
> {
> if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
> static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
> + /*
> + * This check can't live in kmem destruction function,
> + * since the charges will outlive the cgroup
> + */
> + BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
> }
> #else
> static void disarm_kmem_keys(struct mem_cgroup *memcg)
> @@ -3838,6 +3843,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
> int node, zid, shrink;
> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> struct cgroup *cgrp = memcg->css.cgroup;
> + u64 usage;
>
> css_get(&memcg->css);
>
> @@ -3877,8 +3883,10 @@ move_account:
> if (ret == -ENOMEM)
> goto try_to_free;
> cond_resched();
> + usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
> + res_counter_read_u64(&memcg->kmem, RES_USAGE);
> /* "ret" should also be checked to ensure all lists are empty. */
> - } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
> + } while (usage > 0 || ret);
> out:
> css_put(&memcg->css);
> return ret;
>
Hm....maybe work enough. Could you add more comments on the code ?
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
|
|
|
Re: [PATCH 05/11] Add a __GFP_KMEMCG flag [message #46962 is a reply to message #46954] |
Tue, 26 June 2012 07:08   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 08:25 AM, David Rientjes wrote:
> On Mon, 25 Jun 2012, Glauber Costa wrote:
>
>> >This flag is used to indicate to the callees that this allocation will be
>> >serviced to the kernel. It is not supposed to be passed by the callers
>> >of kmem_cache_alloc, but rather by the cache core itself.
>> >
> Not sure what "serviced to the kernel" means, does this mean that the
> memory will not be accounted for to the root memcg?
>
In this context, it means that is a kernel allocation, not a userspace
one (but in process context, of course), *and* it is to be accounted a
specific memcg.
|
|
|
Re: [PATCH 04/11] kmem slab accounting basic infrastructure [message #46963 is a reply to message #46953] |
Tue, 26 June 2012 07:09   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 08:22 AM, David Rientjes wrote:
> On Mon, 25 Jun 2012, Glauber Costa wrote:
>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9352d40..6f34b77 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -265,6 +265,10 @@ struct mem_cgroup {
>> };
>>
>> /*
>> + * the counter to account for kernel memory usage.
>> + */
>> + struct res_counter kmem;
>> + /*
>> * Per cgroup active and inactive list, similar to the
>> * per zone LRU lists.
>> */
>> @@ -279,6 +283,7 @@ struct mem_cgroup {
>> * Should the accounting and control be hierarchical, per subtree?
>> */
>> bool use_hierarchy;
>> + bool kmem_accounted;
>>
>> bool oom_lock;
>> atomic_t under_oom;
>> @@ -391,6 +396,7 @@ enum res_type {
>> _MEM,
>> _MEMSWAP,
>> _OOM_TYPE,
>> + _KMEM,
>> };
>>
>> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
>> @@ -1438,6 +1444,10 @@ done:
>> res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
>> res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
>> res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
>> + printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
>> + res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
>> + res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
>> + res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
>> }
>>
>> /*
>> @@ -3879,6 +3889,11 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>> else
>> val = res_counter_read_u64(&memcg->memsw, name);
>> break;
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> + case _KMEM:
>> + val = res_counter_read_u64(&memcg->kmem, name);
>> + break;
>> +#endif
>
> This shouldn't need an #ifdef, ->kmem is available on all
> CONFIG_CGROUP_MEM_RES_CTLR kernels. Same with several of the other
> instances in this patch.
>
> Can't these instances be addressed by not adding kmem_cgroup_files without
> CONFIG_CGROUP_MEM_RES_CTLR_KMEM?
Yes, it can.
|
|
|
Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46964 is a reply to message #46951] |
Tue, 26 June 2012 07:12   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
>
>> + * retries
>> + */
>> +#define NR_PAGES_TO_RETRY 2
>> +
>
> Should be 1 << PAGE_ALLOC_COSTLY_ORDER? Where does this number come from?
> The changelog doesn't specify.
Hocko complained about that, and I changed. Where the number comes from,
is stated in the comments: it is a number small enough to have high
changes of had been freed by the previous reclaim, and yet around the
number of pages of a kernel allocation.
Of course there are allocations for nr_pages > 2. But 2 will already
service the stack most of the time, and most of the slab caches.
>> static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> - unsigned int nr_pages, bool oom_check)
>> + unsigned int nr_pages, unsigned int min_pages,
>> + bool oom_check)
>> {
>> unsigned long csize = nr_pages * PAGE_SIZE;
>> struct mem_cgroup *mem_over_limit;
>> @@ -2182,18 +2190,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> } else
>> mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>> /*
>> - * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
>> - * of regular pages (CHARGE_BATCH), or a single regular page (1).
>> - *
>> * Never reclaim on behalf of optional batching, retry with a
>> * single page instead.
>> */
>> - if (nr_pages == CHARGE_BATCH)
>> + if (nr_pages > min_pages)
>> return CHARGE_RETRY;
>>
>> if (!(gfp_mask & __GFP_WAIT))
>> return CHARGE_WOULDBLOCK;
>>
>> + if (gfp_mask & __GFP_NORETRY)
>> + return CHARGE_NOMEM;
>> +
>> ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
>> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>> return CHARGE_RETRY;
>> @@ -2206,7 +2214,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> * unlikely to succeed so close to the limit, and we fall back
>> * to regular pages anyway in case of failure.
>> */
>> - if (nr_pages == 1 && ret)
>> + if (nr_pages <= NR_PAGES_TO_RETRY && ret)
>> return CHARGE_RETRY;
>>
>> /*
>> @@ -2341,7 +2349,8 @@ again:
>> nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
>> }
>>
>> - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
>> + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
>> + oom_check);
>> switch (ret) {
>> case CHARGE_OK:
>> break;
|
|
|
Re: [PATCH 00/11] kmem controller for memcg: stripped down version [message #46965 is a reply to message #46947] |
Tue, 26 June 2012 07:17   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 03:27 AM, Andrew Morton wrote:
> On Mon, 25 Jun 2012 18:15:17 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> What I am proposing with this series is a stripped down version of the
>> kmem controller for memcg that would allow us to merge significant parts
>> of the infrastructure, while leaving out, for now, the polemic bits about
>> the slab while it is being reworked by Cristoph.
>>
>> Me reasoning for that is that after the last change to introduce a gfp
>> flag to mark kernel allocations, it became clear to me that tracking other
>> resources like the stack would then follow extremely naturaly. I figured
>> that at some point we'd have to solve the issue pointed by David, and avoid
>> testing the Slab flag in the page allocator, since it would soon be made
>> more generic. I do that by having the callers to explicit mark it.
>>
>> So to demonstrate how it would work, I am introducing a stack tracker here,
>> that is already a functionality per-se: it successfully stops fork bombs to
>> happen. (Sorry for doing all your work, Frederic =p ). Note that after all
>> memcg infrastructure is deployed, it becomes very easy to track anything.
>> The last patch of this series is extremely simple.
>>
>> The infrastructure is exactly the same we had in memcg, but stripped down
>> of the slab parts. And because what we have after those patches is a feature
>> per-se, I think it could be considered for merging.
>
> hm. None of this new code makes the kernel smaller, faster, easier to
> understand or more fun to read!
Not sure if this is a general comment - in case I agree - or if targeted
to my statement that this is "stripped down". If so, it is of course
smaller relative to my previous slab accounting patches.
The infrastructure is largely common, but I realized that a future user,
tracking the stack, would be a lot simpler and could be done first.
> Presumably we're getting some benefit for all the downside. When the
> time is appropriate, please do put some time into explaining that
> benefit, so that others can agree that it is a worthwhile tradeoff.
>
Well, for one thing, we stop fork bombs for processes inside cgroups.
I think the justification for that was already given when you asked
people about reasoning for merging Frederic's process tracking cgroup.
Just that wasn't merged because people were largely unhappy with the
form it took. I can't speak for everybody here, but AFAIK, tracking the
stack through the memory it used, therefore using my proposed kmem
controller, was an idea that good quite a bit of traction with the
memcg/memory people. So here you have something that people already
asked a lot for, in a shape and interface that seem to be acceptable.
|
|
|
Re: [PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #46966 is a reply to message #46961] |
Tue, 26 June 2012 07:21   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 09:59 AM, Kamezawa Hiroyuki wrote:
> (2012/06/25 23:15), Glauber Costa wrote:
>> Because the ultimate goal of the kmem tracking in memcg is to
>> track slab pages as well, we can't guarantee that we'll always
>> be able to point a page to a particular process, and migrate
>> the charges along with it - since in the common case, a page
>> will contain data belonging to multiple processes.
>>
>> Because of that, when we destroy a memcg, we only make sure
>> the destruction will succeed by discounting the kmem charges
>> from the user charges when we try to empty the cgroup.
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Christoph Lameter <cl@linux.com>
>> CC: Pekka Enberg <penberg@cs.helsinki.fi>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> CC: Suleiman Souhlal <suleiman@google.com>
>> ---
>> mm/memcontrol.c | 10 +++++++++-
>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a6a440b..bb9b6fe 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
>> {
>> if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
>> static_key_slow_dec(&mem_cgroup_kmem_enabled_key);
>> + /*
>> + * This check can't live in kmem destruction function,
>> + * since the charges will outlive the cgroup
>> + */
>> + BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
>> }
>> #else
>> static void disarm_kmem_keys(struct mem_cgroup *memcg)
>> @@ -3838,6 +3843,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
>> int node, zid, shrink;
>> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>> struct cgroup *cgrp = memcg->css.cgroup;
>> + u64 usage;
>>
>> css_get(&memcg->css);
>>
>> @@ -3877,8 +3883,10 @@ move_account:
>> if (ret == -ENOMEM)
>> goto try_to_free;
>> cond_resched();
>> + usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
>> + res_counter_read_u64(&memcg->kmem, RES_USAGE);
>> /* "ret" should also be checked to ensure all lists are empty. */
>> - } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
>> + } while (usage > 0 || ret);
>> out:
>> css_put(&memcg->css);
>> return ret;
>>
> Hm....maybe work enough. Could you add more comments on the code ?
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
I always can.
|
|
|
Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46967 is a reply to message #46955] |
Tue, 26 June 2012 07:23   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 08:57 AM, David Rientjes wrote:
> On Mon, 25 Jun 2012, Glauber Costa wrote:
>
>> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
>> index ccc1899..914ec07 100644
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -61,6 +61,12 @@ extern long do_no_restart_syscall(struct restart_block *parm);
>> # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
>> #endif
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
>> +#else
>> +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP)
>> +#endif
>> +
>
> This type of requirement is going to become nasty very quickly if nobody
> can use __GFP_KMEMCG without testing for CONFIG_CGROUP_MEM_RES_CTLR_KMEM.
> Perhaps define __GFP_KMEMCG to be 0x0 if it's not enabled, similar to how
> kmemcheck does?
>
That is what I've done in my first version of this patch. At that time,
Christoph wanted it to be this way so we would make sure it would never
be used with #CONFIG_CGROUP_MEM_RES_CTLR_KMEM defined. A value of zero
will generate no errors. Undefined value will.
Now, if you ask me, I personally prefer following what kmemcheck does
here...
|
|
|
|
Re: [PATCH 03/11] memcg: change defines to an enum [message #46970 is a reply to message #46952] |
Tue, 26 June 2012 08:28   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 08:11 AM, David Rientjes wrote:
> On Mon, 25 Jun 2012, Glauber Costa wrote:
>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8e601e8..9352d40 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -387,9 +387,12 @@ enum charge_type {
>> };
>>
>> /* for encoding cft->private value on file */
>> -#define _MEM (0)
>> -#define _MEMSWAP (1)
>> -#define _OOM_TYPE (2)
>> +enum res_type {
>> + _MEM,
>> + _MEMSWAP,
>> + _OOM_TYPE,
>> +};
>> +
>> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
>> #define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff)
>> #define MEMFILE_ATTR(val) ((val) & 0xffff)
>
> Shouldn't everything that does MEMFILE_TYPE() now be using type
> enum res_type rather than int?
>
If you mean the following three fields, no, since they are masks and
operations.
If you mean something else, what do you mean ?
|
|
|
Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46972 is a reply to message #46948] |
Tue, 26 June 2012 08:39   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 03:33 AM, Suleiman Souhlal wrote:
> On Mon, Jun 25, 2012 at 7:15 AM, Glauber Costa <glommer@parallels.com> wrote:
>> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>>
>> mem_cgroup_do_charge() was written before slab accounting, and expects
>> three cases: being called for 1 page, being called for a stock of 32 pages,
>> or being called for a hugepage. If we call for 2 or 3 pages (and several
>> slabs used in process creation are such, at least with the debug options I
>> had), it assumed it's being called for stock and just retried without reclaiming.
>>
>> Fix that by passing down a minsize argument in addition to the csize.
>>
>> And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
>> needed at all (and presumably is since it's there, perhaps to handle
>> races), then it should be extended to more than PAGE_SIZE, yet how far?
>> And should there be a retry count limit, of what? For now retry up to
>> COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
>> and make sure not to do it if __GFP_NORETRY.
>
> The commit description mentions COSTLY_ORDER, but it's not actually
> used in the patch.
>
> -- Suleiman
>
Yeah, forgot to update the changelog =(
But much more importantly, are you still happy with those changes?
|
|
|
Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46973 is a reply to message #46967] |
Tue, 26 June 2012 08:45   |
David Rientjes
Messages: 59 Registered: November 2006
|
Member |
|
|
On Tue, 26 Jun 2012, Glauber Costa wrote:
> > > diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> > > index ccc1899..914ec07 100644
> > > --- a/include/linux/thread_info.h
> > > +++ b/include/linux/thread_info.h
> > > @@ -61,6 +61,12 @@ extern long do_no_restart_syscall(struct restart_block
> > > *parm);
> > > # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
> > > #endif
> > >
> > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> > > +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
> > > +#else
> > > +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP)
> > > +#endif
> > > +
> >
> > This type of requirement is going to become nasty very quickly if nobody
> > can use __GFP_KMEMCG without testing for CONFIG_CGROUP_MEM_RES_CTLR_KMEM.
> > Perhaps define __GFP_KMEMCG to be 0x0 if it's not enabled, similar to how
> > kmemcheck does?
> >
> That is what I've done in my first version of this patch. At that time,
> Christoph wanted it to be this way so we would make sure it would never be
> used with #CONFIG_CGROUP_MEM_RES_CTLR_KMEM defined. A value of zero will
> generate no errors. Undefined value will.
>
> Now, if you ask me, I personally prefer following what kmemcheck does here...
>
Right, because I'm sure that __GFP_KMEMCG will be used in additional
places outside of this patchset and it will be a shame if we have to
always add #ifdef's. I see no reason why we would care if __GFP_KMEMCG
was used when CONFIG_CGROUP_MEM_RES_CTLR_KMEM=n with the semantics that it
as in this patchset. It's much cleaner by making it 0x0 when disabled.
|
|
|
Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46974 is a reply to message #46973] |
Tue, 26 June 2012 08:44   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 12:45 PM, David Rientjes wrote:
> On Tue, 26 Jun 2012, Glauber Costa wrote:
>
>>>> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
>>>> index ccc1899..914ec07 100644
>>>> --- a/include/linux/thread_info.h
>>>> +++ b/include/linux/thread_info.h
>>>> @@ -61,6 +61,12 @@ extern long do_no_restart_syscall(struct restart_block
>>>> *parm);
>>>> # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
>>>> #endif
>>>>
>>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>>> +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
>>>> +#else
>>>> +# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP)
>>>> +#endif
>>>> +
>>>
>>> This type of requirement is going to become nasty very quickly if nobody
>>> can use __GFP_KMEMCG without testing for CONFIG_CGROUP_MEM_RES_CTLR_KMEM.
>>> Perhaps define __GFP_KMEMCG to be 0x0 if it's not enabled, similar to how
>>> kmemcheck does?
>>>
>> That is what I've done in my first version of this patch. At that time,
>> Christoph wanted it to be this way so we would make sure it would never be
>> used with #CONFIG_CGROUP_MEM_RES_CTLR_KMEM defined. A value of zero will
>> generate no errors. Undefined value will.
>>
>> Now, if you ask me, I personally prefer following what kmemcheck does here...
>>
>
> Right, because I'm sure that __GFP_KMEMCG will be used in additional
> places outside of this patchset and it will be a shame if we have to
> always add #ifdef's. I see no reason why we would care if __GFP_KMEMCG
> was used when CONFIG_CGROUP_MEM_RES_CTLR_KMEM=n with the semantics that it
> as in this patchset. It's much cleaner by making it 0x0 when disabled.
>
What I can do, instead, is to WARN_ON conditionally to the config option
in the page allocator, and make sure no one is actually passing the flag
in that case.
|
|
|
Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46975 is a reply to message #46964] |
Tue, 26 June 2012 08:54   |
David Rientjes
Messages: 59 Registered: November 2006
|
Member |
|
|
On Tue, 26 Jun 2012, Glauber Costa wrote:
> > > + * retries
> > > + */
> > > +#define NR_PAGES_TO_RETRY 2
> > > +
> >
> > Should be 1 << PAGE_ALLOC_COSTLY_ORDER? Where does this number come from?
> > The changelog doesn't specify.
>
> Hocko complained about that, and I changed. Where the number comes from, is
> stated in the comments: it is a number small enough to have high changes of
> had been freed by the previous reclaim, and yet around the number of pages of
> a kernel allocation.
>
PAGE_ALLOC_COSTLY_ORDER _is_ the threshold used to determine where reclaim
and compaction is deemed to be too costly to continuously retry, I'm not
sure why this is any different?
And this is certainly not "around the number of pages of a kernel
allocation", that depends very heavily on the slab allocator being used;
slub very often uses order-2 and order-3 page allocations as the default
settings (it is capped at, you guessed it, PAGE_ALLOC_COSTLY_ORDER
internally by default) and can be significantly increased on the command
line.
> Of course there are allocations for nr_pages > 2. But 2 will already service
> the stack most of the time, and most of the slab caches.
>
Nope, have you checked the output of /sys/kernel/slab/.../order when
running slub? On my workstation 127 out of 316 caches have order-2 or
higher by default.
|
|
|
Re: [PATCH 03/11] memcg: change defines to an enum [message #46976 is a reply to message #46970] |
Tue, 26 June 2012 09:01   |
David Rientjes
Messages: 59 Registered: November 2006
|
Member |
|
|
On Tue, 26 Jun 2012, Glauber Costa wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 8e601e8..9352d40 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -387,9 +387,12 @@ enum charge_type {
> > > };
> > >
> > > /* for encoding cft->private value on file */
> > > -#define _MEM (0)
> > > -#define _MEMSWAP (1)
> > > -#define _OOM_TYPE (2)
> > > +enum res_type {
> > > + _MEM,
> > > + _MEMSWAP,
> > > + _OOM_TYPE,
> > > +};
> > > +
> > > #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
> > > #define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff)
> > > #define MEMFILE_ATTR(val) ((val) & 0xffff)
> >
> > Shouldn't everything that does MEMFILE_TYPE() now be using type
> > enum res_type rather than int?
> >
> If you mean the following three fields, no, since they are masks and
> operations.
>
No, I mean everything in mm/memcontrol.c that does
int type = MEMFILE_TYPE(...).
Why define a non-anonymous enum if you're not going to use its type?
Either use enum res_type in place of int or define the enum to be
anonymous.
It's actually quite effective since gcc will warn if you're using the
value of an enum type in your switch() statements later in this series and
one of the enum fields is missing (if you avoid using a "default" case
statement) if you pass -Wswitch, which is included in -Wall.
|
|
|
Re: [PATCH 05/11] Add a __GFP_KMEMCG flag [message #46977 is a reply to message #46962] |
Tue, 26 June 2012 09:03   |
David Rientjes
Messages: 59 Registered: November 2006
|
Member |
|
|
On Tue, 26 Jun 2012, Glauber Costa wrote:
> > > >This flag is used to indicate to the callees that this allocation will be
> > > >serviced to the kernel. It is not supposed to be passed by the callers
> > > >of kmem_cache_alloc, but rather by the cache core itself.
> > > >
> > Not sure what "serviced to the kernel" means, does this mean that the
> > memory will not be accounted for to the root memcg?
> >
> In this context, it means that is a kernel allocation, not a userspace one
> (but in process context, of course), *and* it is to be accounted a
> specific memcg.
>
Ah, that makes sense. I think it would help if this was included in the
changelog as well as a specifying that it is accounted to current's memcg
at the time of the allocation in a comment in the code.
|
|
|
|
Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46979 is a reply to message #46975] |
Tue, 26 June 2012 09:08   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 12:54 PM, David Rientjes wrote:
> On Tue, 26 Jun 2012, Glauber Costa wrote:
>
>>>> + * retries
>>>> + */
>>>> +#define NR_PAGES_TO_RETRY 2
>>>> +
>>>
>>> Should be 1 << PAGE_ALLOC_COSTLY_ORDER? Where does this number come from?
>>> The changelog doesn't specify.
>>
>> Hocko complained about that, and I changed. Where the number comes from, is
>> stated in the comments: it is a number small enough to have high changes of
>> had been freed by the previous reclaim, and yet around the number of pages of
>> a kernel allocation.
>>
>
> PAGE_ALLOC_COSTLY_ORDER _is_ the threshold used to determine where reclaim
> and compaction is deemed to be too costly to continuously retry, I'm not
> sure why this is any different?
>
> And this is certainly not "around the number of pages of a kernel
> allocation", that depends very heavily on the slab allocator being used;
> slub very often uses order-2 and order-3 page allocations as the default
> settings (it is capped at, you guessed it, PAGE_ALLOC_COSTLY_ORDER
> internally by default) and can be significantly increased on the command
> line.
I am obviously okay with either.
Maybe Michal can comment on this?
>> Of course there are allocations for nr_pages > 2. But 2 will already service
>> the stack most of the time, and most of the slab caches.
>>
>
> Nope, have you checked the output of /sys/kernel/slab/.../order when
> running slub? On my workstation 127 out of 316 caches have order-2 or
> higher by default.
>
Well, this is still on the side of my argument, since this is still a
majority of them being low ordered. The code here does not necessarily
have to retry - if I understand it correctly - we just retry for very
small allocations because that is where our likelihood of succeeding is.
|
|
|
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46980 is a reply to message #46925] |
Tue, 26 June 2012 09:12   |
David Rientjes
Messages: 59 Registered: November 2006
|
Member |
|
|
On Mon, 25 Jun 2012, Glauber Costa wrote:
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 83e7ba9..22479eb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,7 @@
> #define _LINUX_MEMCONTROL_H
> #include <linux/cgroup.h>
> #include <linux/vm_event_item.h>
> +#include <linux/hardirq.h>
>
> struct mem_cgroup;
> struct page_cgroup;
> @@ -409,6 +410,12 @@ struct sock;
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> void sock_update_memcg(struct sock *sk);
> void sock_release_memcg(struct sock *sk);
> +
> +#define mem_cgroup_kmem_on 1
> +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
> +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
> +void __mem_cgroup_free_kmem_page(struct page *page, int order);
> +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)
> #else
> static inline void sock_update_memcg(struct sock *sk)
> {
> @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
> static inline void sock_release_memcg(struct sock *sk)
> {
> }
> +
> +#define mem_cgroup_kmem_on 0
> +#define __mem_cgroup_new_kmem_page(a, b, c) false
> +#define __mem_cgroup_free_kmem_page(a,b )
> +#define __mem_cgroup_commit_kmem_page(a, b, c)
> +#define is_kmem_tracked_alloc (false)
> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
> +static __always_inline
> +bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!mem_cgroup_kmem_on)
> + return true;
> + if (!is_kmem_tracked_alloc)
> + return true;
> + if (!current->mm)
> + return true;
> + if (in_interrupt())
> + return true;
You can't test for current->mm in irq context, so you need to check for
in_interrupt() first. Also, what prevents __mem_cgroup_new_kmem_page()
from being called for a kthread that has called use_mm() before
unuse_mm()?
> + if (gfp & __GFP_NOFAIL)
> + return true;
> + return __mem_cgroup_new_kmem_page(gfp, handle, order);
> +}
> +
> +static __always_inline
> +void mem_cgroup_free_kmem_page(struct page *page, int order)
> +{
> + if (mem_cgroup_kmem_on)
> + __mem_cgroup_free_kmem_page(page, order);
> +}
> +
> +static __always_inline
> +void mem_cgroup_commit_kmem_page(struct page *page, struct mem_cgroup *handle,
> + int order)
> +{
> + if (mem_cgroup_kmem_on)
> + __mem_cgroup_commit_kmem_page(page, handle, order);
> +}
> #endif /* _LINUX_MEMCONTROL_H */
>
|
|
|
|
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46982 is a reply to message #46980] |
Tue, 26 June 2012 09:17   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 01:12 PM, David Rientjes wrote:
> On Mon, 25 Jun 2012, Glauber Costa wrote:
>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 83e7ba9..22479eb 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -21,6 +21,7 @@
>> #define _LINUX_MEMCONTROL_H
>> #include <linux/cgroup.h>
>> #include <linux/vm_event_item.h>
>> +#include <linux/hardirq.h>
>>
>> struct mem_cgroup;
>> struct page_cgroup;
>> @@ -409,6 +410,12 @@ struct sock;
>> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> void sock_update_memcg(struct sock *sk);
>> void sock_release_memcg(struct sock *sk);
>> +
>> +#define mem_cgroup_kmem_on 1
>> +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order);
>> +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order);
>> +void __mem_cgroup_free_kmem_page(struct page *page, int order);
>> +#define is_kmem_tracked_alloc (gfp & __GFP_KMEMCG)
>> #else
>> static inline void sock_update_memcg(struct sock *sk)
>> {
>> @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
>> static inline void sock_release_memcg(struct sock *sk)
>> {
>> }
>> +
>> +#define mem_cgroup_kmem_on 0
>> +#define __mem_cgroup_new_kmem_page(a, b, c) false
>> +#define __mem_cgroup_free_kmem_page(a,b )
>> +#define __mem_cgroup_commit_kmem_page(a, b, c)
>> +#define is_kmem_tracked_alloc (false)
>> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>> +
>> +static __always_inline
>> +bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order)
>> +{
>> + if (!mem_cgroup_kmem_on)
>> + return true;
>> + if (!is_kmem_tracked_alloc)
>> + return true;
>> + if (!current->mm)
>> + return true;
>> + if (in_interrupt())
>> + return true;
>
> You can't test for current->mm in irq context, so you need to check for
> in_interrupt() first.
>
Right, thanks.
> Also, what prevents __mem_cgroup_new_kmem_page()
> from being called for a kthread that has called use_mm() before
> unuse_mm()?
Nothing, but I also don't see how to prevent that.
At a first glance, it seems fair to me to say that if a kernel thread
uses the mm of a process, it poses as this process for any accounting
purpose.
|
|
|
Re: [PATCH 02/11] memcg: Reclaim when more than one page needed. [message #46983 is a reply to message #46981] |
Tue, 26 June 2012 09:23   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 01:17 PM, David Rientjes wrote:
> On Tue, 26 Jun 2012, Glauber Costa wrote:
>
>>> Nope, have you checked the output of /sys/kernel/slab/.../order when
>>> running slub? On my workstation 127 out of 316 caches have order-2 or
>>> higher by default.
>>>
>>
>> Well, this is still on the side of my argument, since this is still a majority
>> of them being low ordered.
>
> Ok, so what happens if I pass slub_min_order=2 on the command line? We
> never retry?
Well, indeed. The function has many other RETRY points, but I believe
we'd reach none of them without triggering oom first.
>> The code here does not necessarily have to retry -
>> if I understand it correctly - we just retry for very small allocations
>> because that is where our likelihood of succeeding is.
>>
>
> Well, the comment for NR_PAGES_TO_RETRY says
>
> /*
> * We need a number that is small enough to be likely to have been
> * reclaimed even under pressure, but not too big to trigger unnecessary
> * retries
> */
>
> and mmzone.h says
>
> /*
> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
> * costly to service. That is between allocation orders which should
> * coalesce naturally under reasonable reclaim pressure and those which
> * will not.
> */
> #define PAGE_ALLOC_COSTLY_ORDER 3
>
> so I'm trying to reconcile which one is correct.
>
I am not myself against reverting back to costly order. The check we
have here, of course, is stricter than costly order, so there is nothing
to reconcile. Now if we really need a stricter check or not, is the
subject of discussion.
|
|
|
Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46987 is a reply to message #46933] |
Tue, 26 June 2012 12:48   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/25/2012 10:38 PM, Tejun Heo wrote:
> On Mon, Jun 25, 2012 at 06:55:35PM +0200, Frederic Weisbecker wrote:
>> On 06/25/2012 04:15 PM, Glauber Costa wrote:
>>
>>> Because those architectures will draw their stacks directly from
>>> the page allocator, rather than the slab cache, we can directly
>>> pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
>>>
>>> This code path is taken when the architecture doesn't define
>>> CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
>>> THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
>>> remaining architectures fall in this category.
>>>
>>> This will guarantee that every stack page is accounted to the memcg
>>> the process currently lives on, and will have the allocations to fail
>>> if they go over limit.
>>>
>>> For the time being, I am defining a new variant of THREADINFO_GFP, not
>>> to mess with the other path. Once the slab is also tracked by memcg,
>>> we can get rid of that flag.
>>>
>>> Tested to successfully protect against :(){ :|:& };:
>>>
>>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>>> CC: Christoph Lameter <cl@linux.com>
>>> CC: Pekka Enberg <penberg@cs.helsinki.fi>
>>> CC: Michal Hocko <mhocko@suse.cz>
>>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> CC: Johannes Weiner <hannes@cmpxchg.org>
>>> CC: Suleiman Souhlal <suleiman@google.com>
>>
>>
>> Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
>
> Frederic, does this (with proper slab accounting added later) achieve
> what you wanted with the task counter?
>
A note: Frederic may confirm, but I think he doesn't even need
the slab accounting to follow to achieve that goal.
|
|
|
Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46988 is a reply to message #46987] |
Tue, 26 June 2012 13:38   |
Frederic Weisbecker
Messages: 25 Registered: April 2012
|
Junior Member |
|
|
On Tue, Jun 26, 2012 at 04:48:08PM +0400, Glauber Costa wrote:
> On 06/25/2012 10:38 PM, Tejun Heo wrote:
> >On Mon, Jun 25, 2012 at 06:55:35PM +0200, Frederic Weisbecker wrote:
> >>On 06/25/2012 04:15 PM, Glauber Costa wrote:
> >>
> >>>Because those architectures will draw their stacks directly from
> >>>the page allocator, rather than the slab cache, we can directly
> >>>pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
> >>>
> >>>This code path is taken when the architecture doesn't define
> >>>CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
> >>>THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
> >>>remaining architectures fall in this category.
> >>>
> >>>This will guarantee that every stack page is accounted to the memcg
> >>>the process currently lives on, and will have the allocations to fail
> >>>if they go over limit.
> >>>
> >>>For the time being, I am defining a new variant of THREADINFO_GFP, not
> >>>to mess with the other path. Once the slab is also tracked by memcg,
> >>>we can get rid of that flag.
> >>>
> >>>Tested to successfully protect against :(){ :|:& };:
> >>>
> >>>Signed-off-by: Glauber Costa <glommer@parallels.com>
> >>>CC: Christoph Lameter <cl@linux.com>
> >>>CC: Pekka Enberg <penberg@cs.helsinki.fi>
> >>>CC: Michal Hocko <mhocko@suse.cz>
> >>>CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >>>CC: Johannes Weiner <hannes@cmpxchg.org>
> >>>CC: Suleiman Souhlal <suleiman@google.com>
> >>
> >>
> >>Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
> >
> >Frederic, does this (with proper slab accounting added later) achieve
> >what you wanted with the task counter?
> >
>
> A note: Frederic may confirm, but I think he doesn't even need
> the slab accounting to follow to achieve that goal.
Limiting is enough. But that requires internal accounting.
|
|
|
Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46989 is a reply to message #46988] |
Tue, 26 June 2012 13:37   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 05:38 PM, Frederic Weisbecker wrote:
> On Tue, Jun 26, 2012 at 04:48:08PM +0400, Glauber Costa wrote:
>> On 06/25/2012 10:38 PM, Tejun Heo wrote:
>>> On Mon, Jun 25, 2012 at 06:55:35PM +0200, Frederic Weisbecker wrote:
>>>> On 06/25/2012 04:15 PM, Glauber Costa wrote:
>>>>
>>>>> Because those architectures will draw their stacks directly from
>>>>> the page allocator, rather than the slab cache, we can directly
>>>>> pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
>>>>>
>>>>> This code path is taken when the architecture doesn't define
>>>>> CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
>>>>> THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
>>>>> remaining architectures fall in this category.
>>>>>
>>>>> This will guarantee that every stack page is accounted to the memcg
>>>>> the process currently lives on, and will have the allocations to fail
>>>>> if they go over limit.
>>>>>
>>>>> For the time being, I am defining a new variant of THREADINFO_GFP, not
>>>>> to mess with the other path. Once the slab is also tracked by memcg,
>>>>> we can get rid of that flag.
>>>>>
>>>>> Tested to successfully protect against :(){ :|:& };:
>>>>>
>>>>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>>>>> CC: Christoph Lameter <cl@linux.com>
>>>>> CC: Pekka Enberg <penberg@cs.helsinki.fi>
>>>>> CC: Michal Hocko <mhocko@suse.cz>
>>>>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>>> CC: Johannes Weiner <hannes@cmpxchg.org>
>>>>> CC: Suleiman Souhlal <suleiman@google.com>
>>>>
>>>>
>>>> Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
>>>
>>> Frederic, does this (with proper slab accounting added later) achieve
>>> what you wanted with the task counter?
>>>
>>
>> A note: Frederic may confirm, but I think he doesn't even need
>> the slab accounting to follow to achieve that goal.
>
> Limiting is enough. But that requires internal accounting.
>
Yes, but why the *slab* needs to get involved?
accounting task stack pages should be equivalent to what you
were doing, even without slab accounting. Right ?
|
|
|
Re: [PATCH 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #46990 is a reply to message #46989] |
Tue, 26 June 2012 13:44   |
Frederic Weisbecker
Messages: 25 Registered: April 2012
|
Junior Member |
|
|
On Tue, Jun 26, 2012 at 05:37:41PM +0400, Glauber Costa wrote:
> On 06/26/2012 05:38 PM, Frederic Weisbecker wrote:
> >On Tue, Jun 26, 2012 at 04:48:08PM +0400, Glauber Costa wrote:
> >>On 06/25/2012 10:38 PM, Tejun Heo wrote:
> >>>On Mon, Jun 25, 2012 at 06:55:35PM +0200, Frederic Weisbecker wrote:
> >>>>On 06/25/2012 04:15 PM, Glauber Costa wrote:
> >>>>
> >>>>>Because those architectures will draw their stacks directly from
> >>>>>the page allocator, rather than the slab cache, we can directly
> >>>>>pass __GFP_KMEMCG flag, and issue the corresponding free_pages.
> >>>>>
> >>>>>This code path is taken when the architecture doesn't define
> >>>>>CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
> >>>>>THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the
> >>>>>remaining architectures fall in this category.
> >>>>>
> >>>>>This will guarantee that every stack page is accounted to the memcg
> >>>>>the process currently lives on, and will have the allocations to fail
> >>>>>if they go over limit.
> >>>>>
> >>>>>For the time being, I am defining a new variant of THREADINFO_GFP, not
> >>>>>to mess with the other path. Once the slab is also tracked by memcg,
> >>>>>we can get rid of that flag.
> >>>>>
> >>>>>Tested to successfully protect against :(){ :|:& };:
> >>>>>
> >>>>>Signed-off-by: Glauber Costa <glommer@parallels.com>
> >>>>>CC: Christoph Lameter <cl@linux.com>
> >>>>>CC: Pekka Enberg <penberg@cs.helsinki.fi>
> >>>>>CC: Michal Hocko <mhocko@suse.cz>
> >>>>>CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >>>>>CC: Johannes Weiner <hannes@cmpxchg.org>
> >>>>>CC: Suleiman Souhlal <suleiman@google.com>
> >>>>
> >>>>
> >>>>Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
> >>>
> >>>Frederic, does this (with proper slab accounting added later) achieve
> >>>what you wanted with the task counter?
> >>>
> >>
> >>A note: Frederic may confirm, but I think he doesn't even need
> >>the slab accounting to follow to achieve that goal.
> >
> >Limiting is enough. But that requires internal accounting.
> >
> Yes, but why the *slab* needs to get involved?
> accounting task stack pages should be equivalent to what you
> were doing, even without slab accounting. Right ?
Yeah that alone should be fine.
|
|
|
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46993 is a reply to message #46944] |
Tue, 26 June 2012 14:40   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 03:17 AM, Andrew Morton wrote:
>> }
>> >+
>> >+#define mem_cgroup_kmem_on 0
>> >+#define __mem_cgroup_new_kmem_page(a, b, c) false
>> >+#define __mem_cgroup_free_kmem_page(a,b )
>> >+#define __mem_cgroup_commit_kmem_page(a, b, c)
> I suggest that the naming consistently follow the model
> "mem_cgroup_kmem_foo". So "mem_cgroup_kmem_" becomes the well-known
> identifier for this subsystem.
>
> Then, s/mem_cgroup/memcg/g/ - show us some mercy here!
>
I always prefer shorter names, but mem_cgroup, and not memcg, seems to
be the default for external functions.
I am nothing but a follower =)
|
|
|
|
|
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46997 is a reply to message #46994] |
Tue, 26 June 2012 18:01   |
akpm
Messages: 224 Registered: March 2007
|
Senior Member |
|
|
On Tue, 26 Jun 2012 19:01:15 +0400 Glauber Costa <glommer@parallels.com> wrote:
> On 06/26/2012 03:17 AM, Andrew Morton wrote:
> >> + memcg_uncharge_kmem(memcg, size);
> >> >+ mem_cgroup_put(memcg);
> >> >+}
> >> >+EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
> >> > #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> >> >
> >> > #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> >> >@@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
> >> > __setup("swapaccount=", enable_swap_account);
> >> >
> >> > #endif
> >> >+
> >> >+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> > gargh. CONFIG_MEMCG_KMEM, please!
> >
>
> Here too. I like it as much as you do.
>
> But that is consistent with the rest of the file, and I'd rather have
> it this way.
There's not much point in being consistent with something which is so
unpleasant. I'm on a little campaign to rename
CONFIG_CGROUP_MEM_RES_CTLR to CONFIG_MEMCG, only nobody has taken my
bait yet. Be first!
|
|
|
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46998 is a reply to message #46997] |
Tue, 26 June 2012 18:08   |
Tejun Heo
Messages: 184 Registered: November 2006
|
Senior Member |
|
|
On Tue, Jun 26, 2012 at 11:01:42AM -0700, Andrew Morton wrote:
> On Tue, 26 Jun 2012 19:01:15 +0400 Glauber Costa <glommer@parallels.com> wrote:
>
> > On 06/26/2012 03:17 AM, Andrew Morton wrote:
> > >> + memcg_uncharge_kmem(memcg, size);
> > >> >+ mem_cgroup_put(memcg);
> > >> >+}
> > >> >+EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
> > >> > #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> > >> >
> > >> > #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> > >> >@@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
> > >> > __setup("swapaccount=", enable_swap_account);
> > >> >
> > >> > #endif
> > >> >+
> > >> >+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> > > gargh. CONFIG_MEMCG_KMEM, please!
> > >
> >
> > Here too. I like it as much as you do.
> >
> > But that is consistent with the rest of the file, and I'd rather have
> > it this way.
>
> There's not much point in being consistent with something which is so
> unpleasant. I'm on a little campaign to rename
> CONFIG_CGROUP_MEM_RES_CTLR to CONFIG_MEMCG, only nobody has taken my
> bait yet. Be first!
+1.
Block cgroup recently did blkio / blkiocg / blkio_cgroup -> blkcg.
Join the cool crowd! :P
--
tejun
|
|
|
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #46999 is a reply to message #46997] |
Tue, 26 June 2012 18:14   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/26/2012 10:01 PM, Andrew Morton wrote:
> On Tue, 26 Jun 2012 19:01:15 +0400 Glauber Costa <glommer@parallels.com> wrote:
>
>> On 06/26/2012 03:17 AM, Andrew Morton wrote:
>>>> + memcg_uncharge_kmem(memcg, size);
>>>>> + mem_cgroup_put(memcg);
>>>>> +}
>>>>> +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
>>>>> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>>>>>
>>>>> #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>>>>> @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
>>>>> __setup("swapaccount=", enable_swap_account);
>>>>>
>>>>> #endif
>>>>> +
>>>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> gargh. CONFIG_MEMCG_KMEM, please!
>>>
>>
>> Here too. I like it as much as you do.
>>
>> But that is consistent with the rest of the file, and I'd rather have
>> it this way.
>
> There's not much point in being consistent with something which is so
> unpleasant. I'm on a little campaign to rename
> CONFIG_CGROUP_MEM_RES_CTLR to CONFIG_MEMCG, only nobody has taken my
> bait yet. Be first!
>
If you are okay with a preparation mechanical patch to convert the whole
file, I can change mine too.
But you'll be responsible for arguing with whoever stepping up opposing
this =p
|
|
|
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #47000 is a reply to message #46999] |
Tue, 26 June 2012 19:20   |
akpm
Messages: 224 Registered: March 2007
|
Senior Member |
|
|
On Tue, 26 Jun 2012 22:14:51 +0400
Glauber Costa <glommer@parallels.com> wrote:
> On 06/26/2012 10:01 PM, Andrew Morton wrote:
> > On Tue, 26 Jun 2012 19:01:15 +0400 Glauber Costa <glommer@parallels.com> wrote:
> >
> >> On 06/26/2012 03:17 AM, Andrew Morton wrote:
> >>>> + memcg_uncharge_kmem(memcg, size);
> >>>>> + mem_cgroup_put(memcg);
> >>>>> +}
> >>>>> +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
> >>>>> #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> >>>>>
> >>>>> #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> >>>>> @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
> >>>>> __setup("swapaccount=", enable_swap_account);
> >>>>>
> >>>>> #endif
> >>>>> +
> >>>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> >>> gargh. CONFIG_MEMCG_KMEM, please!
> >>>
> >>
> >> Here too. I like it as much as you do.
> >>
> >> But that is consistent with the rest of the file, and I'd rather have
> >> it this way.
> >
> > There's not much point in being consistent with something which is so
> > unpleasant. I'm on a little campaign to rename
> > CONFIG_CGROUP_MEM_RES_CTLR to CONFIG_MEMCG, only nobody has taken my
> > bait yet. Be first!
> >
>
> If you are okay with a preparation mechanical patch to convert the whole
> file, I can change mine too.
>
> But you'll be responsible for arguing with whoever stepping up opposing
> this =p
>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: memcg: rename config variables
Sanity:
CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/cgroup_subsys.h | 2 +-
include/linux/memcontrol.h | 14 +++++++-------
include/linux/mmzone.h | 8 ++++----
include/linux/page_cgroup.h | 10 +++++-----
include/linux/sched.h | 2 +-
init/Kconfig | 14 +++++++-------
kernel/fork.c | 2 +-
mm/hwpoison-inject.c | 2 +-
mm/memcontrol.c | 20 ++++++++++----------
mm/memory-failure.c | 2 +-
mm/mmzone.c | 2 +-
mm/oom_kill.c | 2 +-
mm/page_cgroup.c | 2 +-
mm/vmscan.c | 4 ++--
14 files changed, 43 insertions(+), 43 deletions(-)
diff -puN kernel/fork.c~a kernel/fork.c
--- a/kernel/fork.c~a
+++ a/kernel/fork.c
@@ -1302,7 +1302,7 @@ static struct task_struct *copy_process(
#ifdef CONFIG_DEBUG_MUTEXES
p->blocked_on = NULL; /* not blocked yet */
#endif
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_MEMCG
p->memcg_batch.do_batch = 0;
p->memcg_batch.memcg = NULL;
#endif
diff -puN mm/hwpoison-inject.c~a mm/hwpoison-inject.c
--- a/mm/hwpoison-inject.c~a
+++ a/mm/hwpoison-inject.c
@@ -123,7 +123,7 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
hwpoison_dir, &hwpoison_filter_memcg);
if (!dentry)
diff -puN mm/memcontrol.c~a mm/memcontrol.c
--- a/mm/memcontrol.c~a
+++ a/mm/memcontrol.c
@@ -61,12 +61,12 @@ struct cgroup_subsys mem_cgroup_subsys _
#define MEM_CGROUP_RECLAIM_RETRIES 5
static struct mem_cgroup *root_mem_cgroup __read_mostly;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
/* Turned on only when memory cgroup is enabled && really_do_swap_account = 1 */
int do_swap_account __read_mostly;
/* for remember boot option*/
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED
+#ifdef CONFIG_MEMCG_SWAP_ENABLED
static int really_do_swap_account __initdata = 1;
#else
static int really_do_swap_account __initdata = 0;
@@ -407,7 +407,7 @@ static void mem_cgroup_get(struct mem_cg
static void mem_cgroup_put(struct mem_cgroup *memcg);
/* Writing them here to avoid exposing memcg's inner layout */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+#ifdef CONFIG_MEMCG_KMEM
#include <net/sock.h>
#include <net/ip.h>
@@ -466,9 +466,9 @@ struct cg_proto *tcp_proto_cgroup(struct
}
EXPORT_SYMBOL(tcp_proto_cgroup);
#endif /* CONFIG_INET */
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+#endif /* CONFIG_MEMCG_KMEM */
-#if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
+#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
static void disarm_sock_keys(struct mem_cgroup *memcg)
{
if (!memcg_proto_activated(&memcg->tcp_mem.cg_proto))
@@ -3085,7 +3085,7 @@ mem_cgroup_uncharge_swapcache(struct pag
}
#endif
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
/*
* called from swap_entry_free(). remove record in swap_cgroup and
* uncharge "memsw" account.
@@ -4518,7 +4518,7 @@ static int mem_cgroup_oom_control_write(
return 0;
}
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+#ifdef CONFIG_MEMCG_KMEM
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
return mem_cgroup_sockets_init(memcg, ss);
@@ -4608,7 +4608,7 @@ static struct cftype mem_cgroup_files[]
.read_seq_string = mem_control_numa_stat_show,
},
#endif
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
{
.name = "memsw.usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
@@ -4795,7 +4795,7 @@ struct mem_cgroup *parent_mem_cgroup(str
}
EXPORT_SYMBOL(parent_mem_cgroup);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
static void __init enable_swap_cgroup(void)
{
if (!mem_cgroup_disabled() && really_do_swap_account)
@@ -5526,7 +5526,7 @@ struct cgroup_subsys mem_cgroup_subsys =
.__DEPRECATED_clear_css_refs = true,
};
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
static int __init enable_swap_account(char *s)
{
/* consider enabled if no parameter or 1 is given */
diff -puN mm/memory-failure.c~a mm/memory-failure.c
--- a/mm/memory-failure.c~a
+++ a/mm/memory-failure.c
@@ -128,7 +128,7 @@ static int hwpoison_filter_flags(struct
* can only guarantee that the page either belongs to the memcg tasks, or is
* a freed page.
*/
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
u64 hwpoison_filter_memcg;
EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
static int hwpoison_filter_task(struct page *p)
diff -puN mm/mmzone.c~a mm/mmzone.c
--- a/mm/mmzone.c~a
+++ a/mm/mmzone.c
@@ -96,7 +96,7 @@ void lruvec_init(struct lruvec *lruvec,
for_each_lru(lru)
INIT_LIST_HEAD(&lruvec->lists[lru]);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_MEMCG
lruvec->zone = zone;
#endif
}
diff -puN mm/oom_kill.c~a mm/oom_kill.c
--- a/mm/oom_kill.c~a
+++ a/mm/oom_kill.c
@@ -541,7 +541,7 @@ static void check_panic_on_oom(enum oom_
sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
}
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_MEMCG
void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
int order)
{
diff -puN mm/page_cgroup.c~a mm/page_cgroup.c
--- a/mm/page_cgroup.c~a
+++ a/mm/page_cgroup.c
@@ -317,7 +317,7 @@ void __meminit pgdat_page_cgroup_init(st
#endif
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
static DEFINE_MUTEX(swap_cgroup_mutex);
struct swap_cgroup_ctrl {
diff -puN mm/vmscan.c~a mm/vmscan.c
--- a/mm/vmscan.c~a
+++ a/mm/vmscan.c
@@ -133,7 +133,7 @@ long vm_total_pages; /* The total number
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_MEMCG
static bool global_reclaim(struct scan_control *sc)
{
return !sc->target_mem_cgroup;
@@ -2152,7 +2152,7 @@ unsigned long try_to_free_pages(struct z
return nr_reclaimed;
}
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_MEMCG
unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
gfp_t gfp_mask, bool noswap,
diff -puN init/Kconfig~a init/Kconfig
--- a/init/Kconfig~a
+++ a/init/Kconfig
@@ -686,7 +686,7 @@ config RESOURCE_COUNTERS
This option enables controller independent resource accounting
infrastructure that works with cgroups.
-config CGROUP_MEM_RES_CTLR
+config MEMCG
bool "Memory Resource Controller for Control Groups"
depends on RESOURCE_COUNTERS
select MM_OWNER
@@ -709,9 +709,9 @@ config CGROUP_MEM_RES_CTLR
This config option also selects MM_OWNER config option, which
could in turn add some fork/exit overhead.
-config CGROUP_MEM_RES_CTLR_SWAP
+config MEMCG_SWAP
bool "Memory Resource Controller Swap Extension"
- depends on CGROUP_MEM_RES_CTLR && SWAP
+ depends on MEMCG && SWAP
help
Add swap management feature to memory resource controller. When you
enable this, you can limit mem+swap usage per cgroup. In other words,
@@ -726,9 +726,9 @@ config CGROUP_MEM_RES_CTLR_SWAP
if boot option "swapac
...
|
|
|
Re: [PATCH 00/11] kmem controller for memcg: stripped down version [message #47001 is a reply to message #46965] |
Tue, 26 June 2012 21:55   |
akpm
Messages: 224 Registered: March 2007
|
Senior Member |
|
|
On Tue, 26 Jun 2012 11:17:49 +0400
Glauber Costa <glommer@parallels.com> wrote:
> On 06/26/2012 03:27 AM, Andrew Morton wrote:
> > On Mon, 25 Jun 2012 18:15:17 +0400
> > Glauber Costa <glommer@parallels.com> wrote:
> >
> >> What I am proposing with this series is a stripped down version of the
> >> kmem controller for memcg that would allow us to merge significant parts
> >> of the infrastructure, while leaving out, for now, the polemic bits about
> >> the slab while it is being reworked by Cristoph.
> >>
> >> Me reasoning for that is that after the last change to introduce a gfp
> >> flag to mark kernel allocations, it became clear to me that tracking other
> >> resources like the stack would then follow extremely naturaly. I figured
> >> that at some point we'd have to solve the issue pointed by David, and avoid
> >> testing the Slab flag in the page allocator, since it would soon be made
> >> more generic. I do that by having the callers to explicit mark it.
> >>
> >> So to demonstrate how it would work, I am introducing a stack tracker here,
> >> that is already a functionality per-se: it successfully stops fork bombs to
> >> happen. (Sorry for doing all your work, Frederic =p ). Note that after all
> >> memcg infrastructure is deployed, it becomes very easy to track anything.
> >> The last patch of this series is extremely simple.
> >>
> >> The infrastructure is exactly the same we had in memcg, but stripped down
> >> of the slab parts. And because what we have after those patches is a feature
> >> per-se, I think it could be considered for merging.
> >
> > hm. None of this new code makes the kernel smaller, faster, easier to
> > understand or more fun to read!
> Not sure if this is a general comment - in case I agree - or if targeted
> to my statement that this is "stripped down". If so, it is of course
> smaller relative to my previous slab accounting patches.
It's a general comment. The patch adds overhead: runtime costs and
maintenance costs. Do its benefits justify that cost?
> The infrastructure is largely common, but I realized that a future user,
> tracking the stack, would be a lot simpler and could be done first.
>
> > Presumably we're getting some benefit for all the downside. When the
> > time is appropriate, please do put some time into explaining that
> > benefit, so that others can agree that it is a worthwhile tradeoff.
> >
>
> Well, for one thing, we stop fork bombs for processes inside cgroups.
"inside cgroups" is a significant limitation! Is this capability
important enough to justify adding the new code? That's unobvious
(to me).
Are there any other user-facing things which we can do with this
feature? Present, or planned?
> I can't speak for everybody here, but AFAIK, tracking the stack through
> the memory it used, therefore using my proposed kmem controller, was an
> idea that good quite a bit of traction with the memcg/memory people.
> So here you have something that people already asked a lot for, in a
> shape and interface that seem to be acceptable.
mm, maybe. Kernel developers tend to look at code from the point of
view "does it work as designed", "is it clean", "is it efficient", "do
I understand it", etc. We often forget to step back and really
consider whether or not it should be merged at all.
I mean, unless the code is an explicit simplification, we should have
a very strong bias towards "don't merge".
|
|
|
Re: [PATCH 00/11] kmem controller for memcg: stripped down version [message #47006 is a reply to message #47001] |
Wed, 27 June 2012 01:08   |
David Rientjes
Messages: 59 Registered: November 2006
|
Member |
|
|
On Tue, 26 Jun 2012, Andrew Morton wrote:
> mm, maybe. Kernel developers tend to look at code from the point of
> view "does it work as designed", "is it clean", "is it efficient", "do
> I understand it", etc. We often forget to step back and really
> consider whether or not it should be merged at all.
>
It's appropriate for true memory isolation so that applications cannot
cause an excess of slab to be consumed. This allows other applications to
have higher reservations without the risk of incurring a global oom
condition as the result of the usage of other memcgs.
I'm not sure whether it would ever be appropriate to limit the amount of
slab for an individual slab cache, however, instead of limiting the sum of
all slab for a set of processes. With cache merging in slub this would
seem to be difficult to do correctly.
|
|
|
Re: [PATCH 06/11] memcg: kmem controller infrastructure [message #47008 is a reply to message #46982] |
Wed, 27 June 2012 04:01   |
David Rientjes
Messages: 59 Registered: November 2006
|
Member |
|
|
On Tue, 26 Jun 2012, Glauber Costa wrote:
> > > @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
> > > static inline void sock_release_memcg(struct sock *sk)
> > > {
> > > }
> > > +
> > > +#define mem_cgroup_kmem_on 0
> > > +#define __mem_cgroup_new_kmem_page(a, b, c) false
> > > +#define __mem_cgroup_free_kmem_page(a,b )
> > > +#define __mem_cgroup_commit_kmem_page(a, b, c)
> > > +#define is_kmem_tracked_alloc (false)
> > > #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> > > +
> > > +static __always_inline
> > > +bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order)
> > > +{
> > > + if (!mem_cgroup_kmem_on)
> > > + return true;
> > > + if (!is_kmem_tracked_alloc)
> > > + return true;
> > > + if (!current->mm)
> > > + return true;
> > > + if (in_interrupt())
> > > + return true;
> >
> > You can't test for current->mm in irq context, so you need to check for
> > in_interrupt() first.
> >
> Right, thanks.
>
> > Also, what prevents __mem_cgroup_new_kmem_page()
> > from being called for a kthread that has called use_mm() before
> > unuse_mm()?
>
> Nothing, but I also don't see how to prevent that.
You can test for current->flags & PF_KTHREAD following the check for
in_interrupt() and return true, it's what you were trying to do with the
check for !current->mm.
|
|
|
Re: [PATCH 00/11] kmem controller for memcg: stripped down version [message #47010 is a reply to message #47006] |
Wed, 27 June 2012 08:39   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/27/2012 05:08 AM, David Rientjes wrote:
> On Tue, 26 Jun 2012, Andrew Morton wrote:
>
>> mm, maybe. Kernel developers tend to look at code from the point of
>> view "does it work as designed", "is it clean", "is it efficient", "do
>> I understand it", etc. We often forget to step back and really
>> consider whether or not it should be merged at all.
>>
>
> It's appropriate for true memory isolation so that applications cannot
> cause an excess of slab to be consumed. This allows other applications to
> have higher reservations without the risk of incurring a global oom
> condition as the result of the usage of other memcgs.
Just a note for Andrew, we we're in the same page: The slab cache
limitation is not included in *this* particular series. The goal was
always to have other kernel resources limited as well, and the general
argument from David holds: we want a set of applications to run truly
independently from others, without creating memory pressure on the
global system.
The way history develop in this series, I started from the slab cache,
and a page-level tracking appeared on that series. I then figured it
would be better to start tracking something that is totally page-based,
such as the stack - that already accounts for 70 % of the
infrastructure, and then merge the slab code later. In this sense, it
was just a strategy inversion. But both are, and were, in the goals.
> I'm not sure whether it would ever be appropriate to limit the amount of
> slab for an individual slab cache, however, instead of limiting the sum of
> all slab for a set of processes. With cache merging in slub this would
> seem to be difficult to do correctly.
Yes, I do agree.
|
|
|
Fork bomb limitation in memcg WAS: Re: [PATCH 00/11] kmem controller for memcg: stripped down versio [message #47012 is a reply to message #47001] |
Wed, 27 June 2012 09:29   |
Glauber Costa
Messages: 916 Registered: October 2011
|
Senior Member |
|
|
On 06/27/2012 01:55 AM, Andrew Morton wrote:
> On Tue, 26 Jun 2012 11:17:49 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> On 06/26/2012 03:27 AM, Andrew Morton wrote:
>>> On Mon, 25 Jun 2012 18:15:17 +0400
>>> Glauber Costa <glommer@parallels.com> wrote:
>>>
>>>> What I am proposing with this series is a stripped down version of the
>>>> kmem controller for memcg that would allow us to merge significant parts
>>>> of the infrastructure, while leaving out, for now, the polemic bits about
>>>> the slab while it is being reworked by Cristoph.
>>>>
>>>> Me reasoning for that is that after the last change to introduce a gfp
>>>> flag to mark kernel allocations, it became clear to me that tracking other
>>>> resources like the stack would then follow extremely naturaly. I figured
>>>> that at some point we'd have to solve the issue pointed by David, and avoid
>>>> testing the Slab flag in the page allocator, since it would soon be made
>>>> more generic. I do that by having the callers to explicit mark it.
>>>>
>>>> So to demonstrate how it would work, I am introducing a stack tracker here,
>>>> that is already a functionality per-se: it successfully stops fork bombs to
>>>> happen. (Sorry for doing all your work, Frederic =p ). Note that after all
>>>> memcg infrastructure is deployed, it becomes very easy to track anything.
>>>> The last patch of this series is extremely simple.
>>>>
>>>> The infrastructure is exactly the same we had in memcg, but stripped down
>>>> of the slab parts. And because what we have after those patches is a feature
>>>> per-se, I think it could be considered for merging.
>>>
>>> hm. None of this new code makes the kernel smaller, faster, easier to
>>> understand or more fun to read!
>> Not sure if this is a general comment - in case I agree - or if targeted
>> to my statement that this is "stripped down". If so, it is of course
>> smaller relative to my previous slab accounting patches.
>
> It's a general comment. The patch adds overhead: runtime costs and
> maintenance costs. Do its benefits justify that cost?
Despite potential disagreement on the following statement from my peer
and friends,
I am not crazy. This means I of course believe so =)
I will lay down my views below, but I believe the general justification
was given already by people commenting in the task counter subsystem,
and I invite them (newly CCd) to chime in here if they want.
For whoever is arriving in the thread now, this is about limiting the
amount of kernel memory used by a set of processes, aka cgroup.
In this particular state - more is to follow - we're able to limit
fork bombs in a clean way, since overlimit processes will be unable
to allocate their stack.
>> The infrastructure is largely common, but I realized that a future user,
>> tracking the stack, would be a lot simpler and could be done first.
>>
>>> Presumably we're getting some benefit for all the downside. When the
>>> time is appropriate, please do put some time into explaining that
>>> benefit, so that others can agree that it is a worthwhile tradeoff.
>>>
>>
>> Well, for one thing, we stop fork bombs for processes inside cgroups.
>
> "inside cgroups" is a significant limitation! Is this capability
> important enough to justify adding the new code? That's unobvious
> (to me).
Again, for one thing. The general mechanism here is to limit the amount
of kernel memory used by a particular set of processes. Processes use
stack, and if they can't grab more pages for the stack, no new
processes can be created.
But there are other things the general mechanism protects against.
Using too much of pinned dentry and inode cache, by touching files
an leaving them in memory forever.
In fact, a simple:
while true; do mkdir x; cd x; done
can halt your system easily, because the file system limits are hard
to reach (big disks), but the kernel memory is not.
Those are examples, but the list certainly don't stop here.
Our particular use case is concerned with people offering hosting
services. In a physical box, we can put a limit to some resources,
like total number of processes or threads. But in an environment
where each independent user gets its own piece of the machine, we
don't want a potentially malicious user to destroy good users' services
This might be true for systemd as well, that now groups services inside
cgroups. They generally want to put forward a set of guarantees that
limits the running service in a variety of ways, so that if they become
badly behaved, they won't interfere with the rest of the system.
fork bombs are a way bad behaved processes interfere with the rest of
the system. In here, I propose fork bomb stopping as a natural
consequence of the fact that the amount of kernel memory can be limited,
and each process uses 1 or 2 pages for the stack, that are freed when
the process goes away.
The limitation "inside cgroups" is not as important as you seem to state.
Everything can be put "inside cgroups". Being inside cgroups is just a
way to state "I want this feature". And in this case: I want to pay the
price.
Yes, because as you said yourself, of course there is a cost for that.
But I am extensively using static branches to make sure that even if
the feature is compiled in, even if we have a bunch of memory cgroups
deployed (that already pay a price for that), this code will only be
enabled after the first user of this service configures any limit.
The impact before that, is as low as it can be.
After that is enabled, processes living outside of limited cgroups will
perform an inline test for a flag, and continue normal page allocation
with only that price paid: a flag test.
> Are there any other user-facing things which we can do with this
> feature? Present, or planned?
>
Yes. We can establish all sorts of boundaries to a group of processes.
The dentry / icache bombing limitation is an example of that as well.
But that will need the slab part of this patchset to follow. I not only
have the code for that, but I posted it for review many times. That is
not progressing fully so far because we're still looking for ways to
keep the changes in the allocators to a minimum. Until I realized that
the infrastructure could be used as-is, and stripped down from its
original slab version to account pages, its proposed second user. And
this is what you see here now.
So everything that uses resources in the kernel can proceed until it
uses too much. Be it processes, files, or anything that consumes kernel
memory - and may potentially leave it there forever.
>> I can't speak for everybody here, but AFAIK, tracking the stack through
>> the memory it used, therefore using my proposed kmem controller, was an
>> idea that good quite a bit of traction with the memcg/memory people.
>> So here you have something that people already asked a lot for, in a
>> shape and interface that seem to be acceptable.
>
> mm, maybe. Kernel developers tend to look at code from the point of
> view "does it work as designed", "is it clean", "is it efficient", "do
> I understand it", etc. We often forget to step back and really
> consider whether or not it should be merged at all.
>
> I mean, unless the code is an explicit simplification, we should have
> a very strong bias towards "don't merge".
Well, simplifications are welcome - this series itself was simplified
beyond what I thought initially possible through the valuable comments
of other people.
But of course, this adds more complexity to the kernel as a whole. And
this is true to every single new feature we may add, now or in the
future.
What I can tell you about this particular one, is that the justification
for it doesn't come out of nowhere, but from a rather real use case that
we support and maintain in OpenVZ and our line of products for years.
It can potentially achieve more than that by being well used by schemes
such as systemd that aim at limiting the extent through which a service
can damage a system.
I hope all the above justification was enough to clarify any points you
may have. I'll be happy to go further if necessary.
|
|
|
|
Goto Forum:
Current Time: Fri Oct 24 17:22:37 GMT 2025
Total time taken to generate the page: 0.10868 seconds
|