| Home » Mailing lists » Devel » [PATCH v2 00/11] Request for Inclusion: kmem controller for memcg. Goto Forum:
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47549 is a reply to message #47511] | Wed, 15 August 2012 18:01   |  
			| 
				
				
					|  Ying Han Messages: 8
 Registered: February 2012
 | Junior Member |  |  |  
	| On Wed, Aug 15, 2012 at 5:39 AM, Michal Hocko <mhocko@suse.cz> wrote: > On Wed 15-08-12 13:33:55, Glauber Costa wrote:
 > [...]
 >> > This can
 >> > be quite confusing.  I am still not sure whether we should mix the two
 >> > things together. If somebody wants to limit the kernel memory he has to
 >> > touch the other limit anyway.  Do you have a strong reason to mix the
 >> > user and kernel counters?
 >>
 >> This is funny, because the first opposition I found to this work was
 >> "Why would anyone want to limit it separately?" =p
 >>
 >> It seems that a quite common use case is to have a container with a
 >> unified view of "memory" that it can use the way he likes, be it with
 >> kernel memory, or user memory. I believe those people would be happy to
 >> just silently account kernel memory to user memory, or at the most have
 >> a switch to enable it.
 >>
 >> What gets clear from this back and forth, is that there are people
 >> interested in both use cases.
 >
 > I am still not 100% sure myself. It is just clear that the reclaim would
 > need some work in order to do accounting like this.
 >
 >> > My impression was that kernel allocation should simply fail while user
 >> > allocations might reclaim as well. Why should we reclaim just because of
 >> > the kernel allocation (which is unreclaimable from hard limit reclaim
 >> > point of view)?
 >>
 >> That is not what the kernel does, in general. We assume that if he wants
 >> that memory and we can serve it, we should. Also, not all kernel memory
 >> is unreclaimable. We can shrink the slabs, for instance. Ying Han
 >> claims she has patches for that already...
 >
 > Are those patches somewhere around?
 
 Yes, I am working on it to post it sometime *this week*. My last
 rebase is based on v3.3 and now I am trying to get it rebased to
 github-memcg. The patch itself has a functional dependency on kernel
 slab accounting, and I am trying to get that rebased on Glauber's tree
 but has some difficulty now. What I am planning to do is post the RFC
 w/ only complied version by far.
 
 The patch handles dentry cache shrinker only at this moment. That is
 what we discussed last time as well, where dentry contributes most of
 the reclaimable objects. (it pins inode, so we leave inode behind)
 
 
 --Ying
 >
 > [...]
 >> > This doesn't check for the hierachy so kmem_accounted might not be in
 >> > sync with it's parents. mem_cgroup_create (below) needs to copy
 >> > kmem_accounted down from the parent and the above needs to check if this
 >> > is a similar dance like mem_cgroup_oom_control_write.
 >> >
 >>
 >> I don't see why we have to.
 >>
 >> I believe in a A/B/C hierarchy, C should be perfectly able to set a
 >> different limit than its parents. Note that this is not a boolean.
 >
 > Ohh, I wasn't clear enough. I am not against setting the _limit_ I just
 > meant that the kmem_accounted should be consistent within the hierarchy.
 >
 > --
 > Michal Hocko
 > SUSE Labs
 >
 > --
 > To unsubscribe, send a message with 'unsubscribe linux-mm' in
 > the body to majordomo@kvack.org.  For more info on Linux MM,
 > see: http://www.linux-mm.org/ .
 > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47550 is a reply to message #47549] | Wed, 15 August 2012 18:00   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 10:01 PM, Ying Han wrote: > On Wed, Aug 15, 2012 at 5:39 AM, Michal Hocko <mhocko@suse.cz> wrote:
 >> On Wed 15-08-12 13:33:55, Glauber Costa wrote:
 >> [...]
 >>>> This can
 >>>> be quite confusing.  I am still not sure whether we should mix the two
 >>>> things together. If somebody wants to limit the kernel memory he has to
 >>>> touch the other limit anyway.  Do you have a strong reason to mix the
 >>>> user and kernel counters?
 >>>
 >>> This is funny, because the first opposition I found to this work was
 >>> "Why would anyone want to limit it separately?" =p
 >>>
 >>> It seems that a quite common use case is to have a container with a
 >>> unified view of "memory" that it can use the way he likes, be it with
 >>> kernel memory, or user memory. I believe those people would be happy to
 >>> just silently account kernel memory to user memory, or at the most have
 >>> a switch to enable it.
 >>>
 >>> What gets clear from this back and forth, is that there are people
 >>> interested in both use cases.
 >>
 >> I am still not 100% sure myself. It is just clear that the reclaim would
 >> need some work in order to do accounting like this.
 >>
 >>>> My impression was that kernel allocation should simply fail while user
 >>>> allocations might reclaim as well. Why should we reclaim just because of
 >>>> the kernel allocation (which is unreclaimable from hard limit reclaim
 >>>> point of view)?
 >>>
 >>> That is not what the kernel does, in general. We assume that if he wants
 >>> that memory and we can serve it, we should. Also, not all kernel memory
 >>> is unreclaimable. We can shrink the slabs, for instance. Ying Han
 >>> claims she has patches for that already...
 >>
 >> Are those patches somewhere around?
 >
 > Yes, I am working on it to post it sometime *this week*. My last
 > rebase is based on v3.3 and now I am trying to get it rebased to
 > github-memcg. The patch itself has a functional dependency on kernel
 > slab accounting, and I am trying to get that rebased on Glauber's tree
 > but has some difficulty now. What I am planning to do is post the RFC
 > w/ only complied version by far.
 
 That would be great, so we can start looking at its design, at least.
 
 > The patch handles dentry cache shrinker only at this moment. That is
 > what we discussed last time as well, where dentry contributes most of
 > the reclaimable objects. (it pins inode, so we leave inode behind)
 >
 
 This will mark the inodes as reclaimable, but will leave them in memory.
 If we are assuming memory pressure, it would be good to shrink them too.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47551 is a reply to message #47530] | Wed, 15 August 2012 18:07   |  
			| 
				
				
					|  Ying Han Messages: 8
 Registered: February 2012
 | Junior Member |  |  |  
	| On Wed, Aug 15, 2012 at 8:11 AM, Glauber Costa <glommer@parallels.com> wrote: > On 08/15/2012 06:47 PM, Christoph Lameter wrote:
 >> On Wed, 15 Aug 2012, Michal Hocko wrote:
 >>
 >>>> That is not what the kernel does, in general. We assume that if he wants
 >>>> that memory and we can serve it, we should. Also, not all kernel memory
 >>>> is unreclaimable. We can shrink the slabs, for instance. Ying Han
 >>>> claims she has patches for that already...
 >>>
 >>> Are those patches somewhere around?
 >>
 >> You can already shrink the reclaimable slabs (dentries / inodes) via
 >> calls to the subsystem specific shrinkers. Did Ying Han do anything to
 >> go beyond that?
 >>
 > That is not enough for us.
 > We would like to make sure that the objects being discarded belong to
 > the memcg which is under pressure. We don't need to be perfect here, and
 > an occasional slip is totally fine. But if in general, shrinking from
 > memcg A will mostly wipe out objects from memcg B, we harmed the system
 > in return for nothing good.
 
 Correct. For example, we have per-superblock shrinker today for vfs
 caches. That is not enough since we need to isolate the dentry caches
 per-memcg basis.
 
 --Ying
 >
 > --
 > To unsubscribe, send a message with 'unsubscribe linux-mm' in
 > the body to majordomo@kvack.org.  For more info on Linux MM,
 > see: http://www.linux-mm.org/ .
 > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47552 is a reply to message #47532] | Wed, 15 August 2012 18:11   |  
			| 
				
				
					|  Ying Han Messages: 8
 Registered: February 2012
 | Junior Member |  |  |  
	| On Wed, Aug 15, 2012 at 8:34 AM, Christoph Lameter <cl@linux.com> wrote: > On Wed, 15 Aug 2012, Glauber Costa wrote:
 >
 >> On 08/15/2012 06:47 PM, Christoph Lameter wrote:
 >> > On Wed, 15 Aug 2012, Michal Hocko wrote:
 >> >
 >> >>> That is not what the kernel does, in general. We assume that if he wants
 >> >>> that memory and we can serve it, we should. Also, not all kernel memory
 >> >>> is unreclaimable. We can shrink the slabs, for instance. Ying Han
 >> >>> claims she has patches for that already...
 >> >>
 >> >> Are those patches somewhere around?
 >> >
 >> > You can already shrink the reclaimable slabs (dentries / inodes) via
 >> > calls to the subsystem specific shrinkers. Did Ying Han do anything to
 >> > go beyond that?
 >> >
 >> That is not enough for us.
 >> We would like to make sure that the objects being discarded belong to
 >> the memcg which is under pressure. We don't need to be perfect here, and
 >> an occasional slip is totally fine. But if in general, shrinking from
 >> memcg A will mostly wipe out objects from memcg B, we harmed the system
 >> in return for nothing good.
 >
 > How can you figure out which objects belong to which memcg? The ownerships
 > of dentries and inodes is a dubious concept already.
 
 I figured it out based on the kernel slab accounting.
 obj->page->kmem_cache->memcg
 
 --Ying
 >
 > --
 > To unsubscribe, send a message with 'unsubscribe linux-mm' in
 > the body to majordomo@kvack.org.  For more info on Linux MM,
 > see: http://www.linux-mm.org/ .
 > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47555 is a reply to message #47546] | Wed, 15 August 2012 19:31   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/15/2012 09:12 PM, Greg Thelen wrote: > On Wed, Aug 15 2012, Glauber Costa wrote:
 >
 >> On 08/15/2012 08:38 PM, Greg Thelen wrote:
 >>> On Wed, Aug 15 2012, Glauber Costa wrote:
 >>>
 >>>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
 >>>>> On Mon, Aug 13 2012, Glauber Costa wrote:
 >>>>>
 >>>>>>>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>>>>>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>>>>>>> +	memcg_uncharge_kmem(memcg, size);
 >>>>>>>>> +	mem_cgroup_put(memcg);
 >>>>>>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>>>>>> reference ?
 >>>>>> This is of course the pair of the mem_cgroup_get() you commented on
 >>>>>> earlier. If we need one, we need the other. If we don't need one, we
 >>>>>> don't need the other =)
 >>>>>>
 >>>>>> The guarantee we're trying to give here is that the memcg structure will
 >>>>>> stay around while there are dangling charges to kmem, that we decided
 >>>>>> not to move (remember: moving it for the stack is simple, for the slab
 >>>>>> is very complicated and ill-defined, and I believe it is better to treat
 >>>>>> all kmem equally here)
 >>>>>
 >>>>> By keeping memcg structures hanging around until the last referring kmem
 >>>>> page is uncharged do such zombie memcg each consume a css_id and thus
 >>>>> put pressure on the 64k css_id space?  I imagine in pathological cases
 >>>>> this would prevent creation of new cgroups until these zombies are
 >>>>> dereferenced.
 >>>>
 >>>> Yes, but although this patch makes it more likely, it doesn't introduce
 >>>> that. If the tasks, for instance, grab a reference to the cgroup dentry
 >>>> in the filesystem (like their CWD, etc), they will also keep the cgroup
 >>>> around.
 >>>
 >>> Fair point.  But this doesn't seems like a feature.  It's probably not
 >>> needed initially, but what do you think about creating a
 >>> memcg_kernel_context structure which is allocated when memcg is
 >>> allocated?  Kernel pages charged to a memcg would have
 >>> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 >>> would allow the mem_cgroup and its css_id to be deleted when the cgroup
 >>> is unlinked from cgroupfs while allowing for the active kernel pages to
 >>> continue pointing to a valid memcg_kernel_context.  This would be a
 >>> reference counted structure much like you are doing with memcg.  When a
 >>> memcg is deleted the memcg_kernel_context would be linked into its
 >>> surviving parent memcg.  This would avoid needing to visit each kernel
 >>> page.
 >>
 >> You need more, you need at the res_counters to stay around as well. And
 >> probably other fields.
 >
 > I am not sure the res_counters would need to stay around.  Once a
 > memcg_kernel_context has been reparented, then any future kernel page
 > uncharge calls will uncharge the parent res_counter.
 
 Well, if you hold the memcg due to a reference, like in the dentry case,
 then fine. But if this is a dangling charge, as will be the case with
 the slab, then you have to uncharge it.
 
 An arbitrary number of parents might have been deleted as well, so you
 need to transverse them all until you reach a live parent to uncharge from.
 
 To do that, your counters have to be still alive.
 
 >
 >> So my fear here is that as you add fields to that structure, you can
 >> defeat a bit the goal of reducing memory consumption. Still leaves the
 >> css space, yes. But by doing this we can introduce some subtle bugs by
 >> having a field in the wrong structure.
 >>
 >> Did you observe that to be a big problem in your systems?
 >
 > No I have not seen this yet.  But our past solutions have reparented
 > kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
 > concerns with your approach are just a suspicion because we have been
 > experimenting with accounting of even more kernel memory (e.g. vmalloc,
 > kernel stacks, page tables).  As the scope of such accounting grows the
 > chance of long lived charged pages grows and thus the chance of zombies
 > which exhaust the css_id space grows.
 
 Well, since we agree this can all be done under the hood, I'd say let's
 wait until a problem actually exists, since the solution is likely to be
 a bit convoluted...
 
 I personally believe that if won't have a lot of task movement, most of
 the data will go away as the cgroup dies. The remainder shouldn't be too
 much to hold it in memory for a lot of time. This is of course assuming
 a real use case, not an adversarial scenario, which is quite easy to
 come up with: just create a task, hold a bunch of kmem, move the task
 away, delete the cgroup, etc.
 
 That said, nothing stops us to actively try to create a scenario that
 would demonstrate such a problem.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47557 is a reply to message #47498] | Wed, 15 August 2012 19:50   |  
			| 
				
				
					|  Ying Han Messages: 8
 Registered: February 2012
 | Junior Member |  |  |  
	| On Tue, Aug 14, 2012 at 9:21 AM, Michal Hocko <mhocko@suse.cz> wrote: > On Thu 09-08-12 17:01:12, Glauber Costa wrote:
 >> This patch adds the basic infrastructure for the accounting of the slab
 >> caches. To control that, the following files are created:
 >>
 >>  * memory.kmem.usage_in_bytes
 >>  * memory.kmem.limit_in_bytes
 >>  * memory.kmem.failcnt
 >>  * memory.kmem.max_usage_in_bytes
 >>
 >> They have the same meaning of their user memory counterparts. They
 >> reflect the state of the "kmem" res_counter.
 >>
 >> The code is not enabled until a limit is set. This can be tested by the
 >> flag "kmem_accounted". This means that after the patch is applied, no
 >> behavioral changes exists for whoever is still using memcg to control
 >> their memory usage.
 >>
 >> We always account to both user and kernel resource_counters. This
 >> effectively means that an independent kernel limit is in place when the
 >> limit is set to a lower value than the user memory. A equal or higher
 >> value means that the user limit will always hit first, meaning that kmem
 >> is effectively unlimited.
 >
 > Well, it contributes to the user limit so it is not unlimited. It just
 > falls under a different limit and it tends to contribute less. This can
 > be quite confusing.  I am still not sure whether we should mix the two
 > things together. If somebody wants to limit the kernel memory he has to
 > touch the other limit anyway.  Do you have a strong reason to mix the
 > user and kernel counters?
 
 The reason to mix the two together is a compromise of the two use
 cases we've heard by far. In google, we only need one limit which
 limits u & k, and the reclaim kicks in when the total usage hits the
 limit.
 
 > My impression was that kernel allocation should simply fail while user
 > allocations might reclaim as well. Why should we reclaim just because of
 > the kernel allocation (which is unreclaimable from hard limit reclaim
 > point of view)?
 
 Some of kernel objects are reclaimable if we have per-memcg shrinker.
 
 > I also think that the whole thing would get much simpler if those two
 > are split. Anyway if this is really a must then this should be
 > documented here.
 
 What would be the use case you have in your end?
 
 --Ying
 
 > One nit bellow.
 >
 >> People who want to track kernel memory but not limit it, can set this
 >> limit to a very high number (like RESOURCE_MAX - 1page - that no one
 >> will ever hit, or equal to the user memory)
 >>
 >> Signed-off-by: Glauber Costa <glommer@parallels.com>
 >> CC: Michal Hocko <mhocko@suse.cz>
 >> CC: Johannes Weiner <hannes@cmpxchg.org>
 >> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 >> ---
 >>  mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 >>  1 file changed, 68 insertions(+), 1 deletion(-)
 >>
 >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 >> index b0e29f4..54e93de 100644
 >> --- a/mm/memcontrol.c
 >> +++ b/mm/memcontrol.c
 > [...]
 >> @@ -4046,8 +4059,23 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 >>                       break;
 >>               if (type == _MEM)
 >>                       ret = mem_cgroup_resize_limit(memcg, val);
 >> -             else
 >> +             else if (type == _MEMSWAP)
 >>                       ret = mem_cgroup_resize_memsw_limit(memcg, val);
 >> +             else if (type == _KMEM) {
 >> +                     ret = res_counter_set_limit(&memcg->kmem, val);
 >> +                     if (ret)
 >> +                             break;
 >> +                     /*
 >> +                      * Once enabled, can't be disabled. We could in theory
 >> +                      * disable it if we haven't yet created any caches, or
 >> +                      * if we can shrink them all to death.
 >> +                      *
 >> +                      * But it is not worth the trouble
 >> +                      */
 >> +                     if (!memcg->kmem_accounted && val != RESOURCE_MAX)
 >> +                             memcg->kmem_accounted = true;
 >> +             } else
 >> +                     return -EINVAL;
 >>               break;
 >
 > This doesn't check for the hierachy so kmem_accounted might not be in
 > sync with it's parents. mem_cgroup_create (below) needs to copy
 > kmem_accounted down from the parent and the above needs to check if this
 > is a similar dance like mem_cgroup_oom_control_write.
 >
 > [...]
 >
 >> @@ -5033,6 +5098,7 @@ mem_cgroup_create(struct cgroup *cont)
 >>       if (parent && parent->use_hierarchy) {
 >>               res_counter_init(&memcg->res, &parent->res);
 >>               res_counter_init(&memcg->memsw, &parent->memsw);
 >> +             res_counter_init(&memcg->kmem, &parent->kmem);
 >>               /*
 >>                * We increment refcnt of the parent to ensure that we can
 >>                * safely access it on res_counter_charge/uncharge.
 >> @@ -5043,6 +5109,7 @@ mem_cgroup_create(struct cgroup *cont)
 >>       } else {
 >>               res_counter_init(&memcg->res, NULL);
 >>               res_counter_init(&memcg->memsw, NULL);
 >> +             res_counter_init(&memcg->kmem, NULL);
 >>       }
 >>       memcg->last_scanned_node = MAX_NUMNODES;
 >>       INIT_LIST_HEAD(&memcg->oom_notify);
 >> --
 >> 1.7.11.2
 >>
 >> --
 >> To unsubscribe from this list: send the line "unsubscribe cgroups" in
 >> the body of a message to majordomo@vger.kernel.org
 >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
 >
 > --
 > Michal Hocko
 > SUSE Labs
 >
 > --
 > To unsubscribe, send a message with 'unsubscribe linux-mm' in
 > the body to majordomo@kvack.org.  For more info on Linux MM,
 > see: http://www.linux-mm.org/ .
 > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47561 is a reply to message #47555] | Thu, 16 August 2012 03:37   |  
			| 
				
				
					|  Greg Thelen Messages: 18
 Registered: February 2011
 | Junior Member |  |  |  
	| On Wed, Aug 15 2012, Glauber Costa wrote: 
 > On 08/15/2012 09:12 PM, Greg Thelen wrote:
 >> On Wed, Aug 15 2012, Glauber Costa wrote:
 >>
 >>> On 08/15/2012 08:38 PM, Greg Thelen wrote:
 >>>> On Wed, Aug 15 2012, Glauber Costa wrote:
 >>>>
 >>>>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
 >>>>>> On Mon, Aug 13 2012, Glauber Costa wrote:
 >>>>>>
 >>>>>>>>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>>>>>>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>>>>>>>> +	memcg_uncharge_kmem(memcg, size);
 >>>>>>>>>> +	mem_cgroup_put(memcg);
 >>>>>>>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>>>>>>> reference ?
 >>>>>>> This is of course the pair of the mem_cgroup_get() you commented on
 >>>>>>> earlier. If we need one, we need the other. If we don't need one, we
 >>>>>>> don't need the other =)
 >>>>>>>
 >>>>>>> The guarantee we're trying to give here is that the memcg structure will
 >>>>>>> stay around while there are dangling charges to kmem, that we decided
 >>>>>>> not to move (remember: moving it for the stack is simple, for the slab
 >>>>>>> is very complicated and ill-defined, and I believe it is better to treat
 >>>>>>> all kmem equally here)
 >>>>>>
 >>>>>> By keeping memcg structures hanging around until the last referring kmem
 >>>>>> page is uncharged do such zombie memcg each consume a css_id and thus
 >>>>>> put pressure on the 64k css_id space?  I imagine in pathological cases
 >>>>>> this would prevent creation of new cgroups until these zombies are
 >>>>>> dereferenced.
 >>>>>
 >>>>> Yes, but although this patch makes it more likely, it doesn't introduce
 >>>>> that. If the tasks, for instance, grab a reference to the cgroup dentry
 >>>>> in the filesystem (like their CWD, etc), they will also keep the cgroup
 >>>>> around.
 >>>>
 >>>> Fair point.  But this doesn't seems like a feature.  It's probably not
 >>>> needed initially, but what do you think about creating a
 >>>> memcg_kernel_context structure which is allocated when memcg is
 >>>> allocated?  Kernel pages charged to a memcg would have
 >>>> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 >>>> would allow the mem_cgroup and its css_id to be deleted when the cgroup
 >>>> is unlinked from cgroupfs while allowing for the active kernel pages to
 >>>> continue pointing to a valid memcg_kernel_context.  This would be a
 >>>> reference counted structure much like you are doing with memcg.  When a
 >>>> memcg is deleted the memcg_kernel_context would be linked into its
 >>>> surviving parent memcg.  This would avoid needing to visit each kernel
 >>>> page.
 >>>
 >>> You need more, you need at the res_counters to stay around as well. And
 >>> probably other fields.
 >>
 >> I am not sure the res_counters would need to stay around.  Once a
 >> memcg_kernel_context has been reparented, then any future kernel page
 >> uncharge calls will uncharge the parent res_counter.
 >
 > Well, if you hold the memcg due to a reference, like in the dentry case,
 > then fine. But if this is a dangling charge, as will be the case with
 > the slab, then you have to uncharge it.
 >
 > An arbitrary number of parents might have been deleted as well, so you
 > need to transverse them all until you reach a live parent to uncharge from.
 
 I was thinking that each time a memcg is deleted move the
 memcg_kernel_context from the victim memcg to its parent.  When moving,
 also update the context to refer to the parent and link context to
 parent:
 for_each_kernel_context(kernel_context, memcg) {
 kernel_context->memcg = memcg->parent;
 list_add(&kernel_context->list, &memcg->parent->kernel_contexts);
 }
 
 Whenever pages referring to a memcg_kernel_context are uncharged they
 will uncharge the nearest surviving parent memcg.
 
 > To do that, your counters have to be still alive.
 
 The counters of nearest surviving parent will be alive and pointed to by
 memcg_kernel_context->memcg.
 
 >>> So my fear here is that as you add fields to that structure, you can
 >>> defeat a bit the goal of reducing memory consumption. Still leaves the
 >>> css space, yes. But by doing this we can introduce some subtle bugs by
 >>> having a field in the wrong structure.
 >>>
 >>> Did you observe that to be a big problem in your systems?
 >>
 >> No I have not seen this yet.  But our past solutions have reparented
 >> kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
 >> concerns with your approach are just a suspicion because we have been
 >> experimenting with accounting of even more kernel memory (e.g. vmalloc,
 >> kernel stacks, page tables).  As the scope of such accounting grows the
 >> chance of long lived charged pages grows and thus the chance of zombies
 >> which exhaust the css_id space grows.
 >
 > Well, since we agree this can all be done under the hood, I'd say let's
 > wait until a problem actually exists, since the solution is likely to be
 > a bit convoluted...
 >
 > I personally believe that if won't have a lot of task movement, most of
 > the data will go away as the cgroup dies. The remainder shouldn't be too
 > much to hold it in memory for a lot of time. This is of course assuming
 > a real use case, not an adversarial scenario, which is quite easy to
 > come up with: just create a task, hold a bunch of kmem, move the task
 > away, delete the cgroup, etc.
 >
 > That said, nothing stops us to actively try to create a scenario that
 > would demonstrate such a problem.
 
 With our in-house per-memcg slab accounting (similar to what's discussed
 here), we're seeing a few slab allocations (mostly radix_tree_node) that
 survive a long time after memcg deletion.  This isn't meant as criticism
 of this patch series, just an fyi that I expect there will be scenarios
 where some dead kmem caches will live for a long time.  Though I think
 that in your patches a dead kmem cache does not hold reference to the
 memcg.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47562 is a reply to message #47561] | Thu, 16 August 2012 07:47   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/16/2012 07:37 AM, Greg Thelen wrote: > On Wed, Aug 15 2012, Glauber Costa wrote:
 >
 >> On 08/15/2012 09:12 PM, Greg Thelen wrote:
 >>> On Wed, Aug 15 2012, Glauber Costa wrote:
 >>>
 >>>> On 08/15/2012 08:38 PM, Greg Thelen wrote:
 >>>>> On Wed, Aug 15 2012, Glauber Costa wrote:
 >>>>>
 >>>>>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
 >>>>>>> On Mon, Aug 13 2012, Glauber Costa wrote:
 >>>>>>>
 >>>>>>>>>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>>>>>>>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>>>>>>>>> +	memcg_uncharge_kmem(memcg, size);
 >>>>>>>>>>> +	mem_cgroup_put(memcg);
 >>>>>>>>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>>>>>>>> reference ?
 >>>>>>>> This is of course the pair of the mem_cgroup_get() you commented on
 >>>>>>>> earlier. If we need one, we need the other. If we don't need one, we
 >>>>>>>> don't need the other =)
 >>>>>>>>
 >>>>>>>> The guarantee we're trying to give here is that the memcg structure will
 >>>>>>>> stay around while there are dangling charges to kmem, that we decided
 >>>>>>>> not to move (remember: moving it for the stack is simple, for the slab
 >>>>>>>> is very complicated and ill-defined, and I believe it is better to treat
 >>>>>>>> all kmem equally here)
 >>>>>>>
 >>>>>>> By keeping memcg structures hanging around until the last referring kmem
 >>>>>>> page is uncharged do such zombie memcg each consume a css_id and thus
 >>>>>>> put pressure on the 64k css_id space?  I imagine in pathological cases
 >>>>>>> this would prevent creation of new cgroups until these zombies are
 >>>>>>> dereferenced.
 >>>>>>
 >>>>>> Yes, but although this patch makes it more likely, it doesn't introduce
 >>>>>> that. If the tasks, for instance, grab a reference to the cgroup dentry
 >>>>>> in the filesystem (like their CWD, etc), they will also keep the cgroup
 >>>>>> around.
 >>>>>
 >>>>> Fair point.  But this doesn't seems like a feature.  It's probably not
 >>>>> needed initially, but what do you think about creating a
 >>>>> memcg_kernel_context structure which is allocated when memcg is
 >>>>> allocated?  Kernel pages charged to a memcg would have
 >>>>> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 >>>>> would allow the mem_cgroup and its css_id to be deleted when the cgroup
 >>>>> is unlinked from cgroupfs while allowing for the active kernel pages to
 >>>>> continue pointing to a valid memcg_kernel_context.  This would be a
 >>>>> reference counted structure much like you are doing with memcg.  When a
 >>>>> memcg is deleted the memcg_kernel_context would be linked into its
 >>>>> surviving parent memcg.  This would avoid needing to visit each kernel
 >>>>> page.
 >>>>
 >>>> You need more, you need at the res_counters to stay around as well. And
 >>>> probably other fields.
 >>>
 >>> I am not sure the res_counters would need to stay around.  Once a
 >>> memcg_kernel_context has been reparented, then any future kernel page
 >>> uncharge calls will uncharge the parent res_counter.
 >>
 >> Well, if you hold the memcg due to a reference, like in the dentry case,
 >> then fine. But if this is a dangling charge, as will be the case with
 >> the slab, then you have to uncharge it.
 >>
 >> An arbitrary number of parents might have been deleted as well, so you
 >> need to transverse them all until you reach a live parent to uncharge from.
 >
 > I was thinking that each time a memcg is deleted move the
 > memcg_kernel_context from the victim memcg to its parent.  When moving,
 > also update the context to refer to the parent and link context to
 > parent:
 >   for_each_kernel_context(kernel_context, memcg) {
 >     kernel_context->memcg = memcg->parent;
 >     list_add(&kernel_context->list, &memcg->parent->kernel_contexts);
 >   }
 >
 > Whenever pages referring to a memcg_kernel_context are uncharged they
 > will uncharge the nearest surviving parent memcg.
 >
 >> To do that, your counters have to be still alive.
 >
 > The counters of nearest surviving parent will be alive and pointed to by
 > memcg_kernel_context->memcg.
 >
 >>>> So my fear here is that as you add fields to that structure, you can
 >>>> defeat a bit the goal of reducing memory consumption. Still leaves the
 >>>> css space, yes. But by doing this we can introduce some subtle bugs by
 >>>> having a field in the wrong structure.
 >>>>
 >>>> Did you observe that to be a big problem in your systems?
 >>>
 >>> No I have not seen this yet.  But our past solutions have reparented
 >>> kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
 >>> concerns with your approach are just a suspicion because we have been
 >>> experimenting with accounting of even more kernel memory (e.g. vmalloc,
 >>> kernel stacks, page tables).  As the scope of such accounting grows the
 >>> chance of long lived charged pages grows and thus the chance of zombies
 >>> which exhaust the css_id space grows.
 >>
 
 Can't we just free the css_id, and convention that it should not be used
 after mem_cgroup_destroy()? The memory will still stay around,
 sure, but at least the pressure on the css_id space goes away.
 
 I am testing a patch that does precisely that here, and will let you
 know of the results. But if you were willing to have a smaller structure
 just to serve as a zombie, any approach that works for it would have to
 assume the css_id was already freed, so I don't anticipate huge problems.
 
 >> Well, since we agree this can all be done under the hood, I'd say let's
 >> wait until a problem actually exists, since the solution is likely to be
 >> a bit convoluted...
 >>
 >> I personally believe that if won't have a lot of task movement, most of
 >> the data will go away as the cgroup dies. The remainder shouldn't be too
 >> much to hold it in memory for a lot of time. This is of course assuming
 >> a real use case, not an adversarial scenario, which is quite easy to
 >> come up with: just create a task, hold a bunch of kmem, move the task
 >> away, delete the cgroup, etc.
 >>
 >> That said, nothing stops us to actively try to create a scenario that
 >> would demonstrate such a problem.
 >
 > With our in-house per-memcg slab accounting (similar to what's discussed
 > here), we're seeing a few slab allocations (mostly radix_tree_node) that
 > survive a long time after memcg deletion.  This isn't meant as criticism
 > of this patch series, just an fyi that I expect there will be scenarios
 > where some dead kmem caches will live for a long time.  Though I think
 > that in your patches a dead kmem cache does not hold reference to the
 > memcg.
 >
 
 Does shrinking help?
 
 One of the things I was thinking about doing when we have proper
 per-memcg shrinking, is to shrink all caches when destroying the memcg.
 
 Because the memcg is dead, we'll have no more memcg pressure, and those
 will go away only when global pressure comes to play. Which means that
 the references will then be around for a very long time. What is the
 best behavior is debatable, but at least at first, I'd stand by the side
 of getting rid of everything the memcg created as much as possible.
 
 Also, if you are concerned with memory usage due to the memcg structure,
 bear in mind that the caches metadata may be considerably more...
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47572 is a reply to message #47557] | Thu, 16 August 2012 15:25   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Wed 15-08-12 12:50:55, Ying Han wrote: > On Tue, Aug 14, 2012 at 9:21 AM, Michal Hocko <mhocko@suse.cz> wrote:
 > > On Thu 09-08-12 17:01:12, Glauber Costa wrote:
 > >> This patch adds the basic infrastructure for the accounting of the slab
 > >> caches. To control that, the following files are created:
 > >>
 > >>  * memory.kmem.usage_in_bytes
 > >>  * memory.kmem.limit_in_bytes
 > >>  * memory.kmem.failcnt
 > >>  * memory.kmem.max_usage_in_bytes
 > >>
 > >> They have the same meaning of their user memory counterparts. They
 > >> reflect the state of the "kmem" res_counter.
 > >>
 > >> The code is not enabled until a limit is set. This can be tested by the
 > >> flag "kmem_accounted". This means that after the patch is applied, no
 > >> behavioral changes exists for whoever is still using memcg to control
 > >> their memory usage.
 > >>
 > >> We always account to both user and kernel resource_counters. This
 > >> effectively means that an independent kernel limit is in place when the
 > >> limit is set to a lower value than the user memory. A equal or higher
 > >> value means that the user limit will always hit first, meaning that kmem
 > >> is effectively unlimited.
 > >
 > > Well, it contributes to the user limit so it is not unlimited. It just
 > > falls under a different limit and it tends to contribute less. This can
 > > be quite confusing.  I am still not sure whether we should mix the two
 > > things together. If somebody wants to limit the kernel memory he has to
 > > touch the other limit anyway.  Do you have a strong reason to mix the
 > > user and kernel counters?
 >
 > The reason to mix the two together is a compromise of the two use
 > cases we've heard by far. In google, we only need one limit which
 > limits u & k, and the reclaim kicks in when the total usage hits the
 > limit.
 >
 > > My impression was that kernel allocation should simply fail while user
 > > allocations might reclaim as well. Why should we reclaim just because of
 > > the kernel allocation (which is unreclaimable from hard limit reclaim
 > > point of view)?
 >
 > Some of kernel objects are reclaimable if we have per-memcg shrinker.
 
 Agreed and I think we need that before this is merged as I state in
 other email.
 
 > > I also think that the whole thing would get much simpler if those two
 > > are split. Anyway if this is really a must then this should be
 > > documented here.
 >
 > What would be the use case you have in your end?
 
 I do not have any specific unfortunately but I would like to prevent us
 from closing other possible. I realize this sounds hand wavy and that is
 why I do not want to block this work but I think we should give it some
 time before this gets merged.
 
 > --Ying
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47573 is a reply to message #47569] | Thu, 16 August 2012 15:22   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/16/2012 07:05 PM, Michal Hocko wrote: > On Thu 16-08-12 13:57:07, Glauber Costa wrote:
 >> On 08/16/2012 01:53 PM, Michal Hocko wrote:
 >>> On Wed 15-08-12 18:27:45, Glauber Costa wrote:
 >>>>
 >>>>>>
 >>>>>> I see now, you seem to be right.
 >>>>>
 >>>>> No I am not because it seems that I am really blind these days...
 >>>>> We were doing this in mem_cgroup_do_charge for ages:
 >>>>> 	if (!(gfp_mask & __GFP_WAIT))
 >>>>>                 return CHARGE_WOULDBLOCK;
 >>>>>
 >>>>> /me goes to hide and get with further feedback with a clean head.
 >>>>>
 >>>>> Sorry about that.
 >>>>>
 >>>> I am as well, since I went to look at mem_cgroup_do_charge() and missed
 >>>> that.
 >>>
 >>> I thought we are not doing atomic allocations in user pages accounting
 >>> but I was obviously wrong because at least shmem uses atomic
 >>> allocations for ages.
 >>>
 >>>> Do you have any other concerns specific to this patch ?
 >>>
 >>> I understood you changed also handle thingy. So the patch should be
 >>> correct.
 >>> Do you plan to send an updated version?
 >>>
 >> That depends more on you than on me! =)
 >>
 >> Do you still have any concerns regarding the u+k charging as it stands
 >> now? That would be the last big concern I heard during this iteration.
 >
 > Well, I am still not 100% sure because I still see technical
 > difficulties that are not addressed by the patchset (memcg-oom, memcg
 > slab shrinking, possibly others). More importantly this is changing the
 > current semantic of the limit so we should better be careful about it
 > and check that we are not making the code tight to specific workloads
 > without a way out.
 >
 > On the other hand I do not want to block the progress here without
 > having _really_ good arguments against that couldn't be handled later
 > (and it seems that some of my concerns are work in progress already).
 >
 > I have to admit I like several things about the patchset. Especially the
 > way how it enables easy-to-setup (aka don't care about kmem details just
 > make sure you can cap the thing) as well as "I know exactly what I want
 > to do" usecases.
 > It is also good nice that only users of the feature are affected by
 > potential issues.
 >
 > So I think it is worth a broader attention which could produce other use
 > cases which could show potential drawbacks from the u+k semantic but I
 > would be still very careful about merging it to the Linus tree and only
 > merge it after at least the memcg reclaim path is slab aware. Living in
 > the -mm tree should help us with the testing converage.
 >
 > Does it sounds reasonable?
 >
 What I really want is to have it in an "official" tree so it starts
 getting used and tested without me having to rebase at every single change.
 
 If Andrew is okay merging this into -mm, it is fine for me.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47583 is a reply to message #47471] | Fri, 17 August 2012 02:36   |  
			| 
				
				
					|  KAMEZAWA Hiroyuki Messages: 463
 Registered: September 2006
 | Senior Member |  |  |  
	| (2012/08/13 17:28), Glauber Costa wrote: >>>> + * Needs to be called after memcg_kmem_new_page, regardless of success or
 >>>> + * failure of the allocation. if @page is NULL, this function will revert the
 >>>> + * charges. Otherwise, it will commit the memcg given by @handle to the
 >>>> + * corresponding page_cgroup.
 >>>> + */
 >>>> +static __always_inline void
 >>>> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
 >>>> +{
 >>>> +	if (memcg_kmem_on)
 >>>> +		__memcg_kmem_commit_page(page, handle, order);
 >>>> +}
 >> Doesn't this 2 functions has no short-cuts ?
 >
 > Sorry kame, what exactly do you mean?
 >
 I meant avoinding function call. But please ignore, I missed following patches.
 
 
 >> if (memcg_kmem_on && handle) ?
 > I guess this can be done to avoid a function call.
 >
 >> Maybe free() needs to access page_cgroup...
 >>
 > Can you also be a bit more specific here?
 >
 
 Please ignore, I misunderstood the usage of free_accounted_pages().
 
 >>>> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
 >>>> +{
 >>>> +	struct mem_cgroup *memcg;
 >>>> +	struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 >>>> +	bool ret = true;
 >>>> +	size_t size;
 >>>> +	struct task_struct *p;
 >>>> +
 >>>> +	*handle = NULL;
 >>>> +	rcu_read_lock();
 >>>> +	p = rcu_dereference(current->mm->owner);
 >>>> +	memcg = mem_cgroup_from_task(p);
 >>>> +	if (!memcg_kmem_enabled(memcg))
 >>>> +		goto out;
 >>>> +
 >>>> +	mem_cgroup_get(memcg);
 >>>> +
 >> This mem_cgroup_get() will be a potentioal performance problem.
 >> Don't you have good idea to avoid accessing atomic counter here ?
 >> I think some kind of percpu counter or a feature to disable "move task"
 >> will be a help.
 >
 >
 >
 >
 >>>> +	pc = lookup_page_cgroup(page);
 >>>> +	lock_page_cgroup(pc);
 >>>> +	pc->mem_cgroup = memcg;
 >>>> +	SetPageCgroupUsed(pc);
 >>>> +	unlock_page_cgroup(pc);
 >>>> +}
 >>>> +
 >>>> +void __memcg_kmem_free_page(struct page *page, int order)
 >>>> +{
 >>>> +	struct mem_cgroup *memcg;
 >>>> +	size_t size;
 >>>> +	struct page_cgroup *pc;
 >>>> +
 >>>> +	if (mem_cgroup_disabled())
 >>>> +		return;
 >>>> +
 >>>> +	pc = lookup_page_cgroup(page);
 >>>> +	lock_page_cgroup(pc);
 >>>> +	memcg = pc->mem_cgroup;
 >>>> +	pc->mem_cgroup = NULL;
 >
 >> shouldn't this happen after checking "Used" bit ?
 >> Ah, BTW, why do you need to clear pc->memcg ?
 >
 > As for clearing pc->memcg, I think I'm just being overzealous. I can't
 > foresee any problems due to removing it.
 >
 > As for the Used bit, what difference does it make when we clear it?
 >
 I just want to see the same logic used in mem_cgroup_uncharge_common().
 Hmm, at setting pc->mem_cgroup, the things happens in
 set pc->mem_cgroup
 set Used bit
 order. If you clear pc->mem_cgroup
 unset Used bit
 clear pc->mem_cgroup
 seems reasonable.
 
 
 
 >>>> +	if (!PageCgroupUsed(pc)) {
 >>>> +		unlock_page_cgroup(pc);
 >>>> +		return;
 >>>> +	}
 >>>> +	ClearPageCgroupUsed(pc);
 >>>> +	unlock_page_cgroup(pc);
 >>>> +
 >>>> +	/*
 >>>> +	 * Checking if kmem accounted is enabled won't work for uncharge, since
 >>>> +	 * it is possible that the user enabled kmem tracking, allocated, and
 >>>> +	 * then disabled it again.
 >>>> +	 *
 >>>> +	 * We trust if there is a memcg associated with the page, it is a valid
 >>>> +	 * allocation
 >>>> +	 */
 >>>> +	if (!memcg)
 >>>> +		return;
 >>>> +
 >>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>> +	memcg_uncharge_kmem(memcg, size);
 >>>> +	mem_cgroup_put(memcg);
 >> Why do we need ref-counting here ? kmem res_counter cannot work as
 >> reference ?
 > This is of course the pair of the mem_cgroup_get() you commented on
 > earlier. If we need one, we need the other. If we don't need one, we
 > don't need the other =)
 >
 > The guarantee we're trying to give here is that the memcg structure will
 > stay around while there are dangling charges to kmem, that we decided
 > not to move (remember: moving it for the stack is simple, for the slab
 > is very complicated and ill-defined, and I believe it is better to treat
 > all kmem equally here)
 >
 > So maybe we can be clever here, and avoid reference counting at all
 > times. We call mem_cgroup_get() when the first charge occurs, and then
 > go for mem_cgroup_put() when our count reaches 0.
 >
 > What do you think about that?
 >
 
 I think that should work. I don't want to add not-optimized atomic counter ops
 in this very hot path.
 
 
 >
 >>>> +#ifdef CONFIG_MEMCG_KMEM
 >>>> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 >>>> +{
 >> What does 'delta' means ?
 >>
 > I can change it to something like nr_bytes, more informative.
 >
 >>>> +	struct res_counter *fail_res;
 >>>> +	struct mem_cgroup *_memcg;
 >>>> +	int ret;
 >>>> +	bool may_oom;
 >>>> +	bool nofail = false;
 >>>> +
 >>>> +	may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
 >>>> +	    !(gfp & __GFP_NORETRY);
 >>>> +
 >>>> +	ret = 0;
 >>>> +
 >>>> +	if (!memcg)
 >>>> +		return ret;
 >>>> +
 >>>> +	_memcg = memcg;
 >>>> +	ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 >>>> +	    &_memcg, may_oom);
 >>>> +
 >>>> +	if (ret == -EINTR)  {
 >>>> +		nofail = true;
 >>>> +		/*
 >>>> +		 * __mem_cgroup_try_charge() chosed to bypass to root due to
 >>>> +		 * OOM kill or fatal signal.  Since our only options are to
 >>>> +		 * either fail the allocation or charge it to this cgroup, do
 >>>> +		 * it as a temporary condition. But we can't fail. From a
 >>>> +		 * kmem/slab perspective, the cache has already been selected,
 >>>> +		 * by mem_cgroup_get_kmem_cache(), so it is too late to change
 >>>> +		 * our minds
 >>>> +		 */
 >>>> +		res_counter_charge_nofail(&memcg->res, delta, &fail_res);
 >>>> +		if (do_swap_account)
 >>>> +			res_counter_charge_nofail(&memcg->memsw, delta,
 >>>> +						  &fail_res);
 >>>> +		ret = 0;
 >> Hm, you returns 0 and this charge may never be uncharged....right ?
 >>
 >
 > Can't see why. By returning 0 we inform our caller that the allocation
 > succeeded. It is up to him to undo it later through a call to uncharge.
 >
 Hmm, okay. You trust callers.
 
 Thanks,
 -Kame
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47584 is a reply to message #47472] | Fri, 17 August 2012 02:38   |  
			| 
				
				
					|  KAMEZAWA Hiroyuki Messages: 463
 Registered: September 2006
 | Senior Member |  |  |  
	| (2012/08/13 17:36), Glauber Costa wrote: > On 08/10/2012 09:02 PM, Kamezawa Hiroyuki wrote:
 >> (2012/08/09 22:01), Glauber Costa wrote:
 >>> This patch adds the basic infrastructure for the accounting of the slab
 >>> caches. To control that, the following files are created:
 >>>
 >>>    * memory.kmem.usage_in_bytes
 >>>    * memory.kmem.limit_in_bytes
 >>>    * memory.kmem.failcnt
 >>>    * memory.kmem.max_usage_in_bytes
 >>>
 >>> They have the same meaning of their user memory counterparts. They
 >>> reflect the state of the "kmem" res_counter.
 >>>
 >>> The code is not enabled until a limit is set. This can be tested by the
 >>> flag "kmem_accounted". This means that after the patch is applied, no
 >>> behavioral changes exists for whoever is still using memcg to control
 >>> their memory usage.
 >>>
 >>> We always account to both user and kernel resource_counters. This
 >>> effectively means that an independent kernel limit is in place when the
 >>> limit is set to a lower value than the user memory. A equal or higher
 >>> value means that the user limit will always hit first, meaning that kmem
 >>> is effectively unlimited.
 >>>
 >>> People who want to track kernel memory but not limit it, can set this
 >>> limit to a very high number (like RESOURCE_MAX - 1page - that no one
 >>> will ever hit, or equal to the user memory)
 >>>
 >>> Signed-off-by: Glauber Costa <glommer@parallels.com>
 >>> CC: Michal Hocko <mhocko@suse.cz>
 >>> CC: Johannes Weiner <hannes@cmpxchg.org>
 >>> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 >>
 >> Could you add  a patch for documentation of this new interface and a text
 >> explaining the behavior of "kmem_accounting" ?
 >>
 >> Hm, my concern is the difference of behavior between user page accounting and
 >> kmem accounting...but this is how tcp-accounting is working.
 >>
 >> Once you add Documentation, it's okay to add my Ack.
 >>
 > I plan to add documentation in a separate patch. Due to that, can I add
 > your ack to this patch here?
 >
 > Also, I find that the description text in patch0 grew to be quite
 > informative and complete. I plan to add that to the documentation
 > if that is ok with you
 >
 
 Ack to this patch.
 
 -Kame
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 04/11] kmem accounting basic infrastructure [message #47585 is a reply to message #47572] | Fri, 17 August 2012 05:58   |  
			| 
				
				
					|  Ying Han Messages: 8
 Registered: February 2012
 | Junior Member |  |  |  
	| On Thu, Aug 16, 2012 at 8:25 AM, Michal Hocko <mhocko@suse.cz> wrote: > On Wed 15-08-12 12:50:55, Ying Han wrote:
 >> On Tue, Aug 14, 2012 at 9:21 AM, Michal Hocko <mhocko@suse.cz> wrote:
 >> > On Thu 09-08-12 17:01:12, Glauber Costa wrote:
 >> >> This patch adds the basic infrastructure for the accounting of the slab
 >> >> caches. To control that, the following files are created:
 >> >>
 >> >>  * memory.kmem.usage_in_bytes
 >> >>  * memory.kmem.limit_in_bytes
 >> >>  * memory.kmem.failcnt
 >> >>  * memory.kmem.max_usage_in_bytes
 >> >>
 >> >> They have the same meaning of their user memory counterparts. They
 >> >> reflect the state of the "kmem" res_counter.
 >> >>
 >> >> The code is not enabled until a limit is set. This can be tested by the
 >> >> flag "kmem_accounted". This means that after the patch is applied, no
 >> >> behavioral changes exists for whoever is still using memcg to control
 >> >> their memory usage.
 >> >>
 >> >> We always account to both user and kernel resource_counters. This
 >> >> effectively means that an independent kernel limit is in place when the
 >> >> limit is set to a lower value than the user memory. A equal or higher
 >> >> value means that the user limit will always hit first, meaning that kmem
 >> >> is effectively unlimited.
 >> >
 >> > Well, it contributes to the user limit so it is not unlimited. It just
 >> > falls under a different limit and it tends to contribute less. This can
 >> > be quite confusing.  I am still not sure whether we should mix the two
 >> > things together. If somebody wants to limit the kernel memory he has to
 >> > touch the other limit anyway.  Do you have a strong reason to mix the
 >> > user and kernel counters?
 >>
 >> The reason to mix the two together is a compromise of the two use
 >> cases we've heard by far. In google, we only need one limit which
 >> limits u & k, and the reclaim kicks in when the total usage hits the
 >> limit.
 >>
 >> > My impression was that kernel allocation should simply fail while user
 >> > allocations might reclaim as well. Why should we reclaim just because of
 >> > the kernel allocation (which is unreclaimable from hard limit reclaim
 >> > point of view)?
 >>
 >> Some of kernel objects are reclaimable if we have per-memcg shrinker.
 >
 > Agreed and I think we need that before this is merged as I state in
 > other email.
 >
 >> > I also think that the whole thing would get much simpler if those two
 >> > are split. Anyway if this is really a must then this should be
 >> > documented here.
 >>
 >> What would be the use case you have in your end?
 >
 > I do not have any specific unfortunately but I would like to prevent us
 > from closing other possible. I realize this sounds hand wavy and that is
 > why I do not want to block this work but I think we should give it some
 > time before this gets merged.
 
 Agreed that we don't want to rush merge anything.
 
 On the other hand, I was trying to understand your concern of the k &
 u+k counter. After reading your previous replies,
 I think I understand your concern of missing the target shrinker. I
 posted the patch and please take a look :)
 
 Meanwhile, can you help to clarify other concerns in your mind on
 having the two counters? Please ignore me if you answered
 the question somewhere and just give me the pointer.
 
 --Ying
 
 >
 >> --Ying
 > --
 > Michal Hocko
 > SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 08/11] memcg: disable kmem code when not in use. [message #47586 is a reply to message #47397] | Fri, 17 August 2012 07:02   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Thu 09-08-12 17:01:16, Glauber Costa wrote: > We can use jump labels to patch the code in or out when not used.
 >
 > Because the assignment: memcg->kmem_accounted = true is done after the
 > jump labels increment, we guarantee that the root memcg will always be
 > selected until all call sites are patched (see memcg_kmem_enabled).
 
 Not that it would be really important because kmem_accounted goes away
 in a subsequent patch but I think the wording is a bit misleading here.
 First of all there is no guanratee that kmem_accounted=true is seen
 before atomic_inc(&key->enabled) because there is no memory barrier and
 the lock serves just a leave barrier. But I do not think this is
 important at all because key->enabled is what matters here. Even if
 memcg_kmem_enabled is true we do not consider it if the key is disabled,
 right?
 
 > This guarantees that no mischarges are applied.
 >
 > Jump label decrement happens when the last reference count from the
 > memcg dies. This will only happen when the caches are all dead.
 >
 > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > CC: Christoph Lameter <cl@linux.com>
 > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > CC: Suleiman Souhlal <suleiman@google.com>
 
 Anyway the code looks correct.
 Reviewed-by: Michal Hocko <mhocko@suse.cz>
 
 > ---
 >  include/linux/memcontrol.h |  5 ++++-
 >  mm/memcontrol.c            | 50 ++++++++++++++++++++++++++++++++++++----------
 >  2 files changed, 44 insertions(+), 11 deletions(-)
 >
 > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 > index 75b247e..f39d933 100644
 > --- a/include/linux/memcontrol.h
 > +++ b/include/linux/memcontrol.h
 > @@ -22,6 +22,7 @@
 >  #include <linux/cgroup.h>
 >  #include <linux/vm_event_item.h>
 >  #include <linux/hardirq.h>
 > +#include <linux/jump_label.h>
 >
 >  struct mem_cgroup;
 >  struct page_cgroup;
 > @@ -401,7 +402,9 @@ struct sock;
 >  void sock_update_memcg(struct sock *sk);
 >  void sock_release_memcg(struct sock *sk);
 >
 > -#define memcg_kmem_on 1
 > +extern struct static_key memcg_kmem_enabled_key;
 > +#define memcg_kmem_on static_key_false(&memcg_kmem_enabled_key)
 > +
 >  bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
 >  void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
 >  void __memcg_kmem_free_page(struct page *page, int order);
 > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > index e9824c1..3216292 100644
 > --- a/mm/memcontrol.c
 > +++ b/mm/memcontrol.c
 > @@ -437,6 +437,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
 >  #include <net/sock.h>
 >  #include <net/ip.h>
 >
 > +struct static_key memcg_kmem_enabled_key;
 > +/* so modules can inline the checks */
 > +EXPORT_SYMBOL(memcg_kmem_enabled_key);
 > +
 >  static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
 >  static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
 >  static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
 > @@ -607,6 +611,16 @@ void __memcg_kmem_free_page(struct page *page, int order)
 >  	mem_cgroup_put(memcg);
 >  }
 >  EXPORT_SYMBOL(__memcg_kmem_free_page);
 > +
 > +static void disarm_kmem_keys(struct mem_cgroup *memcg)
 > +{
 > +	if (memcg->kmem_accounted)
 > +		static_key_slow_dec(&memcg_kmem_enabled_key);
 > +}
 > +#else
 > +static void disarm_kmem_keys(struct mem_cgroup *memcg)
 > +{
 > +}
 >  #endif /* CONFIG_MEMCG_KMEM */
 >
 >  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 > @@ -622,6 +636,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
 >  }
 >  #endif
 >
 > +static void disarm_static_keys(struct mem_cgroup *memcg)
 > +{
 > +	disarm_sock_keys(memcg);
 > +	disarm_kmem_keys(memcg);
 > +}
 > +
 >  static void drain_all_stock_async(struct mem_cgroup *memcg);
 >
 >  static struct mem_cgroup_per_zone *
 > @@ -4147,6 +4167,24 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 >  	len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
 >  	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 >  }
 > +
 > +static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
 > +{
 > +#ifdef CONFIG_MEMCG_KMEM
 > +	/*
 > +	 * Once enabled, can't be disabled. We could in theory disable it if we
 > +	 * haven't yet created any caches, or if we can shrink them all to
 > +	 * death. But it is not worth the trouble.
 > +	 */
 > +	mutex_lock(&set_limit_mutex);
 > +	if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
 > +		static_key_slow_inc(&memcg_kmem_enabled_key);
 > +		memcg->kmem_accounted = true;
 > +	}
 > +	mutex_unlock(&set_limit_mutex);
 > +#endif
 > +}
 > +
 >  /*
 >   * The user of this function is...
 >   * RES_LIMIT.
 > @@ -4184,15 +4222,7 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 >  			ret = res_counter_set_limit(&memcg->kmem, val);
 >  			if (ret)
 >  				break;
 > -			/*
 > -			 * Once enabled, can't be disabled. We could in theory
 > -			 * disable it if we haven't yet created any caches, or
 > -			 * if we can shrink them all to death.
 > -			 *
 > -			 * But it is not worth the trouble
 > -			 */
 > -			if (!memcg->kmem_accounted && val != RESOURCE_MAX)
 > -				memcg->kmem_accounted = true;
 > +			memcg_update_kmem_limit(memcg, val);
 >  		} else
 >  			return -EINVAL;
 >  		break;
 > @@ -5054,7 +5084,7 @@ static void free_work(struct work_struct *work)
 >  	 * to move this code around, and make sure it is outside
 >  	 * the cgroup_lock.
 >  	 */
 > -	disarm_sock_keys(memcg);
 > +	disarm_static_keys(memcg);
 >  	if (size < PAGE_SIZE)
 >  		kfree(memcg);
 >  	else
 > --
 > 1.7.11.2
 >
 > --
 > To unsubscribe from this list: send the line "unsubscribe cgroups" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children [message #47590 is a reply to message #47396] | Fri, 17 August 2012 09:00   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Thu 09-08-12 17:01:17, Glauber Costa wrote: > The current memcg slab cache management fails to present satisfatory
 > hierarchical behavior in the following scenario:
 >
 > -> /cgroups/memory/A/B/C
 >
 > * kmem limit set at A,
 > * A and B have no tasks,
 > * span a new task in in C.
 >
 > Because kmem_accounted is a boolean that was not set for C, no
 > accounting would be done. This is, however, not what we expect.
 >
 > The basic idea, is that when a cgroup is limited, we walk the tree
 > upwards
 
 Isn't it rather downwards? We start at A and then mark all children so
 we go down the tree. Moreover the walk is not atomic wrt. parallel
 charges nor to a new child creation. First one seems to be acceptable
 as the charges go to the root. The second one requires cgroup_lock.
 
 It also seems that you are missing memcg_kmem_account_parent in
 mem_cgroup_create (use_hierarchy path) if memcg_kmem_is_accounted(parent).
 
 Some further "wording" comments below. Other than that the patch looks
 correct.
 
 > (something Kame and I already thought about doing for other
 > purposes), and make sure that we store the information about the parent
 > being limited in kmem_accounted (that is turned into a bitmap: two
 > booleans would not be space efficient).
 
 Two booleans even don't serve the purpose because you want to test this
 atomically, right?
 
 > The code for that is taken from sched/core.c. My reasons for not
 > putting it into a common place is to dodge the type issues that would
 > arise from a common implementation between memcg and the scheduler -
 > but I think that it should ultimately happen, so if you want me to do
 > it now, let me know.
 
 Is this really relevant for the patch?
 
 > We do the reverse operation when a formerly limited cgroup becomes
 > unlimited.
 >
 > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > CC: Christoph Lameter <cl@linux.com>
 > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > CC: Suleiman Souhlal <suleiman@google.com>
 > ---
 >  mm/memcontrol.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 >  1 file changed, 79 insertions(+), 9 deletions(-)
 >
 > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > index 3216292..3d30b79 100644
 > --- a/mm/memcontrol.c
 > +++ b/mm/memcontrol.c
 > @@ -295,7 +295,8 @@ struct mem_cgroup {
 >  	 * Should the accounting and control be hierarchical, per subtree?
 >  	 */
 >  	bool use_hierarchy;
 > -	bool kmem_accounted;
 > +
 > +	unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
 >
 >  	bool		oom_lock;
 >  	atomic_t	under_oom;
 > @@ -348,6 +349,38 @@ struct mem_cgroup {
 >  #endif
 >  };
 >
 > +enum {
 > +	KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
 > +	KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
 
 How it can be accounted by its parent, the charge doesn't go downwards.
 Shouldn't it rather be /* a parent is accounted */
 
 > +};
 > +
 > +#ifdef CONFIG_MEMCG_KMEM
 > +static bool memcg_kmem_account(struct mem_cgroup *memcg)
 
 memcg_kmem_set_account? It matches _clear_ counterpart and it makes
 obvious that the value is changed actually.
 
 [...]
 > +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
 > +{
 > +	return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
 > +}
 > +
 > +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
 
 same here _set_parent
 
 [...]
 > @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
 >
 >  static void disarm_kmem_keys(struct mem_cgroup *memcg)
 >  {
 > -	if (memcg->kmem_accounted)
 > +	if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
 
 memcg_kmem_is_accounted. I do not see any reason to open code this.
 
 >  		static_key_slow_dec(&memcg_kmem_enabled_key);
 >  }
 >  #else
 > @@ -4171,17 +4204,54 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 >  static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
 >  {
 >  #ifdef CONFIG_MEMCG_KMEM
 > -	/*
 > -	 * Once enabled, can't be disabled. We could in theory disable it if we
 > -	 * haven't yet created any caches, or if we can shrink them all to
 > -	 * death. But it is not worth the trouble.
 > -	 */
 > +	struct mem_cgroup *iter;
 > +
 >  	mutex_lock(&set_limit_mutex);
 > -	if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
 > +	if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
 > +
 > +		/*
 > +		 * Once enabled, can't be disabled. We could in theory disable
 > +		 * it if we haven't yet created any caches, or if we can shrink
 > +		 * them all to death. But it is not worth the trouble
 > +		 */
 >  		static_key_slow_inc(&memcg_kmem_enabled_key);
 > -		memcg->kmem_accounted = true;
 > +
 > +		if (!memcg->use_hierarchy)
 > +			goto out;
 > +
 > +		for_each_mem_cgroup_tree(iter, memcg) {
 
 for_each_mem_cgroup_tree does respect use_hierarchy so the above
 shortcut is not necessary. Dunno but IMHO we should get rid of explicit
 tests as much as possible. This doesn't look like a hot path anyway.
 
 > +			if (iter == memcg)
 > +				continue;
 > +			memcg_kmem_account_parent(iter);
 > +		}
 > +	} else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
 
 Above you said "Once enabled, can't be disabled." and now you can
 disable it? Say you are a leaf group with non accounted parents. This
 will clear the flag and so no further accounting is done. Shouldn't
 unlimited mean that we will never reach the limit? Or am I missing
 something?
 
 > +
 > +		if (!memcg->use_hierarchy)
 > +			goto out;
 > +
 > +		for_each_mem_cgroup_tree(iter, memcg) {
 > +			struct mem_cgroup *parent;
 > +
 > +			if (iter == memcg)
 > +				continue;
 > +			/*
 > +			 * We should only have our parent bit cleared if none
 > +			 * of our parents are accounted. The transversal order
 > +			 * of our iter function forces us to always look at the
 > +			 * parents.
 > +			 */
 > +			parent = parent_mem_cgroup(iter);
 > +			for (; parent != memcg; parent = parent_mem_cgroup(iter))
 > +				if (memcg_kmem_is_accounted(parent))
 > +					goto noclear;
 > +			memcg_kmem_clear_account_parent(iter);
 
 Brain hurts...
 Yes we are iterating in the creation ordering so we cannot rely on the
 first encountered accounted memcg
 A(a) - B - D
 - C (a) - E
 
 
 > +noclear:
 > +			continue;
 > +		}
 >  	}
 > +out:
 >  	mutex_unlock(&set_limit_mutex);
 > +
 >  #endif
 >  }
 >
 > --
 > 1.7.11.2
 >
 > --
 > To unsubscribe from this list: send the line "unsubscribe cgroups" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children [message #47591 is a reply to message #47590] | Fri, 17 August 2012 09:15   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/17/2012 01:00 PM, Michal Hocko wrote: > On Thu 09-08-12 17:01:17, Glauber Costa wrote:
 >> The current memcg slab cache management fails to present satisfatory
 >> hierarchical behavior in the following scenario:
 >>
 >> -> /cgroups/memory/A/B/C
 >>
 >> * kmem limit set at A,
 >> * A and B have no tasks,
 >> * span a new task in in C.
 >>
 >> Because kmem_accounted is a boolean that was not set for C, no
 >> accounting would be done. This is, however, not what we expect.
 >>
 >> The basic idea, is that when a cgroup is limited, we walk the tree
 >> upwards
 >
 > Isn't it rather downwards? We start at A and then mark all children so
 > we go down the tree. Moreover the walk is not atomic wrt. parallel
 > charges nor to a new child creation. First one seems to be acceptable
 > as the charges go to the root. The second one requires cgroup_lock.
 >
 
 Yes, it is downwards. I've already noticed that yesterday and updated
 in my tree.
 
 As for the lock, can't we take set_limit lock in cgroup creation just
 around the place that updates that field in the child? It is a lot more
 fine grained - everything except the dead bkl is -  and what we're
 actually protecting is the limit.
 
 If you prefer, I can use cgroup lock just fine. But then I won't sleep
 at night and probably pee my pants, which is something I don't do for at
 least two decades now.
 
 > It also seems that you are missing memcg_kmem_account_parent in
 > mem_cgroup_create (use_hierarchy path) if memcg_kmem_is_accounted(parent).
 >
 
 You mean when we create a cgroup ontop of an already limited parent?
 Humm, you are very right.
 
 > Some further "wording" comments below. Other than that the patch looks
 > correct.
 >
 >> (something Kame and I already thought about doing for other
 >> purposes), and make sure that we store the information about the parent
 >> being limited in kmem_accounted (that is turned into a bitmap: two
 >> booleans would not be space efficient).
 >
 > Two booleans even don't serve the purpose because you want to test this
 > atomically, right?
 >
 
 Well, yes, we have that extra problem as well.
 >> The code for that is taken from sched/core.c. My reasons for not
 >> putting it into a common place is to dodge the type issues that would
 >> arise from a common implementation between memcg and the scheduler -
 >> but I think that it should ultimately happen, so if you want me to do
 >> it now, let me know.
 >
 > Is this really relevant for the patch?
 >
 
 Not at all. Besides not being relevant, it is also not true, since I now
 use the memcg iterator. I would prefer the tree walk instead of having
 to cope with the order imposed by the memcg iterator, but we add
 less code this way...
 
 Again, already modified that in my yesterday's update.
 
 >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 >> index 3216292..3d30b79 100644
 >> --- a/mm/memcontrol.c
 >> +++ b/mm/memcontrol.c
 >> @@ -295,7 +295,8 @@ struct mem_cgroup {
 >>  	 * Should the accounting and control be hierarchical, per subtree?
 >>  	 */
 >>  	bool use_hierarchy;
 >> -	bool kmem_accounted;
 >> +
 >> +	unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
 >>
 >>  	bool		oom_lock;
 >>  	atomic_t	under_oom;
 >> @@ -348,6 +349,38 @@ struct mem_cgroup {
 >>  #endif
 >>  };
 >>
 >> +enum {
 >> +	KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
 >> +	KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
 >
 > How it can be accounted by its parent, the charge doesn't go downwards.
 > Shouldn't it rather be /* a parent is accounted */
 >
 indeed.
 
 >> +};
 >> +
 >> +#ifdef CONFIG_MEMCG_KMEM
 >> +static bool memcg_kmem_account(struct mem_cgroup *memcg)
 >
 > memcg_kmem_set_account? It matches _clear_ counterpart and it makes
 > obvious that the value is changed actually.
 >
 
 Ok.
 
 > [...]
 >> +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
 >> +{
 >> +	return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
 >> +}
 >> +
 >> +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
 >
 > same here _set_parent
 >
 
 Ok, agreed.
 
 > [...]
 >> @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
 >>
 >>  static void disarm_kmem_keys(struct mem_cgroup *memcg)
 >>  {
 >> -	if (memcg->kmem_accounted)
 >> +	if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
 >
 > memcg_kmem_is_accounted. I do not see any reason to open code this.
 >
 
 ok.
 
 >>  #ifdef CONFIG_MEMCG_KMEM
 >> -	/*
 >> -	 * Once enabled, can't be disabled. We could in theory disable it if we
 >> -	 * haven't yet created any caches, or if we can shrink them all to
 >> -	 * death. But it is not worth the trouble.
 >> -	 */
 >> +	struct mem_cgroup *iter;
 >> +
 >>  	mutex_lock(&set_limit_mutex);
 >> -	if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
 >> +	if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
 >> +
 >> +		/*
 >> +		 * Once enabled, can't be disabled. We could in theory disable
 >> +		 * it if we haven't yet created any caches, or if we can shrink
 >> +		 * them all to death. But it is not worth the trouble
 >> +		 */
 >>  		static_key_slow_inc(&memcg_kmem_enabled_key);
 >> -		memcg->kmem_accounted = true;
 >> +
 >> +		if (!memcg->use_hierarchy)
 >> +			goto out;
 >> +
 >> +		for_each_mem_cgroup_tree(iter, memcg) {
 >
 > for_each_mem_cgroup_tree does respect use_hierarchy so the above
 > shortcut is not necessary. Dunno but IMHO we should get rid of explicit
 > tests as much as possible. This doesn't look like a hot path anyway.
 >
 
 I can't remember any reason for doing so other than gaining some time.
 I will remove it.
 
 >> +			if (iter == memcg)
 >> +				continue;
 >> +			memcg_kmem_account_parent(iter);
 >> +		}
 >> +	} else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
 >
 > Above you said "Once enabled, can't be disabled." and now you can
 > disable it? Say you are a leaf group with non accounted parents. This
 > will clear the flag and so no further accounting is done. Shouldn't
 > unlimited mean that we will never reach the limit? Or am I missing
 > something?
 >
 
 You are missing something, and maybe I should be more clear about that.
 The static branches can't be disabled (it is only safe to disable them
 from disarm_static_branches(), when all references are gone). Note that
 when unlimited, we flip bits, do a transversal, but there is no mention
 to the static branch.
 
 The limiting can come and go at will.
 
 >> +
 >> +		if (!memcg->use_hierarchy)
 >> +			goto out;
 >> +
 >> +		for_each_mem_cgroup_tree(iter, memcg) {
 >> +			struct mem_cgroup *parent;
 >> +
 >> +			if (iter == memcg)
 >> +				continue;
 >> +			/*
 >> +			 * We should only have our parent bit cleared if none
 >> +			 * of our parents are accounted. The transversal order
 >> +			 * of our iter function forces us to always look at the
 >> +			 * parents.
 >> +			 */
 >> +			parent = parent_mem_cgroup(iter);
 >> +			for (; parent != memcg; parent = parent_mem_cgroup(iter))
 >> +				if (memcg_kmem_is_accounted(parent))
 >> +					goto noclear;
 >> +			memcg_kmem_clear_account_parent(iter);
 >
 > Brain hurts...
 > Yes we are iterating in the creation ordering so we cannot rely on the
 > first encountered accounted memcg
 > A(a) - B - D
 >      - C (a) - E
 >
 >
 
 That's why I said I preferred the iterator the scheduler uses. The
 actual transverse code was much simpler, because it will stop at an
 unlimited parent. But this is the only drawback I see in the memcg
 iterator, so I decided that just documenting this "interesting" piece of
 code well would do...
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children [message #47592 is a reply to message #47591] | Fri, 17 August 2012 09:35   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Fri 17-08-12 13:15:47, Glauber Costa wrote: > On 08/17/2012 01:00 PM, Michal Hocko wrote:
 > > On Thu 09-08-12 17:01:17, Glauber Costa wrote:
 > >> The current memcg slab cache management fails to present satisfatory
 > >> hierarchical behavior in the following scenario:
 > >>
 > >> -> /cgroups/memory/A/B/C
 > >>
 > >> * kmem limit set at A,
 > >> * A and B have no tasks,
 > >> * span a new task in in C.
 > >>
 > >> Because kmem_accounted is a boolean that was not set for C, no
 > >> accounting would be done. This is, however, not what we expect.
 > >>
 > >> The basic idea, is that when a cgroup is limited, we walk the tree
 > >> upwards
 > >
 > > Isn't it rather downwards? We start at A and then mark all children so
 > > we go down the tree. Moreover the walk is not atomic wrt. parallel
 > > charges nor to a new child creation. First one seems to be acceptable
 > > as the charges go to the root. The second one requires cgroup_lock.
 > >
 >
 > Yes, it is downwards. I've already noticed that yesterday and updated
 > in my tree.
 >
 > As for the lock, can't we take set_limit lock in cgroup creation just
 > around the place that updates that field in the child? It is a lot more
 > fine grained - everything except the dead bkl is -  and what we're
 > actually protecting is the limit.
 
 That should work as well. It is less obvious because we are not
 considering the parent limit (maybe we should rename the lock but that
 is just a detail).
 
 > If you prefer, I can use cgroup lock just fine. But then I won't sleep
 > at night and probably pee my pants, which is something I don't do for at
 > least two decades now.
 
 Heh, please no, I would feel terrible then
 
 > > It also seems that you are missing memcg_kmem_account_parent in
 > > mem_cgroup_create (use_hierarchy path) if memcg_kmem_is_accounted(parent).
 > >
 >
 > You mean when we create a cgroup ontop of an already limited parent?
 
 I would prefer bellow but yes
 A (a) - B (a, pa)
 - C (new)
 
 > Humm, you are very right.
 >
 > > Some further "wording" comments below. Other than that the patch looks
 > > correct.
 > >
 > >> (something Kame and I already thought about doing for other
 > >> purposes), and make sure that we store the information about the parent
 > >> being limited in kmem_accounted (that is turned into a bitmap: two
 > >> booleans would not be space efficient).
 > >
 > > Two booleans even don't serve the purpose because you want to test this
 > > atomically, right?
 > >
 >
 > Well, yes, we have that extra problem as well.
 > >> The code for that is taken from sched/core.c. My reasons for not
 > >> putting it into a common place is to dodge the type issues that would
 > >> arise from a common implementation between memcg and the scheduler -
 > >> but I think that it should ultimately happen, so if you want me to do
 > >> it now, let me know.
 > >
 > > Is this really relevant for the patch?
 > >
 >
 > Not at all. Besides not being relevant, it is also not true, since I now
 > use the memcg iterator. I would prefer the tree walk instead of having
 > to cope with the order imposed by the memcg iterator, but we add
 > less code this way...
 >
 > Again, already modified that in my yesterday's update.
 
 OK
 
 > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > >> index 3216292..3d30b79 100644
 > >> --- a/mm/memcontrol.c
 > >> +++ b/mm/memcontrol.c
 > >> @@ -295,7 +295,8 @@ struct mem_cgroup {
 > >>  	 * Should the accounting and control be hierarchical, per subtree?
 > >>  	 */
 > >>  	bool use_hierarchy;
 > >> -	bool kmem_accounted;
 > >> +
 > >> +	unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
 > >>
 > >>  	bool		oom_lock;
 > >>  	atomic_t	under_oom;
 > >> @@ -348,6 +349,38 @@ struct mem_cgroup {
 > >>  #endif
 > >>  };
 > >>
 > >> +enum {
 > >> +	KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
 > >> +	KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
 > >
 > > How it can be accounted by its parent, the charge doesn't go downwards.
 > > Shouldn't it rather be /* a parent is accounted */
 > >
 > indeed.
 >
 > >> +};
 > >> +
 > >> +#ifdef CONFIG_MEMCG_KMEM
 > >> +static bool memcg_kmem_account(struct mem_cgroup *memcg)
 > >
 > > memcg_kmem_set_account? It matches _clear_ counterpart and it makes
 > > obvious that the value is changed actually.
 > >
 >
 > Ok.
 >
 > > [...]
 > >> +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
 > >> +{
 > >> +	return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
 > >> +}
 > >> +
 > >> +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
 > >
 > > same here _set_parent
 > >
 >
 > Ok, agreed.
 
 Thanks
 
 >
 > > [...]
 > >> @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
 > >>
 > >>  static void disarm_kmem_keys(struct mem_cgroup *memcg)
 > >>  {
 > >> -	if (memcg->kmem_accounted)
 > >> +	if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
 > >
 > > memcg_kmem_is_accounted. I do not see any reason to open code this.
 > >
 >
 > ok.
 >
 > >>  #ifdef CONFIG_MEMCG_KMEM
 > >> -	/*
 > >> -	 * Once enabled, can't be disabled. We could in theory disable it if we
 > >> -	 * haven't yet created any caches, or if we can shrink them all to
 > >> -	 * death. But it is not worth the trouble.
 > >> -	 */
 > >> +	struct mem_cgroup *iter;
 > >> +
 > >>  	mutex_lock(&set_limit_mutex);
 > >> -	if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
 > >> +	if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
 > >> +
 > >> +		/*
 > >> +		 * Once enabled, can't be disabled. We could in theory disable
 > >> +		 * it if we haven't yet created any caches, or if we can shrink
 > >> +		 * them all to death. But it is not worth the trouble
 > >> +		 */
 > >>  		static_key_slow_inc(&memcg_kmem_enabled_key);
 > >> -		memcg->kmem_accounted = true;
 > >> +
 > >> +		if (!memcg->use_hierarchy)
 > >> +			goto out;
 > >> +
 > >> +		for_each_mem_cgroup_tree(iter, memcg) {
 > >
 > > for_each_mem_cgroup_tree does respect use_hierarchy so the above
 > > shortcut is not necessary. Dunno but IMHO we should get rid of explicit
 > > tests as much as possible. This doesn't look like a hot path anyway.
 > >
 >
 > I can't remember any reason for doing so other than gaining some time.
 > I will remove it.
 
 Well it involves a bit more code because you would basically do expand
 to a loop which does one iteration (continue) and terminates also take
 and drop the reference on the group. That all seems unnecessary but as I
 said this is not a hot path and we better get rid of direct checks.
 I am not insisting on this so use your good taste...
 
 >
 > >> +			if (iter == memcg)
 > >> +				continue;
 > >> +			memcg_kmem_account_parent(iter);
 > >> +		}
 > >> +	} else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
 > >
 > > Above you said "Once enabled, can't be disabled." and now you can
 > > disable it? Say you are a leaf group with non accounted parents. This
 > > will clear the flag and so no further accounting is done. Shouldn't
 > > unlimited mean that we will never reach the limit? Or am I missing
 > > something?
 > >
 >
 > You are missing something, and maybe I should be more clear about that.
 > The static branches can't be disabled (it is only safe to disable them
 > from disarm_static_branches(), when all references are gone). Note that
 > when unlimited, we flip bits, do a transversal, but there is no mention
 > to the static branch.
 
 My little brain still doesn't get this. I wasn't concerned about static
 branches. I was worried about memcg_can_account_kmem which will return
 false now, doesn't it.
 
 >
 > The limiting can come and go at will.
 >
 > >> +
 > >> +		if (!memcg->use_hierarchy)
 > >> +			goto out;
 > >> +
 > >> +		for_each_mem_cgroup_tree(iter, memcg) {
 > >> +			struct mem_cgroup *parent;
 > >> +
 > >> +			if (iter == memcg)
 > >> +				continue;
 > >> +			/*
 > >> +			 * We should only have our parent bit cleared if none
 > >> +			 * of our parents are accounted. The transversal order
 > >> +			 * of our iter function forces us to always look at the
 > >> +			 * parents.
 > >> +			 */
 > >> +			parent = parent_mem_cgroup(iter);
 > >> +			for (; parent != memcg; parent = parent_mem_cgroup(iter))
 > >> +				if (memcg_kmem_is_accounted(parent))
 > >> +					goto noclear;
 > >> +			memcg_kmem_clear_account_parent(iter);
 > >
 > > Brain hurts...
 > > Yes we are iterating in the creation ordering so we cannot rely on the
 > > first encountered accounted memcg
 > > A(a) - B - D
 > >      - C (a) - E
 >
 > That's why I said I preferred the iterator the scheduler uses. The
 > actual transverse code was much 
...
 
 
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children [message #47593 is a reply to message #47592] | Fri, 17 August 2012 10:07   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/17/2012 01:35 PM, Michal Hocko wrote: >>> Above you said "Once enabled, can't be disabled." and now you can
 >>> > > disable it? Say you are a leaf group with non accounted parents. This
 >>> > > will clear the flag and so no further accounting is done. Shouldn't
 >>> > > unlimited mean that we will never reach the limit? Or am I missing
 >>> > > something?
 >>> > >
 >> >
 >> > You are missing something, and maybe I should be more clear about that.
 >> > The static branches can't be disabled (it is only safe to disable them
 >> > from disarm_static_branches(), when all references are gone). Note that
 >> > when unlimited, we flip bits, do a transversal, but there is no mention
 >> > to the static branch.
 > My little brain still doesn't get this. I wasn't concerned about static
 > branches. I was worried about memcg_can_account_kmem which will return
 > false now, doesn't it.
 >
 
 Yes, it will. If I got you right, you are concerned because I said that
 can't happen. But it will.
 
 But I never said that can't happen. I said (ok, I meant) the static
 branches can't be disabled.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children [message #47594 is a reply to message #47593] | Fri, 17 August 2012 10:35   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Fri 17-08-12 14:07:00, Glauber Costa wrote: > On 08/17/2012 01:35 PM, Michal Hocko wrote:
 > >>> Above you said "Once enabled, can't be disabled." and now you can
 > >>> > > disable it? Say you are a leaf group with non accounted parents. This
 > >>> > > will clear the flag and so no further accounting is done. Shouldn't
 > >>> > > unlimited mean that we will never reach the limit? Or am I missing
 > >>> > > something?
 > >>> > >
 > >> >
 > >> > You are missing something, and maybe I should be more clear about that.
 > >> > The static branches can't be disabled (it is only safe to disable them
 > >> > from disarm_static_branches(), when all references are gone). Note that
 > >> > when unlimited, we flip bits, do a transversal, but there is no mention
 > >> > to the static branch.
 > > My little brain still doesn't get this. I wasn't concerned about static
 > > branches. I was worried about memcg_can_account_kmem which will return
 > > false now, doesn't it.
 > >
 >
 > Yes, it will. If I got you right, you are concerned because I said that
 > can't happen. But it will.
 >
 > But I never said that can't happen. I said (ok, I meant) the static
 > branches can't be disabled.
 
 Ok, then I misunderstood that because the comment was there even before
 static branches were introduced and it made sense to me. This is
 inconsistent with what we do for user accounting because even if we set
 limit to unlimitted we still account. Why should we differ here?
 
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children [message #47596 is a reply to message #47594] | Fri, 17 August 2012 10:39   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/17/2012 02:35 PM, Michal Hocko wrote: > On Fri 17-08-12 14:07:00, Glauber Costa wrote:
 >> On 08/17/2012 01:35 PM, Michal Hocko wrote:
 >>>>> Above you said "Once enabled, can't be disabled." and now you can
 >>>>>>> disable it? Say you are a leaf group with non accounted parents. This
 >>>>>>> will clear the flag and so no further accounting is done. Shouldn't
 >>>>>>> unlimited mean that we will never reach the limit? Or am I missing
 >>>>>>> something?
 >>>>>>>
 >>>>>
 >>>>> You are missing something, and maybe I should be more clear about that.
 >>>>> The static branches can't be disabled (it is only safe to disable them
 >>>>> from disarm_static_branches(), when all references are gone). Note that
 >>>>> when unlimited, we flip bits, do a transversal, but there is no mention
 >>>>> to the static branch.
 >>> My little brain still doesn't get this. I wasn't concerned about static
 >>> branches. I was worried about memcg_can_account_kmem which will return
 >>> false now, doesn't it.
 >>>
 >>
 >> Yes, it will. If I got you right, you are concerned because I said that
 >> can't happen. But it will.
 >>
 >> But I never said that can't happen. I said (ok, I meant) the static
 >> branches can't be disabled.
 >
 > Ok, then I misunderstood that because the comment was there even before
 > static branches were introduced and it made sense to me. This is
 > inconsistent with what we do for user accounting because even if we set
 > limit to unlimitted we still account. Why should we differ here?
 >
 There is another thing as well. Mel was right in his comment: I am
 actually abusing this bit (because it is flippable), and it seems the
 static branch can be updated more than once...
 
 I'll merge your comments, and fix this.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 00/11] Request for Inclusion: kmem controller for memcg. [message #47600 is a reply to message #47389] | Fri, 17 August 2012 21:37   |  
			| 
				
				
					|  Ying Han Messages: 8
 Registered: February 2012
 | Junior Member |  |  |  
	| On Thu, Aug 9, 2012 at 6:01 AM, Glauber Costa <glommer@parallels.com> wrote: > Hi,
 >
 > This is the first part of the kernel memory controller for memcg. It has been
 > discussed many times, and I consider this stable enough to be on tree. A follow
 > up to this series are the patches to also track slab memory. They are not
 > included here because I believe we could benefit from merging them separately
 > for better testing coverage. If there are any issues preventing this to be
 > merged, let me know. I'll be happy to address them.
 >
 > The slab patches are also mature in my self evaluation and could be merged not
 > too long after this. For the reference, the last discussion about them happened
 > at http://lwn.net/Articles/508087/
 >
 > A (throwaway) git tree with them is placed at:
 >
 >         git://github.com/glommer/linux.git kmemcg-slab
 
 I would like to make a kernel on the tree and run some perf tests on
 it. However the kernel
 doesn't boot due to "divide error: 0000 [#1] SMP".
 https://lkml.org/lkml/2012/5/21/502
 
 I believe the issue has been fixed ( didn't look through) and can you
 do a rebase on your tree?
 
 --Ying
 
 >
 > A general explanation of what this is all about follows:
 >
 > The kernel memory limitation mechanism for memcg concerns itself with
 > disallowing potentially non-reclaimable allocations to happen in exaggerate
 > quantities by a particular set of processes (cgroup). Those allocations could
 > create pressure that affects the behavior of a different and unrelated set of
 > processes.
 >
 > Its basic working mechanism is to annotate some allocations with the
 > _GFP_KMEMCG flag. When this flag is set, the current process allocating will
 > have its memcg identified and charged against. When reaching a specific limit,
 > further allocations will be denied.
 >
 > One example of such problematic pressure that can be prevented by this work is
 > a fork bomb conducted in a shell. We prevent it by noting that processes use a
 > limited amount of stack pages. Seen this way, a fork bomb is just a special
 > case of resource abuse. If the offender is unable to grab more pages for the
 > stack, no new processes can be created.
 >
 > There are also other things the general mechanism protects against. For
 > example, using too much of pinned dentry and inode cache, by touching files an
 > leaving them in memory forever.
 >
 > In fact, a simple:
 >
 > while true; do mkdir x; cd x; done
 >
 > can halt your system easily because the file system limits are hard to reach
 > (big disks), but the kernel memory is not. Those are examples, but the list
 > certainly don't stop here.
 >
 > An important use case for all that, is concerned with people offering hosting
 > services through containers. In a physical box we can put a limit to some
 > resources, like total number of processes or threads. But in an environment
 > where each independent user gets its own piece of the machine, we don't want a
 > potentially malicious user to destroy good users' services.
 >
 > This might be true for systemd as well, that now groups services inside
 > cgroups. They generally want to put forward a set of guarantees that limits the
 > running service in a variety of ways, so that if they become badly behaved,
 > they won't interfere with the rest of the system.
 >
 > There is, of course, a cost for that. To attempt to mitigate that, static
 > branches are used to make sure that even if the feature is compiled in with
 > potentially a lot of memory cgroups deployed this code will only be enabled
 > after the first user of this service configures any limit. Limits lower than
 > the user limit effectively means there is a separate kernel memory limit that
 > may be reached independently than the user limit. Values equal or greater than
 > the user limit implies only that kernel memory is tracked. This provides a
 > unified vision of "maximum memory", be it kernel or user memory. Because this
 > is all default-off, existing deployments will see no change in behavior.
 >
 > Glauber Costa (9):
 >   memcg: change defines to an enum
 >   kmem accounting basic infrastructure
 >   Add a __GFP_KMEMCG flag
 >   memcg: kmem controller infrastructure
 >   mm: Allocate kernel pages to the right memcg
 >   memcg: disable kmem code when not in use.
 >   memcg: propagate kmem limiting information to children
 >   memcg: allow a memcg with kmem charges to be destructed.
 >   protect architectures where THREAD_SIZE >= PAGE_SIZE against fork
 >     bombs
 >
 > Suleiman Souhlal (2):
 >   memcg: Make it possible to use the stock for more than one page.
 >   memcg: Reclaim when more than one page needed.
 >
 >  include/linux/gfp.h         |  10 +-
 >  include/linux/memcontrol.h  |  82 ++++++++
 >  include/linux/thread_info.h |   2 +
 >  kernel/fork.c               |   4 +-
 >  mm/memcontrol.c             | 443 +++++++++++++++++++++++++++++++++++++++++---
 >  mm/page_alloc.c             |  38 ++++
 >  6 files changed, 546 insertions(+), 33 deletions(-)
 >
 > --
 > 1.7.11.2
 >
 > --
 > To unsubscribe, send a message with 'unsubscribe linux-mm' in
 > the body to majordomo@kvack.org.  For more info on Linux MM,
 > see: http://www.linux-mm.org/ .
 > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47611 is a reply to message #47545] | Mon, 20 August 2012 13:36   |  
			| 
				
				
					|  KAMEZAWA Hiroyuki Messages: 463
 Registered: September 2006
 | Senior Member |  |  |  
	| (2012/08/16 2:00), Glauber Costa wrote: > On 08/15/2012 08:38 PM, Greg Thelen wrote:
 >> On Wed, Aug 15 2012, Glauber Costa wrote:
 >>
 >>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
 >>>> On Mon, Aug 13 2012, Glauber Costa wrote:
 >>>>
 >>>>>>>> +	WARN_ON(mem_cgroup_is_root(memcg));
 >>>>>>>> +	size = (1 << order) << PAGE_SHIFT;
 >>>>>>>> +	memcg_uncharge_kmem(memcg, size);
 >>>>>>>> +	mem_cgroup_put(memcg);
 >>>>>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>>>>> reference ?
 >>>>> This is of course the pair of the mem_cgroup_get() you commented on
 >>>>> earlier. If we need one, we need the other. If we don't need one, we
 >>>>> don't need the other =)
 >>>>>
 >>>>> The guarantee we're trying to give here is that the memcg structure will
 >>>>> stay around while there are dangling charges to kmem, that we decided
 >>>>> not to move (remember: moving it for the stack is simple, for the slab
 >>>>> is very complicated and ill-defined, and I believe it is better to treat
 >>>>> all kmem equally here)
 >>>>
 >>>> By keeping memcg structures hanging around until the last referring kmem
 >>>> page is uncharged do such zombie memcg each consume a css_id and thus
 >>>> put pressure on the 64k css_id space?  I imagine in pathological cases
 >>>> this would prevent creation of new cgroups until these zombies are
 >>>> dereferenced.
 >>>
 >>> Yes, but although this patch makes it more likely, it doesn't introduce
 >>> that. If the tasks, for instance, grab a reference to the cgroup dentry
 >>> in the filesystem (like their CWD, etc), they will also keep the cgroup
 >>> around.
 >>
 >> Fair point.  But this doesn't seems like a feature.  It's probably not
 >> needed initially, but what do you think about creating a
 >> memcg_kernel_context structure which is allocated when memcg is
 >> allocated?  Kernel pages charged to a memcg would have
 >> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 >> would allow the mem_cgroup and its css_id to be deleted when the cgroup
 >> is unlinked from cgroupfs while allowing for the active kernel pages to
 >> continue pointing to a valid memcg_kernel_context.  This would be a
 >> reference counted structure much like you are doing with memcg.  When a
 >> memcg is deleted the memcg_kernel_context would be linked into its
 >> surviving parent memcg.  This would avoid needing to visit each kernel
 >> page.
 >
 > You need more, you need at the res_counters to stay around as well. And
 > probably other fields.
 >
 > So my fear here is that as you add fields to that structure, you can
 > defeat a bit the goal of reducing memory consumption. Still leaves the
 > css space, yes. But by doing this we can introduce some subtle bugs by
 > having a field in the wrong structure.
 >
 
 Hm, can't we free css_id and delete css structure from the css_id idr tree
 when a memcg goes zombie ?
 
 Thanks,
 -Kame
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 06/11] memcg: kmem controller infrastructure [message #47627 is a reply to message #47611] | Mon, 20 August 2012 15:29   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/20/2012 05:36 PM, Kamezawa Hiroyuki wrote: > (2012/08/16 2:00), Glauber Costa wrote:
 >> On 08/15/2012 08:38 PM, Greg Thelen wrote:
 >>> On Wed, Aug 15 2012, Glauber Costa wrote:
 >>>
 >>>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
 >>>>> On Mon, Aug 13 2012, Glauber Costa wrote:
 >>>>>
 >>>>>>>>> +    WARN_ON(mem_cgroup_is_root(memcg));
 >>>>>>>>> +    size = (1 << order) << PAGE_SHIFT;
 >>>>>>>>> +    memcg_uncharge_kmem(memcg, size);
 >>>>>>>>> +    mem_cgroup_put(memcg);
 >>>>>>> Why do we need ref-counting here ? kmem res_counter cannot work as
 >>>>>>> reference ?
 >>>>>> This is of course the pair of the mem_cgroup_get() you commented on
 >>>>>> earlier. If we need one, we need the other. If we don't need one, we
 >>>>>> don't need the other =)
 >>>>>>
 >>>>>> The guarantee we're trying to give here is that the memcg
 >>>>>> structure will
 >>>>>> stay around while there are dangling charges to kmem, that we decided
 >>>>>> not to move (remember: moving it for the stack is simple, for the
 >>>>>> slab
 >>>>>> is very complicated and ill-defined, and I believe it is better to
 >>>>>> treat
 >>>>>> all kmem equally here)
 >>>>>
 >>>>> By keeping memcg structures hanging around until the last referring
 >>>>> kmem
 >>>>> page is uncharged do such zombie memcg each consume a css_id and thus
 >>>>> put pressure on the 64k css_id space?  I imagine in pathological cases
 >>>>> this would prevent creation of new cgroups until these zombies are
 >>>>> dereferenced.
 >>>>
 >>>> Yes, but although this patch makes it more likely, it doesn't introduce
 >>>> that. If the tasks, for instance, grab a reference to the cgroup dentry
 >>>> in the filesystem (like their CWD, etc), they will also keep the cgroup
 >>>> around.
 >>>
 >>> Fair point.  But this doesn't seems like a feature.  It's probably not
 >>> needed initially, but what do you think about creating a
 >>> memcg_kernel_context structure which is allocated when memcg is
 >>> allocated?  Kernel pages charged to a memcg would have
 >>> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 >>> would allow the mem_cgroup and its css_id to be deleted when the cgroup
 >>> is unlinked from cgroupfs while allowing for the active kernel pages to
 >>> continue pointing to a valid memcg_kernel_context.  This would be a
 >>> reference counted structure much like you are doing with memcg.  When a
 >>> memcg is deleted the memcg_kernel_context would be linked into its
 >>> surviving parent memcg.  This would avoid needing to visit each kernel
 >>> page.
 >>
 >> You need more, you need at the res_counters to stay around as well. And
 >> probably other fields.
 >>
 >> So my fear here is that as you add fields to that structure, you can
 >> defeat a bit the goal of reducing memory consumption. Still leaves the
 >> css space, yes. But by doing this we can introduce some subtle bugs by
 >> having a field in the wrong structure.
 >>
 >
 > Hm, can't we free css_id and delete css structure from the css_id idr tree
 > when a memcg goes zombie ?
 >
 > Thanks,
 > -Kame
 
 Kame,
 
 I wrote a patch that does exactly that. Can you take a look? (I posted
 it already)
 I actually need to go back to it, because greg seems to be right saying
 that that will break things for memsw. But a simplified version may work.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v2 10/11] memcg: allow a memcg with kmem charges to be destructed. [message #47637 is a reply to message #47399] | Tue, 21 August 2012 08:22   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Thu 09-08-12 17:01:18, Glauber Costa wrote: > Because the ultimate goal of the kmem tracking in memcg is to track slab
 > pages as well, we can't guarantee that we'll always be able to point a
 > page to a particular process, and migrate the charges along with it -
 > since in the common case, a page will contain data belonging to multiple
 > processes.
 >
 > Because of that, when we destroy a memcg, we only make sure the
 > destruction will succeed by discounting the kmem charges from the user
 > charges when we try to empty the cgroup.
 
 This changes the semantic of memory.force_empty file because the usage
 should be 0 on success but it will show kmem usage in fact now. I guess
 it is inevitable with u+k accounting so you should be explicit about
 that and also update the documentation. If some tests (I am not 100%
 sure but I guess LTP) rely on that then they could be fixed by checking
 the kmem limit as well.
 
 > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > CC: Christoph Lameter <cl@linux.com>
 > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > CC: Suleiman Souhlal <suleiman@google.com>
 > ---
 >  mm/memcontrol.c | 17 ++++++++++++++++-
 >  1 file changed, 16 insertions(+), 1 deletion(-)
 >
 > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 > index 3d30b79..7c1ea49 100644
 > --- a/mm/memcontrol.c
 > +++ b/mm/memcontrol.c
 > @@ -649,6 +649,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 >  {
 >  	if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
 >  		static_key_slow_dec(&memcg_kmem_enabled_key);
 > +	/*
 > +	 * This check can't live in kmem destruction function,
 > +	 * since the charges will outlive the cgroup
 > +	 */
 > +	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 >  }
 >  #else
 >  static void disarm_kmem_keys(struct mem_cgroup *memcg)
 > @@ -4005,6 +4010,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
 >  	int node, zid, shrink;
 >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 >  	struct cgroup *cgrp = memcg->css.cgroup;
 > +	u64 usage;
 >
 >  	css_get(&memcg->css);
 >
 > @@ -4038,8 +4044,17 @@ move_account:
 >  		mem_cgroup_end_move(memcg);
 >  		memcg_oom_recover(memcg);
 >  		cond_resched();
 > +		/*
 > +		 * Kernel memory may not necessarily be trackable to a specific
 > +		 * process. So they are not migrated, and therefore we can't
 > +		 * expect their value to drop to 0 here.
 > +		 *
 > +		 * having res filled up with kmem only is enough
 > +		 */
 > +		usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
 > +			res_counter_read_u64(&memcg->kmem, RES_USAGE);
 >  	/* "ret" should also be checked to ensure all lists are empty. */
 > -	} while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
 > +	} while (usage > 0 || ret);
 >  out:
 >  	css_put(&memcg->css);
 >  	return ret;
 > --
 > 1.7.11.2
 >
 > --
 > To unsubscribe from this list: send the line "unsubscribe cgroups" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children [message #47639 is a reply to message #47638] | Tue, 21 August 2012 09:17   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/21/2012 12:35 PM, Michal Hocko wrote: > On Tue 21-08-12 09:54:30, Michal Hocko wrote:
 >> E.g. how do you handle charges you left behind? Say you charged some
 >> pages for stack?
 >
 > I got to the last patch and see how you do it. You are relying on
 > free_accounted_pages directly which doesn't check kmem_accounted and
 > uses PageUsed bit instead. So this is correct. I guess you are relying
 > on the life cycle of the object in general so other types of objects
 > should be safe as well and there shouldn't be any leaks. It is just that
 > the memcg life time is not bounded now. Will think about that.
 >
 Unless you have a better way, I believe any kind of transversal in the
 free page path is performance detrimental. So the best way is to be
 explicit and mark a specific callsite as a memcg free.
 
 As for the unbounded time, you are correct. However, I believe it is
 possible to move a lot of the work we do for free (such as freeing the
 percpu counters and the css_id itself) to an earlier time.
 
 Also, if it ever becomes a problem, it is theoretically possible to
 avoid this, by tracking the kmem pages in a per-memcg list. We would
 then transverse such list as we do for user pages, and reparent them.
 The problem is that this is also a bit space inefficient, since we can't
 reuse any more fields in page_struct for the list_head, so we'd need an
 external structure. There is a list_head + a pointer per tracked page.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children [message #47640 is a reply to message #47636] | Tue, 21 August 2012 09:22   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 08/21/2012 11:54 AM, Michal Hocko wrote: > On Fri 17-08-12 14:36:00, Glauber Costa wrote:
 >> On 08/17/2012 02:35 PM, Michal Hocko wrote:
 >>>>> But I never said that can't happen. I said (ok, I meant) the static
 >>>>> branches can't be disabled.
 >>> Ok, then I misunderstood that because the comment was there even before
 >>> static branches were introduced and it made sense to me. This is
 >>> inconsistent with what we do for user accounting because even if we set
 >>> limit to unlimitted we still account. Why should we differ here?
 >>
 >> Well, we account even without a limit for user accounting. This is a
 >> fundamental difference, no ?
 >
 > Yes, user memory accounting is either on or off all the time (switchable
 > at boot time).
 > My understanding of kmem is that the feature is off by default because
 > it brings an overhead that is worth only special use cases. And that
 > sounds good to me. I do not see a good reason to have runtime switch
 > off. It makes the code more complicated for no good reason. E.g. how do
 > you handle charges you left behind? Say you charged some pages for
 > stack?
 >
 Answered in your other e-mail. About the code complication, yes, it does
 make the code more complicated. See below.
 
 > But maybe you have a good use case for that?
 >
 Honestly, I don't. For my particular use case, this would be always on,
 and end of story. I was operating under the belief that being able to
 say "Oh, I regret", and then turning it off would be beneficial, even at
 the expense of the - self contained - complication.
 
 For the general sanity of the interface, it is also a bit simpler to say
 "if kmem is unlimited, x happens", which is a verifiable statement, than
 to have a statement that is dependent on past history. But all of those
 need of course, as you pointed out, to be traded off by the code complexity.
 
 I am fine with either, I just need a clear sign from you guys so I don't
 keep deimplementing and reimplementing this forever.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v2 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [message #47642 is a reply to message #47395] | Tue, 21 August 2012 09:35   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Thu 09-08-12 17:01:19, Glauber Costa wrote: > Because those architectures will draw their stacks directly from the
 > page allocator, rather than the slab cache, we can directly pass
 > __GFP_KMEMCG flag, and issue the corresponding free_pages.
 >
 > This code path is taken when the architecture doesn't define
 > CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
 > THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
 > architectures fall in this category.
 
 quick git grep "define *THREAD_SIZE\>" arch says that there is no such
 architecture.
 
 > This will guarantee that every stack page is accounted to the memcg the
 > process currently lives on, and will have the allocations to fail if
 > they go over limit.
 >
 > For the time being, I am defining a new variant of THREADINFO_GFP, not
 > to mess with the other path. Once the slab is also tracked by memcg, we
 > can get rid of that flag.
 >
 > Tested to successfully protect against :(){ :|:& };:
 
 I guess there were no other tasks in the same group (except for the
 parent shell), right? I am asking because this should trigger memcg-oom
 but that one will usually pick up something else than the fork bomb
 which would have a small memory footprint. But that needs to be handled
 on the oom level obviously.
 
 > Signed-off-by: Glauber Costa <glommer@parallels.com>
 > Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
 > CC: Christoph Lameter <cl@linux.com>
 > CC: Pekka Enberg <penberg@cs.helsinki.fi>
 > CC: Michal Hocko <mhocko@suse.cz>
 > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
 > CC: Johannes Weiner <hannes@cmpxchg.org>
 > CC: Suleiman Souhlal <suleiman@google.com>
 
 Reviewed-by: Michal Hocko <mhocko@suse.cz>
 
 > ---
 >  include/linux/thread_info.h | 2 ++
 >  kernel/fork.c               | 4 ++--
 >  2 files changed, 4 insertions(+), 2 deletions(-)
 >
 > diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
 > index ccc1899..e7e0473 100644
 > --- a/include/linux/thread_info.h
 > +++ b/include/linux/thread_info.h
 > @@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm);
 >  # define THREADINFO_GFP		(GFP_KERNEL | __GFP_NOTRACK)
 >  #endif
 >
 > +#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
 > +
 >  /*
 >   * flag set/clear/test wrappers
 >   * - pass TIF_xxxx constants to these functions
 > diff --git a/kernel/fork.c b/kernel/fork.c
 > index dc3ff16..b0b90c3 100644
 > --- a/kernel/fork.c
 > +++ b/kernel/fork.c
 > @@ -142,7 +142,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) { }
 >  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 >  						  int node)
 >  {
 > -	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 > +	struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
 >  					     THREAD_SIZE_ORDER);
 >
 >  	return page ? page_address(page) : NULL;
 > @@ -151,7 +151,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 >  static inline void free_thread_info(struct thread_info *ti)
 >  {
 >  	arch_release_thread_info(ti);
 > -	free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 > +	free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 >  }
 >  # else
 >  static struct kmem_cache *thread_info_cache;
 > --
 > 1.7.11.2
 >
 > --
 > To unsubscribe from this list: send the line "unsubscribe cgroups" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  | 
 
 
 Current Time: Sat Oct 25 11:26:40 GMT 2025 
 Total time taken to generate the page: 0.08458 seconds |