| Home » Mailing lists » Devel » [PATCH v3 04/13] kmem accounting basic infrastructure Goto Forum:
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48113 is a reply to message #48111] | Thu, 27 September 2012 14:57   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 09/27/2012 06:49 PM, Tejun Heo wrote: > Hello, Mel.
 >
 > On Thu, Sep 27, 2012 at 03:28:22PM +0100, Mel Gorman wrote:
 >>> In addition, how is userland supposed to know which
 >>> workload is shared kmem heavy or not?
 >>
 >> By using a bit of common sense.
 >>
 >> An application may not be able to figure this out but the administrator
 >> is going to be able to make a very educated guess. If processes running
 >> within two containers are not sharing a filesystem hierarchy for example
 >> then it'll be clear they are not sharing dentries.
 >>
 >> If there was a suspicion they were then it could be analysed with
 >> something like SystemTap probing when files are opened and see if files
 >> are being opened that are shared between containers.
 >>
 >> It's not super-easy but it's not impossible either and I fail to see why
 >> it's such a big deal for you.
 >
 > Because we're not even trying to actually solve the problem but just
 > dumping it to userland.  If dentry/inode usage is the only case we're
 > being worried about, there can be better ways to solve it or at least
 > we should strive for that.
 >
 
 Not only it is not the only case we care about, this is not even touched
 in this series. (It is only touched in the next one). This one, for
 instance, cares about the stack. The reason everything is being dumped
 into "kmem", is precisely to make things simpler. I argue that at some
 point it makes sense to draw a line, and "kmem" is a much better line
 than any fine grained control - precisely because it is conceptually
 easier to grasp.
 
 
 > Also, the problem is not that it is impossible if you know and
 > carefully plan for things beforehand (that would be one extremely
 > competent admin) but that the problem is undiscoverable.  With kmemcg
 > accounting disabled, there's no way to tell a looking cgroup the admin
 > thinks running something which doesn'ft tax kmem much could be
 > generating a ton without the admin ever noticing.
 >
 >>> The fact that the numbers don't really mean what they apparently
 >>> should mean.
 >>
 >> I think it is a reasonable limitation that only some kernel allocations are
 >> accounted for although I'll freely admit I'm not a cgroup or memcg user
 >> either.
 >>
 >> My understanding is that this comes down to cost -- accounting for the
 >> kernel memory usage is expensive so it is limited only to the allocations
 >> that are easy to abuse by an unprivileged process. Hence this is
 >> initially concerned with stack pages with dentries and TCP usage to
 >> follow in later patches.
 >
 > I think the cost isn't too prohibitive considering it's already using
 > memcg.  Charging / uncharging happens only as pages enter and leave
 > slab caches and the hot path overhead is essentially single
 > indirection.  Glauber's benchmark seemed pretty reasonable to me and I
 > don't yet think that warrants exposing this subtle tree of
 > configuration.
 >
 
 Only so we can get some numbers: the cost is really minor if this is all
 disabled. It this is fully enable, it can get to some 2 or 3 %, which
 may or may not be acceptable to an application. But for me this is not
 even about cost, and that's why I haven't brought it up so far
 
 >>> Sure, conferences are useful for building consensus but that's the
 >>> extent of it.  Sorry that I didn't realize the implications then but
 >>> conferences don't really add any finality to decisions.
 >>>
 >>> So, this seems properly crazy to me at the similar level of
 >>> use_hierarchy fiasco.  I'm gonna NACK on this.
 >>
 >> I think you're over-reacting to say the very least :|
 >
 > The part I nacked is enabling kmemcg on a populated cgroup and then
 > starting accounting from then without any apparent indication that any
 > past allocation hasn't been considered.  You end up with numbers which
 > nobody can't tell what they really mean and there's no mechanism to
 > guarantee any kind of ordering between populating the cgroup and
 > configuring it and there's *no* way to find out what happened
 > afterwards neither.  This is properly crazy and definitely deserves a
 > nack.
 >
 
 Mel suggestion of not allowing this to happen once the cgroup has tasks
 takes care of this, and is something I thought of myself.
 
 This would remove this particular piece of objection, and maintain the
 per-subtree control.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48114 is a reply to message #48109] | Thu, 27 September 2012 15:09   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| On Thu 27-09-12 07:33:00, Tejun Heo wrote: > Hello, Michal.
 >
 > On Thu, Sep 27, 2012 at 02:08:06PM +0200, Michal Hocko wrote:
 > > Yes, because we have many users (basically almost all) who care only
 > > about the user memory because that's what occupies the vast majority of
 > > the memory. They usually want to isolate workload which would disrupt
 > > the global memory otherwise (e.g. backup process vs. database). You
 > > really do not want to pay an additional overhead for kmem accounting
 > > here.
 >
 > I'm not too convinced.  First of all, the overhead added by kmemcg
 > isn't big.
 
 You are probably talking about memory overhead which is indeed not that
 big (except for possible side effects as fragmentation which you mention
 bellow). But the runtime overhead is present, as far as I understand from
 what Glauber said. But, on the other hand, it is fair to say that those
 who _want_ use the feature should pay for it.
 
 > The hot path overhead is quite minimal - it doesn't do much more than
 > indirecting one more time.  In terms of memory usage, it sure could
 > lead to a bit more fragmentation but even if it gets to several megs
 > per cgroup, I don't think that's something excessive.  So, there is
 > overhead but I don't believe it to be prohibitive.
 
 Remember that users do not want to pay even "something minimal" when the
 feature is not needed.
 
 > > > So your question for global vs local switch (that again, doesn't
 > > > exist; only a local *limit* exists) should really be posed in the
 > > > following way:  "Can two different use cases with different needs be
 > > > hosted in the same box?"
 > >
 > > I think this is a good and a relevant question. I think this boils down
 > > to whether you want to have trusted and untrusted workloads at the same
 > > machine.
 > > Trusted loads usually only need user memory accounting because kmem
 > > consumption should be really negligible (unless kernel is doing
 > > something really stupid and no kmem limit will help here).
 > > On the other hand, untrusted workloads can do nasty things that
 > > administrator has hard time to mitigate and setting a kmem limit can
 > > help significantly.
 > >
 > > IMHO such a different loads exist on a single machine quite often (Web
 > > server and a back up process as the most simplistic one). The per
 > > hierarchy accounting, therefore, sounds like a good idea without too
 > > much added complexity (actually the only added complexity is in the
 > > proper kmem.limit_in_bytes handling which is a single place).
 >
 > The distinction between "trusted" and "untrusted" is something
 > artificially created due to the assumed deficiency of kmemcg
 > implementation.
 
 Not really. It doesn't have to do anything with the overhead (be it
 memory or runtime). It really boils down to "do I need/want it at all".
 Why would I want to think about how much kernel memory is in use in the
 first place? Or do you think that user memory accounting should be
 deprecated?
 
 > Making things like this visible to userland is a bad
 > idea because it locks us into a place where we can't or don't need to
 > improve the said deficiencies and end up pushing the difficult
 > problems to somewhere else where it will likely be implemented in a
 > shabbier way.  There sure are cases when such approach simply cannot
 > be avoided, but I really don't think that's the case here - the
 > overhead already seems to be at an acceptable level and we're not
 > taking away the escape switch.
 >
 > This is userland visible API.
 
 I am not sure which API visible part you have in mind but
 kmem.limit_in_bytes will be there whether we go with global knob or "no
 limit no accounting" approach.
 
 > We better err on the side of being conservative than going overboard
 > with flexibility.  Even if we eventually need to make this switching
 > fullly hierarchical, we really should be doing,
 >
 > 1. Implement simple global switching and look for problem cases.
 >
 > 2. Analyze them and see whether the problem case can't be solved in a
 >    better, more intelligent way.
 >
 > 3. If the problem is something structurally inherent or reasonably too
 >    difficult to solve any other way, consider dumping the problem as
 >    config parameters to userland.
 >
 > We can always expand the flexibility.  Let's do the simple thing
 > first.  As an added bonus, it would enable using static_keys for
 > accounting branches too.
 
 While I do agree with you in general and being careful is at place in
 this area as time shown several times, this seems to be too restrictive
 in this particular case.
 We won't save almost no code with the global knob so I am not sure
 what we are actually saving here. Global knob will just give us all or
 nothing semantic without making the whole thing simpler. You will stick
 with static branches and checkes whether the group accountable anyway,
 right?
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48120 is a reply to message #48113] | Thu, 27 September 2012 17:46   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, 
 On Thu, Sep 27, 2012 at 06:57:59PM +0400, Glauber Costa wrote:
 > > Because we're not even trying to actually solve the problem but just
 > > dumping it to userland.  If dentry/inode usage is the only case we're
 > > being worried about, there can be better ways to solve it or at least
 > > we should strive for that.
 >
 > Not only it is not the only case we care about, this is not even touched
 > in this series. (It is only touched in the next one). This one, for
 > instance, cares about the stack. The reason everything is being dumped
 > into "kmem", is precisely to make things simpler. I argue that at some
 > point it makes sense to draw a line, and "kmem" is a much better line
 > than any fine grained control - precisely because it is conceptually
 > easier to grasp.
 
 Can you please give other examples of cases where this type of issue
 exists (plenty of shared kernel data structure which is inherent to
 the workload at hand)?  Until now, this has been the only example for
 this type of issues.
 
 > > I think the cost isn't too prohibitive considering it's already using
 > > memcg.  Charging / uncharging happens only as pages enter and leave
 > > slab caches and the hot path overhead is essentially single
 > > indirection.  Glauber's benchmark seemed pretty reasonable to me and I
 > > don't yet think that warrants exposing this subtle tree of
 > > configuration.
 >
 > Only so we can get some numbers: the cost is really minor if this is all
 > disabled. It this is fully enable, it can get to some 2 or 3 %, which
 > may or may not be acceptable to an application. But for me this is not
 > even about cost, and that's why I haven't brought it up so far
 
 It seems like Mel's concern is mostly based on performance overhead
 concerns tho.
 
 > > The part I nacked is enabling kmemcg on a populated cgroup and then
 > > starting accounting from then without any apparent indication that any
 > > past allocation hasn't been considered.  You end up with numbers which
 > > nobody can't tell what they really mean and there's no mechanism to
 > > guarantee any kind of ordering between populating the cgroup and
 > > configuring it and there's *no* way to find out what happened
 > > afterwards neither.  This is properly crazy and definitely deserves a
 > > nack.
 > >
 >
 > Mel suggestion of not allowing this to happen once the cgroup has tasks
 > takes care of this, and is something I thought of myself.
 
 You mean Michal's?  It should also disallow switching if there are
 children cgroups, right?
 
 > This would remove this particular piece of objection, and maintain the
 > per-subtree control.
 
 Yeah, I don't see anything broken with that although I'll try to argue
 for simpler one a bit more.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48123 is a reply to message #48112] | Thu, 27 September 2012 18:30   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 09/27/2012 06:58 PM, Tejun Heo wrote: > Hello, Mel.
 >
 > On Thu, Sep 27, 2012 at 03:43:07PM +0100, Mel Gorman wrote:
 >>> I'm not too convinced.  First of all, the overhead added by kmemcg
 >>> isn't big.
 >>
 >> Really?
 >>
 >> If kmemcg was globally accounted then every __GFP_KMEMCG allocation in
 >> the page allocator potentially ends up down in
 >> __memcg_kmem_newpage_charge which
 >>
 >> 1. takes RCU read lock
 >> 2. looks up cgroup from task
 >> 3. takes a reference count
 >> 4. memcg_charge_kmem -> __mem_cgroup_try_charge
 >> 5. release reference count
 >>
 >> That's a *LOT* of work to incur for cgroups that do not care about kernel
 >> accounting. This is why I thought it was reasonable that the kmem accounting
 >> not be global.
 >
 > But that happens only when pages enter and leave slab and if it still
 > is significant, we can try to further optimize charging.  Given that
 > this is only for cases where memcg is already in use and we provide a
 > switch to disable it globally, I really don't think this warrants
 > implementing fully hierarchy configuration.
 >
 
 Not totally true. We still have to match every allocation to the right
 cache, and that is actually our heaviest hit, responsible for the 2, 3 %
 we're seeing when this is enabled. It is the kind of path so hot that
 people frown upon branches being added, so I don't think we'll ever get
 this close to being free.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48124 is a reply to message #48120] | Thu, 27 September 2012 18:45   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 09/27/2012 09:46 PM, Tejun Heo wrote: > Hello,
 >
 > On Thu, Sep 27, 2012 at 06:57:59PM +0400, Glauber Costa wrote:
 >>> Because we're not even trying to actually solve the problem but just
 >>> dumping it to userland.  If dentry/inode usage is the only case we're
 >>> being worried about, there can be better ways to solve it or at least
 >>> we should strive for that.
 >>
 >> Not only it is not the only case we care about, this is not even touched
 >> in this series. (It is only touched in the next one). This one, for
 >> instance, cares about the stack. The reason everything is being dumped
 >> into "kmem", is precisely to make things simpler. I argue that at some
 >> point it makes sense to draw a line, and "kmem" is a much better line
 >> than any fine grained control - precisely because it is conceptually
 >> easier to grasp.
 >
 > Can you please give other examples of cases where this type of issue
 > exists (plenty of shared kernel data structure which is inherent to
 > the workload at hand)?  Until now, this has been the only example for
 > this type of issues.
 >
 
 Yes. the namespace related caches (*), all kinds of sockets and network
 structures, other file system structures like file struct, vm areas, and
 pretty much everything a full container does.
 
 (*) we run full userspace, so we have namespaces + cgroups combination.
 
 >>> I think the cost isn't too prohibitive considering it's already using
 >>> memcg.  Charging / uncharging happens only as pages enter and leave
 >>> slab caches and the hot path overhead is essentially single
 >>> indirection.  Glauber's benchmark seemed pretty reasonable to me and I
 >>> don't yet think that warrants exposing this subtle tree of
 >>> configuration.
 >>
 >> Only so we can get some numbers: the cost is really minor if this is all
 >> disabled. It this is fully enable, it can get to some 2 or 3 %, which
 >> may or may not be acceptable to an application. But for me this is not
 >> even about cost, and that's why I haven't brought it up so far
 >
 > It seems like Mel's concern is mostly based on performance overhead
 > concerns tho.
 >
 >>> The part I nacked is enabling kmemcg on a populated cgroup and then
 >>> starting accounting from then without any apparent indication that any
 >>> past allocation hasn't been considered.  You end up with numbers which
 >>> nobody can't tell what they really mean and there's no mechanism to
 >>> guarantee any kind of ordering between populating the cgroup and
 >>> configuring it and there's *no* way to find out what happened
 >>> afterwards neither.  This is properly crazy and definitely deserves a
 >>> nack.
 >>>
 >>
 >> Mel suggestion of not allowing this to happen once the cgroup has tasks
 >> takes care of this, and is something I thought of myself.
 >
 > You mean Michal's?  It should also disallow switching if there are
 > children cgroups, right?
 >
 
 No, I meant Mel, quoting this:
 
 "Further I would expect that an administrator would be aware of these
 limitations and set kmem_accounting at cgroup creation time before any
 processes start. Maybe that should be enforced but it's not a
 fundamental problem."
 
 But I guess it is pretty much the same thing Michal proposes, in essence.
 
 Or IOW, if your concern is with the fact that charges may have happened
 in the past before this is enabled, we can make sure this cannot happen
 by disallowing the limit to be set if currently unset (value changes are
 obviously fine) if you have children or any tasks already in the group.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48138 is a reply to message #48124] | Sun, 30 September 2012 07:57   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, Glauber. 
 On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote:
 > > Can you please give other examples of cases where this type of issue
 > > exists (plenty of shared kernel data structure which is inherent to
 > > the workload at hand)?  Until now, this has been the only example for
 > > this type of issues.
 >
 > Yes. the namespace related caches (*), all kinds of sockets and network
 > structures, other file system structures like file struct, vm areas, and
 > pretty much everything a full container does.
 >
 > (*) we run full userspace, so we have namespaces + cgroups combination.
 
 This is probably me being dumb but wouldn't resources used by full
 namespaces be mostly independent?  Which parts get shared?  Also, if
 you do full namespace, isn't it more likely that you would want fuller
 resource isolation too?
 
 > >> Mel suggestion of not allowing this to happen once the cgroup has tasks
 > >> takes care of this, and is something I thought of myself.
 > >
 > > You mean Michal's?  It should also disallow switching if there are
 > > children cgroups, right?
 >
 > No, I meant Mel, quoting this:
 >
 > "Further I would expect that an administrator would be aware of these
 > limitations and set kmem_accounting at cgroup creation time before any
 > processes start. Maybe that should be enforced but it's not a
 > fundamental problem."
 >
 > But I guess it is pretty much the same thing Michal proposes, in essence.
 >
 > Or IOW, if your concern is with the fact that charges may have happened
 > in the past before this is enabled, we can make sure this cannot happen
 > by disallowing the limit to be set if currently unset (value changes are
 > obviously fine) if you have children or any tasks already in the group.
 
 Yeah, please do that.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48139 is a reply to message #48138] | Sun, 30 September 2012 08:02   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| On Sun, Sep 30, 2012 at 04:57:00PM +0900, Tejun Heo wrote: > On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote:
 > > > Can you please give other examples of cases where this type of issue
 > > > exists (plenty of shared kernel data structure which is inherent to
 > > > the workload at hand)?  Until now, this has been the only example for
 > > > this type of issues.
 > >
 > > Yes. the namespace related caches (*), all kinds of sockets and network
 > > structures, other file system structures like file struct, vm areas, and
 > > pretty much everything a full container does.
 > >
 > > (*) we run full userspace, so we have namespaces + cgroups combination.
 >
 > This is probably me being dumb but wouldn't resources used by full
 > namespaces be mostly independent?  Which parts get shared?  Also, if
 > you do full namespace, isn't it more likely that you would want fuller
 > resource isolation too?
 
 Just a thought about dentry/inode.  Would it make sense to count total
 number of references per cgroup and charge the total amount according
 to that?  Reference counts are how the shared ownership is represented
 after all.  Counting total per cgroup isn't accurate and pathological
 cases could be weird tho.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48140 is a reply to message #48123] | Sun, 30 September 2012 08:23   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, Glauber. 
 On Thu, Sep 27, 2012 at 10:30:36PM +0400, Glauber Costa wrote:
 > > But that happens only when pages enter and leave slab and if it still
 > > is significant, we can try to further optimize charging.  Given that
 > > this is only for cases where memcg is already in use and we provide a
 > > switch to disable it globally, I really don't think this warrants
 > > implementing fully hierarchy configuration.
 >
 > Not totally true. We still have to match every allocation to the right
 > cache, and that is actually our heaviest hit, responsible for the 2, 3 %
 > we're seeing when this is enabled. It is the kind of path so hot that
 > people frown upon branches being added, so I don't think we'll ever get
 > this close to being free.
 
 Sure, depening on workload, any addition to alloc/free could be
 noticeable.  I don't know.  I'll write more about it when replying to
 Michal's message.  BTW, __memcg_kmem_get_cache() does seem a bit
 heavy.  I wonder whether indexing from cache side would make it
 cheaper?  e.g. something like the following.
 
 kmem_cache *__memcg_kmem_get_cache(cachep, gfp)
 {
 struct kmem_cache *c;
 
 c = cachep->memcg_params->caches[percpu_read(kmemcg_slab_idx)];
 if (likely(c))
 return c;
 /* try to create and then fall back to cachep */
 }
 
 where kmemcg_slab_idx is updated from sched notifier (or maybe add and
 use current->kmemcg_slab_idx?).  You would still need __GFP_* and
 in_interrupt() tests but current->mm and PF_KTHREAD tests can be
 rolled into index selection.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48142 is a reply to message #48114] | Sun, 30 September 2012 08:47   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, Michal. 
 On Thu, Sep 27, 2012 at 05:09:50PM +0200, Michal Hocko wrote:
 > On Thu 27-09-12 07:33:00, Tejun Heo wrote:
 > > I'm not too convinced.  First of all, the overhead added by kmemcg
 > > isn't big.
 >
 > You are probably talking about memory overhead which is indeed not that
 > big (except for possible side effects as fragmentation which you mention
 > bellow). But the runtime overhead is present, as far as I understand from
 > what Glauber said. But, on the other hand, it is fair to say that those
 > who _want_ use the feature should pay for it.
 
 Yeah, as long as the overhead is reasonable and it doesn't affect
 non-users, I think we should put more emphasis on simplicity.  cgroup
 is pretty hairy (from both implementation and interface POVs) to begin
 with.  Unfortunately, what's reasonable or how much more emphasis
 varies widely depending on who one asks.
 
 > > The hot path overhead is quite minimal - it doesn't do much more than
 > > indirecting one more time.  In terms of memory usage, it sure could
 > > lead to a bit more fragmentation but even if it gets to several megs
 > > per cgroup, I don't think that's something excessive.  So, there is
 > > overhead but I don't believe it to be prohibitive.
 >
 > Remember that users do not want to pay even "something minimal" when the
 > feature is not needed.
 
 Yeah but, if we can get it down to, say, around 1% under most
 workloads for memcg users, it is quite questionable to introduce full
 hierarchical configuration to allow avoiding that, isn't it?
 
 > > The distinction between "trusted" and "untrusted" is something
 > > artificially created due to the assumed deficiency of kmemcg
 > > implementation.
 >
 > Not really. It doesn't have to do anything with the overhead (be it
 > memory or runtime). It really boils down to "do I need/want it at all".
 > Why would I want to think about how much kernel memory is in use in the
 > first place? Or do you think that user memory accounting should be
 > deprecated?
 
 But you can apply the same "do I need/want it at all" question to the
 configuration parameter too.  I can see your point but the decision
 seems muddy to me, and if muddy, I prefer to err on the side of being
 too conservative.
 
 > > This is userland visible API.
 >
 > I am not sure which API visible part you have in mind but
 > kmem.limit_in_bytes will be there whether we go with global knob or "no
 > limit no accounting" approach.
 
 I mean full hierarchical configuration of it.  It becomes something
 which each memcg user cares about instead of something which the base
 system / admin flips on system boot.
 
 > > We can always expand the flexibility.  Let's do the simple thing
 > > first.  As an added bonus, it would enable using static_keys for
 > > accounting branches too.
 >
 > While I do agree with you in general and being careful is at place in
 > this area as time shown several times, this seems to be too restrictive
 > in this particular case.
 > We won't save almost no code with the global knob so I am not sure
 > what we are actually saving here. Global knob will just give us all or
 > nothing semantic without making the whole thing simpler. You will stick
 > with static branches and checkes whether the group accountable anyway,
 > right?
 
 The thing is about the same argument can be made about .use_hierarchy
 too.  It doesn't necessarily make the code much harier.  Especially
 because the code is structured with that feature on mind, removing
 .use_hierarchy might not remove whole lot of code; however, the wider
 range of behavior which got exposed through that poses a much larger
 problem when we try to make modifications on related behaviors.  We
 get a lot more locked down by seemingly not too much code and our long
 term maintainability / sustainability suffers as a result.
 
 I'm not trying to say this is as bad as .use_hierarchy but want to
 point out that memcg and cgroup in general have had pretty strong
 tendency to choose overly flexible and complex designs and interfaces
 and it's probably about time we become more careful especially about
 stuff which is visible to userland.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48143 is a reply to message #48139] | Sun, 30 September 2012 08:56   |  
			| 
				
				
					|  James Bottomley Messages: 17
 Registered: May 2006
 | Junior Member |  |  |  
	| On Sun, 2012-09-30 at 17:02 +0900, Tejun Heo wrote: > On Sun, Sep 30, 2012 at 04:57:00PM +0900, Tejun Heo wrote:
 > > On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote:
 > > > > Can you please give other examples of cases where this type of issue
 > > > > exists (plenty of shared kernel data structure which is inherent to
 > > > > the workload at hand)?  Until now, this has been the only example for
 > > > > this type of issues.
 > > >
 > > > Yes. the namespace related caches (*), all kinds of sockets and network
 > > > structures, other file system structures like file struct, vm areas, and
 > > > pretty much everything a full container does.
 > > >
 > > > (*) we run full userspace, so we have namespaces + cgroups combination.
 > >
 > > This is probably me being dumb but wouldn't resources used by full
 > > namespaces be mostly independent?  Which parts get shared?  Also, if
 > > you do full namespace, isn't it more likely that you would want fuller
 > > resource isolation too?
 >
 > Just a thought about dentry/inode.  Would it make sense to count total
 > number of references per cgroup and charge the total amount according
 > to that?  Reference counts are how the shared ownership is represented
 > after all.  Counting total per cgroup isn't accurate and pathological
 > cases could be weird tho.
 
 The beancounter approach originally used by OpenVZ does exactly this.
 There are two specific problems, though, firstly you can't count
 references in generic code, so now you have to extend the cgroup
 tentacles into every object, an invasiveness which people didn't really
 like.  Secondly split accounting causes oddities too, like your total
 kernel memory usage can appear to go down even though you do nothing
 just because someone else added a share.  Worse, if someone drops the
 reference, your usage can go up, even though you did nothing, and push
 you over your limit, at which point action gets taken against the
 container.  This leads to nasty system unpredictability (The whole point
 of cgroup isolation is supposed to be preventing resource usage in one
 cgroup from affecting that in another).
 
 We discussed this pretty heavily at the Containers Mini Summit in Santa
 Rosa.  The emergent consensus was that no-one really likes first use
 accounting, but it does solve all the problems and it has the fewest
 unexpected side effects.
 
 If you have an alternative that wasn't considered then, I'm sure
 everyone would be interested, but it isn't split accounting.
 
 James
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48144 is a reply to message #48143] | Sun, 30 September 2012 10:37   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, James. 
 On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote:
 > The beancounter approach originally used by OpenVZ does exactly this.
 > There are two specific problems, though, firstly you can't count
 > references in generic code, so now you have to extend the cgroup
 > tentacles into every object, an invasiveness which people didn't really
 > like.
 
 Yeah, it will need some hooks.  For dentry and inode, I think it would
 be pretty well isolated tho.  Wasn't it?
 
 > Secondly split accounting causes oddities too, like your total
 > kernel memory usage can appear to go down even though you do nothing
 > just because someone else added a share.  Worse, if someone drops the
 > reference, your usage can go up, even though you did nothing, and push
 > you over your limit, at which point action gets taken against the
 > container.  This leads to nasty system unpredictability (The whole point
 > of cgroup isolation is supposed to be preventing resource usage in one
 > cgroup from affecting that in another).
 
 In a sense, the fluctuating amount is the actual resource burden the
 cgroup is putting on the system, so maybe it just needs to be handled
 better or maybe we should charge fixed amount per refcnt?  I don't
 know.
 
 > We discussed this pretty heavily at the Containers Mini Summit in Santa
 > Rosa.  The emergent consensus was that no-one really likes first use
 > accounting, but it does solve all the problems and it has the fewest
 > unexpected side effects.
 
 But that's like fitting the problem to the mechanism.  Maybe that is
 the best which can be done, but the side effect there is way-off
 accounting under pretty common workload, which sounds pretty nasty to
 me.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48145 is a reply to message #48144] | Sun, 30 September 2012 11:25   |  
			| 
				
				
					|  James Bottomley Messages: 17
 Registered: May 2006
 | Junior Member |  |  |  
	| On Sun, 2012-09-30 at 19:37 +0900, Tejun Heo wrote: > Hello, James.
 >
 > On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote:
 > > The beancounter approach originally used by OpenVZ does exactly this.
 > > There are two specific problems, though, firstly you can't count
 > > references in generic code, so now you have to extend the cgroup
 > > tentacles into every object, an invasiveness which people didn't really
 > > like.
 >
 > Yeah, it will need some hooks.  For dentry and inode, I think it would
 > be pretty well isolated tho.  Wasn't it?
 
 But you've got to ask yourself who cares about accurate accounting per
 container of dentry and inode objects? They're not objects that any
 administrator is used to limiting.  What we at parallels care about
 isn't accurately accounting them, it's that one container can't DoS
 another by exhausting system resources.  That's achieved equally well by
 first charge slab accounting, so we don't really have an interest in
 pushing object accounting code for which there's no use case.
 
 > > Secondly split accounting causes oddities too, like your total
 > > kernel memory usage can appear to go down even though you do nothing
 > > just because someone else added a share.  Worse, if someone drops the
 > > reference, your usage can go up, even though you did nothing, and push
 > > you over your limit, at which point action gets taken against the
 > > container.  This leads to nasty system unpredictability (The whole point
 > > of cgroup isolation is supposed to be preventing resource usage in one
 > > cgroup from affecting that in another).
 >
 > In a sense, the fluctuating amount is the actual resource burden the
 > cgroup is putting on the system, so maybe it just needs to be handled
 > better or maybe we should charge fixed amount per refcnt?  I don't
 > know.
 
 Yes, we considered single charge per reference accounting as well
 (although I don't believe anyone went as far as producing an
 implementation).  The problem here is now that the sum of the container
 resources no-longer bears any relation to the host consumption.  This
 makes it very difficult for virtualisation orchestration systems to make
 accurate decisions when doing dynamic resource scheduling (DRS).
 
 Conversely, as ugly as you think it, first use accounting is actually
 pretty good at identifying problem containers (at least with regard to
 memory usage) for DRS because containers which are stretching the memory
 tend to accumulate the greatest number of first charges over the system
 lifetime.
 
 > > We discussed this pretty heavily at the Containers Mini Summit in Santa
 > > Rosa.  The emergent consensus was that no-one really likes first use
 > > accounting, but it does solve all the problems and it has the fewest
 > > unexpected side effects.
 >
 > But that's like fitting the problem to the mechanism.  Maybe that is
 > the best which can be done, but the side effect there is way-off
 > accounting under pretty common workload, which sounds pretty nasty to
 > me.
 
 All we need kernel memory accounting and limiting for is DoS prevention.
 There aren't really any system administrators who care about Kernel
 Memory accounting (at least until the system goes oom) because there are
 no absolute knobs for it (all there is are a set of weird and wonderful
 heuristics, like dirty limit ratio and drop caches).  Kernel memory
 usage has a whole set of regulatory infrastructure for trying to make it
 transparent to the user.
 
 Don't get me wrong: if there were some easy way to get proper memory
 accounting for free, we'd be happy but, because it has no practical
 application for any of our customers, there's a limited price we're
 willing to pay to get it.
 
 James
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48150 is a reply to message #48145] | Mon, 01 October 2012 00:57   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, James. 
 On Sun, Sep 30, 2012 at 12:25:52PM +0100, James Bottomley wrote:
 > But you've got to ask yourself who cares about accurate accounting per
 > container of dentry and inode objects? They're not objects that any
 > administrator is used to limiting.  What we at parallels care about
 > isn't accurately accounting them, it's that one container can't DoS
 > another by exhausting system resources.  That's achieved equally well by
 > first charge slab accounting, so we don't really have an interest in
 > pushing object accounting code for which there's no use case.
 
 Isn't it more because the use cases you have on mind don't share
 dentries/inodes too much?  Wildly incorrect accounting definitely
 degrades container isolation and can lead to unexpected behaviors.
 
 > All we need kernel memory accounting and limiting for is DoS prevention.
 > There aren't really any system administrators who care about Kernel
 > Memory accounting (at least until the system goes oom) because there are
 > no absolute knobs for it (all there is are a set of weird and wonderful
 > heuristics, like dirty limit ratio and drop caches).  Kernel memory
 
 I think that's because the mechanism currently doesn't exist.  If one
 wants to control how memory is distributed across different cgroups,
 it's logical to control kernel memory too.  The resource in question
 is the actual memory after all.  I think at least google would be
 interested in it, so, no, I don't agree that nobody wants it.  If that
 is the case, we're working towards the wrong direction.
 
 > usage has a whole set of regulatory infrastructure for trying to make it
 > transparent to the user.
 >
 > Don't get me wrong: if there were some easy way to get proper memory
 > accounting for free, we'd be happy but, because it has no practical
 > application for any of our customers, there's a limited price we're
 > willing to pay to get it.
 
 Even on purely technical ground, it could be that first-use is the
 right trade off if other more accurate approaches are too difficult
 and most workloads are happy with such approach.  I'm still a bit
 weary to base userland interface decisions on that tho.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48152 is a reply to message #48138] | Mon, 01 October 2012 08:36   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 09/30/2012 11:57 AM, Tejun Heo wrote: > Hello, Glauber.
 >
 > On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote:
 >>> Can you please give other examples of cases where this type of issue
 >>> exists (plenty of shared kernel data structure which is inherent to
 >>> the workload at hand)?  Until now, this has been the only example for
 >>> this type of issues.
 >>
 >> Yes. the namespace related caches (*), all kinds of sockets and network
 >> structures, other file system structures like file struct, vm areas, and
 >> pretty much everything a full container does.
 >>
 >> (*) we run full userspace, so we have namespaces + cgroups combination.
 >
 > This is probably me being dumb but wouldn't resources used by full
 > namespaces be mostly independent?  Which parts get shared?  Also, if
 > you do full namespace, isn't it more likely that you would want fuller
 > resource isolation too?
 >
 
 Not necessarily. Namespaces are pretty flexible. If you are using the
 network namespace, for instance, you can create interfaces, routes,
 addresses, etc. But because this deals with the network only, there is
 nothing unreasonable in saying that your webserver and database lives in
 the same network (which is a different network namespace), but are
 entitled to different memory limits - which is cgroups realm. With
 application-only containers being championed these days by multiple
 users, I would expect this situation to become non-negligible.
 
 The full container scenario, of course, is very different. Most of the
 accesses tends to be local.
 
 I will second what Michal said, since I believe this is also very
 important: User memory is completely at the control of the application,
 while kernel memory is not, and never will. It is perfectly fine to
 imagine applications that want its memory to be limited by a very
 predictable amount, and there are no reasons to believe that those
 cannot live in the same box as full containers - the biggest example of
 kmem interested folks. It could be, for instance, that a management tool
 for containers lives in there, and that application wants to be umem
 limited but not kmem limited. (If it goes touching files and data inside
 each container, for instance, it is obviously not local)
 
 >>>> Mel suggestion of not allowing this to happen once the cgroup has tasks
 >>>> takes care of this, and is something I thought of myself.
 >>>
 >>> You mean Michal's?  It should also disallow switching if there are
 >>> children cgroups, right?
 >>
 >> No, I meant Mel, quoting this:
 >>
 >> "Further I would expect that an administrator would be aware of these
 >> limitations and set kmem_accounting at cgroup creation time before any
 >> processes start. Maybe that should be enforced but it's not a
 >> fundamental problem."
 >>
 >> But I guess it is pretty much the same thing Michal proposes, in essence.
 >>
 >> Or IOW, if your concern is with the fact that charges may have happened
 >> in the past before this is enabled, we can make sure this cannot happen
 >> by disallowing the limit to be set if currently unset (value changes are
 >> obviously fine) if you have children or any tasks already in the group.
 >
 > Yeah, please do that.
 >
 
 I did already, patches soon! =)
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48153 is a reply to message #48150] | Mon, 01 October 2012 08:43   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 10/01/2012 04:57 AM, Tejun Heo wrote: > Hello, James.
 >
 > On Sun, Sep 30, 2012 at 12:25:52PM +0100, James Bottomley wrote:
 >> But you've got to ask yourself who cares about accurate accounting per
 >> container of dentry and inode objects? They're not objects that any
 >> administrator is used to limiting.  What we at parallels care about
 >> isn't accurately accounting them, it's that one container can't DoS
 >> another by exhausting system resources.  That's achieved equally well by
 >> first charge slab accounting, so we don't really have an interest in
 >> pushing object accounting code for which there's no use case.
 >
 > Isn't it more because the use cases you have on mind don't share
 > dentries/inodes too much?  Wildly incorrect accounting definitely
 > degrades container isolation and can lead to unexpected behaviors.
 >
 >> All we need kernel memory accounting and limiting for is DoS prevention.
 >> There aren't really any system administrators who care about Kernel
 >> Memory accounting (at least until the system goes oom) because there are
 >> no absolute knobs for it (all there is are a set of weird and wonderful
 >> heuristics, like dirty limit ratio and drop caches).  Kernel memory
 >
 > I think that's because the mechanism currently doesn't exist.  If one
 > wants to control how memory is distributed across different cgroups,
 > it's logical to control kernel memory too.  The resource in question
 > is the actual memory after all.  I think at least google would be
 > interested in it, so, no, I don't agree that nobody wants it.  If that
 > is the case, we're working towards the wrong direction.
 >
 >> usage has a whole set of regulatory infrastructure for trying to make it
 >> transparent to the user.
 >>
 >> Don't get me wrong: if there were some easy way to get proper memory
 >> accounting for free, we'd be happy but, because it has no practical
 >> application for any of our customers, there's a limited price we're
 >> willing to pay to get it.
 >
 > Even on purely technical ground, it could be that first-use is the
 > right trade off if other more accurate approaches are too difficult
 > and most workloads are happy with such approach.  I'm still a bit
 > weary to base userland interface decisions on that tho.
 >
 
 For the record, user memory also suffers a bit from being always
 constrained to first-touch accounting. Greg Thelen is working on
 alternative solutions to make first-accounting the default in a
 configurable environment, as he explained in the kernel summit.
 
 When that happens, kernel memory can take advantage of it for free.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48154 is a reply to message #48140] | Mon, 01 October 2012 08:45   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 09/30/2012 12:23 PM, Tejun Heo wrote: > Hello, Glauber.
 >
 > On Thu, Sep 27, 2012 at 10:30:36PM +0400, Glauber Costa wrote:
 >>> But that happens only when pages enter and leave slab and if it still
 >>> is significant, we can try to further optimize charging.  Given that
 >>> this is only for cases where memcg is already in use and we provide a
 >>> switch to disable it globally, I really don't think this warrants
 >>> implementing fully hierarchy configuration.
 >>
 >> Not totally true. We still have to match every allocation to the right
 >> cache, and that is actually our heaviest hit, responsible for the 2, 3 %
 >> we're seeing when this is enabled. It is the kind of path so hot that
 >> people frown upon branches being added, so I don't think we'll ever get
 >> this close to being free.
 >
 > Sure, depening on workload, any addition to alloc/free could be
 > noticeable.  I don't know.  I'll write more about it when replying to
 > Michal's message.  BTW, __memcg_kmem_get_cache() does seem a bit
 > heavy.  I wonder whether indexing from cache side would make it
 > cheaper?  e.g. something like the following.
 >
 > 	kmem_cache *__memcg_kmem_get_cache(cachep, gfp)
 > 	{
 > 		struct kmem_cache *c;
 >
 > 		c = cachep->memcg_params->caches[percpu_read(kmemcg_slab_idx)];
 > 		if (likely(c))
 > 			return c;
 > 		/* try to create and then fall back to cachep */
 > 	}
 >
 > where kmemcg_slab_idx is updated from sched notifier (or maybe add and
 > use current->kmemcg_slab_idx?).  You would still need __GFP_* and
 > in_interrupt() tests but current->mm and PF_KTHREAD tests can be
 > rolled into index selection.
 >
 
 How big would this array be? there can be a lot more kmem_caches than
 there are memcgs. That is why it is done from memcg side.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48155 is a reply to message #48144] | Mon, 01 October 2012 08:46   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 09/30/2012 02:37 PM, Tejun Heo wrote: > Hello, James.
 >
 > On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote:
 >> The beancounter approach originally used by OpenVZ does exactly this.
 >> There are two specific problems, though, firstly you can't count
 >> references in generic code, so now you have to extend the cgroup
 >> tentacles into every object, an invasiveness which people didn't really
 >> like.
 >
 > Yeah, it will need some hooks.  For dentry and inode, I think it would
 > be pretty well isolated tho.  Wasn't it?
 >
 
 We would still need something for the stack. For open files, and for
 everything that becomes a potential problem. We then end up with 35
 different knobs instead of one. One of the perceived advantages of this
 approach, is that it condenses as much data as a single knob as
 possible, reducing complexity and over flexibility.
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48156 is a reply to message #48142] | Mon, 01 October 2012 09:27   |  
			| 
				
				
					|  Michal Hocko Messages: 109
 Registered: December 2011
 | Senior Member |  |  |  
	| Hi, 
 On Sun 30-09-12 17:47:50, Tejun Heo wrote:
 [...]
 > > > The hot path overhead is quite minimal - it doesn't do much more than
 > > > indirecting one more time.  In terms of memory usage, it sure could
 > > > lead to a bit more fragmentation but even if it gets to several megs
 > > > per cgroup, I don't think that's something excessive.  So, there is
 > > > overhead but I don't believe it to be prohibitive.
 > >
 > > Remember that users do not want to pay even "something minimal" when the
 > > feature is not needed.
 >
 > Yeah but, if we can get it down to, say, around 1% under most
 > workloads for memcg users, it is quite questionable to introduce full
 > hierarchical configuration to allow avoiding that, isn't it?
 
 Remember that the kmem memory is still accounted to u+k if it is enabled
 which could be a no-go because some workloads (I have provided an
 example that those which are trusted are generally safe to ignore kernel
 memory overhead) simply don't want to consider additional memory which
 is mostly invisible for them.
 
 > > > The distinction between "trusted" and "untrusted" is something
 > > > artificially created due to the assumed deficiency of kmemcg
 > > > implementation.
 > >
 > > Not really. It doesn't have to do anything with the overhead (be it
 > > memory or runtime). It really boils down to "do I need/want it at all".
 > > Why would I want to think about how much kernel memory is in use in the
 > > first place? Or do you think that user memory accounting should be
 > > deprecated?
 >
 > But you can apply the same "do I need/want it at all" question to the
 > configuration parameter too.
 
 Yes but, as I've said, the global configuration parameter is too
 coarse. You can have a mix of trusted and untrusted workloads at the
 same machine (e.g. web server which is inherently untrusted) and trusted
 (local batch jobs which just needs a special LRU aging).
 
 > I can see your point but the decision seems muddy to me, and if muddy,
 > I prefer to err on the side of being too conservative.
 >
 > > > This is userland visible API.
 > >
 > > I am not sure which API visible part you have in mind but
 > > kmem.limit_in_bytes will be there whether we go with global knob or "no
 > > limit no accounting" approach.
 >
 > I mean full hierarchical configuration of it.  It becomes something
 > which each memcg user cares about instead of something which the base
 > system / admin flips on system boot.
 >
 > > > We can always expand the flexibility.  Let's do the simple thing
 > > > first.  As an added bonus, it would enable using static_keys for
 > > > accounting branches too.
 > >
 > > While I do agree with you in general and being careful is at place in
 > > this area as time shown several times, this seems to be too restrictive
 > > in this particular case.
 > > We won't save almost no code with the global knob so I am not sure
 > > what we are actually saving here. Global knob will just give us all or
 > > nothing semantic without making the whole thing simpler. You will stick
 > > with static branches and checkes whether the group accountable anyway,
 > > right?
 >
 > The thing is about the same argument can be made about .use_hierarchy
 > too.  It doesn't necessarily make the code much harier.  Especially
 > because the code is structured with that feature on mind, removing
 > .use_hierarchy might not remove whole lot of code; however, the wider
 > range of behavior which got exposed through that poses a much larger
 > problem when we try to make modifications on related behaviors.  We
 > get a lot more locked down by seemingly not too much code and our long
 > term maintainability / sustainability suffers as a result.
 
 I think that comparing kmem accounting with use_hierarchy is not fair.
 Glauber tried to explain why already so I will not repeat it here.
 I will just mention one thing. use_hierarchy has been introduces becuase
 hierarchies were expensive at the time. kmem accounting is about should
 we do u or u+k accounting. So there is a crucial difference.
 
 > I'm not trying to say this is as bad as .use_hierarchy but want to
 > point out that memcg and cgroup in general have had pretty strong
 > tendency to choose overly flexible and complex designs and interfaces
 > and it's probably about time we become more careful especially about
 > stuff which is visible to userland.
 
 That is right but I think that the current discussion shows that a mixed
 (kmem disabled and kmem enabled hierarchies) workloads are far from
 being theoretical and a global knob is just too coarse. I am afraid we
 will see "we want that per hierarchy" requests shortly and that would
 just add a new confusion where global knob would complicate it
 considerably (do we really want on/off/per_hierarchy global knob?).
 --
 Michal Hocko
 SUSE Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48200 is a reply to message #48156] | Wed, 03 October 2012 22:43   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hey, Michal. 
 On Mon, Oct 01, 2012 at 11:27:56AM +0200, Michal Hocko wrote:
 > > Yeah but, if we can get it down to, say, around 1% under most
 > > workloads for memcg users, it is quite questionable to introduce full
 > > hierarchical configuration to allow avoiding that, isn't it?
 >
 > Remember that the kmem memory is still accounted to u+k if it is enabled
 > which could be a no-go because some workloads (I have provided an
 > example that those which are trusted are generally safe to ignore kernel
 > memory overhead) simply don't want to consider additional memory which
 > is mostly invisible for them.
 
 Maybe it's because my exposure to cgroup usage is different from yours
 but the argument that not accounting kernel memory is something
 inherently beneficial is lost on me.  For compatibility, overhead
 and/or implementation complexity issues, yeah, sure, we can't always
 (well more like usually) have all we want but I don't get how not
 accounting kernel memory is something inherently necessary or
 beneficial.  This is all about provisioning physical memory to
 different groups of users and memory is memory (that's why u+k makes
 sense, right?).  Without kmemcg enabled, we not only lack a way to
 control kernel memory usage but also a way to even watch per-group
 usages.
 
 > > But you can apply the same "do I need/want it at all" question to the
 > > configuration parameter too.
 >
 > Yes but, as I've said, the global configuration parameter is too
 > coarse. You can have a mix of trusted and untrusted workloads at the
 > same machine (e.g. web server which is inherently untrusted) and trusted
 > (local batch jobs which just needs a special LRU aging).
 
 This too stems from the same difference stated above.  You think
 there's inherent distinction between trusted and untrusted workloads
 and they need different features from the kernel while I think why
 trust anyone if you can untrust everyone and consider the knob as a
 compatibility thing.
 
 > I think that comparing kmem accounting with use_hierarchy is not fair.
 > Glauber tried to explain why already so I will not repeat it here.
 > I will just mention one thing. use_hierarchy has been introduces becuase
 > hierarchies were expensive at the time. kmem accounting is about should
 > we do u or u+k accounting. So there is a crucial difference.
 
 It may be less crazy but I think there are enough commonalities to at
 least make a comparison.  Mel seems to think it's mostly about
 performance overhead.  You think that not accounting kmem is something
 inherently necessary.
 
 > That is right but I think that the current discussion shows that a mixed
 > (kmem disabled and kmem enabled hierarchies) workloads are far from
 > being theoretical and a global knob is just too coarse. I am afraid we
 
 I'm not sure there's much evidence in this thread.  The strongest upto
 this point seems to be performance overhead / difficulty of general
 enough implementation.  As for "trusted" workload, what are the
 inherent benefits of trusting if you don't have to?
 
 > will see "we want that per hierarchy" requests shortly and that would
 > just add a new confusion where global knob would complicate it
 > considerably (do we really want on/off/per_hierarchy global knob?).
 
 Hmmm?  The global knob is just the same per_hierarchy knob at the
 root.  It's hierarchical after all.
 
 Anyways, as long as the "we silently ignore what happened before being
 enabled" is gone, I won't fight this anymore.  It isn't broken after
 all.  But, please think about making things simpler in general, cgroup
 is riddled with mis-designed complexities and memcg seems to be
 leading the charge at times.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48202 is a reply to message #48155] | Wed, 03 October 2012 22:59   |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, Glauber. 
 On Mon, Oct 01, 2012 at 12:46:02PM +0400, Glauber Costa wrote:
 > > Yeah, it will need some hooks.  For dentry and inode, I think it would
 > > be pretty well isolated tho.  Wasn't it?
 >
 > We would still need something for the stack. For open files, and for
 > everything that becomes a potential problem. We then end up with 35
 > different knobs instead of one. One of the perceived advantages of this
 > approach, is that it condenses as much data as a single knob as
 > possible, reducing complexity and over flexibility.
 
 Oh, I didn't mean to use object-specific counting for all of them.
 Most resources don't have such common misaccounting problem.  I mean,
 for stack, it doesn't exist by definition (other than cgroup
 migration).  There's no reason to use anything other than first-use
 kmem based accounting for them.  My point was that for particularly
 problematic ones like dentry/inode, it might be better to treat them
 differently.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48207 is a reply to message #48201] | Thu, 04 October 2012 11:55   |  
			| 
				
				
					|  Glauber Costa Messages: 916
 Registered: October 2011
 | Senior Member |  |  |  
	| On 10/04/2012 02:54 AM, Tejun Heo wrote: > Hello, Glauber.
 >
 > On Mon, Oct 01, 2012 at 12:45:11PM +0400, Glauber Costa wrote:
 >>> where kmemcg_slab_idx is updated from sched notifier (or maybe add and
 >>> use current->kmemcg_slab_idx?).  You would still need __GFP_* and
 >>> in_interrupt() tests but current->mm and PF_KTHREAD tests can be
 >>> rolled into index selection.
 >>
 >> How big would this array be? there can be a lot more kmem_caches than
 >> there are memcgs. That is why it is done from memcg side.
 >
 > The total number of memcgs are pretty limited due to the ID thing,
 > right?  And kmemcg is only applied to subset of caches.  I don't think
 > the array size would be a problem in terms of memory overhead, would
 > it?  If so, RCU synchronize and dynamically grow them?
 >
 > Thanks.
 >
 
 I don't want to assume the number of memcgs will always be that limited.
 Sure, the ID limitation sounds pretty much a big one, but people doing
 VMs usually want to stack as many VMs as they possibly can in an
 environment, and the less things preventing that from happening, the better.
 
 That said, now that I've experimented with this a bit, indexing from the
 cache may have some advantages: it can get too complicated to propagate
 new caches appearing to all memcgs that already in-flight. We don't have
 this problem from the cache side, because instances of it are guaranteed
 not to exist at this point by definition.
 
 I don't want to bloat unrelated kmem_cache structures, so I can't embed
 a memcg array in there: I would have to have a pointer to a memcg array
 that gets assigned at first use. But if we don't want to have a static
 number, as you and christoph already frowned upon heavily, we may have
 to do that memcg side as well.
 
 The array gets bigger, though, because it pretty much has to be enough
 to accomodate all css_ids. Even now, they are more than the 400 I used
 in this patchset. Not allocating all of them at once will lead to more
 complication and pointer chasing in here.
 
 I'll take a look at the alternatives today and tomorrow.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48216 is a reply to message #48207] | Sat, 06 October 2012 02:19  |  
			| 
				
				
					|  Tejun Heo Messages: 184
 Registered: November 2006
 | Senior Member |  |  |  
	| Hello, Glauber. 
 On Thu, Oct 04, 2012 at 03:55:14PM +0400, Glauber Costa wrote:
 > I don't want to bloat unrelated kmem_cache structures, so I can't embed
 > a memcg array in there: I would have to have a pointer to a memcg array
 > that gets assigned at first use. But if we don't want to have a static
 > number, as you and christoph already frowned upon heavily, we may have
 > to do that memcg side as well.
 >
 > The array gets bigger, though, because it pretty much has to be enough
 > to accomodate all css_ids. Even now, they are more than the 400 I used
 > in this patchset. Not allocating all of them at once will lead to more
 > complication and pointer chasing in here.
 
 I don't think it would require more pointer chasing.  At the simplest,
 we can just compare the array size each time.  If you wanna be more
 efficient, all arrays can be kept at the same size and resized when
 the number of memcgs cross the current number.  The only runtime
 overhead would be one pointer deref which I don't think can be avoided
 regardless of the indexing direction.
 
 Thanks.
 
 --
 tejun
 |  
	|  |  | 
 
 
 Current Time: Sun Oct 26 04:36:12 GMT 2025 
 Total time taken to generate the page: 0.08430 seconds |