OpenVZ Forum


Home » Mailing lists » Devel » [PATCH v3 04/13] kmem accounting basic infrastructure
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48113 is a reply to message #48111] Thu, 27 September 2012 14:57 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 09/27/2012 06:49 PM, Tejun Heo wrote:
> Hello, Mel.
>
> On Thu, Sep 27, 2012 at 03:28:22PM +0100, Mel Gorman wrote:
>>> In addition, how is userland supposed to know which
>>> workload is shared kmem heavy or not?
>>
>> By using a bit of common sense.
>>
>> An application may not be able to figure this out but the administrator
>> is going to be able to make a very educated guess. If processes running
>> within two containers are not sharing a filesystem hierarchy for example
>> then it'll be clear they are not sharing dentries.
>>
>> If there was a suspicion they were then it could be analysed with
>> something like SystemTap probing when files are opened and see if files
>> are being opened that are shared between containers.
>>
>> It's not super-easy but it's not impossible either and I fail to see why
>> it's such a big deal for you.
>
> Because we're not even trying to actually solve the problem but just
> dumping it to userland. If dentry/inode usage is the only case we're
> being worried about, there can be better ways to solve it or at least
> we should strive for that.
>

Not only it is not the only case we care about, this is not even touched
in this series. (It is only touched in the next one). This one, for
instance, cares about the stack. The reason everything is being dumped
into "kmem", is precisely to make things simpler. I argue that at some
point it makes sense to draw a line, and "kmem" is a much better line
than any fine grained control - precisely because it is conceptually
easier to grasp.


> Also, the problem is not that it is impossible if you know and
> carefully plan for things beforehand (that would be one extremely
> competent admin) but that the problem is undiscoverable. With kmemcg
> accounting disabled, there's no way to tell a looking cgroup the admin
> thinks running something which doesn'ft tax kmem much could be
> generating a ton without the admin ever noticing.
>
>>> The fact that the numbers don't really mean what they apparently
>>> should mean.
>>
>> I think it is a reasonable limitation that only some kernel allocations are
>> accounted for although I'll freely admit I'm not a cgroup or memcg user
>> either.
>>
>> My understanding is that this comes down to cost -- accounting for the
>> kernel memory usage is expensive so it is limited only to the allocations
>> that are easy to abuse by an unprivileged process. Hence this is
>> initially concerned with stack pages with dentries and TCP usage to
>> follow in later patches.
>
> I think the cost isn't too prohibitive considering it's already using
> memcg. Charging / uncharging happens only as pages enter and leave
> slab caches and the hot path overhead is essentially single
> indirection. Glauber's benchmark seemed pretty reasonable to me and I
> don't yet think that warrants exposing this subtle tree of
> configuration.
>

Only so we can get some numbers: the cost is really minor if this is all
disabled. It this is fully enable, it can get to some 2 or 3 %, which
may or may not be acceptable to an application. But for me this is not
even about cost, and that's why I haven't brought it up so far

>>> Sure, conferences are useful for building consensus but that's the
>>> extent of it. Sorry that I didn't realize the implications then but
>>> conferences don't really add any finality to decisions.
>>>
>>> So, this seems properly crazy to me at the similar level of
>>> use_hierarchy fiasco. I'm gonna NACK on this.
>>
>> I think you're over-reacting to say the very least :|
>
> The part I nacked is enabling kmemcg on a populated cgroup and then
> starting accounting from then without any apparent indication that any
> past allocation hasn't been considered. You end up with numbers which
> nobody can't tell what they really mean and there's no mechanism to
> guarantee any kind of ordering between populating the cgroup and
> configuring it and there's *no* way to find out what happened
> afterwards neither. This is properly crazy and definitely deserves a
> nack.
>

Mel suggestion of not allowing this to happen once the cgroup has tasks
takes care of this, and is something I thought of myself.

This would remove this particular piece of objection, and maintain the
per-subtree control.
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48114 is a reply to message #48109] Thu, 27 September 2012 15:09 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
On Thu 27-09-12 07:33:00, Tejun Heo wrote:
> Hello, Michal.
>
> On Thu, Sep 27, 2012 at 02:08:06PM +0200, Michal Hocko wrote:
> > Yes, because we have many users (basically almost all) who care only
> > about the user memory because that's what occupies the vast majority of
> > the memory. They usually want to isolate workload which would disrupt
> > the global memory otherwise (e.g. backup process vs. database). You
> > really do not want to pay an additional overhead for kmem accounting
> > here.
>
> I'm not too convinced. First of all, the overhead added by kmemcg
> isn't big.

You are probably talking about memory overhead which is indeed not that
big (except for possible side effects as fragmentation which you mention
bellow). But the runtime overhead is present, as far as I understand from
what Glauber said. But, on the other hand, it is fair to say that those
who _want_ use the feature should pay for it.

> The hot path overhead is quite minimal - it doesn't do much more than
> indirecting one more time. In terms of memory usage, it sure could
> lead to a bit more fragmentation but even if it gets to several megs
> per cgroup, I don't think that's something excessive. So, there is
> overhead but I don't believe it to be prohibitive.

Remember that users do not want to pay even "something minimal" when the
feature is not needed.

> > > So your question for global vs local switch (that again, doesn't
> > > exist; only a local *limit* exists) should really be posed in the
> > > following way: "Can two different use cases with different needs be
> > > hosted in the same box?"
> >
> > I think this is a good and a relevant question. I think this boils down
> > to whether you want to have trusted and untrusted workloads at the same
> > machine.
> > Trusted loads usually only need user memory accounting because kmem
> > consumption should be really negligible (unless kernel is doing
> > something really stupid and no kmem limit will help here).
> > On the other hand, untrusted workloads can do nasty things that
> > administrator has hard time to mitigate and setting a kmem limit can
> > help significantly.
> >
> > IMHO such a different loads exist on a single machine quite often (Web
> > server and a back up process as the most simplistic one). The per
> > hierarchy accounting, therefore, sounds like a good idea without too
> > much added complexity (actually the only added complexity is in the
> > proper kmem.limit_in_bytes handling which is a single place).
>
> The distinction between "trusted" and "untrusted" is something
> artificially created due to the assumed deficiency of kmemcg
> implementation.

Not really. It doesn't have to do anything with the overhead (be it
memory or runtime). It really boils down to "do I need/want it at all".
Why would I want to think about how much kernel memory is in use in the
first place? Or do you think that user memory accounting should be
deprecated?

> Making things like this visible to userland is a bad
> idea because it locks us into a place where we can't or don't need to
> improve the said deficiencies and end up pushing the difficult
> problems to somewhere else where it will likely be implemented in a
> shabbier way. There sure are cases when such approach simply cannot
> be avoided, but I really don't think that's the case here - the
> overhead already seems to be at an acceptable level and we're not
> taking away the escape switch.
>
> This is userland visible API.

I am not sure which API visible part you have in mind but
kmem.limit_in_bytes will be there whether we go with global knob or "no
limit no accounting" approach.

> We better err on the side of being conservative than going overboard
> with flexibility. Even if we eventually need to make this switching
> fullly hierarchical, we really should be doing,
>
> 1. Implement simple global switching and look for problem cases.
>
> 2. Analyze them and see whether the problem case can't be solved in a
> better, more intelligent way.
>
> 3. If the problem is something structurally inherent or reasonably too
> difficult to solve any other way, consider dumping the problem as
> config parameters to userland.
>
> We can always expand the flexibility. Let's do the simple thing
> first. As an added bonus, it would enable using static_keys for
> accounting branches too.

While I do agree with you in general and being careful is at place in
this area as time shown several times, this seems to be too restrictive
in this particular case.
We won't save almost no code with the global knob so I am not sure
what we are actually saving here. Global knob will just give us all or
nothing semantic without making the whole thing simpler. You will stick
with static branches and checkes whether the group accountable anyway,
right?
--
Michal Hocko
SUSE Labs
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48120 is a reply to message #48113] Thu, 27 September 2012 17:46 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello,

On Thu, Sep 27, 2012 at 06:57:59PM +0400, Glauber Costa wrote:
> > Because we're not even trying to actually solve the problem but just
> > dumping it to userland. If dentry/inode usage is the only case we're
> > being worried about, there can be better ways to solve it or at least
> > we should strive for that.
>
> Not only it is not the only case we care about, this is not even touched
> in this series. (It is only touched in the next one). This one, for
> instance, cares about the stack. The reason everything is being dumped
> into "kmem", is precisely to make things simpler. I argue that at some
> point it makes sense to draw a line, and "kmem" is a much better line
> than any fine grained control - precisely because it is conceptually
> easier to grasp.

Can you please give other examples of cases where this type of issue
exists (plenty of shared kernel data structure which is inherent to
the workload at hand)? Until now, this has been the only example for
this type of issues.

> > I think the cost isn't too prohibitive considering it's already using
> > memcg. Charging / uncharging happens only as pages enter and leave
> > slab caches and the hot path overhead is essentially single
> > indirection. Glauber's benchmark seemed pretty reasonable to me and I
> > don't yet think that warrants exposing this subtle tree of
> > configuration.
>
> Only so we can get some numbers: the cost is really minor if this is all
> disabled. It this is fully enable, it can get to some 2 or 3 %, which
> may or may not be acceptable to an application. But for me this is not
> even about cost, and that's why I haven't brought it up so far

It seems like Mel's concern is mostly based on performance overhead
concerns tho.

> > The part I nacked is enabling kmemcg on a populated cgroup and then
> > starting accounting from then without any apparent indication that any
> > past allocation hasn't been considered. You end up with numbers which
> > nobody can't tell what they really mean and there's no mechanism to
> > guarantee any kind of ordering between populating the cgroup and
> > configuring it and there's *no* way to find out what happened
> > afterwards neither. This is properly crazy and definitely deserves a
> > nack.
> >
>
> Mel suggestion of not allowing this to happen once the cgroup has tasks
> takes care of this, and is something I thought of myself.

You mean Michal's? It should also disallow switching if there are
children cgroups, right?

> This would remove this particular piece of objection, and maintain the
> per-subtree control.

Yeah, I don't see anything broken with that although I'll try to argue
for simpler one a bit more.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48121 is a reply to message #48120] Thu, 27 September 2012 17:56 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
On Thu 27-09-12 10:46:05, Tejun Heo wrote:
[...]
> > > The part I nacked is enabling kmemcg on a populated cgroup and then
> > > starting accounting from then without any apparent indication that any
> > > past allocation hasn't been considered. You end up with numbers which
> > > nobody can't tell what they really mean and there's no mechanism to
> > > guarantee any kind of ordering between populating the cgroup and
> > > configuring it and there's *no* way to find out what happened
> > > afterwards neither. This is properly crazy and definitely deserves a
> > > nack.
> > >
> >
> > Mel suggestion of not allowing this to happen once the cgroup has tasks
> > takes care of this, and is something I thought of myself.
>
> You mean Michal's? It should also disallow switching if there are
> children cgroups, right?

Right.

--
Michal Hocko
SUSE Labs
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48123 is a reply to message #48112] Thu, 27 September 2012 18:30 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 09/27/2012 06:58 PM, Tejun Heo wrote:
> Hello, Mel.
>
> On Thu, Sep 27, 2012 at 03:43:07PM +0100, Mel Gorman wrote:
>>> I'm not too convinced. First of all, the overhead added by kmemcg
>>> isn't big.
>>
>> Really?
>>
>> If kmemcg was globally accounted then every __GFP_KMEMCG allocation in
>> the page allocator potentially ends up down in
>> __memcg_kmem_newpage_charge which
>>
>> 1. takes RCU read lock
>> 2. looks up cgroup from task
>> 3. takes a reference count
>> 4. memcg_charge_kmem -> __mem_cgroup_try_charge
>> 5. release reference count
>>
>> That's a *LOT* of work to incur for cgroups that do not care about kernel
>> accounting. This is why I thought it was reasonable that the kmem accounting
>> not be global.
>
> But that happens only when pages enter and leave slab and if it still
> is significant, we can try to further optimize charging. Given that
> this is only for cases where memcg is already in use and we provide a
> switch to disable it globally, I really don't think this warrants
> implementing fully hierarchy configuration.
>

Not totally true. We still have to match every allocation to the right
cache, and that is actually our heaviest hit, responsible for the 2, 3 %
we're seeing when this is enabled. It is the kind of path so hot that
people frown upon branches being added, so I don't think we'll ever get
this close to being free.
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48124 is a reply to message #48120] Thu, 27 September 2012 18:45 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 09/27/2012 09:46 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Sep 27, 2012 at 06:57:59PM +0400, Glauber Costa wrote:
>>> Because we're not even trying to actually solve the problem but just
>>> dumping it to userland. If dentry/inode usage is the only case we're
>>> being worried about, there can be better ways to solve it or at least
>>> we should strive for that.
>>
>> Not only it is not the only case we care about, this is not even touched
>> in this series. (It is only touched in the next one). This one, for
>> instance, cares about the stack. The reason everything is being dumped
>> into "kmem", is precisely to make things simpler. I argue that at some
>> point it makes sense to draw a line, and "kmem" is a much better line
>> than any fine grained control - precisely because it is conceptually
>> easier to grasp.
>
> Can you please give other examples of cases where this type of issue
> exists (plenty of shared kernel data structure which is inherent to
> the workload at hand)? Until now, this has been the only example for
> this type of issues.
>

Yes. the namespace related caches (*), all kinds of sockets and network
structures, other file system structures like file struct, vm areas, and
pretty much everything a full container does.

(*) we run full userspace, so we have namespaces + cgroups combination.

>>> I think the cost isn't too prohibitive considering it's already using
>>> memcg. Charging / uncharging happens only as pages enter and leave
>>> slab caches and the hot path overhead is essentially single
>>> indirection. Glauber's benchmark seemed pretty reasonable to me and I
>>> don't yet think that warrants exposing this subtle tree of
>>> configuration.
>>
>> Only so we can get some numbers: the cost is really minor if this is all
>> disabled. It this is fully enable, it can get to some 2 or 3 %, which
>> may or may not be acceptable to an application. But for me this is not
>> even about cost, and that's why I haven't brought it up so far
>
> It seems like Mel's concern is mostly based on performance overhead
> concerns tho.
>
>>> The part I nacked is enabling kmemcg on a populated cgroup and then
>>> starting accounting from then without any apparent indication that any
>>> past allocation hasn't been considered. You end up with numbers which
>>> nobody can't tell what they really mean and there's no mechanism to
>>> guarantee any kind of ordering between populating the cgroup and
>>> configuring it and there's *no* way to find out what happened
>>> afterwards neither. This is properly crazy and definitely deserves a
>>> nack.
>>>
>>
>> Mel suggestion of not allowing this to happen once the cgroup has tasks
>> takes care of this, and is something I thought of myself.
>
> You mean Michal's? It should also disallow switching if there are
> children cgroups, right?
>

No, I meant Mel, quoting this:

"Further I would expect that an administrator would be aware of these
limitations and set kmem_accounting at cgroup creation time before any
processes start. Maybe that should be enforced but it's not a
fundamental problem."

But I guess it is pretty much the same thing Michal proposes, in essence.

Or IOW, if your concern is with the fact that charges may have happened
in the past before this is enabled, we can make sure this cannot happen
by disallowing the limit to be set if currently unset (value changes are
obviously fine) if you have children or any tasks already in the group.
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48138 is a reply to message #48124] Sun, 30 September 2012 07:57 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Glauber.

On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote:
> > Can you please give other examples of cases where this type of issue
> > exists (plenty of shared kernel data structure which is inherent to
> > the workload at hand)? Until now, this has been the only example for
> > this type of issues.
>
> Yes. the namespace related caches (*), all kinds of sockets and network
> structures, other file system structures like file struct, vm areas, and
> pretty much everything a full container does.
>
> (*) we run full userspace, so we have namespaces + cgroups combination.

This is probably me being dumb but wouldn't resources used by full
namespaces be mostly independent? Which parts get shared? Also, if
you do full namespace, isn't it more likely that you would want fuller
resource isolation too?

> >> Mel suggestion of not allowing this to happen once the cgroup has tasks
> >> takes care of this, and is something I thought of myself.
> >
> > You mean Michal's? It should also disallow switching if there are
> > children cgroups, right?
>
> No, I meant Mel, quoting this:
>
> "Further I would expect that an administrator would be aware of these
> limitations and set kmem_accounting at cgroup creation time before any
> processes start. Maybe that should be enforced but it's not a
> fundamental problem."
>
> But I guess it is pretty much the same thing Michal proposes, in essence.
>
> Or IOW, if your concern is with the fact that charges may have happened
> in the past before this is enabled, we can make sure this cannot happen
> by disallowing the limit to be set if currently unset (value changes are
> obviously fine) if you have children or any tasks already in the group.

Yeah, please do that.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48139 is a reply to message #48138] Sun, 30 September 2012 08:02 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
On Sun, Sep 30, 2012 at 04:57:00PM +0900, Tejun Heo wrote:
> On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote:
> > > Can you please give other examples of cases where this type of issue
> > > exists (plenty of shared kernel data structure which is inherent to
> > > the workload at hand)? Until now, this has been the only example for
> > > this type of issues.
> >
> > Yes. the namespace related caches (*), all kinds of sockets and network
> > structures, other file system structures like file struct, vm areas, and
> > pretty much everything a full container does.
> >
> > (*) we run full userspace, so we have namespaces + cgroups combination.
>
> This is probably me being dumb but wouldn't resources used by full
> namespaces be mostly independent? Which parts get shared? Also, if
> you do full namespace, isn't it more likely that you would want fuller
> resource isolation too?

Just a thought about dentry/inode. Would it make sense to count total
number of references per cgroup and charge the total amount according
to that? Reference counts are how the shared ownership is represented
after all. Counting total per cgroup isn't accurate and pathological
cases could be weird tho.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48140 is a reply to message #48123] Sun, 30 September 2012 08:23 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Glauber.

On Thu, Sep 27, 2012 at 10:30:36PM +0400, Glauber Costa wrote:
> > But that happens only when pages enter and leave slab and if it still
> > is significant, we can try to further optimize charging. Given that
> > this is only for cases where memcg is already in use and we provide a
> > switch to disable it globally, I really don't think this warrants
> > implementing fully hierarchy configuration.
>
> Not totally true. We still have to match every allocation to the right
> cache, and that is actually our heaviest hit, responsible for the 2, 3 %
> we're seeing when this is enabled. It is the kind of path so hot that
> people frown upon branches being added, so I don't think we'll ever get
> this close to being free.

Sure, depening on workload, any addition to alloc/free could be
noticeable. I don't know. I'll write more about it when replying to
Michal's message. BTW, __memcg_kmem_get_cache() does seem a bit
heavy. I wonder whether indexing from cache side would make it
cheaper? e.g. something like the following.

kmem_cache *__memcg_kmem_get_cache(cachep, gfp)
{
struct kmem_cache *c;

c = cachep->memcg_params->caches[percpu_read(kmemcg_slab_idx)];
if (likely(c))
return c;
/* try to create and then fall back to cachep */
}

where kmemcg_slab_idx is updated from sched notifier (or maybe add and
use current->kmemcg_slab_idx?). You would still need __GFP_* and
in_interrupt() tests but current->mm and PF_KTHREAD tests can be
rolled into index selection.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48142 is a reply to message #48114] Sun, 30 September 2012 08:47 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Michal.

On Thu, Sep 27, 2012 at 05:09:50PM +0200, Michal Hocko wrote:
> On Thu 27-09-12 07:33:00, Tejun Heo wrote:
> > I'm not too convinced. First of all, the overhead added by kmemcg
> > isn't big.
>
> You are probably talking about memory overhead which is indeed not that
> big (except for possible side effects as fragmentation which you mention
> bellow). But the runtime overhead is present, as far as I understand from
> what Glauber said. But, on the other hand, it is fair to say that those
> who _want_ use the feature should pay for it.

Yeah, as long as the overhead is reasonable and it doesn't affect
non-users, I think we should put more emphasis on simplicity. cgroup
is pretty hairy (from both implementation and interface POVs) to begin
with. Unfortunately, what's reasonable or how much more emphasis
varies widely depending on who one asks.

> > The hot path overhead is quite minimal - it doesn't do much more than
> > indirecting one more time. In terms of memory usage, it sure could
> > lead to a bit more fragmentation but even if it gets to several megs
> > per cgroup, I don't think that's something excessive. So, there is
> > overhead but I don't believe it to be prohibitive.
>
> Remember that users do not want to pay even "something minimal" when the
> feature is not needed.

Yeah but, if we can get it down to, say, around 1% under most
workloads for memcg users, it is quite questionable to introduce full
hierarchical configuration to allow avoiding that, isn't it?

> > The distinction between "trusted" and "untrusted" is something
> > artificially created due to the assumed deficiency of kmemcg
> > implementation.
>
> Not really. It doesn't have to do anything with the overhead (be it
> memory or runtime). It really boils down to "do I need/want it at all".
> Why would I want to think about how much kernel memory is in use in the
> first place? Or do you think that user memory accounting should be
> deprecated?

But you can apply the same "do I need/want it at all" question to the
configuration parameter too. I can see your point but the decision
seems muddy to me, and if muddy, I prefer to err on the side of being
too conservative.

> > This is userland visible API.
>
> I am not sure which API visible part you have in mind but
> kmem.limit_in_bytes will be there whether we go with global knob or "no
> limit no accounting" approach.

I mean full hierarchical configuration of it. It becomes something
which each memcg user cares about instead of something which the base
system / admin flips on system boot.

> > We can always expand the flexibility. Let's do the simple thing
> > first. As an added bonus, it would enable using static_keys for
> > accounting branches too.
>
> While I do agree with you in general and being careful is at place in
> this area as time shown several times, this seems to be too restrictive
> in this particular case.
> We won't save almost no code with the global knob so I am not sure
> what we are actually saving here. Global knob will just give us all or
> nothing semantic without making the whole thing simpler. You will stick
> with static branches and checkes whether the group accountable anyway,
> right?

The thing is about the same argument can be made about .use_hierarchy
too. It doesn't necessarily make the code much harier. Especially
because the code is structured with that feature on mind, removing
.use_hierarchy might not remove whole lot of code; however, the wider
range of behavior which got exposed through that poses a much larger
problem when we try to make modifications on related behaviors. We
get a lot more locked down by seemingly not too much code and our long
term maintainability / sustainability suffers as a result.

I'm not trying to say this is as bad as .use_hierarchy but want to
point out that memcg and cgroup in general have had pretty strong
tendency to choose overly flexible and complex designs and interfaces
and it's probably about time we become more careful especially about
stuff which is visible to userland.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48143 is a reply to message #48139] Sun, 30 September 2012 08:56 Go to previous messageGo to next message
James Bottomley is currently offline  James Bottomley
Messages: 17
Registered: May 2006
Junior Member
On Sun, 2012-09-30 at 17:02 +0900, Tejun Heo wrote:
> On Sun, Sep 30, 2012 at 04:57:00PM +0900, Tejun Heo wrote:
> > On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote:
> > > > Can you please give other examples of cases where this type of issue
> > > > exists (plenty of shared kernel data structure which is inherent to
> > > > the workload at hand)? Until now, this has been the only example for
> > > > this type of issues.
> > >
> > > Yes. the namespace related caches (*), all kinds of sockets and network
> > > structures, other file system structures like file struct, vm areas, and
> > > pretty much everything a full container does.
> > >
> > > (*) we run full userspace, so we have namespaces + cgroups combination.
> >
> > This is probably me being dumb but wouldn't resources used by full
> > namespaces be mostly independent? Which parts get shared? Also, if
> > you do full namespace, isn't it more likely that you would want fuller
> > resource isolation too?
>
> Just a thought about dentry/inode. Would it make sense to count total
> number of references per cgroup and charge the total amount according
> to that? Reference counts are how the shared ownership is represented
> after all. Counting total per cgroup isn't accurate and pathological
> cases could be weird tho.

The beancounter approach originally used by OpenVZ does exactly this.
There are two specific problems, though, firstly you can't count
references in generic code, so now you have to extend the cgroup
tentacles into every object, an invasiveness which people didn't really
like. Secondly split accounting causes oddities too, like your total
kernel memory usage can appear to go down even though you do nothing
just because someone else added a share. Worse, if someone drops the
reference, your usage can go up, even though you did nothing, and push
you over your limit, at which point action gets taken against the
container. This leads to nasty system unpredictability (The whole point
of cgroup isolation is supposed to be preventing resource usage in one
cgroup from affecting that in another).

We discussed this pretty heavily at the Containers Mini Summit in Santa
Rosa. The emergent consensus was that no-one really likes first use
accounting, but it does solve all the problems and it has the fewest
unexpected side effects.

If you have an alternative that wasn't considered then, I'm sure
everyone would be interested, but it isn't split accounting.

James
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48144 is a reply to message #48143] Sun, 30 September 2012 10:37 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, James.

On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote:
> The beancounter approach originally used by OpenVZ does exactly this.
> There are two specific problems, though, firstly you can't count
> references in generic code, so now you have to extend the cgroup
> tentacles into every object, an invasiveness which people didn't really
> like.

Yeah, it will need some hooks. For dentry and inode, I think it would
be pretty well isolated tho. Wasn't it?

> Secondly split accounting causes oddities too, like your total
> kernel memory usage can appear to go down even though you do nothing
> just because someone else added a share. Worse, if someone drops the
> reference, your usage can go up, even though you did nothing, and push
> you over your limit, at which point action gets taken against the
> container. This leads to nasty system unpredictability (The whole point
> of cgroup isolation is supposed to be preventing resource usage in one
> cgroup from affecting that in another).

In a sense, the fluctuating amount is the actual resource burden the
cgroup is putting on the system, so maybe it just needs to be handled
better or maybe we should charge fixed amount per refcnt? I don't
know.

> We discussed this pretty heavily at the Containers Mini Summit in Santa
> Rosa. The emergent consensus was that no-one really likes first use
> accounting, but it does solve all the problems and it has the fewest
> unexpected side effects.

But that's like fitting the problem to the mechanism. Maybe that is
the best which can be done, but the side effect there is way-off
accounting under pretty common workload, which sounds pretty nasty to
me.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48145 is a reply to message #48144] Sun, 30 September 2012 11:25 Go to previous messageGo to next message
James Bottomley is currently offline  James Bottomley
Messages: 17
Registered: May 2006
Junior Member
On Sun, 2012-09-30 at 19:37 +0900, Tejun Heo wrote:
> Hello, James.
>
> On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote:
> > The beancounter approach originally used by OpenVZ does exactly this.
> > There are two specific problems, though, firstly you can't count
> > references in generic code, so now you have to extend the cgroup
> > tentacles into every object, an invasiveness which people didn't really
> > like.
>
> Yeah, it will need some hooks. For dentry and inode, I think it would
> be pretty well isolated tho. Wasn't it?

But you've got to ask yourself who cares about accurate accounting per
container of dentry and inode objects? They're not objects that any
administrator is used to limiting. What we at parallels care about
isn't accurately accounting them, it's that one container can't DoS
another by exhausting system resources. That's achieved equally well by
first charge slab accounting, so we don't really have an interest in
pushing object accounting code for which there's no use case.

> > Secondly split accounting causes oddities too, like your total
> > kernel memory usage can appear to go down even though you do nothing
> > just because someone else added a share. Worse, if someone drops the
> > reference, your usage can go up, even though you did nothing, and push
> > you over your limit, at which point action gets taken against the
> > container. This leads to nasty system unpredictability (The whole point
> > of cgroup isolation is supposed to be preventing resource usage in one
> > cgroup from affecting that in another).
>
> In a sense, the fluctuating amount is the actual resource burden the
> cgroup is putting on the system, so maybe it just needs to be handled
> better or maybe we should charge fixed amount per refcnt? I don't
> know.

Yes, we considered single charge per reference accounting as well
(although I don't believe anyone went as far as producing an
implementation). The problem here is now that the sum of the container
resources no-longer bears any relation to the host consumption. This
makes it very difficult for virtualisation orchestration systems to make
accurate decisions when doing dynamic resource scheduling (DRS).

Conversely, as ugly as you think it, first use accounting is actually
pretty good at identifying problem containers (at least with regard to
memory usage) for DRS because containers which are stretching the memory
tend to accumulate the greatest number of first charges over the system
lifetime.

> > We discussed this pretty heavily at the Containers Mini Summit in Santa
> > Rosa. The emergent consensus was that no-one really likes first use
> > accounting, but it does solve all the problems and it has the fewest
> > unexpected side effects.
>
> But that's like fitting the problem to the mechanism. Maybe that is
> the best which can be done, but the side effect there is way-off
> accounting under pretty common workload, which sounds pretty nasty to
> me.

All we need kernel memory accounting and limiting for is DoS prevention.
There aren't really any system administrators who care about Kernel
Memory accounting (at least until the system goes oom) because there are
no absolute knobs for it (all there is are a set of weird and wonderful
heuristics, like dirty limit ratio and drop caches). Kernel memory
usage has a whole set of regulatory infrastructure for trying to make it
transparent to the user.

Don't get me wrong: if there were some easy way to get proper memory
accounting for free, we'd be happy but, because it has no practical
application for any of our customers, there's a limited price we're
willing to pay to get it.

James
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48150 is a reply to message #48145] Mon, 01 October 2012 00:57 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, James.

On Sun, Sep 30, 2012 at 12:25:52PM +0100, James Bottomley wrote:
> But you've got to ask yourself who cares about accurate accounting per
> container of dentry and inode objects? They're not objects that any
> administrator is used to limiting. What we at parallels care about
> isn't accurately accounting them, it's that one container can't DoS
> another by exhausting system resources. That's achieved equally well by
> first charge slab accounting, so we don't really have an interest in
> pushing object accounting code for which there's no use case.

Isn't it more because the use cases you have on mind don't share
dentries/inodes too much? Wildly incorrect accounting definitely
degrades container isolation and can lead to unexpected behaviors.

> All we need kernel memory accounting and limiting for is DoS prevention.
> There aren't really any system administrators who care about Kernel
> Memory accounting (at least until the system goes oom) because there are
> no absolute knobs for it (all there is are a set of weird and wonderful
> heuristics, like dirty limit ratio and drop caches). Kernel memory

I think that's because the mechanism currently doesn't exist. If one
wants to control how memory is distributed across different cgroups,
it's logical to control kernel memory too. The resource in question
is the actual memory after all. I think at least google would be
interested in it, so, no, I don't agree that nobody wants it. If that
is the case, we're working towards the wrong direction.

> usage has a whole set of regulatory infrastructure for trying to make it
> transparent to the user.
>
> Don't get me wrong: if there were some easy way to get proper memory
> accounting for free, we'd be happy but, because it has no practical
> application for any of our customers, there's a limited price we're
> willing to pay to get it.

Even on purely technical ground, it could be that first-use is the
right trade off if other more accurate approaches are too difficult
and most workloads are happy with such approach. I'm still a bit
weary to base userland interface decisions on that tho.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48152 is a reply to message #48138] Mon, 01 October 2012 08:36 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 09/30/2012 11:57 AM, Tejun Heo wrote:
> Hello, Glauber.
>
> On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote:
>>> Can you please give other examples of cases where this type of issue
>>> exists (plenty of shared kernel data structure which is inherent to
>>> the workload at hand)? Until now, this has been the only example for
>>> this type of issues.
>>
>> Yes. the namespace related caches (*), all kinds of sockets and network
>> structures, other file system structures like file struct, vm areas, and
>> pretty much everything a full container does.
>>
>> (*) we run full userspace, so we have namespaces + cgroups combination.
>
> This is probably me being dumb but wouldn't resources used by full
> namespaces be mostly independent? Which parts get shared? Also, if
> you do full namespace, isn't it more likely that you would want fuller
> resource isolation too?
>

Not necessarily. Namespaces are pretty flexible. If you are using the
network namespace, for instance, you can create interfaces, routes,
addresses, etc. But because this deals with the network only, there is
nothing unreasonable in saying that your webserver and database lives in
the same network (which is a different network namespace), but are
entitled to different memory limits - which is cgroups realm. With
application-only containers being championed these days by multiple
users, I would expect this situation to become non-negligible.

The full container scenario, of course, is very different. Most of the
accesses tends to be local.

I will second what Michal said, since I believe this is also very
important: User memory is completely at the control of the application,
while kernel memory is not, and never will. It is perfectly fine to
imagine applications that want its memory to be limited by a very
predictable amount, and there are no reasons to believe that those
cannot live in the same box as full containers - the biggest example of
kmem interested folks. It could be, for instance, that a management tool
for containers lives in there, and that application wants to be umem
limited but not kmem limited. (If it goes touching files and data inside
each container, for instance, it is obviously not local)

>>>> Mel suggestion of not allowing this to happen once the cgroup has tasks
>>>> takes care of this, and is something I thought of myself.
>>>
>>> You mean Michal's? It should also disallow switching if there are
>>> children cgroups, right?
>>
>> No, I meant Mel, quoting this:
>>
>> "Further I would expect that an administrator would be aware of these
>> limitations and set kmem_accounting at cgroup creation time before any
>> processes start. Maybe that should be enforced but it's not a
>> fundamental problem."
>>
>> But I guess it is pretty much the same thing Michal proposes, in essence.
>>
>> Or IOW, if your concern is with the fact that charges may have happened
>> in the past before this is enabled, we can make sure this cannot happen
>> by disallowing the limit to be set if currently unset (value changes are
>> obviously fine) if you have children or any tasks already in the group.
>
> Yeah, please do that.
>

I did already, patches soon! =)
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48153 is a reply to message #48150] Mon, 01 October 2012 08:43 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/01/2012 04:57 AM, Tejun Heo wrote:
> Hello, James.
>
> On Sun, Sep 30, 2012 at 12:25:52PM +0100, James Bottomley wrote:
>> But you've got to ask yourself who cares about accurate accounting per
>> container of dentry and inode objects? They're not objects that any
>> administrator is used to limiting. What we at parallels care about
>> isn't accurately accounting them, it's that one container can't DoS
>> another by exhausting system resources. That's achieved equally well by
>> first charge slab accounting, so we don't really have an interest in
>> pushing object accounting code for which there's no use case.
>
> Isn't it more because the use cases you have on mind don't share
> dentries/inodes too much? Wildly incorrect accounting definitely
> degrades container isolation and can lead to unexpected behaviors.
>
>> All we need kernel memory accounting and limiting for is DoS prevention.
>> There aren't really any system administrators who care about Kernel
>> Memory accounting (at least until the system goes oom) because there are
>> no absolute knobs for it (all there is are a set of weird and wonderful
>> heuristics, like dirty limit ratio and drop caches). Kernel memory
>
> I think that's because the mechanism currently doesn't exist. If one
> wants to control how memory is distributed across different cgroups,
> it's logical to control kernel memory too. The resource in question
> is the actual memory after all. I think at least google would be
> interested in it, so, no, I don't agree that nobody wants it. If that
> is the case, we're working towards the wrong direction.
>
>> usage has a whole set of regulatory infrastructure for trying to make it
>> transparent to the user.
>>
>> Don't get me wrong: if there were some easy way to get proper memory
>> accounting for free, we'd be happy but, because it has no practical
>> application for any of our customers, there's a limited price we're
>> willing to pay to get it.
>
> Even on purely technical ground, it could be that first-use is the
> right trade off if other more accurate approaches are too difficult
> and most workloads are happy with such approach. I'm still a bit
> weary to base userland interface decisions on that tho.
>

For the record, user memory also suffers a bit from being always
constrained to first-touch accounting. Greg Thelen is working on
alternative solutions to make first-accounting the default in a
configurable environment, as he explained in the kernel summit.

When that happens, kernel memory can take advantage of it for free.
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48154 is a reply to message #48140] Mon, 01 October 2012 08:45 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 09/30/2012 12:23 PM, Tejun Heo wrote:
> Hello, Glauber.
>
> On Thu, Sep 27, 2012 at 10:30:36PM +0400, Glauber Costa wrote:
>>> But that happens only when pages enter and leave slab and if it still
>>> is significant, we can try to further optimize charging. Given that
>>> this is only for cases where memcg is already in use and we provide a
>>> switch to disable it globally, I really don't think this warrants
>>> implementing fully hierarchy configuration.
>>
>> Not totally true. We still have to match every allocation to the right
>> cache, and that is actually our heaviest hit, responsible for the 2, 3 %
>> we're seeing when this is enabled. It is the kind of path so hot that
>> people frown upon branches being added, so I don't think we'll ever get
>> this close to being free.
>
> Sure, depening on workload, any addition to alloc/free could be
> noticeable. I don't know. I'll write more about it when replying to
> Michal's message. BTW, __memcg_kmem_get_cache() does seem a bit
> heavy. I wonder whether indexing from cache side would make it
> cheaper? e.g. something like the following.
>
> kmem_cache *__memcg_kmem_get_cache(cachep, gfp)
> {
> struct kmem_cache *c;
>
> c = cachep->memcg_params->caches[percpu_read(kmemcg_slab_idx)];
> if (likely(c))
> return c;
> /* try to create and then fall back to cachep */
> }
>
> where kmemcg_slab_idx is updated from sched notifier (or maybe add and
> use current->kmemcg_slab_idx?). You would still need __GFP_* and
> in_interrupt() tests but current->mm and PF_KTHREAD tests can be
> rolled into index selection.
>

How big would this array be? there can be a lot more kmem_caches than
there are memcgs. That is why it is done from memcg side.
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48155 is a reply to message #48144] Mon, 01 October 2012 08:46 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 09/30/2012 02:37 PM, Tejun Heo wrote:
> Hello, James.
>
> On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote:
>> The beancounter approach originally used by OpenVZ does exactly this.
>> There are two specific problems, though, firstly you can't count
>> references in generic code, so now you have to extend the cgroup
>> tentacles into every object, an invasiveness which people didn't really
>> like.
>
> Yeah, it will need some hooks. For dentry and inode, I think it would
> be pretty well isolated tho. Wasn't it?
>

We would still need something for the stack. For open files, and for
everything that becomes a potential problem. We then end up with 35
different knobs instead of one. One of the perceived advantages of this
approach, is that it condenses as much data as a single knob as
possible, reducing complexity and over flexibility.
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48156 is a reply to message #48142] Mon, 01 October 2012 09:27 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
Hi,

On Sun 30-09-12 17:47:50, Tejun Heo wrote:
[...]
> > > The hot path overhead is quite minimal - it doesn't do much more than
> > > indirecting one more time. In terms of memory usage, it sure could
> > > lead to a bit more fragmentation but even if it gets to several megs
> > > per cgroup, I don't think that's something excessive. So, there is
> > > overhead but I don't believe it to be prohibitive.
> >
> > Remember that users do not want to pay even "something minimal" when the
> > feature is not needed.
>
> Yeah but, if we can get it down to, say, around 1% under most
> workloads for memcg users, it is quite questionable to introduce full
> hierarchical configuration to allow avoiding that, isn't it?

Remember that the kmem memory is still accounted to u+k if it is enabled
which could be a no-go because some workloads (I have provided an
example that those which are trusted are generally safe to ignore kernel
memory overhead) simply don't want to consider additional memory which
is mostly invisible for them.

> > > The distinction between "trusted" and "untrusted" is something
> > > artificially created due to the assumed deficiency of kmemcg
> > > implementation.
> >
> > Not really. It doesn't have to do anything with the overhead (be it
> > memory or runtime). It really boils down to "do I need/want it at all".
> > Why would I want to think about how much kernel memory is in use in the
> > first place? Or do you think that user memory accounting should be
> > deprecated?
>
> But you can apply the same "do I need/want it at all" question to the
> configuration parameter too.

Yes but, as I've said, the global configuration parameter is too
coarse. You can have a mix of trusted and untrusted workloads at the
same machine (e.g. web server which is inherently untrusted) and trusted
(local batch jobs which just needs a special LRU aging).

> I can see your point but the decision seems muddy to me, and if muddy,
> I prefer to err on the side of being too conservative.
>
> > > This is userland visible API.
> >
> > I am not sure which API visible part you have in mind but
> > kmem.limit_in_bytes will be there whether we go with global knob or "no
> > limit no accounting" approach.
>
> I mean full hierarchical configuration of it. It becomes something
> which each memcg user cares about instead of something which the base
> system / admin flips on system boot.
>
> > > We can always expand the flexibility. Let's do the simple thing
> > > first. As an added bonus, it would enable using static_keys for
> > > accounting branches too.
> >
> > While I do agree with you in general and being careful is at place in
> > this area as time shown several times, this seems to be too restrictive
> > in this particular case.
> > We won't save almost no code with the global knob so I am not sure
> > what we are actually saving here. Global knob will just give us all or
> > nothing semantic without making the whole thing simpler. You will stick
> > with static branches and checkes whether the group accountable anyway,
> > right?
>
> The thing is about the same argument can be made about .use_hierarchy
> too. It doesn't necessarily make the code much harier. Especially
> because the code is structured with that feature on mind, removing
> .use_hierarchy might not remove whole lot of code; however, the wider
> range of behavior which got exposed through that poses a much larger
> problem when we try to make modifications on related behaviors. We
> get a lot more locked down by seemingly not too much code and our long
> term maintainability / sustainability suffers as a result.

I think that comparing kmem accounting with use_hierarchy is not fair.
Glauber tried to explain why already so I will not repeat it here.
I will just mention one thing. use_hierarchy has been introduces becuase
hierarchies were expensive at the time. kmem accounting is about should
we do u or u+k accounting. So there is a crucial difference.

> I'm not trying to say this is as bad as .use_hierarchy but want to
> point out that memcg and cgroup in general have had pretty strong
> tendency to choose overly flexible and complex designs and interfaces
> and it's probably about time we become more careful especially about
> stuff which is visible to userland.

That is right but I think that the current discussion shows that a mixed
(kmem disabled and kmem enabled hierarchies) workloads are far from
being theoretical and a global knob is just too coarse. I am afraid we
will see "we want that per hierarchy" requests shortly and that would
just add a new confusion where global knob would complicate it
considerably (do we really want on/off/per_hierarchy global knob?).
--
Michal Hocko
SUSE Labs
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48200 is a reply to message #48156] Wed, 03 October 2012 22:43 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hey, Michal.

On Mon, Oct 01, 2012 at 11:27:56AM +0200, Michal Hocko wrote:
> > Yeah but, if we can get it down to, say, around 1% under most
> > workloads for memcg users, it is quite questionable to introduce full
> > hierarchical configuration to allow avoiding that, isn't it?
>
> Remember that the kmem memory is still accounted to u+k if it is enabled
> which could be a no-go because some workloads (I have provided an
> example that those which are trusted are generally safe to ignore kernel
> memory overhead) simply don't want to consider additional memory which
> is mostly invisible for them.

Maybe it's because my exposure to cgroup usage is different from yours
but the argument that not accounting kernel memory is something
inherently beneficial is lost on me. For compatibility, overhead
and/or implementation complexity issues, yeah, sure, we can't always
(well more like usually) have all we want but I don't get how not
accounting kernel memory is something inherently necessary or
beneficial. This is all about provisioning physical memory to
different groups of users and memory is memory (that's why u+k makes
sense, right?). Without kmemcg enabled, we not only lack a way to
control kernel memory usage but also a way to even watch per-group
usages.

> > But you can apply the same "do I need/want it at all" question to the
> > configuration parameter too.
>
> Yes but, as I've said, the global configuration parameter is too
> coarse. You can have a mix of trusted and untrusted workloads at the
> same machine (e.g. web server which is inherently untrusted) and trusted
> (local batch jobs which just needs a special LRU aging).

This too stems from the same difference stated above. You think
there's inherent distinction between trusted and untrusted workloads
and they need different features from the kernel while I think why
trust anyone if you can untrust everyone and consider the knob as a
compatibility thing.

> I think that comparing kmem accounting with use_hierarchy is not fair.
> Glauber tried to explain why already so I will not repeat it here.
> I will just mention one thing. use_hierarchy has been introduces becuase
> hierarchies were expensive at the time. kmem accounting is about should
> we do u or u+k accounting. So there is a crucial difference.

It may be less crazy but I think there are enough commonalities to at
least make a comparison. Mel seems to think it's mostly about
performance overhead. You think that not accounting kmem is something
inherently necessary.

> That is right but I think that the current discussion shows that a mixed
> (kmem disabled and kmem enabled hierarchies) workloads are far from
> being theoretical and a global knob is just too coarse. I am afraid we

I'm not sure there's much evidence in this thread. The strongest upto
this point seems to be performance overhead / difficulty of general
enough implementation. As for "trusted" workload, what are the
inherent benefits of trusting if you don't have to?

> will see "we want that per hierarchy" requests shortly and that would
> just add a new confusion where global knob would complicate it
> considerably (do we really want on/off/per_hierarchy global knob?).

Hmmm? The global knob is just the same per_hierarchy knob at the
root. It's hierarchical after all.

Anyways, as long as the "we silently ignore what happened before being
enabled" is gone, I won't fight this anymore. It isn't broken after
all. But, please think about making things simpler in general, cgroup
is riddled with mis-designed complexities and memcg seems to be
leading the charge at times.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48201 is a reply to message #48154] Wed, 03 October 2012 22:54 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Glauber.

On Mon, Oct 01, 2012 at 12:45:11PM +0400, Glauber Costa wrote:
> > where kmemcg_slab_idx is updated from sched notifier (or maybe add and
> > use current->kmemcg_slab_idx?). You would still need __GFP_* and
> > in_interrupt() tests but current->mm and PF_KTHREAD tests can be
> > rolled into index selection.
>
> How big would this array be? there can be a lot more kmem_caches than
> there are memcgs. That is why it is done from memcg side.

The total number of memcgs are pretty limited due to the ID thing,
right? And kmemcg is only applied to subset of caches. I don't think
the array size would be a problem in terms of memory overhead, would
it? If so, RCU synchronize and dynamically grow them?

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48202 is a reply to message #48155] Wed, 03 October 2012 22:59 Go to previous messageGo to next message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Glauber.

On Mon, Oct 01, 2012 at 12:46:02PM +0400, Glauber Costa wrote:
> > Yeah, it will need some hooks. For dentry and inode, I think it would
> > be pretty well isolated tho. Wasn't it?
>
> We would still need something for the stack. For open files, and for
> everything that becomes a potential problem. We then end up with 35
> different knobs instead of one. One of the perceived advantages of this
> approach, is that it condenses as much data as a single knob as
> possible, reducing complexity and over flexibility.

Oh, I didn't mean to use object-specific counting for all of them.
Most resources don't have such common misaccounting problem. I mean,
for stack, it doesn't exist by definition (other than cgroup
migration). There's no reason to use anything other than first-use
kmem based accounting for them. My point was that for particularly
problematic ones like dentry/inode, it might be better to treat them
differently.

Thanks.

--
tejun
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48207 is a reply to message #48201] Thu, 04 October 2012 11:55 Go to previous messageGo to next message
Glauber Costa is currently offline  Glauber Costa
Messages: 916
Registered: October 2011
Senior Member
On 10/04/2012 02:54 AM, Tejun Heo wrote:
> Hello, Glauber.
>
> On Mon, Oct 01, 2012 at 12:45:11PM +0400, Glauber Costa wrote:
>>> where kmemcg_slab_idx is updated from sched notifier (or maybe add and
>>> use current->kmemcg_slab_idx?). You would still need __GFP_* and
>>> in_interrupt() tests but current->mm and PF_KTHREAD tests can be
>>> rolled into index selection.
>>
>> How big would this array be? there can be a lot more kmem_caches than
>> there are memcgs. That is why it is done from memcg side.
>
> The total number of memcgs are pretty limited due to the ID thing,
> right? And kmemcg is only applied to subset of caches. I don't think
> the array size would be a problem in terms of memory overhead, would
> it? If so, RCU synchronize and dynamically grow them?
>
> Thanks.
>

I don't want to assume the number of memcgs will always be that limited.
Sure, the ID limitation sounds pretty much a big one, but people doing
VMs usually want to stack as many VMs as they possibly can in an
environment, and the less things preventing that from happening, the better.

That said, now that I've experimented with this a bit, indexing from the
cache may have some advantages: it can get too complicated to propagate
new caches appearing to all memcgs that already in-flight. We don't have
this problem from the cache side, because instances of it are guaranteed
not to exist at this point by definition.

I don't want to bloat unrelated kmem_cache structures, so I can't embed
a memcg array in there: I would have to have a pointer to a memcg array
that gets assigned at first use. But if we don't want to have a static
number, as you and christoph already frowned upon heavily, we may have
to do that memcg side as well.

The array gets bigger, though, because it pretty much has to be enough
to accomodate all css_ids. Even now, they are more than the 400 I used
in this patchset. Not allocating all of them at once will lead to more
complication and pointer chasing in here.

I'll take a look at the alternatives today and tomorrow.
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48214 is a reply to message #48200] Fri, 05 October 2012 13:47 Go to previous messageGo to next message
Michal Hocko is currently offline  Michal Hocko
Messages: 109
Registered: December 2011
Senior Member
On Thu 04-10-12 07:43:16, Tejun Heo wrote:
[...]
> > That is right but I think that the current discussion shows that a mixed
> > (kmem disabled and kmem enabled hierarchies) workloads are far from
> > being theoretical and a global knob is just too coarse. I am afraid we
>
> I'm not sure there's much evidence in this thread. The strongest upto
> this point seems to be performance overhead / difficulty of general
> enough implementation. As for "trusted" workload, what are the
> inherent benefits of trusting if you don't have to?

One advantage is that you do _not have_ to consider kernel memory
allocations (which are inherently bound to the kernel version) so the
sizing is much easier and version independent. If you set a limit to XY
because you know what you are doing you certainly do not want to regress
(e.g. because of unnecessary reclaim) just because a certain kernel
allocation got bigger, right?

> > will see "we want that per hierarchy" requests shortly and that would
> > just add a new confusion where global knob would complicate it
> > considerably (do we really want on/off/per_hierarchy global knob?).
>
> Hmmm? The global knob is just the same per_hierarchy knob at the
> root. It's hierarchical after all.

When you said global knob I imagined mount or boot option. If you want
to have yet another memory.enable_kmem then IMHO it is much easier to
use set-accounted semantic (which is hierarchical as well).

> Anyways, as long as the "we silently ignore what happened before being
> enabled" is gone, I won't fight this anymore. It isn't broken after
> all.

OK, it is good that we settled this.

> But, please think about making things simpler in general, cgroup
> is riddled with mis-designed complexities and memcg seems to be
> leading the charge at times.

Yes this is an evolution and it seems that we are slowly getting there.

>
> Thanks.

--
Michal Hocko
SUSE Labs
Re: [PATCH v3 04/13] kmem accounting basic infrastructure [message #48216 is a reply to message #48207] Sat, 06 October 2012 02:19 Go to previous message
Tejun Heo is currently offline  Tejun Heo
Messages: 184
Registered: November 2006
Senior Member
Hello, Glauber.

On Thu, Oct 04, 2012 at 03:55:14PM +0400, Glauber Costa wrote:
> I don't want to bloat unrelated kmem_cache structures, so I can't embed
> a memcg array in there: I would have to have a pointer to a memcg array
> that gets assigned at first use. But if we don't want to have a static
> number, as you and christoph already frowned upon heavily, we may have
> to do that memcg side as well.
>
> The array gets bigger, though, because it pretty much has to be enough
> to accomodate all css_ids. Even now, they are more than the 400 I used
> in this patchset. Not allocating all of them at once will lead to more
> complication and pointer chasing in here.

I don't think it would require more pointer chasing. At the simplest,
we can just compare the array size each time. If you wanna be more
efficient, all arrays can be kept at the same size and resized when
the number of memcgs cross the current number. The only runtime
overhead would be one pointer deref which I don't think can be avoided
regardless of the indexing direction.

Thanks.

--
tejun
Previous Topic: [PATCH v3 00/16] slab accounting for memcg
Next Topic: [RFC PATCH 0/5] net: socket bind to file descriptor introduced
Goto Forum:
  


Current Time: Sun Oct 26 03:11:32 GMT 2025

Total time taken to generate the page: 0.10026 seconds