Home » Mailing lists » Devel » [PATCH] BC: resource beancounters (v2)
|
Re: [PATCH 2/6] BC: beancounters core (API) [message #5621 is a reply to message #5612] |
Thu, 24 August 2006 15:00   |
Andrew Morton
Messages: 127 Registered: December 2005
|
Senior Member |
|
|
On Thu, 24 Aug 2006 16:06:11 +0400
Kirill Korotaev <dev@sw.ru> wrote:
> >>+#define bc_charge_locked(bc, r, v, s) (0)
> >>> +#define bc_charge(bc, r, v) (0)
> >
> >akpm:/home/akpm> cat t.c
> >void foo(void)
> >{
> > (0);
> >}
> >akpm:/home/akpm> gcc -c -Wall t.c
> >t.c: In function 'foo':
> >t.c:4: warning: statement with no effect
>
> these functions return value should always be checked (!).
We have __must_check for that.
> i.e. it is never called like:
> ub_charge(bc, r, v);
Also...
if (bc_charge(tpyo, undefined_variable, syntax_error))
will happily compile if !CONFIG_BEANCOUNTER.
Turning these stubs into static inline __must_check functions fixes all this.
|
|
|
|
|
Re: [PATCH 2/6] BC: beancounters core (API) [message #5626 is a reply to message #5544] |
Thu, 24 August 2006 17:09   |
Oleg Nesterov
Messages: 143 Registered: August 2006
|
Senior Member |
|
|
On 08/23, Kirill Korotaev wrote:
>
> +struct beancounter *beancounter_findcreate(uid_t uid, int mask)
> +{
> + struct beancounter *new_bc, *bc;
> + unsigned long flags;
> + struct hlist_head *slot;
> + struct hlist_node *pos;
> +
> + slot = &bc_hash[bc_hash_fun(uid)];
> + new_bc = NULL;
> +
> +retry:
> + spin_lock_irqsave(&bc_hash_lock, flags);
> + hlist_for_each_entry (bc, pos, slot, hash)
> + if (bc->bc_id == uid)
> + break;
> +
> + if (pos != NULL) {
> + get_beancounter(bc);
> + spin_unlock_irqrestore(&bc_hash_lock, flags);
> +
> + if (new_bc != NULL)
> + kmem_cache_free(bc_cachep, new_bc);
> + return bc;
> + }
> +
> + if (!(mask & BC_ALLOC))
> + goto out_unlock;
Very minor nit: it is not clear why we are doing this check under
bc_hash_lock. I'd suggest to do
if (!(mask & BC_ALLOC))
goto out;
after unlock(bc_hash_lock) and kill out_unlock label.
> + if (new_bc != NULL)
> + goto out_install;
> +
> + spin_unlock_irqrestore(&bc_hash_lock, flags);
> +
> + new_bc = kmem_cache_alloc(bc_cachep,
> + mask & BC_ALLOC_ATOMIC ? GFP_ATOMIC : GFP_KERNEL);
> + if (new_bc == NULL)
> + goto out;
> +
> + memcpy(new_bc, &default_beancounter, sizeof(*new_bc));
May be it is just me, but I need a couple of seconds to parse this 'memcpy'.
How about
*new_bc = default_beancounter;
?
Oleg.
|
|
|
|
|
Re: [PATCH 2/6] BC: beancounters core (API) [message #5649 is a reply to message #5621] |
Fri, 25 August 2006 10:51   |
dev
Messages: 1693 Registered: September 2005 Location: Moscow
|
Senior Member |

|
|
Andrew Morton wrote:
> On Thu, 24 Aug 2006 16:06:11 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
>
>>>>+#define bc_charge_locked(bc, r, v, s) (0)
>>>>
>>>>>+#define bc_charge(bc, r, v) (0)
>>>
>>>akpm:/home/akpm> cat t.c
>>>void foo(void)
>>>{
>>> (0);
>>>}
>>>akpm:/home/akpm> gcc -c -Wall t.c
>>>t.c: In function 'foo':
>>>t.c:4: warning: statement with no effect
>>
>>these functions return value should always be checked (!).
>
>
> We have __must_check for that.
>
>
>>i.e. it is never called like:
>> ub_charge(bc, r, v);
>
>
> Also...
>
> if (bc_charge(tpyo, undefined_variable, syntax_error))
>
> will happily compile if !CONFIG_BEANCOUNTER.
>
> Turning these stubs into static inline __must_check functions fixes all this.
ok. will replace all empty stubs with inlines (with __must_check where appropriate)
Thanks,
Kirill
|
|
|
|
Re: [PATCH 1/6] BC: kconfig [message #5652 is a reply to message #5582] |
Fri, 25 August 2006 11:27   |
dev
Messages: 1693 Registered: September 2005 Location: Moscow
|
Senior Member |

|
|
Matt Helsley wrote:
> On Wed, 2006-08-23 at 15:04 -0700, Dave Hansen wrote:
>
>>On Wed, 2006-08-23 at 15:01 +0400, Kirill Korotaev wrote:
>>
>>>--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
>>>+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
>>>@@ -432,3 +432,5 @@ source "security/Kconfig"
>>> source "crypto/Kconfig"
>>>
>>> source "lib/Kconfig"
>>>+
>>>+source "kernel/bc/Kconfig"
>>
>>...
>>
>>>--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
>>>+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
>>>@@ -432,3 +432,5 @@ source "security/Kconfig"
>>> source "crypto/Kconfig"
>>>
>>> source "lib/Kconfig"
>>>+
>>>+source "kernel/bc/Kconfig"
>>
>>Is it just me, or do these patches look a little funky? Looks like it
>>is trying to patch the same thing into the same file, twice. Also, the
>>patches look to be -p0 instead of -p1.
>
>
> They do appear to be -p0
it is -p1. patches are generated with gendiff and ./ in names is for -p1
> They aren't adding the same thing twice to the same file. This patch
> makes different arches source the same Kconfig.
>
> I seem to recall Chandra suggested that instead of doing it this way it
> would be more appropriate to add the source line to init/Kconfig because
> it's more central and arch-independent. I tend to agree.
agreed. init/Kconfig looks like a good place for including
kernel/bc/Kconfig
Kirill
|
|
|
Re: [PATCH 1/6] BC: kconfig [message #5653 is a reply to message #5579] |
Fri, 25 August 2006 11:31   |
dev
Messages: 1693 Registered: September 2005 Location: Moscow
|
Senior Member |

|
|
Dave Hansen wrote:
> On Wed, 2006-08-23 at 15:01 +0400, Kirill Korotaev wrote:
>
>>--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
>>+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
>>@@ -432,3 +432,5 @@ source "security/Kconfig"
>> source "crypto/Kconfig"
>>
>> source "lib/Kconfig"
>>+
>>+source "kernel/bc/Kconfig"
>
> ...
>
>>--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
>>+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
>>@@ -432,3 +432,5 @@ source "security/Kconfig"
>> source "crypto/Kconfig"
>>
>> source "lib/Kconfig"
>>+
>>+source "kernel/bc/Kconfig"
>
>
> Is it just me, or do these patches look a little funky? Looks like it
> is trying to patch the same thing into the same file, twice. Also, the
> patches look to be -p0 instead of -p1.
>
> I'm having a few problems applying them.
Oh, it's my fault. I pasted text twice :/
Kirill
|
|
|
Re: [PATCH] BC: resource beancounters (v2) [message #5654 is a reply to message #5570] |
Fri, 25 August 2006 11:47   |
dev
Messages: 1693 Registered: September 2005 Location: Moscow
|
Senior Member |

|
|
Andrew Morton wrote:
>>As the first step we want to propose for discussion
>>the most complicated parts of resource management:
>>kernel memory and virtual memory.
>
>
> The patches look reasonable to me - mergeable after updating them for
> today's batch of review commentlets.
sure. will do updates as long as there are reasonable comments.
> I have two high-level problems though.
>
> a) I don't yet have a sense of whether this implementation
> is appropriate/sufficient for the various other
> applications which people are working on.
>
> If the general shape is OK and we think this
> implementation can be grown into one which everyone can
> use then fine.
>
> And...
>
>
>>The patch set to be sent provides core for BC and
>>management of kernel memory only. Virtual memory
>>management will be sent in a couple of days.
>
>
> We need to go over this work before we can commit to the BC
> core. Last time I looked at the VM accounting patch it
> seemed rather unpleasing from a maintainability POV.
hmmm... in which regard?
> And, if I understand it correctly, the only response to a job
> going over its VM limits is to kill it, rather than trimming
> it. Which sounds like a big problem?
No, UBC virtual memory management refuses occur on mmap()'s.
Andrey Savochkin wrote already a brief summary on vm resource management:
------------- cut ----------------
The task of limiting a container to 4.5GB of memory bottles down to the
question: what to do when the container starts to use more than assigned
4.5GB of memory?
At this moment there are only 3 viable alternatives.
A) Have separate memory management for each container,
with separate buddy allocator, lru lists, page replacement mechanism.
That implies a considerable overhead, and the main challenge there
is sharing of pages between these separate memory managers.
B) Return errors on extension of mappings, but not on page faults, where
memory is actually consumed.
In this case it makes sense to take into account not only the size of used
memory, but the size of created mappings as well.
This is approximately what "privvmpages" accounting/limiting provides in
UBC.
C) Rely on OOM killer.
This is a fall-back method in UBC, for the case "privvmpages" limits
still leave the possibility to overload the system.
It would be nice, indeed, to invent something new.
The ideal mechanism would
- slow down the container over-using memory, to signal the user that
he is over his limits,
- at the same time this slowdown shouldn't lead to the increase of memory
usage: for example, a simple slowdown of apache web server would lead
to the growth of the number of serving children and consumption of more
memory while showing worse performance,
- and, at the same time, it shouldn't penalize the rest of the system from
the performance point of view...
May be this can be achieved via carefully tuned swapout mechanism together
with disk bandwidth management capable of tracking asynchronous write
requests, may be something else is required.
It's really a big challenge.
Meanwhile, I guess we can only make small steps in improving Linux resource
management features for this moment.
------------- cut ----------------
Thanks,
Kirill
|
|
|
Re: [PATCH] BC: resource beancounters (v2) [message #5655 is a reply to message #5654] |
Fri, 25 August 2006 14:30   |
Andrew Morton
Messages: 127 Registered: December 2005
|
Senior Member |
|
|
On Fri, 25 Aug 2006 15:49:15 +0400
Kirill Korotaev <dev@sw.ru> wrote:
> > We need to go over this work before we can commit to the BC
> > core. Last time I looked at the VM accounting patch it
> > seemed rather unpleasing from a maintainability POV.
> hmmm... in which regard?
Little changes all over the MM code which might get accidentally broken.
> > And, if I understand it correctly, the only response to a job
> > going over its VM limits is to kill it, rather than trimming
> > it. Which sounds like a big problem?
> No, UBC virtual memory management refuses occur on mmap()'s.
That's worse, isn't it? Firstly it rules out big sparse mappings and secondly
mmap_and_use(80% of container size)
fork_and_immediately_exec(/bin/true)
will fail at the fork?
> Andrey Savochkin wrote already a brief summary on vm resource management:
>
> ------------- cut ----------------
> The task of limiting a container to 4.5GB of memory bottles down to the
> question: what to do when the container starts to use more than assigned
> 4.5GB of memory?
>
> At this moment there are only 3 viable alternatives.
>
> A) Have separate memory management for each container,
> with separate buddy allocator, lru lists, page replacement mechanism.
> That implies a considerable overhead, and the main challenge there
> is sharing of pages between these separate memory managers.
>
> B) Return errors on extension of mappings, but not on page faults, where
> memory is actually consumed.
> In this case it makes sense to take into account not only the size of used
> memory, but the size of created mappings as well.
> This is approximately what "privvmpages" accounting/limiting provides in
> UBC.
>
> C) Rely on OOM killer.
> This is a fall-back method in UBC, for the case "privvmpages" limits
> still leave the possibility to overload the system.
>
D) Virtual scan of mm's in the over-limit container
E) Modify existing physical scanner to be able to skip pages which
belong to not-over-limit containers.
F) Something else ;)
|
|
|
|
|
|
Re: BC: resource beancounters (v2) [message #5661 is a reply to message #5655] |
Fri, 25 August 2006 16:30   |
Andrey Savochkin
Messages: 47 Registered: December 2005
|
Member |
|
|
On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote:
> On Fri, 25 Aug 2006 15:49:15 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
> > Andrey Savochkin wrote already a brief summary on vm resource management:
> >
> > ------------- cut ----------------
> > The task of limiting a container to 4.5GB of memory bottles down to the
> > question: what to do when the container starts to use more than assigned
> > 4.5GB of memory?
> >
> > At this moment there are only 3 viable alternatives.
> >
> > A) Have separate memory management for each container,
> > with separate buddy allocator, lru lists, page replacement mechanism.
> > That implies a considerable overhead, and the main challenge there
> > is sharing of pages between these separate memory managers.
> >
> > B) Return errors on extension of mappings, but not on page faults, where
> > memory is actually consumed.
> > In this case it makes sense to take into account not only the size of used
> > memory, but the size of created mappings as well.
> > This is approximately what "privvmpages" accounting/limiting provides in
> > UBC.
> >
> > C) Rely on OOM killer.
> > This is a fall-back method in UBC, for the case "privvmpages" limits
> > still leave the possibility to overload the system.
> >
>
> D) Virtual scan of mm's in the over-limit container
>
> E) Modify existing physical scanner to be able to skip pages which
> belong to not-over-limit containers.
I've actually tried (E), but it didn't work as I wished.
It didn't handle well shared pages.
Then, in my experiments such modified scanner was unable to regulate
quality-of-service. When I ran 2 over-the-limit containers, they worked
equally slow regardless of their limits and work set size.
That is, I didn't observe a smooth transition "under limit, maximum
performance" to "slightly over limit, a bit reduced performance" to
"significantly over limit, poor performance". Neither did I see any fairness
in how containers got penalized for exceeding their limits.
My explanation of what I observed is that
- since filesystem caches play a huge role in performance, page scanner will
be very limited in controlling container's performance if caches
stay shared between containers,
- in the absence of decent disk I/O manager, stalls due to swapin/swapout
are more influenced by disk subsystem than by page scanner policy.
So in fact modified page scanner provides control over memory usage only as
"stay under limits or die", and doesn't show many advantages over (B) or (C).
At the same time, skipping pages visibly penalizes "good citizens", not only
in disk bandwidth but in CPU overhead as well.
So I settled for (A)-(C) for now.
But it certainly would be interesting to hear if someone else makes such
experiments.
Best regards
Andrey
|
|
|
Re: BC: resource beancounters (v2) [message #5663 is a reply to message #5661] |
Fri, 25 August 2006 17:50   |
Andrew Morton
Messages: 127 Registered: December 2005
|
Senior Member |
|
|
On Fri, 25 Aug 2006 20:30:26 +0400
Andrey Savochkin <saw@sw.ru> wrote:
> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote:
> >
> > D) Virtual scan of mm's in the over-limit container
> >
> > E) Modify existing physical scanner to be able to skip pages which
> > belong to not-over-limit containers.
>
> I've actually tried (E), but it didn't work as I wished.
>
> It didn't handle well shared pages.
> Then, in my experiments such modified scanner was unable to regulate
> quality-of-service. When I ran 2 over-the-limit containers, they worked
> equally slow regardless of their limits and work set size.
> That is, I didn't observe a smooth transition "under limit, maximum
> performance" to "slightly over limit, a bit reduced performance" to
> "significantly over limit, poor performance". Neither did I see any fairness
> in how containers got penalized for exceeding their limits.
>
> My explanation of what I observed is that
> - since filesystem caches play a huge role in performance, page scanner will
> be very limited in controlling container's performance if caches
> stay shared between containers,
> - in the absence of decent disk I/O manager, stalls due to swapin/swapout
> are more influenced by disk subsystem than by page scanner policy.
> So in fact modified page scanner provides control over memory usage only as
> "stay under limits or die", and doesn't show many advantages over (B) or (C).
> At the same time, skipping pages visibly penalizes "good citizens", not only
> in disk bandwidth but in CPU overhead as well.
>
> So I settled for (A)-(C) for now.
> But it certainly would be interesting to hear if someone else makes such
> experiments.
>
Makes sense. If one is looking for good machine partitioning then a shared
disk is obviously a great contention point. To address that we'd need to
be able to say "container A swaps to /dev/sda1 and container B swaps to
/dev/sdb1". But the swap system at present can't do that.
|
|
|
Re: BC: resource beancounters (v2) [message #5666 is a reply to message #5661] |
Fri, 25 August 2006 19:00   |
Chandra Seetharaman
Messages: 88 Registered: August 2006
|
Member |
|
|
Have you seen/tried the memory controller in CKRM/Resource Groups ?
http://sourceforge.net/projects/ckrm
It maintains a per resource group LRU lists and also maintains a list of
over-guarantee groups (with ordering based on where they are in their
guarantee-limit scale). So, when a reclaim needs to happen, pages are
first freed from a group that is way over its limit, and then the next
one and so on.
Few things that it does that are not good:
- doesn't account shared pages accurately
- moves all pages from a task when the task moves to a different group
- totally new reclamation path
regards,
chandra
On Fri, 2006-08-25 at 20:30 +0400, Andrey Savochkin wrote:
> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote:
> > On Fri, 25 Aug 2006 15:49:15 +0400
> > Kirill Korotaev <dev@sw.ru> wrote:
> >
> > > Andrey Savochkin wrote already a brief summary on vm resource management:
> > >
> > > ------------- cut ----------------
> > > The task of limiting a container to 4.5GB of memory bottles down to the
> > > question: what to do when the container starts to use more than assigned
> > > 4.5GB of memory?
> > >
> > > At this moment there are only 3 viable alternatives.
> > >
> > > A) Have separate memory management for each container,
> > > with separate buddy allocator, lru lists, page replacement mechanism.
> > > That implies a considerable overhead, and the main challenge there
> > > is sharing of pages between these separate memory managers.
> > >
> > > B) Return errors on extension of mappings, but not on page faults, where
> > > memory is actually consumed.
> > > In this case it makes sense to take into account not only the size of used
> > > memory, but the size of created mappings as well.
> > > This is approximately what "privvmpages" accounting/limiting provides in
> > > UBC.
> > >
> > > C) Rely on OOM killer.
> > > This is a fall-back method in UBC, for the case "privvmpages" limits
> > > still leave the possibility to overload the system.
> > >
> >
> > D) Virtual scan of mm's in the over-limit container
> >
> > E) Modify existing physical scanner to be able to skip pages which
> > belong to not-over-limit containers.
>
> I've actually tried (E), but it didn't work as I wished.
>
> It didn't handle well shared pages.
> Then, in my experiments such modified scanner was unable to regulate
> quality-of-service. When I ran 2 over-the-limit containers, they worked
> equally slow regardless of their limits and work set size.
> That is, I didn't observe a smooth transition "under limit, maximum
> performance" to "slightly over limit, a bit reduced performance" to
> "significantly over limit, poor performance". Neither did I see any fairness
> in how containers got penalized for exceeding their limits.
>
> My explanation of what I observed is that
> - since filesystem caches play a huge role in performance, page scanner will
> be very limited in controlling container's performance if caches
> stay shared between containers,
> - in the absence of decent disk I/O manager, stalls due to swapin/swapout
> are more influenced by disk subsystem than by page scanner policy.
> So in fact modified page scanner provides control over memory usage only as
> "stay under limits or die", and doesn't show many advantages over (B) or (C).
> At the same time, skipping pages visibly penalizes "good citizens", not only
> in disk bandwidth but in CPU overhead as well.
>
> So I settled for (A)-(C) for now.
> But it certainly would be interesting to hear if someone else makes such
> experiments.
>
> Best regards
>
> Andrey
--
------------------------------------------------------------ ----------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
------------------------------------------------------------ ----------
|
|
|
Re: BC: resource beancounters (v2) [message #5676 is a reply to message #5661] |
Sat, 26 August 2006 02:15   |
Rohit Seth
Messages: 101 Registered: August 2006
|
Senior Member |
|
|
On Fri, 2006-08-25 at 20:30 +0400, Andrey Savochkin wrote:
> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote:
> > On Fri, 25 Aug 2006 15:49:15 +0400
> > Kirill Korotaev <dev@sw.ru> wrote:
> >
> > > Andrey Savochkin wrote already a brief summary on vm resource management:
> > >
> > > ------------- cut ----------------
> > > The task of limiting a container to 4.5GB of memory bottles down to the
> > > question: what to do when the container starts to use more than assigned
> > > 4.5GB of memory?
> > >
> > > At this moment there are only 3 viable alternatives.
> > >
> > > A) Have separate memory management for each container,
> > > with separate buddy allocator, lru lists, page replacement mechanism.
> > > That implies a considerable overhead, and the main challenge there
> > > is sharing of pages between these separate memory managers.
> > >
Yes, sharing of pages across different containers/managers will be a
problem. Why not just disallow that scenario (that is what fake nodes
proposal would also end up doing).
> > > B) Return errors on extension of mappings, but not on page faults, where
> > > memory is actually consumed.
> > > In this case it makes sense to take into account not only the size of used
> > > memory, but the size of created mappings as well.
> > > This is approximately what "privvmpages" accounting/limiting provides in
> > > UBC.
> > >
Keeping a tab on all the virtual mappings in a container must also be
troublesome. And IMO is not the right way to go...this is even a
stricter version of overcommit_memory...right?
>
> > > C) Rely on OOM killer.
> > > This is a fall-back method in UBC, for the case "privvmpages" limits
> > > still leave the possibility to overload the system.
> > >
> >
> > D) Virtual scan of mm's in the over-limit container
> >
This seems like an interesting choice. If we can quickly inactivate
some pages belonging to tasks in over_the_limit container.
> > E) Modify existing physical scanner to be able to skip pages which
> > belong to not-over-limit containers.
>
> I've actually tried (E), but it didn't work as I wished.
>
> It didn't handle well shared pages.
> Then, in my experiments such modified scanner was unable to regulate
> quality-of-service. When I ran 2 over-the-limit containers, they worked
> equally slow regardless of their limits and work set size.
> That is, I didn't observe a smooth transition "under limit, maximum
> performance" to "slightly over limit, a bit reduced performance" to
> "significantly over limit, poor performance". Neither did I see any fairness
> in how containers got penalized for exceeding their limits.
>
That sure is an interesting observation though I think it really depends
on if you are doing the same amount of work when counts have just gone
above the limits to the point where they are way over the limit.
> My explanation of what I observed is that
> - since filesystem caches play a huge role in performance, page scanner will
> be very limited in controlling container's performance if caches
> stay shared between containers,
Yeah, if a page is shared between containers then you can end up doing
nothing useful. And that is where containers dedicated to individual
filesystem could be useful.
> - in the absence of decent disk I/O manager, stalls due to swapin/swapout
> are more influenced by disk subsystem than by page scanner policy.
> So in fact modified page scanner provides control over memory usage only as
> "stay under limits or die", and doesn't show many advantages over (B) or (C).
> At the same time, skipping pages visibly penalizes "good citizens", not only
> in disk bandwidth but in CPU overhead as well.
>
Sure that CPU, disk and other variables will kick in when you start
swapping. But then apps are expected to suffer when gone over limit.
The drawback is the apps that are not hit the limit will also suffer,
but then that is where extra controllers like CPU will kick in.
Maybe, we have a flag for each container indicating whether the tasks
belonging to that container should be killed immediately or they are
okay to run with lower performance as far as they can.
-rohit
|
|
|
Re: [PATCH] BC: resource beancounters (v2) [message #5678 is a reply to message #5657] |
Sat, 26 August 2006 03:55   |
Nick Piggin
Messages: 35 Registered: March 2006
|
Member |
|
|
Alan Cox wrote:
> Ar Sad, 2006-08-26 am 01:14 +1000, ysgrifennodd Nick Piggin:
>
>>I still think doing simple accounting per-page would be a better way to
>>go than trying to pin down all "user allocatable" kernel allocations.
>>And would require all of about 2 hooks in the page allocator. And would
>>track *actual* RAM allocated by that container.
>
>
> You have a variety of kernel objects you want to worry about and they
> have very differing properties.
>
> Some are basically shared resources - page cache, dentries, inodes, etc
> and can be balanced pretty well by the kernel (ok the dentries are a bit
> of a problem right now). Others are very specific "owned" resources -
> like file handles, sockets and vmas.
That's true (OTOH I'd argue it would still be very useful for things
like pagecache, so one container can't start a couple of 'dd' loops
and turn everyone else to crap). And while the sharing may not be
exactly captured, statistically things should balance over time.
So I'm not arguing about _also_ accounting resources that are limited
in other ways (than just the RAM they consume).
But as a DoS protection measure on RAM usage, trying to account all
kernel allocations that are user triggerable just sounds hard to
maintain, holey, ugly, invsive (and not perfect either -- in fact it
still isn't clear to me that it is any better than my proposal).
>
> Tracking actual RAM use by container/user/.. isn't actually that
> interesting. It's also inconveniently sub page granularity.
If it isn't interesting, then I don't think we want it (at least, until
someone does get an interest in it).
>
> Its a whole seperate question whether you want a separate bean counter
> limit for sockets, file handles, vmas etc.
Yeah that's fair enough. We obviously want to avoid exposing limits on
things that it doesn't make sense to limit, or that is a kernel
implementation detail as much as possible.
eg. so I would be happy to limit virtual address, less happy to limit
vmas alone (unless that is in the context of accounting their RAM usage
or their implied vaddr charge).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
|
|
|
|
|
|
Re: Re: BC: resource beancounters (v2) [message #5709 is a reply to message #5707] |
Mon, 28 August 2006 17:40   |
|
Rohit Seth wrote:
> On Sat, 2006-08-26 at 17:37 +0100, Alan Cox wrote:
>
>> Ar Gwe, 2006-08-25 am 19:15 -0700, ysgrifennodd Rohit Seth:
>>
>>> Yes, sharing of pages across different containers/managers will be a
>>> problem. Why not just disallow that scenario (that is what fake nodes
>>> proposal would also end up doing).
>>>
>> Because it destroys the entire point of using containers instead of
>> something like Xen - which is sharing. Also at the point I am using
>> beancounters per user I don't want glibc per use, libX11 per use glib
>> per use gtk per user etc..
>>
>>
>>
>
> I'm not saying per use glibc etc. That will indeed be useless and bring
> it to virtualization world. Just like fake node, one should be allowed
> to use pages that are already in (for example) page cache- so that you
> don't end up duplicating all shared stuff. But as far as charging is
> concerned, charge it to container who either got the page in page cache
> OR if FS based semantics exist then charge it to the container where the
> file belongs. What I was suggesting is to not charge a page to
> different counters.
>
Consider the following simple scenario: there are 50 containers
(numbered, say, 1 to 50) all sharing a single installation of Fedora
Core 5. They all run sshd, apache, syslogd, crond and some other stuff
like that. This is actually quite a real scenario.
In the world that you propose the container which was unlucky to start
first (probably the one with ID of either 1 or 50) will be charged for
all the memory, and all the
others will have most of their memory for free. And in such a world
per-container memory accounting or limiting is just not possible.
|
|
|
Re: Re: BC: resource beancounters (v2) [message #5721 is a reply to message #5709] |
Mon, 28 August 2006 22:28   |
Rohit Seth
Messages: 101 Registered: August 2006
|
Senior Member |
|
|
On Mon, 2006-08-28 at 21:41 +0400, Kir Kolyshkin wrote:
> Rohit Seth wrote:
> >
> > I'm not saying per use glibc etc. That will indeed be useless and bring
> > it to virtualization world. Just like fake node, one should be allowed
> > to use pages that are already in (for example) page cache- so that you
> > don't end up duplicating all shared stuff. But as far as charging is
> > concerned, charge it to container who either got the page in page cache
> > OR if FS based semantics exist then charge it to the container where the
> > file belongs. What I was suggesting is to not charge a page to
> > different counters.
> >
>
> Consider the following simple scenario: there are 50 containers
> (numbered, say, 1 to 50) all sharing a single installation of Fedora
> Core 5. They all run sshd, apache, syslogd, crond and some other stuff
> like that. This is actually quite a real scenario.
>
> In the world that you propose the container which was unlucky to start
> first (probably the one with ID of either 1 or 50) will be charged for
> all the memory, and all the
> others will have most of their memory for free. And in such a world
> per-container memory accounting or limiting is just not possible.
If you are only having task based accounting then yes the first
container using a page will be charged. And when it hit its limit then
it will inactivate some of the pages. If some other container now uses
the same page (that got inactivated) again then this next container will
be charged for that page.
Though if we have file/directory based accounting then shared pages
belonging to /usr/lib or /usr/bin can go to a common container.
-rohit
|
|
|
|
Re: [PATCH 6/6] BC: kernel memory accounting (marks) [message #5731 is a reply to message #5573] |
Tue, 29 August 2006 09:52   |
dev
Messages: 1693 Registered: September 2005 Location: Moscow
|
Senior Member |

|
|
Dave Hansen wrote:
> I'm still a bit concerned about if we actually need the 'struct page'
> pointer. I've gone through all of the users, and I'm not sure that I
> see any that _require_ having a pointer in 'struct page'. I think it
> will take some rework, especially with the pagetables, but it should be
> quite doable.
don't worry:
1. we will introduce a separate patch moving this pointer
into mirroring array
2. this pointer is still required for _user_ pages tracking,
that's why I don't follow your suggestion right now...
> vmalloc:
> Store in vm_struct
> fd_set_bits:
> poll_get:
> mount hashtable:
> Don't need alignment. use the slab?
> pagetables:
> either store in an extra field of 'struct page', or use the
> mm's. mm should always be available when alloc/freeing a
> pagetable page
>
> Did I miss any?
flocks, pipe buffers, task_struct, sighand, signal, vmas,
posix timers, uid_cache, shmem dirs,
Thanks,
Kirill
|
|
|
|
Re: [PATCH] BC: resource beancounters (v2) [message #5747 is a reply to message #5655] |
Tue, 29 August 2006 15:33   |
dev
Messages: 1693 Registered: September 2005 Location: Moscow
|
Senior Member |

|
|
Andrew Morton wrote:
> On Fri, 25 Aug 2006 15:49:15 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
>
>>>We need to go over this work before we can commit to the BC
>>>core. Last time I looked at the VM accounting patch it
>>>seemed rather unpleasing from a maintainability POV.
>>
>>hmmm... in which regard?
>
>
> Little changes all over the MM code which might get accidentally broken.
>
>
>>>And, if I understand it correctly, the only response to a job
>>>going over its VM limits is to kill it, rather than trimming
>>>it. Which sounds like a big problem?
>>
>>No, UBC virtual memory management refuses occur on mmap()'s.
>
>
> That's worse, isn't it? Firstly it rules out big sparse mappings and secondly
1) if mappings are private then yes, you can not mmap too much. This is logical,
since this whole mappings are potentially allocatable and there is no way to control
it later except for SIGKILL.
2) if mappings are shared file mappings (shmem is handled in similar way) then
you can mmap as much as you want, since these pages can be reclaimed.
> mmap_and_use(80% of container size)
> fork_and_immediately_exec(/bin/true)
>
> will fail at the fork?
yes, it will fail on fork() or exec() in case of much private (1) mappings.
fail on fork() and exec() is much friendlier then SIGKILL, don't you think so?
private mappings parameter which is limited by UBC is a kind of upper estimation
of container RSS. From our experience such an estimation is ~ 5-20% higher then
a real physical memory used (with real-life applications).
>>Andrey Savochkin wrote already a brief summary on vm resource management:
>>
>>------------- cut ----------------
>>The task of limiting a container to 4.5GB of memory bottles down to the
>>question: what to do when the container starts to use more than assigned
>>4.5GB of memory?
>>
>>At this moment there are only 3 viable alternatives.
>>
>>A) Have separate memory management for each container,
>> with separate buddy allocator, lru lists, page replacement mechanism.
>> That implies a considerable overhead, and the main challenge there
>> is sharing of pages between these separate memory managers.
>>
>>B) Return errors on extension of mappings, but not on page faults, where
>> memory is actually consumed.
>> In this case it makes sense to take into account not only the size of used
>> memory, but the size of created mappings as well.
>> This is approximately what "privvmpages" accounting/limiting provides in
>> UBC.
>>
>>C) Rely on OOM killer.
>> This is a fall-back method in UBC, for the case "privvmpages" limits
>> still leave the possibility to overload the system.
>>
>
>
> D) Virtual scan of mm's in the over-limit container
>
> E) Modify existing physical scanner to be able to skip pages which
> belong to not-over-limit containers.
>
> F) Something else ;)
We fully agree that other possible algorithms can and should exist.
My idea only is that any of them would need accounting anyway
(which is the most part of beancounters).
Throtling, modified scanners etc. can be implemented as a separate
BC parameters. Thus, an administrator will be able to select
which policy should be applied to the container which is near its limit.
So the patches I'm trying to send are a step-by-step accounting of all
the resources and their simple limitations. More comprehensive limitation
policy will be built on top of it later.
BTW, UBC page beancounters allow to distinguish pages used by only one
container and pages which are shared. So scanner can try to reclaim
container private pages first, thus not influencing other containers.
Thanks,
Kirill
|
|
|
|
|
|
|
Re: Re: BC: resource beancounters (v2) [message #5756 is a reply to message #5755] |
Tue, 29 August 2006 19:15  |
Rohit Seth
Messages: 101 Registered: August 2006
|
Senior Member |
|
|
On Tue, 2006-08-29 at 20:06 +0100, Alan Cox wrote:
> Ar Maw, 2006-08-29 am 10:30 -0700, ysgrifennodd Rohit Seth:
> > On Tue, 2006-08-29 at 11:15 +0100, Alan Cox wrote:
> > > Ar Llu, 2006-08-28 am 15:28 -0700, ysgrifennodd Rohit Seth:
> > > > Though if we have file/directory based accounting then shared pages
> > > > belonging to /usr/lib or /usr/bin can go to a common container.
> > >
> > > So that one user can map all the spare libraries and config files and
> > > DoS the system by preventing people from accessing the libraries they do
> > > need ?
> > >
> >
> > Well, there is a risk whenever there is sharing across containers. The
> > point though is, give the choice to sysadmin to configure the platform
> > the way it is appropriate.
>
> In other words your suggestion doesn't actually work for the real world
> cases like web serving.
>
Containers are not going to solve all the problems particularly the
scenarios like when a machine is a web server and an odd user can log on
to the same machine and (w/o any ulimits) claim all the memory that is
present in the system.
Though it is quite possible to implement a combination of two (task and
fs based) policies in containers and sysadmin can set a preference of
each each container. [this probably is another reason for having a per
page container pointer].
-rohit
|
|
|
Re: [PATCH] BC: resource beancounters (v2) [message #5763 is a reply to message #5747] |
Tue, 29 August 2006 17:08  |
Balbir Singh
Messages: 491 Registered: August 2006
|
Senior Member |
|
|
Kirill Korotaev wrote:
>>> ------------- cut ----------------
>>> The task of limiting a container to 4.5GB of memory bottles down to the
>>> question: what to do when the container starts to use more than assigned
>>> 4.5GB of memory?
>>>
>>> At this moment there are only 3 viable alternatives.
>>>
>>> A) Have separate memory management for each container,
>>> with separate buddy allocator, lru lists, page replacement mechanism.
>>> That implies a considerable overhead, and the main challenge there
>>> is sharing of pages between these separate memory managers.
>>>
>>> B) Return errors on extension of mappings, but not on page faults, where
>>> memory is actually consumed.
>>> In this case it makes sense to take into account not only the size
>>> of used
>>> memory, but the size of created mappings as well.
>>> This is approximately what "privvmpages" accounting/limiting
>>> provides in
>>> UBC.
>>>
>>> C) Rely on OOM killer.
>>> This is a fall-back method in UBC, for the case "privvmpages" limits
>>> still leave the possibility to overload the system.
>>>
>>
>>
>> D) Virtual scan of mm's in the over-limit container
>>
>> E) Modify existing physical scanner to be able to skip pages which
>> belong to not-over-limit containers.
>>
>> F) Something else ;)
> We fully agree that other possible algorithms can and should exist.
> My idea only is that any of them would need accounting anyway
> (which is the most part of beancounters).
> Throtling, modified scanners etc. can be implemented as a separate
> BC parameters. Thus, an administrator will be able to select
> which policy should be applied to the container which is near its limit.
>
> So the patches I'm trying to send are a step-by-step accounting of all
> the resources and their simple limitations. More comprehensive limitation
> policy will be built on top of it later.
>
One of the issues I see is that bean counters are not very flexible. Tasks
cannot change bean counters dynamically after fork()/exec() that is - can they?
> BTW, UBC page beancounters allow to distinguish pages used by only one
> container and pages which are shared. So scanner can try to reclaim
> container private pages first, thus not influencing other containers.
>
But can you select the specific container for which we intend to scan pages?
> Thanks,
> Kirill
>
--
Thanks,
Balbir Singh,
Linux Technology Center,
IBM Software Labs
|
|
|
Goto Forum:
Current Time: Fri Oct 24 09:32:28 GMT 2025
Total time taken to generate the page: 0.09686 seconds
|