OpenVZ Forum


Home » Mailing lists » Devel » [PATCH] BC: resource beancounters (v2)
Re: [PATCH 4/6] BC: user interface (syscalls) [message #5618 is a reply to message #5604] Thu, 24 August 2006 13:08 Go to previous messageGo to next message
Alexey Dobriyan is currently offline  Alexey Dobriyan
Messages: 195
Registered: August 2006
Senior Member
On Thu, Aug 24, 2006 at 12:04:16PM +0100, Alan Cox wrote:
> Ar Mer, 2006-08-23 am 21:35 -0700, ysgrifennodd Andrew Morton:
> > > Its a uid_t because of setluid() and twenty odd years of existing unix
> > > practice.
> > >
> >
> > I don't understand. This number is an identifier for an accounting
> > container, which was somehow dreamed up by userspace.
>
> Which happens to be a uid_t. It could easily be anyother_t of itself and
> you can create a container_id_t or whatever. It is just a number.
>
> The ancient Unix implementations of this kind of resource management and
> security are built around setluid() which sets a uid value that cannot
> be changed again and is normally used for security purposes. That
> happened to be a uid_t and in simple setups at login uid = luid = euid
> would be the norm.
>
> Thus the Linux one happens to be a uid_t. It could be something else but
> for the "container per user" model whatever a container is must be able
> to hold all possible uid_t values. So we can certainly do something like
>
> typedef uid_t container_id_t;

What about cid_t? Google mentions cid_t was used in HP-UX specific IPC (only if
_INCLUDE_HPUX_SOURCE is defined).
Re: [PATCH 2/6] BC: beancounters core (API) [message #5621 is a reply to message #5612] Thu, 24 August 2006 15:00 Go to previous messageGo to next message
Andrew Morton is currently offline  Andrew Morton
Messages: 127
Registered: December 2005
Senior Member
On Thu, 24 Aug 2006 16:06:11 +0400
Kirill Korotaev <dev@sw.ru> wrote:

> >>+#define bc_charge_locked(bc, r, v, s) (0)
> >>> +#define bc_charge(bc, r, v) (0)
> >
> >akpm:/home/akpm> cat t.c
> >void foo(void)
> >{
> > (0);
> >}
> >akpm:/home/akpm> gcc -c -Wall t.c
> >t.c: In function 'foo':
> >t.c:4: warning: statement with no effect
>
> these functions return value should always be checked (!).

We have __must_check for that.

> i.e. it is never called like:
> ub_charge(bc, r, v);

Also...

if (bc_charge(tpyo, undefined_variable, syntax_error))

will happily compile if !CONFIG_BEANCOUNTER.

Turning these stubs into static inline __must_check functions fixes all this.
Re: [PATCH 6/6] BC: kernel memory accounting (marks) [message #5622 is a reply to message #5602] Thu, 24 August 2006 15:52 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Thu, 2006-08-24 at 11:30 +0200, Geert Uytterhoeven wrote:
> > +#define THREAD_SHIFT 1
> ^
> Shouldn't this be 13?

Yep. Thanks!

I need to go over the whole things again and proofread.

-- Dave
Re: [PATCH 5/6] BC: kernel memory accounting (core) [message #5625 is a reply to message #5547] Thu, 24 August 2006 16:59 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/23, Kirill Korotaev wrote:
>
> +int bc_slab_charge(kmem_cache_t *cachep, void *objp, gfp_t flags)
> +{
> + unsigned int size;
> + struct beancounter *bc, **slab_bcp;
> +
> + bc = get_exec_bc();
> + if (bc == NULL)
> + return 0;

Is it possible to .exec_bc == NULL ?

If yes, why do we need init_bc? We can do 'set_exec_bc(NULL)' in __do_IRQ()
instead.

Oleg.
Re: [PATCH 2/6] BC: beancounters core (API) [message #5626 is a reply to message #5544] Thu, 24 August 2006 17:09 Go to previous messageGo to next message
Oleg Nesterov is currently offline  Oleg Nesterov
Messages: 143
Registered: August 2006
Senior Member
On 08/23, Kirill Korotaev wrote:
>
> +struct beancounter *beancounter_findcreate(uid_t uid, int mask)
> +{
> + struct beancounter *new_bc, *bc;
> + unsigned long flags;
> + struct hlist_head *slot;
> + struct hlist_node *pos;
> +
> + slot = &bc_hash[bc_hash_fun(uid)];
> + new_bc = NULL;
> +
> +retry:
> + spin_lock_irqsave(&bc_hash_lock, flags);
> + hlist_for_each_entry (bc, pos, slot, hash)
> + if (bc->bc_id == uid)
> + break;
> +
> + if (pos != NULL) {
> + get_beancounter(bc);
> + spin_unlock_irqrestore(&bc_hash_lock, flags);
> +
> + if (new_bc != NULL)
> + kmem_cache_free(bc_cachep, new_bc);
> + return bc;
> + }
> +
> + if (!(mask & BC_ALLOC))
> + goto out_unlock;

Very minor nit: it is not clear why we are doing this check under
bc_hash_lock. I'd suggest to do

if (!(mask & BC_ALLOC))
goto out;

after unlock(bc_hash_lock) and kill out_unlock label.

> + if (new_bc != NULL)
> + goto out_install;
> +
> + spin_unlock_irqrestore(&bc_hash_lock, flags);
> +
> + new_bc = kmem_cache_alloc(bc_cachep,
> + mask & BC_ALLOC_ATOMIC ? GFP_ATOMIC : GFP_KERNEL);
> + if (new_bc == NULL)
> + goto out;
> +
> + memcpy(new_bc, &default_beancounter, sizeof(*new_bc));

May be it is just me, but I need a couple of seconds to parse this 'memcpy'.
How about

*new_bc = default_beancounter;

?

Oleg.
Re: [ckrm-tech] [PATCH 1/6] BC: kconfig [message #5637 is a reply to message #5610] Thu, 24 August 2006 22:23 Go to previous messageGo to next message
Matt Helsley is currently offline  Matt Helsley
Messages: 86
Registered: August 2006
Member
On Thu, 2006-08-24 at 15:47 +0400, Kirill Korotaev wrote:
> > Is there a reason why these can be moved to a arch-neutral place ?
> I think a good place for BC would be kernel/Kconfig.bc

I think kernel/bc/Kconfig is fine.

> but still this should be added into archs.
> ok?

Sourcing the bc Kconfig from arch Kconfigs would seem to suggest that
resource management is only possible on a proper subset of archs. Since
this is not the case wouldn't it be better to source the bc Kconfig from
an arch-independent Kconfig (init/Kconfig for example)?

> > PS: Please keep ckrm-tech on Cc: please.

Also, thank you for CC'ing CKRM-Tech with your earlier posting of these
patches.

> Sorry, it is very hard to track emails coming from authors and 3 mailing lists.

Yes, it can be difficult to keep track of all the email authors.

> Better tell me the interested people emails.

CC'ing only the known-interested people wouldn't be better. If anything
I think it's harder for everyone than simply CC'ing a relevant mailing
list like LKML and CKRM-Tech in this case.

Cheers,
-Matt Helsley
Re: [PATCH 5/6] BC: kernel memory accounting (core) [message #5648 is a reply to message #5625] Fri, 25 August 2006 10:06 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Oleg Nesterov wrote:
> On 08/23, Kirill Korotaev wrote:
>
>>+int bc_slab_charge(kmem_cache_t *cachep, void *objp, gfp_t flags)
>>+{
>>+ unsigned int size;
>>+ struct beancounter *bc, **slab_bcp;
>>+
>>+ bc = get_exec_bc();
>>+ if (bc == NULL)
>>+ return 0;
>
>
> Is it possible to .exec_bc == NULL ?
>
> If yes, why do we need init_bc? We can do 'set_exec_bc(NULL)' in __do_IRQ()
> instead.
no, exec_bc can't be NULL. thanks for catching old hunks which historically exist
due to old times when host system was not accounted (bc was NULL).

Thanks,
Kirill
Re: [PATCH 2/6] BC: beancounters core (API) [message #5649 is a reply to message #5621] Fri, 25 August 2006 10:51 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Andrew Morton wrote:
> On Thu, 24 Aug 2006 16:06:11 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
>
>>>>+#define bc_charge_locked(bc, r, v, s) (0)
>>>>
>>>>>+#define bc_charge(bc, r, v) (0)
>>>
>>>akpm:/home/akpm> cat t.c
>>>void foo(void)
>>>{
>>> (0);
>>>}
>>>akpm:/home/akpm> gcc -c -Wall t.c
>>>t.c: In function 'foo':
>>>t.c:4: warning: statement with no effect
>>
>>these functions return value should always be checked (!).
>
>
> We have __must_check for that.
>
>
>>i.e. it is never called like:
>> ub_charge(bc, r, v);
>
>
> Also...
>
> if (bc_charge(tpyo, undefined_variable, syntax_error))
>
> will happily compile if !CONFIG_BEANCOUNTER.
>
> Turning these stubs into static inline __must_check functions fixes all this.

ok. will replace all empty stubs with inlines (with __must_check where appropriate)

Thanks,
Kirill
Re: [PATCH 4/6] BC: user interface (syscalls) [message #5650 is a reply to message #5618] Fri, 25 August 2006 10:54 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Alexey Dobriyan wrote:
> On Thu, Aug 24, 2006 at 12:04:16PM +0100, Alan Cox wrote:
>
>>Ar Mer, 2006-08-23 am 21:35 -0700, ysgrifennodd Andrew Morton:
>>
>>>>Its a uid_t because of setluid() and twenty odd years of existing unix
>>>>practice.
>>>>
>>>
>>>I don't understand. This number is an identifier for an accounting
>>>container, which was somehow dreamed up by userspace.
>>
>>Which happens to be a uid_t. It could easily be anyother_t of itself and
>>you can create a container_id_t or whatever. It is just a number.
>>
>>The ancient Unix implementations of this kind of resource management and
>>security are built around setluid() which sets a uid value that cannot
>>be changed again and is normally used for security purposes. That
>>happened to be a uid_t and in simple setups at login uid = luid = euid
>>would be the norm.
>>
>>Thus the Linux one happens to be a uid_t. It could be something else but
>>for the "container per user" model whatever a container is must be able
>>to hold all possible uid_t values. So we can certainly do something like
>>
>>typedef uid_t container_id_t;
>
>
> What about cid_t? Google mentions cid_t was used in HP-UX specific IPC (only if
> _INCLUDE_HPUX_SOURCE is defined).
bcid_t?

Thanks,
Kirill
Re: [PATCH 1/6] BC: kconfig [message #5652 is a reply to message #5582] Fri, 25 August 2006 11:27 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Matt Helsley wrote:
> On Wed, 2006-08-23 at 15:04 -0700, Dave Hansen wrote:
>
>>On Wed, 2006-08-23 at 15:01 +0400, Kirill Korotaev wrote:
>>
>>>--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
>>>+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
>>>@@ -432,3 +432,5 @@ source "security/Kconfig"
>>> source "crypto/Kconfig"
>>>
>>> source "lib/Kconfig"
>>>+
>>>+source "kernel/bc/Kconfig"
>>
>>...
>>
>>>--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
>>>+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
>>>@@ -432,3 +432,5 @@ source "security/Kconfig"
>>> source "crypto/Kconfig"
>>>
>>> source "lib/Kconfig"
>>>+
>>>+source "kernel/bc/Kconfig"
>>
>>Is it just me, or do these patches look a little funky? Looks like it
>>is trying to patch the same thing into the same file, twice. Also, the
>>patches look to be -p0 instead of -p1.
>
>
> They do appear to be -p0
it is -p1. patches are generated with gendiff and ./ in names is for -p1

> They aren't adding the same thing twice to the same file. This patch
> makes different arches source the same Kconfig.
>
> I seem to recall Chandra suggested that instead of doing it this way it
> would be more appropriate to add the source line to init/Kconfig because
> it's more central and arch-independent. I tend to agree.
agreed. init/Kconfig looks like a good place for including
kernel/bc/Kconfig

Kirill
Re: [PATCH 1/6] BC: kconfig [message #5653 is a reply to message #5579] Fri, 25 August 2006 11:31 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Dave Hansen wrote:
> On Wed, 2006-08-23 at 15:01 +0400, Kirill Korotaev wrote:
>
>>--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
>>+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
>>@@ -432,3 +432,5 @@ source "security/Kconfig"
>> source "crypto/Kconfig"
>>
>> source "lib/Kconfig"
>>+
>>+source "kernel/bc/Kconfig"
>
> ...
>
>>--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
>>+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
>>@@ -432,3 +432,5 @@ source "security/Kconfig"
>> source "crypto/Kconfig"
>>
>> source "lib/Kconfig"
>>+
>>+source "kernel/bc/Kconfig"
>
>
> Is it just me, or do these patches look a little funky? Looks like it
> is trying to patch the same thing into the same file, twice. Also, the
> patches look to be -p0 instead of -p1.
>
> I'm having a few problems applying them.
Oh, it's my fault. I pasted text twice :/

Kirill
Re: [PATCH] BC: resource beancounters (v2) [message #5654 is a reply to message #5570] Fri, 25 August 2006 11:47 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Andrew Morton wrote:
>>As the first step we want to propose for discussion
>>the most complicated parts of resource management:
>>kernel memory and virtual memory.
>
>
> The patches look reasonable to me - mergeable after updating them for
> today's batch of review commentlets.
sure. will do updates as long as there are reasonable comments.

> I have two high-level problems though.
>
> a) I don't yet have a sense of whether this implementation
> is appropriate/sufficient for the various other
> applications which people are working on.
>
> If the general shape is OK and we think this
> implementation can be grown into one which everyone can
> use then fine.
>
> And...
>
>
>>The patch set to be sent provides core for BC and
>>management of kernel memory only. Virtual memory
>>management will be sent in a couple of days.
>
>
> We need to go over this work before we can commit to the BC
> core. Last time I looked at the VM accounting patch it
> seemed rather unpleasing from a maintainability POV.
hmmm... in which regard?

> And, if I understand it correctly, the only response to a job
> going over its VM limits is to kill it, rather than trimming
> it. Which sounds like a big problem?
No, UBC virtual memory management refuses occur on mmap()'s.
Andrey Savochkin wrote already a brief summary on vm resource management:

------------- cut ----------------
The task of limiting a container to 4.5GB of memory bottles down to the
question: what to do when the container starts to use more than assigned
4.5GB of memory?

At this moment there are only 3 viable alternatives.

A) Have separate memory management for each container,
with separate buddy allocator, lru lists, page replacement mechanism.
That implies a considerable overhead, and the main challenge there
is sharing of pages between these separate memory managers.

B) Return errors on extension of mappings, but not on page faults, where
memory is actually consumed.
In this case it makes sense to take into account not only the size of used
memory, but the size of created mappings as well.
This is approximately what "privvmpages" accounting/limiting provides in
UBC.

C) Rely on OOM killer.
This is a fall-back method in UBC, for the case "privvmpages" limits
still leave the possibility to overload the system.

It would be nice, indeed, to invent something new.
The ideal mechanism would
- slow down the container over-using memory, to signal the user that
he is over his limits,
- at the same time this slowdown shouldn't lead to the increase of memory
usage: for example, a simple slowdown of apache web server would lead
to the growth of the number of serving children and consumption of more
memory while showing worse performance,
- and, at the same time, it shouldn't penalize the rest of the system from
the performance point of view...
May be this can be achieved via carefully tuned swapout mechanism together
with disk bandwidth management capable of tracking asynchronous write
requests, may be something else is required.
It's really a big challenge.

Meanwhile, I guess we can only make small steps in improving Linux resource
management features for this moment.
------------- cut ----------------

Thanks,
Kirill
Re: [PATCH] BC: resource beancounters (v2) [message #5655 is a reply to message #5654] Fri, 25 August 2006 14:30 Go to previous messageGo to next message
Andrew Morton is currently offline  Andrew Morton
Messages: 127
Registered: December 2005
Senior Member
On Fri, 25 Aug 2006 15:49:15 +0400
Kirill Korotaev <dev@sw.ru> wrote:

> > We need to go over this work before we can commit to the BC
> > core. Last time I looked at the VM accounting patch it
> > seemed rather unpleasing from a maintainability POV.
> hmmm... in which regard?

Little changes all over the MM code which might get accidentally broken.

> > And, if I understand it correctly, the only response to a job
> > going over its VM limits is to kill it, rather than trimming
> > it. Which sounds like a big problem?
> No, UBC virtual memory management refuses occur on mmap()'s.

That's worse, isn't it? Firstly it rules out big sparse mappings and secondly

mmap_and_use(80% of container size)
fork_and_immediately_exec(/bin/true)

will fail at the fork?


> Andrey Savochkin wrote already a brief summary on vm resource management:
>
> ------------- cut ----------------
> The task of limiting a container to 4.5GB of memory bottles down to the
> question: what to do when the container starts to use more than assigned
> 4.5GB of memory?
>
> At this moment there are only 3 viable alternatives.
>
> A) Have separate memory management for each container,
> with separate buddy allocator, lru lists, page replacement mechanism.
> That implies a considerable overhead, and the main challenge there
> is sharing of pages between these separate memory managers.
>
> B) Return errors on extension of mappings, but not on page faults, where
> memory is actually consumed.
> In this case it makes sense to take into account not only the size of used
> memory, but the size of created mappings as well.
> This is approximately what "privvmpages" accounting/limiting provides in
> UBC.
>
> C) Rely on OOM killer.
> This is a fall-back method in UBC, for the case "privvmpages" limits
> still leave the possibility to overload the system.
>

D) Virtual scan of mm's in the over-limit container

E) Modify existing physical scanner to be able to skip pages which
belong to not-over-limit containers.

F) Something else ;)
Re: [PATCH] BC: resource beancounters (v2) [message #5656 is a reply to message #5655] Fri, 25 August 2006 14:48 Go to previous messageGo to next message
Andi Kleen is currently offline  Andi Kleen
Messages: 33
Registered: February 2006
Member
> D) Virtual scan of mm's in the over-limit container
>
> E) Modify existing physical scanner to be able to skip pages which
> belong to not-over-limit containers.

The same applies to dentries/inodes too, doesn't it? But their
scan is already virtual so their LRUs could be just split into
a per container list (but then didn't Christoph L. plan to
rewrite that code anyways?)

For memory my guess would be that (E) would be easier than (D) for user/file
memory though less efficient.

-Andi
Re: [PATCH] BC: resource beancounters (v2) [message #5657 is a reply to message #5542] Fri, 25 August 2006 15:36 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Sad, 2006-08-26 am 01:14 +1000, ysgrifennodd Nick Piggin:
> I still think doing simple accounting per-page would be a better way to
> go than trying to pin down all "user allocatable" kernel allocations.
> And would require all of about 2 hooks in the page allocator. And would
> track *actual* RAM allocated by that container.

You have a variety of kernel objects you want to worry about and they
have very differing properties.

Some are basically shared resources - page cache, dentries, inodes, etc
and can be balanced pretty well by the kernel (ok the dentries are a bit
of a problem right now). Others are very specific "owned" resources -
like file handles, sockets and vmas.

Tracking actual RAM use by container/user/.. isn't actually that
interesting. It's also inconveniently sub page granularity.

Its a whole seperate question whether you want a separate bean counter
limit for sockets, file handles, vmas etc.

Alan
Re: [PATCH] BC: resource beancounters (v2) [message #5659 is a reply to message #5655] Fri, 25 August 2006 15:14 Go to previous messageGo to next message
Nick Piggin is currently offline  Nick Piggin
Messages: 35
Registered: March 2006
Member
Andrew Morton wrote:
> On Fri, 25 Aug 2006 15:49:15 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
>
>>>We need to go over this work before we can commit to the BC
>>>core. Last time I looked at the VM accounting patch it
>>>seemed rather unpleasing from a maintainability POV.
>>
>>hmmm... in which regard?
>
>
> Little changes all over the MM code which might get accidentally broken.

I still think doing simple accounting per-page would be a better way to
go than trying to pin down all "user allocatable" kernel allocations.
And would require all of about 2 hooks in the page allocator. And would
track *actual* RAM allocated by that container.

Can we continue that discussion (ie. why it isn't good enough). Last I
was told it is not perfect and can be unfair... sounds like it fits the
semantics perfectly ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Re: BC: resource beancounters (v2) [message #5661 is a reply to message #5655] Fri, 25 August 2006 16:30 Go to previous messageGo to next message
Andrey Savochkin is currently offline  Andrey Savochkin
Messages: 47
Registered: December 2005
Member
On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote:
> On Fri, 25 Aug 2006 15:49:15 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
> > Andrey Savochkin wrote already a brief summary on vm resource management:
> >
> > ------------- cut ----------------
> > The task of limiting a container to 4.5GB of memory bottles down to the
> > question: what to do when the container starts to use more than assigned
> > 4.5GB of memory?
> >
> > At this moment there are only 3 viable alternatives.
> >
> > A) Have separate memory management for each container,
> > with separate buddy allocator, lru lists, page replacement mechanism.
> > That implies a considerable overhead, and the main challenge there
> > is sharing of pages between these separate memory managers.
> >
> > B) Return errors on extension of mappings, but not on page faults, where
> > memory is actually consumed.
> > In this case it makes sense to take into account not only the size of used
> > memory, but the size of created mappings as well.
> > This is approximately what "privvmpages" accounting/limiting provides in
> > UBC.
> >
> > C) Rely on OOM killer.
> > This is a fall-back method in UBC, for the case "privvmpages" limits
> > still leave the possibility to overload the system.
> >
>
> D) Virtual scan of mm's in the over-limit container
>
> E) Modify existing physical scanner to be able to skip pages which
> belong to not-over-limit containers.

I've actually tried (E), but it didn't work as I wished.

It didn't handle well shared pages.
Then, in my experiments such modified scanner was unable to regulate
quality-of-service. When I ran 2 over-the-limit containers, they worked
equally slow regardless of their limits and work set size.
That is, I didn't observe a smooth transition "under limit, maximum
performance" to "slightly over limit, a bit reduced performance" to
"significantly over limit, poor performance". Neither did I see any fairness
in how containers got penalized for exceeding their limits.

My explanation of what I observed is that
- since filesystem caches play a huge role in performance, page scanner will
be very limited in controlling container's performance if caches
stay shared between containers,
- in the absence of decent disk I/O manager, stalls due to swapin/swapout
are more influenced by disk subsystem than by page scanner policy.
So in fact modified page scanner provides control over memory usage only as
"stay under limits or die", and doesn't show many advantages over (B) or (C).
At the same time, skipping pages visibly penalizes "good citizens", not only
in disk bandwidth but in CPU overhead as well.

So I settled for (A)-(C) for now.
But it certainly would be interesting to hear if someone else makes such
experiments.

Best regards

Andrey
Re: BC: resource beancounters (v2) [message #5663 is a reply to message #5661] Fri, 25 August 2006 17:50 Go to previous messageGo to next message
Andrew Morton is currently offline  Andrew Morton
Messages: 127
Registered: December 2005
Senior Member
On Fri, 25 Aug 2006 20:30:26 +0400
Andrey Savochkin <saw@sw.ru> wrote:

> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote:
> >
> > D) Virtual scan of mm's in the over-limit container
> >
> > E) Modify existing physical scanner to be able to skip pages which
> > belong to not-over-limit containers.
>
> I've actually tried (E), but it didn't work as I wished.
>
> It didn't handle well shared pages.
> Then, in my experiments such modified scanner was unable to regulate
> quality-of-service. When I ran 2 over-the-limit containers, they worked
> equally slow regardless of their limits and work set size.
> That is, I didn't observe a smooth transition "under limit, maximum
> performance" to "slightly over limit, a bit reduced performance" to
> "significantly over limit, poor performance". Neither did I see any fairness
> in how containers got penalized for exceeding their limits.
>
> My explanation of what I observed is that
> - since filesystem caches play a huge role in performance, page scanner will
> be very limited in controlling container's performance if caches
> stay shared between containers,
> - in the absence of decent disk I/O manager, stalls due to swapin/swapout
> are more influenced by disk subsystem than by page scanner policy.
> So in fact modified page scanner provides control over memory usage only as
> "stay under limits or die", and doesn't show many advantages over (B) or (C).
> At the same time, skipping pages visibly penalizes "good citizens", not only
> in disk bandwidth but in CPU overhead as well.
>
> So I settled for (A)-(C) for now.
> But it certainly would be interesting to hear if someone else makes such
> experiments.
>

Makes sense. If one is looking for good machine partitioning then a shared
disk is obviously a great contention point. To address that we'd need to
be able to say "container A swaps to /dev/sda1 and container B swaps to
/dev/sdb1". But the swap system at present can't do that.
Re: BC: resource beancounters (v2) [message #5666 is a reply to message #5661] Fri, 25 August 2006 19:00 Go to previous messageGo to next message
Chandra Seetharaman is currently offline  Chandra Seetharaman
Messages: 88
Registered: August 2006
Member
Have you seen/tried the memory controller in CKRM/Resource Groups ?
http://sourceforge.net/projects/ckrm

It maintains a per resource group LRU lists and also maintains a list of
over-guarantee groups (with ordering based on where they are in their
guarantee-limit scale). So, when a reclaim needs to happen, pages are
first freed from a group that is way over its limit, and then the next
one and so on.

Few things that it does that are not good:
- doesn't account shared pages accurately
- moves all pages from a task when the task moves to a different group
- totally new reclamation path

regards,

chandra
On Fri, 2006-08-25 at 20:30 +0400, Andrey Savochkin wrote:
> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote:
> > On Fri, 25 Aug 2006 15:49:15 +0400
> > Kirill Korotaev <dev@sw.ru> wrote:
> >
> > > Andrey Savochkin wrote already a brief summary on vm resource management:
> > >
> > > ------------- cut ----------------
> > > The task of limiting a container to 4.5GB of memory bottles down to the
> > > question: what to do when the container starts to use more than assigned
> > > 4.5GB of memory?
> > >
> > > At this moment there are only 3 viable alternatives.
> > >
> > > A) Have separate memory management for each container,
> > > with separate buddy allocator, lru lists, page replacement mechanism.
> > > That implies a considerable overhead, and the main challenge there
> > > is sharing of pages between these separate memory managers.
> > >
> > > B) Return errors on extension of mappings, but not on page faults, where
> > > memory is actually consumed.
> > > In this case it makes sense to take into account not only the size of used
> > > memory, but the size of created mappings as well.
> > > This is approximately what "privvmpages" accounting/limiting provides in
> > > UBC.
> > >
> > > C) Rely on OOM killer.
> > > This is a fall-back method in UBC, for the case "privvmpages" limits
> > > still leave the possibility to overload the system.
> > >
> >
> > D) Virtual scan of mm's in the over-limit container
> >
> > E) Modify existing physical scanner to be able to skip pages which
> > belong to not-over-limit containers.
>
> I've actually tried (E), but it didn't work as I wished.
>
> It didn't handle well shared pages.
> Then, in my experiments such modified scanner was unable to regulate
> quality-of-service. When I ran 2 over-the-limit containers, they worked
> equally slow regardless of their limits and work set size.
> That is, I didn't observe a smooth transition "under limit, maximum
> performance" to "slightly over limit, a bit reduced performance" to
> "significantly over limit, poor performance". Neither did I see any fairness
> in how containers got penalized for exceeding their limits.
>
> My explanation of what I observed is that
> - since filesystem caches play a huge role in performance, page scanner will
> be very limited in controlling container's performance if caches
> stay shared between containers,
> - in the absence of decent disk I/O manager, stalls due to swapin/swapout
> are more influenced by disk subsystem than by page scanner policy.
> So in fact modified page scanner provides control over memory usage only as
> "stay under limits or die", and doesn't show many advantages over (B) or (C).
> At the same time, skipping pages visibly penalizes "good citizens", not only
> in disk bandwidth but in CPU overhead as well.
>
> So I settled for (A)-(C) for now.
> But it certainly would be interesting to hear if someone else makes such
> experiments.
>
> Best regards
>
> Andrey
--

------------------------------------------------------------ ----------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
------------------------------------------------------------ ----------
Re: BC: resource beancounters (v2) [message #5676 is a reply to message #5661] Sat, 26 August 2006 02:15 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Fri, 2006-08-25 at 20:30 +0400, Andrey Savochkin wrote:
> On Fri, Aug 25, 2006 at 07:30:03AM -0700, Andrew Morton wrote:
> > On Fri, 25 Aug 2006 15:49:15 +0400
> > Kirill Korotaev <dev@sw.ru> wrote:
> >
> > > Andrey Savochkin wrote already a brief summary on vm resource management:
> > >
> > > ------------- cut ----------------
> > > The task of limiting a container to 4.5GB of memory bottles down to the
> > > question: what to do when the container starts to use more than assigned
> > > 4.5GB of memory?
> > >
> > > At this moment there are only 3 viable alternatives.
> > >
> > > A) Have separate memory management for each container,
> > > with separate buddy allocator, lru lists, page replacement mechanism.
> > > That implies a considerable overhead, and the main challenge there
> > > is sharing of pages between these separate memory managers.
> > >

Yes, sharing of pages across different containers/managers will be a
problem. Why not just disallow that scenario (that is what fake nodes
proposal would also end up doing).

> > > B) Return errors on extension of mappings, but not on page faults, where
> > > memory is actually consumed.
> > > In this case it makes sense to take into account not only the size of used
> > > memory, but the size of created mappings as well.
> > > This is approximately what "privvmpages" accounting/limiting provides in
> > > UBC.
> > >

Keeping a tab on all the virtual mappings in a container must also be
troublesome. And IMO is not the right way to go...this is even a
stricter version of overcommit_memory...right?
>
> > > C) Rely on OOM killer.
> > > This is a fall-back method in UBC, for the case "privvmpages" limits
> > > still leave the possibility to overload the system.
> > >
> >
> > D) Virtual scan of mm's in the over-limit container
> >

This seems like an interesting choice. If we can quickly inactivate
some pages belonging to tasks in over_the_limit container.

> > E) Modify existing physical scanner to be able to skip pages which
> > belong to not-over-limit containers.
>
> I've actually tried (E), but it didn't work as I wished.
>
> It didn't handle well shared pages.
> Then, in my experiments such modified scanner was unable to regulate
> quality-of-service. When I ran 2 over-the-limit containers, they worked
> equally slow regardless of their limits and work set size.
> That is, I didn't observe a smooth transition "under limit, maximum
> performance" to "slightly over limit, a bit reduced performance" to
> "significantly over limit, poor performance". Neither did I see any fairness
> in how containers got penalized for exceeding their limits.
>

That sure is an interesting observation though I think it really depends
on if you are doing the same amount of work when counts have just gone
above the limits to the point where they are way over the limit.

> My explanation of what I observed is that
> - since filesystem caches play a huge role in performance, page scanner will
> be very limited in controlling container's performance if caches
> stay shared between containers,

Yeah, if a page is shared between containers then you can end up doing
nothing useful. And that is where containers dedicated to individual
filesystem could be useful.

> - in the absence of decent disk I/O manager, stalls due to swapin/swapout
> are more influenced by disk subsystem than by page scanner policy.
> So in fact modified page scanner provides control over memory usage only as
> "stay under limits or die", and doesn't show many advantages over (B) or (C).
> At the same time, skipping pages visibly penalizes "good citizens", not only
> in disk bandwidth but in CPU overhead as well.
>

Sure that CPU, disk and other variables will kick in when you start
swapping. But then apps are expected to suffer when gone over limit.
The drawback is the apps that are not hit the limit will also suffer,
but then that is where extra controllers like CPU will kick in.

Maybe, we have a flag for each container indicating whether the tasks
belonging to that container should be killed immediately or they are
okay to run with lower performance as far as they can.

-rohit
Re: [PATCH] BC: resource beancounters (v2) [message #5678 is a reply to message #5657] Sat, 26 August 2006 03:55 Go to previous messageGo to next message
Nick Piggin is currently offline  Nick Piggin
Messages: 35
Registered: March 2006
Member
Alan Cox wrote:
> Ar Sad, 2006-08-26 am 01:14 +1000, ysgrifennodd Nick Piggin:
>
>>I still think doing simple accounting per-page would be a better way to
>>go than trying to pin down all "user allocatable" kernel allocations.
>>And would require all of about 2 hooks in the page allocator. And would
>>track *actual* RAM allocated by that container.
>
>
> You have a variety of kernel objects you want to worry about and they
> have very differing properties.
>
> Some are basically shared resources - page cache, dentries, inodes, etc
> and can be balanced pretty well by the kernel (ok the dentries are a bit
> of a problem right now). Others are very specific "owned" resources -
> like file handles, sockets and vmas.

That's true (OTOH I'd argue it would still be very useful for things
like pagecache, so one container can't start a couple of 'dd' loops
and turn everyone else to crap). And while the sharing may not be
exactly captured, statistically things should balance over time.

So I'm not arguing about _also_ accounting resources that are limited
in other ways (than just the RAM they consume).

But as a DoS protection measure on RAM usage, trying to account all
kernel allocations that are user triggerable just sounds hard to
maintain, holey, ugly, invsive (and not perfect either -- in fact it
still isn't clear to me that it is any better than my proposal).

>
> Tracking actual RAM use by container/user/.. isn't actually that
> interesting. It's also inconveniently sub page granularity.

If it isn't interesting, then I don't think we want it (at least, until
someone does get an interest in it).

>
> Its a whole seperate question whether you want a separate bean counter
> limit for sockets, file handles, vmas etc.

Yeah that's fair enough. We obviously want to avoid exposing limits on
things that it doesn't make sense to limit, or that is a kernel
implementation detail as much as possible.

eg. so I would be happy to limit virtual address, less happy to limit
vmas alone (unless that is in the context of accounting their RAM usage
or their implied vaddr charge).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Re: BC: resource beancounters (v2) [message #5681 is a reply to message #5676] Sat, 26 August 2006 16:16 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Gwe, 2006-08-25 am 19:15 -0700, ysgrifennodd Rohit Seth:
> Yes, sharing of pages across different containers/managers will be a
> problem. Why not just disallow that scenario (that is what fake nodes
> proposal would also end up doing).

Because it destroys the entire point of using containers instead of
something like Xen - which is sharing. Also at the point I am using
beancounters per user I don't want glibc per use, libX11 per use glib
per use gtk per user etc..
Re: [PATCH] BC: resource beancounters (v2) [message #5698 is a reply to message #5656] Mon, 28 August 2006 08:27 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Andi Kleen wrote:
>>D) Virtual scan of mm's in the over-limit container
>>
>>E) Modify existing physical scanner to be able to skip pages which
>> belong to not-over-limit containers.
>
>
> The same applies to dentries/inodes too, doesn't it? But their
> scan is already virtual so their LRUs could be just split into
> a per container list (but then didn't Christoph L. plan to
> rewrite that code anyways?)
how do you propose to handle shared files in this case?

> For memory my guess would be that (E) would be easier than (D) for user/file
> memory though less efficient.

Kirill
Re: BC: resource beancounters (v2) [message #5707 is a reply to message #5681] Mon, 28 August 2006 16:48 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Sat, 2006-08-26 at 17:37 +0100, Alan Cox wrote:
> Ar Gwe, 2006-08-25 am 19:15 -0700, ysgrifennodd Rohit Seth:
> > Yes, sharing of pages across different containers/managers will be a
> > problem. Why not just disallow that scenario (that is what fake nodes
> > proposal would also end up doing).
>
> Because it destroys the entire point of using containers instead of
> something like Xen - which is sharing. Also at the point I am using
> beancounters per user I don't want glibc per use, libX11 per use glib
> per use gtk per user etc..
>
>

I'm not saying per use glibc etc. That will indeed be useless and bring
it to virtualization world. Just like fake node, one should be allowed
to use pages that are already in (for example) page cache- so that you
don't end up duplicating all shared stuff. But as far as charging is
concerned, charge it to container who either got the page in page cache
OR if FS based semantics exist then charge it to the container where the
file belongs. What I was suggesting is to not charge a page to
different counters.

-rohit
Re: Re: BC: resource beancounters (v2) [message #5709 is a reply to message #5707] Mon, 28 August 2006 17:40 Go to previous messageGo to next message
kir is currently offline  kir
Messages: 1645
Registered: August 2005
Location: Moscow, Russia
Senior Member

Rohit Seth wrote:
> On Sat, 2006-08-26 at 17:37 +0100, Alan Cox wrote:
>
>> Ar Gwe, 2006-08-25 am 19:15 -0700, ysgrifennodd Rohit Seth:
>>
>>> Yes, sharing of pages across different containers/managers will be a
>>> problem. Why not just disallow that scenario (that is what fake nodes
>>> proposal would also end up doing).
>>>
>> Because it destroys the entire point of using containers instead of
>> something like Xen - which is sharing. Also at the point I am using
>> beancounters per user I don't want glibc per use, libX11 per use glib
>> per use gtk per user etc..
>>
>>
>>
>
> I'm not saying per use glibc etc. That will indeed be useless and bring
> it to virtualization world. Just like fake node, one should be allowed
> to use pages that are already in (for example) page cache- so that you
> don't end up duplicating all shared stuff. But as far as charging is
> concerned, charge it to container who either got the page in page cache
> OR if FS based semantics exist then charge it to the container where the
> file belongs. What I was suggesting is to not charge a page to
> different counters.
>

Consider the following simple scenario: there are 50 containers
(numbered, say, 1 to 50) all sharing a single installation of Fedora
Core 5. They all run sshd, apache, syslogd, crond and some other stuff
like that. This is actually quite a real scenario.

In the world that you propose the container which was unlucky to start
first (probably the one with ID of either 1 or 50) will be charged for
all the memory, and all the
others will have most of their memory for free. And in such a world
per-container memory accounting or limiting is just not possible.
Re: Re: BC: resource beancounters (v2) [message #5721 is a reply to message #5709] Mon, 28 August 2006 22:28 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Mon, 2006-08-28 at 21:41 +0400, Kir Kolyshkin wrote:
> Rohit Seth wrote:
> >
> > I'm not saying per use glibc etc. That will indeed be useless and bring
> > it to virtualization world. Just like fake node, one should be allowed
> > to use pages that are already in (for example) page cache- so that you
> > don't end up duplicating all shared stuff. But as far as charging is
> > concerned, charge it to container who either got the page in page cache
> > OR if FS based semantics exist then charge it to the container where the
> > file belongs. What I was suggesting is to not charge a page to
> > different counters.
> >
>
> Consider the following simple scenario: there are 50 containers
> (numbered, say, 1 to 50) all sharing a single installation of Fedora
> Core 5. They all run sshd, apache, syslogd, crond and some other stuff
> like that. This is actually quite a real scenario.
>
> In the world that you propose the container which was unlucky to start
> first (probably the one with ID of either 1 or 50) will be charged for
> all the memory, and all the
> others will have most of their memory for free. And in such a world
> per-container memory accounting or limiting is just not possible.

If you are only having task based accounting then yes the first
container using a page will be charged. And when it hit its limit then
it will inactivate some of the pages. If some other container now uses
the same page (that got inactivated) again then this next container will
be charged for that page.

Though if we have file/directory based accounting then shared pages
belonging to /usr/lib or /usr/bin can go to a common container.

-rohit
Re: Re: BC: resource beancounters (v2) [message #5730 is a reply to message #5721] Tue, 29 August 2006 10:15 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Llu, 2006-08-28 am 15:28 -0700, ysgrifennodd Rohit Seth:
> Though if we have file/directory based accounting then shared pages
> belonging to /usr/lib or /usr/bin can go to a common container.

So that one user can map all the spare libraries and config files and
DoS the system by preventing people from accessing the libraries they do
need ?
Re: [PATCH 6/6] BC: kernel memory accounting (marks) [message #5731 is a reply to message #5573] Tue, 29 August 2006 09:52 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Dave Hansen wrote:
> I'm still a bit concerned about if we actually need the 'struct page'
> pointer. I've gone through all of the users, and I'm not sure that I
> see any that _require_ having a pointer in 'struct page'. I think it
> will take some rework, especially with the pagetables, but it should be
> quite doable.
don't worry:
1. we will introduce a separate patch moving this pointer
into mirroring array
2. this pointer is still required for _user_ pages tracking,
that's why I don't follow your suggestion right now...

> vmalloc:
> Store in vm_struct
> fd_set_bits:
> poll_get:
> mount hashtable:
> Don't need alignment. use the slab?
> pagetables:
> either store in an extra field of 'struct page', or use the
> mm's. mm should always be available when alloc/freeing a
> pagetable page
>
> Did I miss any?
flocks, pipe buffers, task_struct, sighand, signal, vmas,
posix timers, uid_cache, shmem dirs,

Thanks,
Kirill
Re: [PATCH 6/6] BC: kernel memory accounting (marks) [message #5739 is a reply to message #5583] Tue, 29 August 2006 14:34 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Dave Hansen wrote:
> On Wed, 2006-08-23 at 15:08 +0400, Kirill Korotaev wrote:
>
>> include/asm-i386/thread_info.h | 4 ++--
>> include/asm-ia64/pgalloc.h | 24 +++++++++++++++++-------
>> include/asm-x86_64/pgalloc.h | 12 ++++++++----
>> include/asm-x86_64/thread_info.h | 5 +++--
>
>
> Do you think we need to cover a few more architectures before
> considering merging this, or should we just fix them up as we need them?
I think doing a part of job is usually bad as it never gets fixed fully then :/

> I'm working on a patch to unify as many of the alloc_thread_info()
> functions as I can. That should at least give you one place to modify
> and track the thread_info allocations. I've only compiled for x86_64
> and i386, but I'm working on more. A preliminary version is attached.
Oh, I think such code unification is nice!
Would be perferct if all of them could be merged to some generic
function. Please, keep me on CC!

Thanks,
Kirill
Re: [PATCH] BC: resource beancounters (v2) [message #5747 is a reply to message #5655] Tue, 29 August 2006 15:33 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Andrew Morton wrote:
> On Fri, 25 Aug 2006 15:49:15 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
>
>>>We need to go over this work before we can commit to the BC
>>>core. Last time I looked at the VM accounting patch it
>>>seemed rather unpleasing from a maintainability POV.
>>
>>hmmm... in which regard?
>
>
> Little changes all over the MM code which might get accidentally broken.
>
>
>>>And, if I understand it correctly, the only response to a job
>>>going over its VM limits is to kill it, rather than trimming
>>>it. Which sounds like a big problem?
>>
>>No, UBC virtual memory management refuses occur on mmap()'s.
>
>
> That's worse, isn't it? Firstly it rules out big sparse mappings and secondly
1) if mappings are private then yes, you can not mmap too much. This is logical,
since this whole mappings are potentially allocatable and there is no way to control
it later except for SIGKILL.
2) if mappings are shared file mappings (shmem is handled in similar way) then
you can mmap as much as you want, since these pages can be reclaimed.

> mmap_and_use(80% of container size)
> fork_and_immediately_exec(/bin/true)
>
> will fail at the fork?
yes, it will fail on fork() or exec() in case of much private (1) mappings.
fail on fork() and exec() is much friendlier then SIGKILL, don't you think so?

private mappings parameter which is limited by UBC is a kind of upper estimation
of container RSS. From our experience such an estimation is ~ 5-20% higher then
a real physical memory used (with real-life applications).

>>Andrey Savochkin wrote already a brief summary on vm resource management:
>>
>>------------- cut ----------------
>>The task of limiting a container to 4.5GB of memory bottles down to the
>>question: what to do when the container starts to use more than assigned
>>4.5GB of memory?
>>
>>At this moment there are only 3 viable alternatives.
>>
>>A) Have separate memory management for each container,
>> with separate buddy allocator, lru lists, page replacement mechanism.
>> That implies a considerable overhead, and the main challenge there
>> is sharing of pages between these separate memory managers.
>>
>>B) Return errors on extension of mappings, but not on page faults, where
>> memory is actually consumed.
>> In this case it makes sense to take into account not only the size of used
>> memory, but the size of created mappings as well.
>> This is approximately what "privvmpages" accounting/limiting provides in
>> UBC.
>>
>>C) Rely on OOM killer.
>> This is a fall-back method in UBC, for the case "privvmpages" limits
>> still leave the possibility to overload the system.
>>
>
>
> D) Virtual scan of mm's in the over-limit container
>
> E) Modify existing physical scanner to be able to skip pages which
> belong to not-over-limit containers.
>
> F) Something else ;)
We fully agree that other possible algorithms can and should exist.
My idea only is that any of them would need accounting anyway
(which is the most part of beancounters).
Throtling, modified scanners etc. can be implemented as a separate
BC parameters. Thus, an administrator will be able to select
which policy should be applied to the container which is near its limit.

So the patches I'm trying to send are a step-by-step accounting of all
the resources and their simple limitations. More comprehensive limitation
policy will be built on top of it later.

BTW, UBC page beancounters allow to distinguish pages used by only one
container and pages which are shared. So scanner can try to reclaim
container private pages first, thus not influencing other containers.

Thanks,
Kirill
Re: [PATCH 6/6] BC: kernel memory accounting (marks) [message #5748 is a reply to message #5548] Tue, 29 August 2006 15:53 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Dave Hansen wrote:
> On Tue, 2006-08-29 at 13:52 +0400, Kirill Korotaev wrote:
>
>>1. we will introduce a separate patch moving this pointer
>> into mirroring array
>>2. this pointer is still required for _user_ pages tracking,
>> that's why I don't follow your suggestion right now...
>
>
> You hadn't mentioned that part. ;)
I was crying about it 2 or 3 times :)
But no suprise that you hadn't notice that due to too many emails on BC.

> I guess we'll wait for the user patches before these can go any further.
:))) are you the one to decide? :) then lets drink some beer :)))

Thanks,
Kirill
Re: [PATCH 6/6] BC: kernel memory accounting (marks) [message #5751 is a reply to message #5731] Tue, 29 August 2006 15:48 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Tue, 2006-08-29 at 13:52 +0400, Kirill Korotaev wrote:
> 1. we will introduce a separate patch moving this pointer
> into mirroring array
> 2. this pointer is still required for _user_ pages tracking,
> that's why I don't follow your suggestion right now...

You hadn't mentioned that part. ;)

I guess we'll wait for the user patches before these can go any further.

-- Dave
Re: Re: BC: resource beancounters (v2) [message #5752 is a reply to message #5730] Tue, 29 August 2006 17:30 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Tue, 2006-08-29 at 11:15 +0100, Alan Cox wrote:
> Ar Llu, 2006-08-28 am 15:28 -0700, ysgrifennodd Rohit Seth:
> > Though if we have file/directory based accounting then shared pages
> > belonging to /usr/lib or /usr/bin can go to a common container.
>
> So that one user can map all the spare libraries and config files and
> DoS the system by preventing people from accessing the libraries they do
> need ?
>

Well, there is a risk whenever there is sharing across containers. The
point though is, give the choice to sysadmin to configure the platform
the way it is appropriate.

-rohit
Re: Re: BC: resource beancounters (v2) [message #5755 is a reply to message #5752] Tue, 29 August 2006 18:46 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Maw, 2006-08-29 am 10:30 -0700, ysgrifennodd Rohit Seth:
> On Tue, 2006-08-29 at 11:15 +0100, Alan Cox wrote:
> > Ar Llu, 2006-08-28 am 15:28 -0700, ysgrifennodd Rohit Seth:
> > > Though if we have file/directory based accounting then shared pages
> > > belonging to /usr/lib or /usr/bin can go to a common container.
> >
> > So that one user can map all the spare libraries and config files and
> > DoS the system by preventing people from accessing the libraries they do
> > need ?
> >
>
> Well, there is a risk whenever there is sharing across containers. The
> point though is, give the choice to sysadmin to configure the platform
> the way it is appropriate.

In other words your suggestion doesn't actually work for the real world
cases like web serving.
Re: Re: BC: resource beancounters (v2) [message #5756 is a reply to message #5755] Tue, 29 August 2006 19:15 Go to previous message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Tue, 2006-08-29 at 20:06 +0100, Alan Cox wrote:
> Ar Maw, 2006-08-29 am 10:30 -0700, ysgrifennodd Rohit Seth:
> > On Tue, 2006-08-29 at 11:15 +0100, Alan Cox wrote:
> > > Ar Llu, 2006-08-28 am 15:28 -0700, ysgrifennodd Rohit Seth:
> > > > Though if we have file/directory based accounting then shared pages
> > > > belonging to /usr/lib or /usr/bin can go to a common container.
> > >
> > > So that one user can map all the spare libraries and config files and
> > > DoS the system by preventing people from accessing the libraries they do
> > > need ?
> > >
> >
> > Well, there is a risk whenever there is sharing across containers. The
> > point though is, give the choice to sysadmin to configure the platform
> > the way it is appropriate.
>
> In other words your suggestion doesn't actually work for the real world
> cases like web serving.
>

Containers are not going to solve all the problems particularly the
scenarios like when a machine is a web server and an odd user can log on
to the same machine and (w/o any ulimits) claim all the memory that is
present in the system.

Though it is quite possible to implement a combination of two (task and
fs based) policies in containers and sysadmin can set a preference of
each each container. [this probably is another reason for having a per
page container pointer].

-rohit
Re: [PATCH] BC: resource beancounters (v2) [message #5763 is a reply to message #5747] Tue, 29 August 2006 17:08 Go to previous message
Balbir Singh is currently offline  Balbir Singh
Messages: 491
Registered: August 2006
Senior Member
Kirill Korotaev wrote:
>>> ------------- cut ----------------
>>> The task of limiting a container to 4.5GB of memory bottles down to the
>>> question: what to do when the container starts to use more than assigned
>>> 4.5GB of memory?
>>>
>>> At this moment there are only 3 viable alternatives.
>>>
>>> A) Have separate memory management for each container,
>>> with separate buddy allocator, lru lists, page replacement mechanism.
>>> That implies a considerable overhead, and the main challenge there
>>> is sharing of pages between these separate memory managers.
>>>
>>> B) Return errors on extension of mappings, but not on page faults, where
>>> memory is actually consumed.
>>> In this case it makes sense to take into account not only the size
>>> of used
>>> memory, but the size of created mappings as well.
>>> This is approximately what "privvmpages" accounting/limiting
>>> provides in
>>> UBC.
>>>
>>> C) Rely on OOM killer.
>>> This is a fall-back method in UBC, for the case "privvmpages" limits
>>> still leave the possibility to overload the system.
>>>
>>
>>
>> D) Virtual scan of mm's in the over-limit container
>>
>> E) Modify existing physical scanner to be able to skip pages which
>> belong to not-over-limit containers.
>>
>> F) Something else ;)
> We fully agree that other possible algorithms can and should exist.
> My idea only is that any of them would need accounting anyway
> (which is the most part of beancounters).
> Throtling, modified scanners etc. can be implemented as a separate
> BC parameters. Thus, an administrator will be able to select
> which policy should be applied to the container which is near its limit.
>
> So the patches I'm trying to send are a step-by-step accounting of all
> the resources and their simple limitations. More comprehensive limitation
> policy will be built on top of it later.
>

One of the issues I see is that bean counters are not very flexible. Tasks
cannot change bean counters dynamically after fork()/exec() that is - can they?


> BTW, UBC page beancounters allow to distinguish pages used by only one
> container and pages which are shared. So scanner can try to reclaim
> container private pages first, thus not influencing other containers.
>

But can you select the specific container for which we intend to scan pages?

> Thanks,
> Kirill
>

--
Thanks,
Balbir Singh,
Linux Technology Center,
IBM Software Labs
Previous Topic: Re: pspace child_reaper
Next Topic: Re: pspace child_reaper
Goto Forum:
  


Current Time: Mon Nov 18 18:24:44 GMT 2024

Total time taken to generate the page: 0.03054 seconds