OpenVZ Forum: Devel » [RFC][PATCH 0/7] Resource controllers based on process containers

Home » Mailing lists » Devel » [RFC][PATCH 0/7] Resource controllers based on process containers

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Re: [RFC][PATCH 2/7] RSS controller core [message #17795 is a reply to message #17786]

Tue, 13 March 2007 15:05

Herbert Poetzl
Messages: 239
Registered: February 2006

Senior Member

On Tue, Mar 13, 2007 at 10:17:54AM +0300, Pavel Emelianov wrote:
> Herbert Poetzl wrote:
> > On Mon, Mar 12, 2007 at 12:02:01PM +0300, Pavel Emelianov wrote:
> >>>>> Maybe you have some ideas how we can decide on this?
> >>>> We need to work out what the requirements are before we can 
> >>>> settle on an implementation.
> >>> Linux-VServer (and probably OpenVZ):
> >>>
> >>>  - shared mappings of 'shared' files (binaries 
> >>>    and libraries) to allow for reduced memory
> >>>    footprint when N identical guests are running
> >> This is done in current patches.
> > 
> > nice, but the question was about _requirements_
> > (so your requirements are?)
> > 
> >>>  - virtual 'physical' limit should not cause
> >>>    swap out when there are still pages left on
> >>>    the host system (but pages of over limit guests
> >>>    can be preferred for swapping)
> >> So what to do when virtual physical limit is hit?
> >> OOM-kill current task?
> > 
> > when the RSS limit is hit, but there _are_ enough
> > pages left on the physical system, there is no
> > good reason to swap out the page at all
> > 
> >  - there is no benefit in doing so (performance
> >    wise, that is)
> > 
> >  - it actually hurts performance, and could
> >    become a separate source for DoS
> > 
> > what should happen instead (in an ideal world :)
> > is that the page is considered swapped out for
> > the guest (add guest penality for swapout), and 
> 
> Is the page stays mapped for the container or not?
> If yes then what's the use of limits? Container mapped
> pages more than the limit is but all the pages are
> still in memory. Sounds weird.

sounds weird, but makes sense if you look at the full picture

just because the guest is over its page limit doesn't 
mean that you actually want the system to swap stuff
out, what you really want to happen is the following:

 - somehow mark those pages as 'gone' for the guest
 - penalize the guest (and only the guest) for the
   'virtual' swap/page operation
 - penalize the guest again for paging in the page
 - drop/swap/page out those pages when the host system
   decides to reclaim pages (from the host PoV)

> > when the page would be swapped in again, the guest
> > takes a penalty (for the 'virtual' page in) and
> > the page is returned to the guest, possibly kicking
> > out (again virtually) a different page
> > 
> >>>  - accounting and limits have to be consistent
> >>>    and should roughly represent the actual used
> >>>    memory/swap (modulo optimizations, I can go
> >>>    into detail here, if necessary)
> >> This is true for current implementation for
> >> booth - this patchset ang OpenVZ beancounters.
> >>
> >> If you sum up the physpages values for all containers
> >> you'll get the exact number of RAM pages used.
> > 
> > hmm, including or excluding the host pages?
> 
> Depends on whether you account host pages or not.

you tell me? or is that an option in OpenVZ?

best,
Herbert

> >>>  - OOM handling on a per guest basis, i.e. some
> >>>    out of memory condition in guest A must not
> >>>    affect guest B
> >> This is done in current patches.
> > 
> >> Herbert, did you look at the patches before
> >> sending this mail or do you just want to
> >> 'take part' in conversation w/o understanding
> >> of hat is going on?
> > 
> > again, the question was about requirements, not
> > your patches, and yes, I had a look at them _and_
> > the OpenVZ implementations ...
> > 
> > best,
> > Herbert
> > 
> > PS: hat is going on? :)
> > 
> >>> HTC,
> >>> Herbert
> >>>
> >>>> Sigh.  Who is running this show?   Anyone?
> >>>>
> >>>> You can actually do a form of overcommittment by allowing multiple
> >>>> containers to share one or more of the zones. Whether that is
> >>>> sufficient or suitable I don't know. That depends on the requirements,
> >>>> and we haven't even discussed those, let alone agreed to them.
> >>>>
> >>>> _______________________________________________
> >>>> Containers mailing list
> >>>> Containers@lists.osdl.org
> >>>> https://lists.osdl.org/mailman/listinfo/containers
> > 
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17796 is a reply to message #10890]

Tue, 13 March 2007 15:11

Herbert Poetzl
Messages: 239
Registered: February 2006

Senior Member

On Tue, Mar 13, 2007 at 06:10:55PM +0300, Kirill Korotaev wrote:
> >>So what to do when virtual physical limit is hit?
> >>OOM-kill current task?
> > 
> > 
> > when the RSS limit is hit, but there _are_ enough
> > pages left on the physical system, there is no
> > good reason to swap out the page at all
> > 
> >  - there is no benefit in doing so (performance
> >    wise, that is)
> > 
> >  - it actually hurts performance, and could
> >    become a separate source for DoS
> > 
> > what should happen instead (in an ideal world :)
> > is that the page is considered swapped out for
> > the guest (add guest penality for swapout), and 
> > when the page would be swapped in again, the guest
> > takes a penalty (for the 'virtual' page in) and
> > the page is returned to the guest, possibly kicking
> > out (again virtually) a different page
> 
> great. I agree with that.

> Just curious why current vserver code kills arbitrary
> task in container then?

because it obviously lacks the finess of OpenVZ code :)

seriously, handling the OOM kills inside a container
has never been a real world issue, as once you are
really out of memory (and OOM starts killing) you 
usually have lost the game anyways (i.e. a guest restart
or similar is required to get your services up and
running again) and OOM killer decisions are not perfect
in mainline either, but, you've probably seen the 
FIXME and TODO entries in the code showing that this
is work in progress ...

> >>> - accounting and limits have to be consistent
> >>>   and should roughly represent the actual used
> >>>   memory/swap (modulo optimizations, I can go
> >>>   into detail here, if necessary)
> >>
> >>This is true for current implementation for
> >>booth - this patchset ang OpenVZ beancounters.
> >>
> >>If you sum up the physpages values for all containers
> >>you'll get the exact number of RAM pages used.
> > 
> > 
> > hmm, including or excluding the host pages?
> 
> depends on whether you will include beanocunter 0 usages or not :)

so that is an option then?

best,
Herbert

> Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 1/7] Resource counters [message #17797 is a reply to message #17774]

Tue, 13 March 2007 15:21

Herbert Poetzl
Messages: 239
Registered: February 2006

Senior Member

On Tue, Mar 13, 2007 at 03:09:06AM -0600, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> > On Sun, Mar 11, 2007 at 01:00:15PM -0600, Eric W. Biederman wrote:
> >> Herbert Poetzl <herbert@13thfloor.at> writes:
> >> 
> >> >
> >> > Linux-VServer does the accounting with atomic counters,
> >> > so that works quite fine, just do the checks at the
> >> > beginning of whatever resource allocation and the
> >> > accounting once the resource is acquired ...
> >> 
> >> Atomic operations versus locks is only a granularity thing.
> >> You still need the cache line which is the cost on SMP.
> >> 
> >> Are you using atomic_add_return or atomic_add_unless or 
> >> are you performing you actions in two separate steps 
> >> which is racy? What I have seen indicates you are using 
> >> a racy two separate operation form.
> >
> > yes, this is the current implementation which
> > is more than sufficient, but I'm aware of the
> > potential issues here, and I have an experimental
> > patch sitting here which removes this race with
> > the following change:
> >
> >  - doesn't store the accounted value but
> >    limit - accounted (i.e. the free resource)
> >  - uses atomic_add_return() 
> >  - when negative, an error is returned and
> >    the resource amount is added back
> >
> > changes to the limit have to adjust the 'current'
> > value too, but that is again simple and atomic
> >
> > best,
> > Herbert
> >
> > PS: atomic_add_unless() didn't exist back then
> > (at least I think so) but that might be an option
> > too ...
> 
> I think as far as having this discussion if you can remove that race
> people will be more willing to talk about what vserver does.

well, shouldn't be a big deal to brush that patch up
(if somebody actually _is_ interested)

> That said anything that uses locks or atomic operations (finer grained
> locks) because of the cache line ping pong is going to have scaling
> issues on large boxes.

right, but atomic ops have much less impact on most
architectures than locks :)

> So in that sense anything short of per cpu variables sucks at scale.
> That said I would much rather get a simple correct version without the
> complexity of per cpu counters, before we optimize the counters that
> much.

actually I thought about per cpu counters quite a lot, and
we (Llinux-VServer) use them for accounting, but please
tell me how you use per cpu structures for implementing 
limits

TIA,
Herbert


> Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17800 is a reply to message #17777]

Tue, 13 March 2007 16:06

Herbert Poetzl
Messages: 239
Registered: February 2006

Senior Member

On Tue, Mar 13, 2007 at 07:27:06AM +0530, Balbir Singh wrote:
> > hmm, it is very unlikely that this would happen,
> > for several reasons ... and indeed, checking the
> > thread in my mailbox shows that akpm dropped you ...
> 
> But, I got Andrew's email.
> 
> > --------------------------------------------------------------------
> > Subject: [RFC][PATCH 2/7] RSS controller core
> > From: Pavel Emelianov <xemul@sw.ru>
> > To: Andrew Morton <akpm@osdl.org>, Paul Menage <menage@google.com>,
> >         Srivatsa Vaddagiri <vatsa@in.ibm.com>,
> >         Balbir Singh <balbir@in.ibm.com>
> > Cc: containers@lists.osdl.org,
> >         Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
> > Date: Tue, 06 Mar 2007 17:55:29 +0300
> > --------------------------------------------------------------------
> > Subject: Re: [RFC][PATCH 2/7] RSS controller core
> > From: Andrew Morton <akpm@linux-foundation.org>
> > To: Pavel Emelianov <xemul@sw.ru>
> > Cc: Kirill@smtp.osdl.org, Linux@smtp.osdl.org, containers@lists.osdl.org,
> >         Paul Menage <menage@google.com>,
> >         List <linux-kernel@vger.kernel.org>
> > Date: Tue, 6 Mar 2007 14:00:36 -0800
> > --------------------------------------------------------------------
> > that's the one I 'group' replied to ...
> >
> > > Could you please not modify the "cc" list.
> >
> > I never modify the cc unless explicitely asked
> > to do so. I wish others would have it that way
> > too :)
> 
> Thats good to know, but my mailer shows
> 
> 
> Andrew Morton <akpm@linux-foundation.org>
> 	to		Pavel Emelianov <xemul@sw.ru>	
> 	cc	
> 	Paul Menage <menage@google.com>,
> Srivatsa Vaddagiri <vatsa@in.ibm.com>,
> Balbir Singh <balbir@in.ibm.com> (see I am <<HERE>>),
> devel@openvz.org,
> Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
> containers@lists.osdl.org,
> Kirill Korotaev <dev@sw.ru>	
> 	date		Mar 7, 2007 3:30 AM	
> 	subject		Re: [RFC][PATCH 2/7] RSS controller core	
> 	mailed-by		vger.kernel.org	
> On Tue, 06 Mar 2007 17:55:29 +0300
> 
> and your reply as
> 
> Andrew Morton <akpm@linux-foundation.org>,
> Pavel Emelianov <xemul@sw.ru>,
> Kirill@smtp.osdl.org,
> Linux@smtp.osdl.org,
> containers@lists.osdl.org,
> Paul Menage <menage@google.com>,
> List <linux-kernel@vger.kernel.org>	
> 	to		Andrew Morton <akpm@linux-foundation.org>	
> 	cc	
> 	Pavel Emelianov <xemul@sw.ru>,
> Kirill@smtp.osdl.org,
> Linux@smtp.osdl.org,
> containers@lists.osdl.org,
> Paul Menage <menage@google.com>,
> List <linux-kernel@vger.kernel.org>	
> 	date		Mar 9, 2007 10:18 PM	
> 	subject		Re: [RFC][PATCH 2/7] RSS controller core	
> 	mailed-by		vger.kernel.org
> 
> I am not sure what went wrong. Could you please check your mail
> client, cause it seemed to even change email address to smtp.osdl.org
> which bounced back when I wrote to you earlier.

my mail client is not involved in receiving the emails,
so the email I replied to did already miss you in the cc
(i.e. I doubt that mutt would hide you from the cc, if
it would be present in the mailbox :)

maybe one of the mailing lists is removing receipients
according to some strange scheme?

here are the full headers for the email I replied to:

-8<------------------------------------------------------------------------
>From containers-bounces@lists.osdl.org  Tue Mar  6 23:01:21 2007
Return-Path: containers-bounces@lists.osdl.org
X-Original-To: herbert@13thfloor.at
Delivered-To: herbert@13thfloor.at
Received: from smtp.osdl.org (smtp.osdl.org [65.172.181.24])
       	(using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (168/168 bits))
       	(No client certificate requested)
       	by mail.13thfloor.at (Postfix) with ESMTP id 0CD0F707FC
       	for <herbert@13thfloor.at>; Tue,  6 Mar 2007 23:00:52 +0100 (CET)
Received: from fire-2.osdl.org (localhost [127.0.0.1])
       	by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id l26M0eqA023167;
       	Tue, 6 Mar 2007 14:00:47 -0800
Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6])
       	by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id l26M0bq8023159
       	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO);
       	Tue, 6 Mar 2007 14:00:37 -0800
Received: from akpm.corp.google.com (shell0.pdx.osdl.net [10.9.0.31])
       	by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id l26M0ate010730;
       	Tue, 6 Mar 2007 14:00:36 -0800
Date: Tue, 6 Mar 2007 14:00:36 -0800
From: Andrew Morton <akpm@linux-foundation.org>
To: Pavel Emelianov <xemul@sw.ru>
Subject: Re: [RFC][PATCH 2/7] RSS controller core
Message-Id: <20070306140036.4e85bd2f.akpm@linux-foundation.org>
In-Reply-To: <45ED80E1.7030406@sw.ru>
References: <45ED7DEC.7010403@sw.ru>
       	<45ED80E1.7030406@sw.ru>
X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu)
Mime-Version: 1.0
X-Spam-Status: No, hits=-1.453 required=5
+tests=AWL,OSDL_HEADER_LISTID_KNOWN,OSDL_HEADER_SUBJECT_BRACKETED
X-Spam-Checker-Version: SpamAssassin 2.63-osdl_revision__1.119__
X-MIMEDefang-Filter: osdl$Revision: 1.176 $
Cc: Kirill@smtp.osdl.org, Linux@smtp.osdl.org, containers@lists.osdl.org,
       	Paul Menage <menage@google.com>,
       	List <linux-kernel@vger.kernel.org>
X-BeenThere: containers@lists.osdl.org
X-Mailman-Version: 2.1.8
Precedence: list
List-Id: Linux Containers <containers.lists.osdl.org>
List-Unsubscribe: <https://lists.osdl.org/mailman/listinfo/containers>,
       	<mailto:containers-request@lists.osdl.org?subject=unsubscribe>
List-Archive: <http://lists.osdl.org/pipermail/containers>
List-Post: <mailto:containers@lists.osdl.org>
List-Help: <mailto:containers-request@lists.osdl.org?subject=help>
List-Subscribe: <https://lists.osdl.org/mailman/listinfo/containers>,
       	<mailto:containers-request@lists.osdl.org?subject=subscribe>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Sender: containers-bounces@lists.osdl.org
Errors-To: containers-bounces@lists.osdl.org
Received-SPF: pass (localhost is always allowed.)
Status: RO
X-Status: A
Content-Length: 854
Lines: 27

-8<------------------------------------------------------------------------

> > best,
> > Herbert
> >
> 
> Cheers,
> Balbir
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 1/7] Resource counters [message #17802 is a reply to message #10889]

Tue, 13 March 2007 16:32

Herbert Poetzl
Messages: 239
Registered: February 2006

Senior Member

On Tue, Mar 13, 2007 at 06:41:05PM +0300, Pavel Emelianov wrote:
> >>> PS: atomic_add_unless() didn't exist back then
> >>> (at least I think so) but that might be an option
> >>> too ...
> >> I think as far as having this discussion if you can remove that race
> >> people will be more willing to talk about what vserver does.
> > 
> > well, shouldn't be a big deal to brush that patch up
> > (if somebody actually _is_ interested)
> > 
> >> That said anything that uses locks or atomic operations (finer grained
> >> locks) because of the cache line ping pong is going to have scaling
> >> issues on large boxes.
> > 
> > right, but atomic ops have much less impact on most
> > architectures than locks :)
> 
> Right. But atomic_add_unless() is slower as it is
> essentially a loop. See my previous letter in this sub-thread.

fine, nobody actually uses atomic_add_unless(), or am I
missing something?

using two locks will be slower than using a single
lock, adding a loop which counts from 0 to 100 will
eat up some cpu, so what? don't do it :)

> >> So in that sense anything short of per cpu variables sucks at scale.
> >> That said I would much rather get a simple correct version without the
> >> complexity of per cpu counters, before we optimize the counters that
> >> much.
> > 
> > actually I thought about per cpu counters quite a lot, and
> > we (Llinux-VServer) use them for accounting, but please
> > tell me how you use per cpu structures for implementing 
> > limits
> 
> Did you ever look at how get_empty_filp() works?
> I agree, that this is not a "strict" limit, but it
> limits the usage wit some "precision".
> 
> /* off-the-topic */ Herbert, you've lost Balbir again:
> In this sub-thread some letters up Eric wrote a letter with
> Balbir in Cc:. The next reply from you doesn't include him.

I can happily add him to every email I reply to, but he
definitely isn't removed by my mailer (as I already stated,
it might be the mailing list which does this), fact is, the
email arrives here without him in the cc, so a reply does
not contain it either ...

best,
Herbert

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17803 is a reply to message #17761]

Tue, 13 March 2007 15:10

dev
Messages: 1693
Registered: September 2005
Location: Moscow

Senior Member

>>So what to do when virtual physical limit is hit?
>>OOM-kill current task?
> 
> 
> when the RSS limit is hit, but there _are_ enough
> pages left on the physical system, there is no
> good reason to swap out the page at all
> 
>  - there is no benefit in doing so (performance
>    wise, that is)
> 
>  - it actually hurts performance, and could
>    become a separate source for DoS
> 
> what should happen instead (in an ideal world :)
> is that the page is considered swapped out for
> the guest (add guest penality for swapout), and 
> when the page would be swapped in again, the guest
> takes a penalty (for the 'virtual' page in) and
> the page is returned to the guest, possibly kicking
> out (again virtually) a different page

great. I agree with that.
Just curious why current vserver code kills arbitrary
task in container then?


>>> - accounting and limits have to be consistent
>>>   and should roughly represent the actual used
>>>   memory/swap (modulo optimizations, I can go
>>>   into detail here, if necessary)
>>
>>This is true for current implementation for
>>booth - this patchset ang OpenVZ beancounters.
>>
>>If you sum up the physpages values for all containers
>>you'll get the exact number of RAM pages used.
> 
> 
> hmm, including or excluding the host pages?

depends on whether you will include beanocunter 0 usages or not :)

Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17804 is a reply to message #17795]

Tue, 13 March 2007 15:32

xemul
Messages: 248
Registered: November 2005

Senior Member

Herbert Poetzl wrote:
> On Tue, Mar 13, 2007 at 10:17:54AM +0300, Pavel Emelianov wrote:
>> Herbert Poetzl wrote:
>>> On Mon, Mar 12, 2007 at 12:02:01PM +0300, Pavel Emelianov wrote:
>>>>>>> Maybe you have some ideas how we can decide on this?
>>>>>> We need to work out what the requirements are before we can 
>>>>>> settle on an implementation.
>>>>> Linux-VServer (and probably OpenVZ):
>>>>>
>>>>>  - shared mappings of 'shared' files (binaries 
>>>>>    and libraries) to allow for reduced memory
>>>>>    footprint when N identical guests are running
>>>> This is done in current patches.
>>> nice, but the question was about _requirements_
>>> (so your requirements are?)
>>>
>>>>>  - virtual 'physical' limit should not cause
>>>>>    swap out when there are still pages left on
>>>>>    the host system (but pages of over limit guests
>>>>>    can be preferred for swapping)
>>>> So what to do when virtual physical limit is hit?
>>>> OOM-kill current task?
>>> when the RSS limit is hit, but there _are_ enough
>>> pages left on the physical system, there is no
>>> good reason to swap out the page at all
>>>
>>>  - there is no benefit in doing so (performance
>>>    wise, that is)
>>>
>>>  - it actually hurts performance, and could
>>>    become a separate source for DoS
>>>
>>> what should happen instead (in an ideal world :)
>>> is that the page is considered swapped out for
>>> the guest (add guest penality for swapout), and 
>> Is the page stays mapped for the container or not?
>> If yes then what's the use of limits? Container mapped
>> pages more than the limit is but all the pages are
>> still in memory. Sounds weird.
> 
> sounds weird, but makes sense if you look at the full picture
> 
> just because the guest is over its page limit doesn't 
> mean that you actually want the system to swap stuff
> out, what you really want to happen is the following:
> 
>  - somehow mark those pages as 'gone' for the guest
>  - penalize the guest (and only the guest) for the
>    'virtual' swap/page operation
>  - penalize the guest again for paging in the page
>  - drop/swap/page out those pages when the host system
>    decides to reclaim pages (from the host PoV)

Yeah! And slow down the container which caused global
limit hit (w/o hitting it's own limit!) by swapping
some others' pages out. This breaks the idea of isolation.

>>> when the page would be swapped in again, the guest
>>> takes a penalty (for the 'virtual' page in) and
>>> the page is returned to the guest, possibly kicking
>>> out (again virtually) a different page
>>>
>>>>>  - accounting and limits have to be consistent
>>>>>    and should roughly represent the actual used
>>>>>    memory/swap (modulo optimizations, I can go
>>>>>    into detail here, if necessary)
>>>> This is true for current implementation for
>>>> booth - this patchset ang OpenVZ beancounters.
>>>>
>>>> If you sum up the physpages values for all containers
>>>> you'll get the exact number of RAM pages used.
>>> hmm, including or excluding the host pages?
>> Depends on whether you account host pages or not.
> 
> you tell me? or is that an option in OpenVZ?

In OpenVZ we account resources in host system as well.
However we have an opportunity to turn this off.

> best,
> Herbert
> 
>>>>>  - OOM handling on a per guest basis, i.e. some
>>>>>    out of memory condition in guest A must not
>>>>>    affect guest B
>>>> This is done in current patches.
>>>> Herbert, did you look at the patches before
>>>> sending this mail or do you just want to
>>>> 'take part' in conversation w/o understanding
>>>> of hat is going on?
>>> again, the question was about requirements, not
>>> your patches, and yes, I had a look at them _and_
>>> the OpenVZ implementations ...
>>>
>>> best,
>>> Herbert
>>>
>>> PS: hat is going on? :)
>>>
>>>>> HTC,
>>>>> Herbert
>>>>>
>>>>>> Sigh.  Who is running this show?   Anyone?
>>>>>>
>>>>>> You can actually do a form of overcommittment by allowing multiple
>>>>>> containers to share one or more of the zones. Whether that is
>>>>>> sufficient or suitable I don't know. That depends on the requirements,
>>>>>> and we haven't even discussed those, let alone agreed to them.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Containers mailing list
>>>>>> Containers@lists.osdl.org
>>>>>> https://lists.osdl.org/mailman/listinfo/containers
> 

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 1/7] Resource counters [message #17805 is a reply to message #17797]

Tue, 13 March 2007 15:41

xemul
Messages: 248
Registered: November 2005

Senior Member

>>> PS: atomic_add_unless() didn't exist back then
>>> (at least I think so) but that might be an option
>>> too ...
>> I think as far as having this discussion if you can remove that race
>> people will be more willing to talk about what vserver does.
> 
> well, shouldn't be a big deal to brush that patch up
> (if somebody actually _is_ interested)
> 
>> That said anything that uses locks or atomic operations (finer grained
>> locks) because of the cache line ping pong is going to have scaling
>> issues on large boxes.
> 
> right, but atomic ops have much less impact on most
> architectures than locks :)

Right. But atomic_add_unless() is slower as it is
essentially a loop. See my previous letter in this sub-thread.

>> So in that sense anything short of per cpu variables sucks at scale.
>> That said I would much rather get a simple correct version without the
>> complexity of per cpu counters, before we optimize the counters that
>> much.
> 
> actually I thought about per cpu counters quite a lot, and
> we (Llinux-VServer) use them for accounting, but please
> tell me how you use per cpu structures for implementing 
> limits

Did you ever look at how get_empty_filp() works?
I agree, that this is not a "strict" limit, but it
limits the usage wit some "precision".

/* off-the-topic */ Herbert, you've lost Balbir again:
In this sub-thread some letters up Eric wrote a letter with
Balbir in Cc:. The next reply from you doesn't include him.
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17806 is a reply to message #17796]

Tue, 13 March 2007 15:54

dev
Messages: 1693
Registered: September 2005
Location: Moscow

Senior Member

Herbert,

>>Just curious why current vserver code kills arbitrary
>>task in container then?
> 
> 
> because it obviously lacks the finess of OpenVZ code :)
> 
> seriously, handling the OOM kills inside a container
> has never been a real world issue, as once you are
> really out of memory (and OOM starts killing) you 
> usually have lost the game anyways (i.e. a guest restart
> or similar is required to get your services up and
> running again) and OOM killer decisions are not perfect
> in mainline either, but, you've probably seen the 
> FIXME and TODO entries in the code showing that this
> is work in progress ...

I'm talking not about the finess of the code,
but rather about the lack of isolation,
i.e. one VE can affect others.

Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17807 is a reply to message #17787]

Tue, 13 March 2007 19:09

Alan Cox
Messages: 48
Registered: May 2006

Member

> stuff is happening by comparing page->count and page->_mapcount, but it
> certainly wouldn't be conclusive.  But, does this kind of nonsense even
> happen in practice?  

"Is it useful for me as a bad guy to make it happen ?"

Alan
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17809 is a reply to message #17776]

Tue, 13 March 2007 10:25

Nick Piggin
Messages: 35
Registered: March 2006

Member

Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
> 
> 
>>On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
>>
>>>On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
>>>
>>>>For these you essentially need per-container page->_mapcount counter,
>>>>otherwise you can't detect whether rss group still has the page 
>>>>in question being mapped in its processes' address spaces or not. 
>>
>>>What do you mean by this?  You can always tell whether a process has a
>>>particular page mapped.  Could you explain the issue a bit more.  I'm
>>>not sure I get it.
>>
>>OpenVZ wants to account _shared_ pages in a guest
>>different than separate pages, so that the RSS
>>accounted values reflect the actual used RAM instead
>>of the sum of all processes RSS' pages, which for
>>sure is more relevant to the administrator, but IMHO
>>not so terribly important to justify memory consuming
>>structures and sacrifice performance to get it right
>>
>>YMMV, but maybe we can find a smart solution to the
>>issue too :)
> 
> 
> I will tell you what I want.
> 
> I want a shared page cache that has nothing to do with RSS limits.
> 
> I want an RSS limit that once I know I can run a deterministic
> application with a fixed set of inputs in I want to know it will
> always run.
> 
> First touch page ownership does not guarantee give me anything useful
> for knowing if I can run my application or not.  Because of page
> sharing my application might run inside the rss limit only because
> I got lucky and happened to share a lot of pages with another running
> application.  If the next I run and it isn't running my application
> will fail.  That is ridiculous.

Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).

> I don't want sharing between vservers/VE/containers to affect how many
> pages I can have mapped into my processes at once.

You seem to want total isolation. You could use virtualization?

> Now sharing is sufficiently rare that I'm pretty certain that problems
> come up rarely.  So maybe these problems have not shown up in testing
> yet.  But until I see the proof that actually doing the accounting for
> sharing properly has intolerable overhead.  I want proper accounting
> not this hand waving that is only accurate on the third Tuesday of the
> month.

It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define "proper", then we could discuss that.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 1/7] Resource counters [message #17810 is a reply to message #17805]

Tue, 13 March 2007 16:07

Srivatsa Vaddagiri
Messages: 241
Registered: August 2006

Senior Member

On Tue, Mar 13, 2007 at 06:41:05PM +0300, Pavel Emelianov wrote:
> > right, but atomic ops have much less impact on most
> > architectures than locks :)
> 
> Right. But atomic_add_unless() is slower as it is
> essentially a loop. See my previous letter in this sub-thread.

If I am not mistaken, you shouldn't loop in normal cases, which means
it boils down to a atomic_read() + atomic_cmpxch()


-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 1/7] Resource counters [message #17815 is a reply to message #17810]

Wed, 14 March 2007 07:12

xemul
Messages: 248
Registered: November 2005

Senior Member

Srivatsa Vaddagiri wrote:
> On Tue, Mar 13, 2007 at 06:41:05PM +0300, Pavel Emelianov wrote:
>>> right, but atomic ops have much less impact on most
>>> architectures than locks :)
>> Right. But atomic_add_unless() is slower as it is
>> essentially a loop. See my previous letter in this sub-thread.
> 
> If I am not mistaken, you shouldn't loop in normal cases, which means
> it boils down to a atomic_read() + atomic_cmpxch()
> 
> 

So does the lock - in a normal case (when it's not
heavily contented) it will boil down to atomic_dec_and_test().

Nevertheless, making charge like in this patchset
requires two atomic ops with atomic_xxx and only
one with spin_lock().
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17817 is a reply to message #10892]

Wed, 14 March 2007 06:42

Balbir Singh
Messages: 491
Registered: August 2006

Senior Member

Nick Piggin wrote:
> Eric W. Biederman wrote:
>> Nick Piggin <nickpiggin@yahoo.com.au> writes:
>>
>>
>>> Eric W. Biederman wrote:
>>>
>>>> First touch page ownership does not guarantee give me anything useful
>>>> for knowing if I can run my application or not.  Because of page
>>>> sharing my application might run inside the rss limit only because
>>>> I got lucky and happened to share a lot of pages with another running
>>>> application.  If the next I run and it isn't running my application
>>>> will fail.  That is ridiculous.
>>>
>>> Let's be practical here, what you're asking is basically impossible.
>>>
>>> Unless by deterministic you mean that it never enters the a non
>>> trivial syscall, in which case, you just want to know about maximum
>>> RSS of the process, which we already account).
>>
>>
>> Not per process I want this on a group of processes, and yes that
>> is all I want just.  I just want accounting of the maximum RSS of
>> a group of processes and then the mechanism to limit that maximum rss.
> 
> Well don't you just sum up the maximum for each process?
> 
> Or do you want to only count shared pages inside a container once,
> or something difficult like that?
> 
> 
>>>> I don't want sharing between vservers/VE/containers to affect how many
>>>> pages I can have mapped into my processes at once.
>>>
>>> You seem to want total isolation. You could use virtualization?
>>
>>
>> No.  I don't want the meaning of my rss limit to be affected by what
>> other processes are doing.  We have constraints of how many resources
>> the box actually has.  But I don't want accounting so sloppy that
>> processes outside my group of processes can artificially
>> lower my rss value, which magically raises my rss limit.
> 
> So what are you going to do about all the shared caches and slabs
> inside the kernel?
> 
> 
>>> It is basically handwaving anyway. The only approach I've seen with
>>> a sane (not perfect, but good) way of accounting memory use is this
>>> one. If you care to define "proper", then we could discuss that.
>>
>>
>> I will agree that this patchset is probably in the right general 
>> ballpark.
>> But the fact that pages are assigned exactly one owner is pure non-sense.
>> We can do better.  That is all I am asking for someone to at least 
>> attempt
>> to actually account for the rss of a group of processes and get the 
>> numbers
>> right when we have shared pages, between different groups of
>> processes.  We have the data structures to support this with rmap.
> 
> Well rmap only supports mapped, userspace pages.
> 
> 
>> Let me describe the situation where I think the accounting in the
>> patchset goes totally wonky.
>>
>> Gcc as I recall maps the pages it is compiling with mmap.
>> If in a single kernel tree I do:
>> make -jN O=../compile1 &
>> make -jN O=../compile2 &
>>
>> But set it up so that the two compiles are in different rss groups.
>> If I run the concurrently they will use the same files at the same
>> time and most likely because of the first touch rss limit rule even
>> if I have a draconian rss limit the compiles will both be able to
>> complete and finish.   However if I run either of them alone if I
>> use the most draconian rss limit I can that allows both compiles to
>> finish I won't be able to compile a single kernel tree.
> 
> Yeah it is not perfect. Fortunately, there is no perfect solution,
> so we don't have to be too upset about that.
> 
> And strangely, this example does not go outside the parameters of
> what you asked for AFAIKS. In the worst case of one container getting
> _all_ the shared pages, they will still remain inside their maximum
> rss limit.
> 

When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the
container that is using those pages most frequently.

> So they might get penalised a bit on reclaim, but maximum rss limits
> will work fine, and you can (almost) guarantee X amount of memory for
> a given container, and it will _work_.
> 
> But I also take back my comments about this being the only design I
> have seen that gets everything, because the node-per-container idea
> is a really good one on the surface. And it could mean even less impact
> on the core VM than this patch. That is also a first-touch scheme.
> 

With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more

With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.

NOTE: page cache == unmapped page cache here.

> 
>> However the messed up accounting that doesn't handle sharing between
>> groups of processes properly really bugs me.  Especially when we have
>> the infrastructure to do it right.
>>
>> Does that make more sense?
> 
> I think it is simplistic.
> 
> Sure you could probably use some of the rmap stuff to account shared
> mapped _user_ pages once for each container that touches them. And
> this patchset isn't preventing that.
> 
> But how do you account kernel allocations? How do you account unmapped
> pagecache?
> 
> What's the big deal so many accounting people have with just RSS? I'm
> not a container person, this is an honest question. Because from my
> POV if you conveniently ignore everything else... you may as well just
> not do any accounting at all.
> 

We decided to implement accounting and control in phases

1. RSS control
2. unmapped page cache control
3. mlock control
4. Kernel accounting and limits

This has several advantages

1. The limits can be individually set and controlled.
2. The code is broken down into simpler chunks for review and merging.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17818 is a reply to message #10892]

Wed, 14 March 2007 07:48

Balbir Singh
Messages: 491
Registered: August 2006

Senior Member

Nick Piggin wrote:
> Balbir Singh wrote:
>> Nick Piggin wrote:
> 
>>> And strangely, this example does not go outside the parameters of
>>> what you asked for AFAIKS. In the worst case of one container getting
>>> _all_ the shared pages, they will still remain inside their maximum
>>> rss limit.
>>>
>>
>> When that does happen and if a container hits it limit, with a LRU
>> per-container, if the container is not actually using those pages,
>> they'll get thrown out of that container and get mapped into the
>> container that is using those pages most frequently.
> 
> Exactly. Statistically, first touch will work OK. It may mean some
> reclaim inefficiencies in corner cases, but things will tend to
> even out.
> 

Exactly!

>>> So they might get penalised a bit on reclaim, but maximum rss limits
>>> will work fine, and you can (almost) guarantee X amount of memory for
>>> a given container, and it will _work_.
>>>
>>> But I also take back my comments about this being the only design I
>>> have seen that gets everything, because the node-per-container idea
>>> is a really good one on the surface. And it could mean even less impact
>>> on the core VM than this patch. That is also a first-touch scheme.
>>>
>>
>> With the proposed node-per-container, we will need to make massive core
>> VM changes to reorganize zones and nodes. We would want to allow
>>
>> 1. For sharing of nodes
>> 2. Resizing nodes
>> 3. May be more
> 
> But a lot of that is happening anyway for other reasons (eg. memory
> plug/unplug). And I don't consider node/zone setup to be part of the
> "core VM" as such... it is _good_ if we can move extra work into setup
> rather than have it in the mm.
> 
> That said, I don't think this patch is terribly intrusive either.
> 

Thanks, thats one of our goals, to keep it simple, understandable and
non-intrusive.

> 
>> With the node-per-container idea, it will hard to control page cache
>> limits, independent of RSS limits or mlock limits.
>>
>> NOTE: page cache == unmapped page cache here.
> 
> I don't know that it would be particularly harder than any other
> first-touch scheme. If one container ends up being charged with too
> much pagecache, eventually they'll reclaim a bit of it and the pages
> will get charged to more frequent users.
> 
> 

Yes, true, but what if a user does not want to control the page
cache usage in a particular container or wants to turn off
RSS control.

>>>> However the messed up accounting that doesn't handle sharing between
>>>> groups of processes properly really bugs me.  Especially when we have
>>>> the infrastructure to do it right.
>>>>
>>>> Does that make more sense?
>>>
>>>
>>> I think it is simplistic.
>>>
>>> Sure you could probably use some of the rmap stuff to account shared
>>> mapped _user_ pages once for each container that touches them. And
>>> this patchset isn't preventing that.
>>>
>>> But how do you account kernel allocations? How do you account unmapped
>>> pagecache?
>>>
>>> What's the big deal so many accounting people have with just RSS? I'm
>>> not a container person, this is an honest question. Because from my
>>> POV if you conveniently ignore everything else... you may as well just
>>> not do any accounting at all.
>>>
>>
>> We decided to implement accounting and control in phases
>>
>> 1. RSS control
>> 2. unmapped page cache control
>> 3. mlock control
>> 4. Kernel accounting and limits
>>
>> This has several advantages
>>
>> 1. The limits can be individually set and controlled.
>> 2. The code is broken down into simpler chunks for review and merging.
> 
> But this patch gives the groundwork to handle 1-4, and it is in a small
> chunk, and one would be able to apply different limits to different types
> of pages with it. Just using rmap to handle 1 does not really seem like a
> viable alternative because it fundamentally isn't going to handle 2 or 4.
> 

For (2), we have the basic setup in the form of a per-container LRU list
and a pointer from struct page to the container that first brought in
the page.

> I'm not saying that you couldn't _later_ add something that uses rmap or
> our current RSS accounting to tweak container-RSS semantics. But isn't it
> sensible to lay the groundwork first? Get a clear path to something that
> is good (not perfect), but *works*?
> 

I agree with your development model suggestion. One of things we are going 
to do in the near future is to build (2) and then add (3) and (4). So far,
we've not encountered any difficulties on building on top of (1).

Vaidy, any comments?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17819 is a reply to message #10890]

Wed, 14 March 2007 20:42

Dave Hansen
Messages: 240
Registered: October 2005

Senior Member

On Wed, 2007-03-14 at 15:38 +0000, Mel Gorman wrote:
> On (13/03/07 10:05), Dave Hansen didst pronounce:
> > How do we determine what is shared, and goes into the shared zones?
> 
> Assuming we had a means of creating a zone that was assigned to a container,
> a second zone for shared data between a set of containers.  For shared data,
> the time the pages are being allocated is at page fault time. At that point,
> the faulting VMA is known and you also know if it's MAP_SHARED or not.

Well, but MAP_SHARED does not necessarily mean shared outside of the
container, right?  Somebody wishing to get around resource limits could
just MAP_SHARED any data they wished to use, and get it into the shared
area before their initial use, right?

How do normal read/write()s fit into this?

> > There's a conflict between the resize granularity of the zones, and the
> > storage space their lookup consumes.  We'd want a container to have a
> > limited ability to fill up memory with stuff like the dcache, so we'd
> > appear to need to put the dentries inside the software zone.  But, that
> > gets us to our inability to evict arbitrary dentries. 
> 
> Stuff like shrinking dentry caches is already pretty course-grained.
> Last I looked, we couldn't even shrink within a specific node, let alone
> a zone or a specific dentry. This is a separate problem.

I shouldn't have used dentries as an example.  I'm just saying that if
we end up (or can end up with) with a whole ton of these software zones,
we might have troubles storing them.  I would imagine the issue would
come immediately from lack of page->flags to address lots of them.

> > After a while,
> > would containers tend to pin an otherwise empty zone into place?  We
> > could resize it, but what is the cost of keeping zones that can be
> > resized down to a small enough size that we don't mind keeping it there?
> > We could merge those "orphaned" zones back into the shared zone.
> 
> Merging "orphaned" zones back into the "main" zone would seem a sensible
> choice.

OK, but merging wouldn't be possible if they're not physically
contiguous.  I guess this could be worked around by just calling it a
shared zone, no matter where it is physically.

> > Were there any requirements about physical contiguity? 
> 
> For the lookup to software zone to be efficient, it would be easiest to have
> them as MAX_ORDER_NR_PAGES contiguous. This would avoid having to break the
> existing assumptions in the buddy allocator about MAX_ORDER_NR_PAGES
> always being in the same zone.

I was mostly wondering about zones spanning other zones.  We _do_
support this today, and it might make quite a bit more merging possible.

> > If we really do bind a set of processes strongly to a set of memory on a
> > set of nodes, then those really do become its home NUMA nodes.  If the
> > CPUs there get overloaded, running it elsewhere will continue to grab
> > pages from the home.  Would this basically keep us from ever being able
> > to move tasks around a NUMA system?
> 
> Moving the tasks around would not be easy. It would require a new zone
> to be created based on the new NUMA node and all the data migrated. hmm

I know we _try_ to avoid this these days, but I'm not sure how taking it
away as an option will affect anything.

-- Dave

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17821 is a reply to message #17782]

Wed, 14 March 2007 03:51

Nick Piggin
Messages: 35
Registered: March 2006

Member

Eric W. Biederman wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> writes:
> 
> 
>>Eric W. Biederman wrote:
>>
>>>First touch page ownership does not guarantee give me anything useful
>>>for knowing if I can run my application or not.  Because of page
>>>sharing my application might run inside the rss limit only because
>>>I got lucky and happened to share a lot of pages with another running
>>>application.  If the next I run and it isn't running my application
>>>will fail.  That is ridiculous.
>>
>>Let's be practical here, what you're asking is basically impossible.
>>
>>Unless by deterministic you mean that it never enters the a non
>>trivial syscall, in which case, you just want to know about maximum
>>RSS of the process, which we already account).
> 
> 
> Not per process I want this on a group of processes, and yes that
> is all I want just.  I just want accounting of the maximum RSS of
> a group of processes and then the mechanism to limit that maximum rss.

Well don't you just sum up the maximum for each process?

Or do you want to only count shared pages inside a container once,
or something difficult like that?

>>>I don't want sharing between vservers/VE/containers to affect how many
>>>pages I can have mapped into my processes at once.
>>
>>You seem to want total isolation. You could use virtualization?
> 
> 
> No.  I don't want the meaning of my rss limit to be affected by what
> other processes are doing.  We have constraints of how many resources
> the box actually has.  But I don't want accounting so sloppy that
> processes outside my group of processes can artificially
> lower my rss value, which magically raises my rss limit.

So what are you going to do about all the shared caches and slabs
inside the kernel?

>>It is basically handwaving anyway. The only approach I've seen with
>>a sane (not perfect, but good) way of accounting memory use is this
>>one. If you care to define "proper", then we could discuss that.
> 
> 
> I will agree that this patchset is probably in the right general ballpark.
> But the fact that pages are assigned exactly one owner is pure non-sense.
> We can do better.  That is all I am asking for someone to at least attempt
> to actually account for the rss of a group of processes and get the numbers
> right when we have shared pages, between different groups of
> processes.  We have the data structures to support this with rmap.

Well rmap only supports mapped, userspace pages.

> Let me describe the situation where I think the accounting in the
> patchset goes totally wonky. 
> 
> 
> Gcc as I recall maps the pages it is compiling with mmap.
> If in a single kernel tree I do:
> make -jN O=../compile1 &
> make -jN O=../compile2 &
> 
> But set it up so that the two compiles are in different rss groups.
> If I run the concurrently they will use the same files at the same
> time and most likely because of the first touch rss limit rule even
> if I have a draconian rss limit the compiles will both be able to
> complete and finish.   However if I run either of them alone if I
> use the most draconian rss limit I can that allows both compiles to
> finish I won't be able to compile a single kernel tree.

Yeah it is not perfect. Fortunately, there is no perfect solution,
so we don't have to be too upset about that.

And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.

So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.

> However the messed up accounting that doesn't handle sharing between
> groups of processes properly really bugs me.  Especially when we have
> the infrastructure to do it right.
> 
> Does that make more sense?

I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17822 is a reply to message #17783]

Wed, 14 March 2007 15:38

mel
Messages: 4
Registered: March 2007

Junior Member

On (13/03/07 10:05), Dave Hansen didst pronounce:
> On Tue, 2007-03-13 at 03:48 -0800, Andrew Morton wrote: 
> > If we use a physical zone-based containment scheme: fake-numa,
> > variable-sized zones, etc then it all becomes moot.  You set up a container
> > which has 1.5GB of physial memory then toss processes into it.  As that
> > process set increases in size it will toss out stray pages which shouldn't
> > be there, then it will start reclaiming and swapping out its own pages and
> > eventually it'll get an oom-killing.
> 
> I was just reading through the (comprehensive) thread about this from
> last week, so forgive me if I missed some of it.  The idea is really
> tempting, precisely because I don't think anyone really wants to have to
> screw with the reclaim logic.  
> 
> I'm just brain-dumping here, hoping that somebody has already thought
> through some of this stuff.  It's not a bitch-fest, I promise. :)
> 
> How do we determine what is shared, and goes into the shared zones?

Assuming we had a means of creating a zone that was assigned to a container,
a second zone for shared data between a set of containers.  For shared data,
the time the pages are being allocated is at page fault time. At that point,
the faulting VMA is known and you also know if it's MAP_SHARED or not.

The caller allocating the page would select (or create) a zonelist that
is appropriate for the container. For shared mappings, it would be one
zone - the shared zone for the set. For private mappings, it would be
one zone - the shared zone for the set.

For overcommit, the allowable zones for overcommit could be included.
Allowing overcommit opens the possibility for containers to interfere with
each other but I'm guessing that if overcommit is enabled, the administrator
is willing to live with that interference.

This has the awkward possibility of having two "shared" zones for two container
sets and one file that needs sharing. Similarly, there is a possibility for
having a container that has no shared zone and faulted in shared data. In
that case, the page ends up in the first faulting container set and it's
too bad it got "charged" for the page use on behalf of other containers. I'm
not sure there is a sane way of accounting this situation fairly.

I think that it's important to note that once data is shared between containers
at all that they have the potential to interfere with each other (by reclaiming
within the shared zone for example).

> Once we've allocated a page, it's too late because we already picked.

We'd choose the appropriate zonelist before faulting. Once allocated,
the page stays there.

> Do we just assume all page cache is shared?  Base it on filesystem,
> mount, ...?  Mount seems the most logical to me, that a sysadmin would
> have to set up a container's fs, anyway, and will likely be doing
> special things to shared data, anyway (r/o bind mounts :).
> 

I have no strong feelings here. To me, it's "who do I assign this fake
zone to?" I guess you would have at least one zone per container mount
for private data.

> There's a conflict between the resize granularity of the zones, and the
> storage space their lookup consumes.  We'd want a container to have a
> limited ability to fill up memory with stuff like the dcache, so we'd
> appear to need to put the dentries inside the software zone.  But, that
> gets us to our inability to evict arbitrary dentries. 

Stuff like shrinking dentry caches is already pretty course-grained.
Last I looked, we couldn't even shrink within a specific node, let alone
a zone or a specific dentry. This is a separate problem.

> After a while,
> would containers tend to pin an otherwise empty zone into place?  We
> could resize it, but what is the cost of keeping zones that can be
> resized down to a small enough size that we don't mind keeping it there?
> We could merge those "orphaned" zones back into the shared zone.

Merging "orphaned" zones back into the "main" zone would seem a sensible
choice.

> Were there any requirements about physical contiguity? 

For the lookup to software zone to be efficient, it would be easiest to have
them as MAX_ORDER_NR_PAGES contiguous. This would avoid having to break the
existing assumptions in the buddy allocator about MAX_ORDER_NR_PAGES
always being in the same zone.

> What about minimum
> zone sizes?
> 

MAX_ORDER_NR_PAGES would be the minimum zone size.

> If we really do bind a set of processes strongly to a set of memory on a
> set of nodes, then those really do become its home NUMA nodes.  If the
> CPUs there get overloaded, running it elsewhere will continue to grab
> pages from the home.  Would this basically keep us from ever being able
> to move tasks around a NUMA system?
> 

Moving the tasks around would not be easy. It would require a new zone
to be created based on the new NUMA node and all the data migrated. hmm

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17823 is a reply to message #17787]

Wed, 14 March 2007 16:47

mel
Messages: 4
Registered: March 2007

Junior Member

On (13/03/07 10:26), Dave Hansen didst pronounce:
> On Mon, 2007-03-12 at 22:04 -0800, Andrew Morton wrote:
> > So these mmapped pages will contiue to be shared across all guests.  The
> > problem boils down to "which guest(s) get charged for each shared page".
> > 
> > A simple and obvious and easy-to-implement answer is "the guest which paged
> > it in".  I think we should firstly explain why that is insufficient.
> 
> My first worry was that this approach is unfair to the poor bastard that
> happened to get started up first.  If we have a bunch of containerized
> web servers, the poor guy who starts Apache first will pay the price for
> keeping it in memory for everybody else.
> 

I think it would be very difficult in practice to exploit a situation where
an evil guy forces another container to hold shared pages that the container
is not using themselves.

> That said, I think this is naturally worked around. The guy charged
> unfairly will get reclaim started on himself sooner.  This will tend to
> page out those pages that he was being unfairly charged for. 

Exactly. That said, the "poor bastard" will have to be pretty determined
to page out because the pages will appear active but it should happen
eventually especially if the container is under pressure.

> Hopefully,
> they will eventually get pretty randomly (eventually evenly) spread
> among all users.  We just might want to make sure that we don't allow
> ptes (or other new references) to be re-established to pages like this
> when we're trying to reclaim them. 

I don't think anything like that currently exists. It's almost the opposite
of what the current reclaim algorithm would be trying to do because it has no
notion of containers. Currently, the idea of paging out something in active
use is a mad plan.

Maybe what would be needed is something where the shared page is unmapped from
page tables and the next faulter must copy the page instead of reestablishing
the PTE. The data copy is less than ideal but it'd be cheaper than reclaim
and help the accounting. However, it would require a counter to track "how
many processes in this container have mapped the page".

> Either that, or force the next
> toucher to take ownership of the thing.  But, that kind of arbitrary
> ownership transfer can't happen if we have rigidly defined boundaries
> for the containers.
> 

Right, charging the next toucher would not work in the zones case. The next
toucher would establish a PTE to the page which is still in the zone of the
container being unfairly charged.  It would need to be paged out or copied.

> The other concern is that the memory load on the system doesn't come
> from the first user ("the guy who paged it in").  The long-term load
> comes from "the guy who keeps using it."  The best way to exemplify this
> is somebody who read()s a page in, followed by another guy mmap()ing the
> same page.  The guy who did the read will get charged, and the mmap()er
> will get a free ride.  We could probably get an idea when this kind of
> stuff is happening by comparing page->count and page->_mapcount, but it
> certainly wouldn't be conclusive.  But, does this kind of nonsense even
> happen in practice?  
> 

I think this problem would happen with other accounting mechanisms as
well. However, it's more pronounced with zones because there are harder
limits on memory usage.

If the counter existed to track "how many processes in this container have
mapped the page", the problem of free-riders could be investigated by comparing
_mapcount to the container count. That would determine if additional steps
are required or not to force another container to assume the accounting cost.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17824 is a reply to message #17817]

Wed, 14 March 2007 06:57

Nick Piggin
Messages: 35
Registered: March 2006

Member

Balbir Singh wrote:
> Nick Piggin wrote:

>> And strangely, this example does not go outside the parameters of
>> what you asked for AFAIKS. In the worst case of one container getting
>> _all_ the shared pages, they will still remain inside their maximum
>> rss limit.
>>
> 
> When that does happen and if a container hits it limit, with a LRU
> per-container, if the container is not actually using those pages,
> they'll get thrown out of that container and get mapped into the
> container that is using those pages most frequently.

Exactly. Statistically, first touch will work OK. It may mean some
reclaim inefficiencies in corner cases, but things will tend to
even out.

>> So they might get penalised a bit on reclaim, but maximum rss limits
>> will work fine, and you can (almost) guarantee X amount of memory for
>> a given container, and it will _work_.
>>
>> But I also take back my comments about this being the only design I
>> have seen that gets everything, because the node-per-container idea
>> is a really good one on the surface. And it could mean even less impact
>> on the core VM than this patch. That is also a first-touch scheme.
>>
> 
> With the proposed node-per-container, we will need to make massive core
> VM changes to reorganize zones and nodes. We would want to allow
> 
> 1. For sharing of nodes
> 2. Resizing nodes
> 3. May be more

But a lot of that is happening anyway for other reasons (eg. memory
plug/unplug). And I don't consider node/zone setup to be part of the
"core VM" as such... it is _good_ if we can move extra work into setup
rather than have it in the mm.

That said, I don't think this patch is terribly intrusive either.

> With the node-per-container idea, it will hard to control page cache
> limits, independent of RSS limits or mlock limits.
> 
> NOTE: page cache == unmapped page cache here.

I don't know that it would be particularly harder than any other
first-touch scheme. If one container ends up being charged with too
much pagecache, eventually they'll reclaim a bit of it and the pages
will get charged to more frequent users.

>>> However the messed up accounting that doesn't handle sharing between
>>> groups of processes properly really bugs me.  Especially when we have
>>> the infrastructure to do it right.
>>>
>>> Does that make more sense?
>>
>>
>> I think it is simplistic.
>>
>> Sure you could probably use some of the rmap stuff to account shared
>> mapped _user_ pages once for each container that touches them. And
>> this patchset isn't preventing that.
>>
>> But how do you account kernel allocations? How do you account unmapped
>> pagecache?
>>
>> What's the big deal so many accounting people have with just RSS? I'm
>> not a container person, this is an honest question. Because from my
>> POV if you conveniently ignore everything else... you may as well just
>> not do any accounting at all.
>>
> 
> We decided to implement accounting and control in phases
> 
> 1. RSS control
> 2. unmapped page cache control
> 3. mlock control
> 4. Kernel accounting and limits
> 
> This has several advantages
> 
> 1. The limits can be individually set and controlled.
> 2. The code is broken down into simpler chunks for review and merging.

But this patch gives the groundwork to handle 1-4, and it is in a small
chunk, and one would be able to apply different limits to different types
of pages with it. Just using rmap to handle 1 does not really seem like a
viable alternative because it fundamentally isn't going to handle 2 or 4.

I'm not saying that you couldn't _later_ add something that uses rmap or
our current RSS accounting to tweak container-RSS semantics. But isn't it
sensible to lay the groundwork first? Get a clear path to something that
is good (not perfect), but *works*?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17826 is a reply to message #17818]

Wed, 14 March 2007 13:25

Vaidyanathan Srinivas is currently offline

Vaidyanathan Srinivas
Messages: 49
Registered: February 2007

Member

Balbir Singh wrote:
> Nick Piggin wrote:
>> Balbir Singh wrote:
>>> Nick Piggin wrote:
>>>> And strangely, this example does not go outside the parameters of
>>>> what you asked for AFAIKS. In the worst case of one container getting
>>>> _all_ the shared pages, they will still remain inside their maximum
>>>> rss limit.
>>>>
>>> When that does happen and if a container hits it limit, with a LRU
>>> per-container, if the container is not actually using those pages,
>>> they'll get thrown out of that container and get mapped into the
>>> container that is using those pages most frequently.
>> Exactly. Statistically, first touch will work OK. It may mean some
>> reclaim inefficiencies in corner cases, but things will tend to
>> even out.
>>
> 
> Exactly!
> 
>>>> So they might get penalised a bit on reclaim, but maximum rss limits
>>>> will work fine, and you can (almost) guarantee X amount of memory for
>>>> a given container, and it will _work_.
>>>>
>>>> But I also take back my comments about this being the only design I
>>>> have seen that gets everything, because the node-per-container idea
>>>> is a really good one on the surface. And it could mean even less impact
>>>> on the core VM than this patch. That is also a first-touch scheme.
>>>>
>>> With the proposed node-per-container, we will need to make massive core
>>> VM changes to reorganize zones and nodes. We would want to allow
>>>
>>> 1. For sharing of nodes
>>> 2. Resizing nodes
>>> 3. May be more
>> But a lot of that is happening anyway for other reasons (eg. memory
>> plug/unplug). And I don't consider node/zone setup to be part of the
>> "core VM" as such... it is _good_ if we can move extra work into setup
>> rather than have it in the mm.
>>
>> That said, I don't think this patch is terribly intrusive either.
>>
> 
> Thanks, thats one of our goals, to keep it simple, understandable and
> non-intrusive.
> 
>>> With the node-per-container idea, it will hard to control page cache
>>> limits, independent of RSS limits or mlock limits.
>>>
>>> NOTE: page cache == unmapped page cache here.
>> I don't know that it would be particularly harder than any other
>> first-touch scheme. If one container ends up being charged with too
>> much pagecache, eventually they'll reclaim a bit of it and the pages
>> will get charged to more frequent users.
>>
>>
> 
> Yes, true, but what if a user does not want to control the page
> cache usage in a particular container or wants to turn off
> RSS control.
> 
>>>>> However the messed up accounting that doesn't handle sharing between
>>>>> groups of processes properly really bugs me.  Especially when we have
>>>>> the infrastructure to do it right.
>>>>>
>>>>> Does that make more sense?
>>>>
>>>> I think it is simplistic.
>>>>
>>>> Sure you could probably use some of the rmap stuff to account shared
>>>> mapped _user_ pages once for each container that touches them. And
>>>> this patchset isn't preventing that.
>>>>
>>>> But how do you account kernel allocations? How do you account unmapped
>>>> pagecache?
>>>>
>>>> What's the big deal so many accounting people have with just RSS? I'm
>>>> not a container person, this is an honest question. Because from my
>>>> POV if you conveniently ignore everything else... you may as well just
>>>> not do any accounting at all.
>>>>
>>> We decided to implement accounting and control in phases
>>>
>>> 1. RSS control
>>> 2. unmapped page cache control
>>> 3. mlock control
>>> 4. Kernel accounting and limits
>>>
>>> This has several advantages
>>>
>>> 1. The limits can be individually set and controlled.
>>> 2. The code is broken down into simpler chunks for review and merging.
>> But this patch gives the groundwork to handle 1-4, and it is in a small
>> chunk, and one would be able to apply different limits to different types
>> of pages with it. Just using rmap to handle 1 does not really seem like a
>> viable alternative because it fundamentally isn't going to handle 2 or 4.
>>
> 
> For (2), we have the basic setup in the form of a per-container LRU list
> and a pointer from struct page to the container that first brought in
> the page.
> 
>> I'm not saying that you couldn't _later_ add something that uses rmap or
>> our current RSS accounting to tweak container-RSS semantics. But isn't it
>> sensible to lay the groundwork first? Get a clear path to something that
>> is good (not perfect), but *works*?
>>
> 
> I agree with your development model suggestion. One of things we are going 
> to do in the near future is to build (2) and then add (3) and (4). So far,
> we've not encountered any difficulties on building on top of (1).
> 
> Vaidy, any comments?

Accounting becomes easy if we have a container pointer in struct page.
 This can form base ground for building controllers since any memory
related controller would be interested in tracking pages.  However we
still want to evaluate if we can build them without bloating the
struct page.  Pagecache controller (2) we can implement with container
pointer in struct page or container pointer in struct address space.

Building on this patchset is much simple and and we hope the bloat in
struct page will be compensated by the benefits in memory controllers
in terms of performance and simplicity.

Adding too many controllers and accounting parameters to start with
will make the patch too big and complex.  As Balbir mentioned, we have
a plan and we shall add new control parameters in stages.

--Vaidy
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17827 is a reply to message #10892]

Wed, 14 March 2007 16:16

dev
Messages: 1693
Registered: September 2005
Location: Moscow

Senior Member

Nick,

>>Accounting becomes easy if we have a container pointer in struct page.
>> This can form base ground for building controllers since any memory
>>related controller would be interested in tracking pages.  However we
>>still want to evaluate if we can build them without bloating the
>>struct page.  Pagecache controller (2) we can implement with container
>>pointer in struct page or container pointer in struct address space.
> 
> 
> The thing is, you have to worry about actually getting anything in the
> kernel rather than trying to do fancy stuff.
> 
> The approaches I have seen that don't have a struct page pointer, do
> intrusive things like try to put hooks everywhere throughout the kernel
> where a userspace task can cause an allocation (and of course end up
> missing many, so they aren't secure anyway)... and basically just
> nasty stuff that will never get merged.

User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner container,
who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.
and 1 pointer in page struct is just 0.1% overehad.

> Struct page overhead really isn't bad. Sure, nobody who doesn't use
> containers will want to turn it on, but unless you're using a big PAE
> system you're actually unlikely to notice.

big PAE doesn't make any difference IMHO
(until struct pages are not created for non-present physical memory areas)

> But again, I'll say the node-container approach of course does avoid
> this nicely (because we already can get the node from the page). So
> definitely that approach needs to be discredited before going with this
> one.

But it lacks some other features:
1. page can't be shared easily with another container
2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it
3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.
4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.
5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.
6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,
   even if it is unused. This is not an option for us.

>>Building on this patchset is much simple and and we hope the bloat in
>>struct page will be compensated by the benefits in memory controllers
>>in terms of performance and simplicity.
>>
>>Adding too many controllers and accounting parameters to start with
>>will make the patch too big and complex.  As Balbir mentioned, we have
>>a plan and we shall add new control parameters in stages.
> 
> Everyone seems to have a plan ;) I don't read the containers list...
> does everyone still have *different* plans, or is any sort of consensus
> being reached?

hope we'll have it soon :)

Thanks,
Kirill

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17828 is a reply to message #10892]

Wed, 14 March 2007 14:43

Vaidyanathan Srinivas
Messages: 49
Registered: February 2007

Member

Nick Piggin wrote:
> Vaidyanathan Srinivasan wrote:
> 
>> Accounting becomes easy if we have a container pointer in struct page.
>>  This can form base ground for building controllers since any memory
>> related controller would be interested in tracking pages.  However we
>> still want to evaluate if we can build them without bloating the
>> struct page.  Pagecache controller (2) we can implement with container
>> pointer in struct page or container pointer in struct address space.
> 
> The thing is, you have to worry about actually getting anything in the
> kernel rather than trying to do fancy stuff.
> 
> The approaches I have seen that don't have a struct page pointer, do
> intrusive things like try to put hooks everywhere throughout the kernel
> where a userspace task can cause an allocation (and of course end up
> missing many, so they aren't secure anyway)... and basically just
> nasty stuff that will never get merged.
> 
> Struct page overhead really isn't bad. Sure, nobody who doesn't use
> containers will want to turn it on, but unless you're using a big PAE
> system you're actually unlikely to notice.
> 
> But again, I'll say the node-container approach of course does avoid
> this nicely (because we already can get the node from the page). So
> definitely that approach needs to be discredited before going with this
> one.

I agree :)

>> Building on this patchset is much simple and and we hope the bloat in
>> struct page will be compensated by the benefits in memory controllers
>> in terms of performance and simplicity.
>>
>> Adding too many controllers and accounting parameters to start with
>> will make the patch too big and complex.  As Balbir mentioned, we have
>> a plan and we shall add new control parameters in stages.
> 
> Everyone seems to have a plan ;) I don't read the containers list...
> does everyone still have *different* plans, or is any sort of consensus
> being reached?

Consensus?  I believe at this point we have a sort of consensus on the
base container infrastructure and the need for memory controller to
control RSS, pagecache, mlock, kernel memory etc.  However the
implementation and approach taken is still being discussed :)

--Vaidy

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17829 is a reply to message #17826]

Wed, 14 March 2007 13:49

Nick Piggin
Messages: 35
Registered: March 2006

Member

Vaidyanathan Srinivasan wrote:

> Accounting becomes easy if we have a container pointer in struct page.
>  This can form base ground for building controllers since any memory
> related controller would be interested in tracking pages.  However we
> still want to evaluate if we can build them without bloating the
> struct page.  Pagecache controller (2) we can implement with container
> pointer in struct page or container pointer in struct address space.

The thing is, you have to worry about actually getting anything in the
kernel rather than trying to do fancy stuff.

The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.

Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.

But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this
one.

> Building on this patchset is much simple and and we hope the bloat in
> struct page will be compensated by the benefits in memory controllers
> in terms of performance and simplicity.
> 
> Adding too many controllers and accounting parameters to start with
> will make the patch too big and complex.  As Balbir mentioned, we have
> a plan and we shall add new control parameters in stages.

Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 1/7] Resource counters [message #17831 is a reply to message #17815]

Thu, 15 March 2007 16:51

ebiederm
Messages: 1354
Registered: February 2006

Senior Member

Pavel Emelianov <xemul@sw.ru> writes:

> Srivatsa Vaddagiri wrote:
>> On Tue, Mar 13, 2007 at 06:41:05PM +0300, Pavel Emelianov wrote:
>>>> right, but atomic ops have much less impact on most
>>>> architectures than locks :)
>>> Right. But atomic_add_unless() is slower as it is
>>> essentially a loop. See my previous letter in this sub-thread.
>> 
>> If I am not mistaken, you shouldn't loop in normal cases, which means
>> it boils down to a atomic_read() + atomic_cmpxch()
>> 
>> 
>
> So does the lock - in a normal case (when it's not
> heavily contented) it will boil down to atomic_dec_and_test().
>
> Nevertheless, making charge like in this patchset
> requires two atomic ops with atomic_xxx and only
> one with spin_lock().

To be very clear.  If you care about optimization cache lines
and lock hold times (to keep contention down) are the important
things.

With spin locks you have to be a little more careful to put them
on the same cache line as your data and to keep should hold times
short.  With atomic ops you get that automatically.

There is really no significant advantage in either approach.
The number of atomic ops doesn't matter.  You bring in
the cache line and manipulate it.  The expensive part is
acquiring the cache line exclusively.  This is expensive even if
things are never contended but there are many users.

Sorry for the rant, but I just wanted to set the record straight.
spin_locks vs atomic ops is a largely meaningless debate.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17832 is a reply to message #10892]

Thu, 15 March 2007 05:44

Balbir Singh
Messages: 491
Registered: August 2006

Senior Member

Nick Piggin wrote:
> Kirill Korotaev wrote:
> 
>>> The approaches I have seen that don't have a struct page pointer, do
>>> intrusive things like try to put hooks everywhere throughout the kernel
>>> where a userspace task can cause an allocation (and of course end up
>>> missing many, so they aren't secure anyway)... and basically just
>>> nasty stuff that will never get merged.
>>
>>
>> User beancounters patch has got through all these...
>> The approach where each charged object has a pointer to the owner 
>> container,
>> who has charged it - is the most easy/clean way to handle
>> all the problems with dynamic context change, races, etc.
>> and 1 pointer in page struct is just 0.1% overehad.
> 
> The pointer in struct page approach is a decent one, which I have
> liked since this whole container effort came up. IIRC Linus and Alan
> also thought that was a reasonable way to go.
> 
> I haven't reviewed the rest of the beancounters patch since looking
> at it quite a few months ago... I probably don't have time for a
> good review at the moment, but I should eventually.
> 

This patch is not really beancounters.

1. It uses the containers framework
2. It is similar to my RSS controller (http://lkml.org/lkml/2007/2/26/8)

I would say that beancounters are changing and evolving.

>>> Struct page overhead really isn't bad. Sure, nobody who doesn't use
>>> containers will want to turn it on, but unless you're using a big PAE
>>> system you're actually unlikely to notice.
>>
>>
>> big PAE doesn't make any difference IMHO
>> (until struct pages are not created for non-present physical memory 
>> areas)
> 
> The issue is just that struct pages use low memory, which is a really
> scarce commodity on PAE. One more pointer in the struct page means
> 64MB less lowmem.
> 
> But PAE is crap anyway. We've already made enough concessions in the
> kernel to support it. I agree: struct page overhead is not really
> significant. The benefits of simplicity seems to outweigh the downside.
> 
>>> But again, I'll say the node-container approach of course does avoid
>>> this nicely (because we already can get the node from the page). So
>>> definitely that approach needs to be discredited before going with this
>>> one.
>>
>>
>> But it lacks some other features:
>> 1. page can't be shared easily with another container
> 
> I think they could be shared. You allocate _new_ pages from your own
> node, but you can definitely use existing pages allocated to other
> nodes.
> 
>> 2. shared page can't be accounted honestly to containers
>>    as fraction=PAGE_SIZE/containers-using-it
> 
> Yes there would be some accounting differences. I think it is hard
> to say exactly what containers are "using" what page anyway, though.
> What do you say about unmapped pages? Kernel allocations? etc.
> 
>> 3. It doesn't help accounting of kernel memory structures.
>>    e.g. in OpenVZ we use exactly the same pointer on the page
>>    to track which container owns it, e.g. pages used for page
>>    tables are accounted this way.
> 
> ?
> page_to_nid(page) ~= container that owns it.
> 
>> 4. I guess container destroy requires destroy of memory zone,
>>    which means write out of dirty data. Which doesn't sound
>>    good for me as well.
> 
> I haven't looked at any implementation, but I think it is fine for
> the zone to stay around.
> 
>> 5. memory reclamation in case of global memory shortage
>>    becomes a tricky/unfair task.
> 
> I don't understand why? You can much more easily target a specific
> container for reclaim with this approach than with others (because
> you have an lru per container).
> 

Yes, but we break the global LRU. With these RSS patches, reclaim not
triggered by containers still uses the global LRU, by using nodes,
we would have lost the global LRU.

>> 6. You cannot overcommit. AFAIU, the memory should be granted
>>    to node exclusive usage and cannot be used by by another containers,
>>    even if it is unused. This is not an option for us.
> 
> I'm not sure about that. If you have a larger number of nodes, then
> you could assign more free nodes to a container on demand. But I
> think there would definitely be less flexibility with nodes...
> 
> I don't know... and seeing as I don't really know where the google
> guys are going with it, I won't misrepresent their work any further ;)
> 
> 
>>> Everyone seems to have a plan ;) I don't read the containers list...
>>> does everyone still have *different* plans, or is any sort of consensus
>>> being reached?
>>
>>
>> hope we'll have it soon :)
> 
> Good luck ;)
> 

I think we have made some forward progress on the consensus.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17843 is a reply to message #17827]

Thu, 15 March 2007 05:01

Nick Piggin
Messages: 35
Registered: March 2006

Member

Kirill Korotaev wrote:

>>The approaches I have seen that don't have a struct page pointer, do
>>intrusive things like try to put hooks everywhere throughout the kernel
>>where a userspace task can cause an allocation (and of course end up
>>missing many, so they aren't secure anyway)... and basically just
>>nasty stuff that will never get merged.
> 
> 
> User beancounters patch has got through all these...
> The approach where each charged object has a pointer to the owner container,
> who has charged it - is the most easy/clean way to handle
> all the problems with dynamic context change, races, etc.
> and 1 pointer in page struct is just 0.1% overehad.

The pointer in struct page approach is a decent one, which I have
liked since this whole container effort came up. IIRC Linus and Alan
also thought that was a reasonable way to go.

I haven't reviewed the rest of the beancounters patch since looking
at it quite a few months ago... I probably don't have time for a
good review at the moment, but I should eventually.

>>Struct page overhead really isn't bad. Sure, nobody who doesn't use
>>containers will want to turn it on, but unless you're using a big PAE
>>system you're actually unlikely to notice.
> 
> 
> big PAE doesn't make any difference IMHO
> (until struct pages are not created for non-present physical memory areas)

The issue is just that struct pages use low memory, which is a really
scarce commodity on PAE. One more pointer in the struct page means
64MB less lowmem.

But PAE is crap anyway. We've already made enough concessions in the
kernel to support it. I agree: struct page overhead is not really
significant. The benefits of simplicity seems to outweigh the downside.

>>But again, I'll say the node-container approach of course does avoid
>>this nicely (because we already can get the node from the page). So
>>definitely that approach needs to be discredited before going with this
>>one.
> 
> 
> But it lacks some other features:
> 1. page can't be shared easily with another container

I think they could be shared. You allocate _new_ pages from your own
node, but you can definitely use existing pages allocated to other
nodes.

> 2. shared page can't be accounted honestly to containers
>    as fraction=PAGE_SIZE/containers-using-it

Yes there would be some accounting differences. I think it is hard
to say exactly what containers are "using" what page anyway, though.
What do you say about unmapped pages? Kernel allocations? etc.

> 3. It doesn't help accounting of kernel memory structures.
>    e.g. in OpenVZ we use exactly the same pointer on the page
>    to track which container owns it, e.g. pages used for page
>    tables are accounted this way.

?
page_to_nid(page) ~= container that owns it.

> 4. I guess container destroy requires destroy of memory zone,
>    which means write out of dirty data. Which doesn't sound
>    good for me as well.

I haven't looked at any implementation, but I think it is fine for
the zone to stay around.

> 5. memory reclamation in case of global memory shortage
>    becomes a tricky/unfair task.

I don't understand why? You can much more easily target a specific
container for reclaim with this approach than with others (because
you have an lru per container).

> 6. You cannot overcommit. AFAIU, the memory should be granted
>    to node exclusive usage and cannot be used by by another containers,
>    even if it is unused. This is not an option for us.

I'm not sure about that. If you have a larger number of nodes, then
you could assign more free nodes to a container on demand. But I
think there would definitely be less flexibility with nodes...

I don't know... and seeing as I don't really know where the google
guys are going with it, I won't misrepresent their work any further ;)

>>Everyone seems to have a plan ;) I don't read the containers list...
>>does everyone still have *different* plans, or is any sort of consensus
>>being reached?
> 
> 
> hope we'll have it soon :)

Good luck ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17844 is a reply to message #17807]

Fri, 16 March 2007 00:55

ebiederm
Messages: 1354
Registered: February 2006

Senior Member

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

>> stuff is happening by comparing page->count and page->_mapcount, but it
>> certainly wouldn't be conclusive.  But, does this kind of nonsense even
>> happen in practice?  
>
> "Is it useful for me as a bad guy to make it happen ?"

To create a DOS attack.

- Allocate some memory you know your victim will want in the future,
  (shared libraries and the like).
- Wait until your victim is using the memory you allocated.
- Terminate your memory resource group.
- Victim is pushed over memory limits by your exiting.
- Victim can no longer allocate memory
- Victim dies

It's not quite that easy unless your victim calls mlockall(MCL_FUTURE),
but the potential is clearly there.

Am I missing something?  Or is this fundamental to any first touch scenario?

I just know I have problems with first touch because it is darn hard to
reason about.

Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17849 is a reply to message #17844]

Fri, 16 March 2007 16:31

Dave Hansen
Messages: 240
Registered: October 2005

Senior Member

On Thu, 2007-03-15 at 18:55 -0600, Eric W. Biederman wrote:
> To create a DOS attack.
> 
> - Allocate some memory you know your victim will want in the future,
>   (shared libraries and the like).
> - Wait until your victim is using the memory you allocated.
> - Terminate your memory resource group.
> - Victim is pushed over memory limits by your exiting.
> - Victim can no longer allocate memory
> - Victim dies
> 
> It's not quite that easy unless your victim calls mlockall(MCL_FUTURE),
> but the potential is clearly there.
> 
> Am I missing something?  Or is this fundamental to any first touch scenario?
> 
> I just know I have problems with first touch because it is darn hard to
> reason about.

I think it's fundamental to any case where two containers share the use
of the page, but either one _can_ be charged but does not receive a
_full_ charge for it.

I don't think it's uniquely associated with first-touch schemes.

The software zones approach where there would be a set of "shared" zones
would not have this problem, because any sharing would have to occur on
data on which neither one was being charged.

http://linux-mm.org/SoftwareZones

-- Dave

_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17853 is a reply to message #10890]

Fri, 16 March 2007 19:46

Dave Hansen
Messages: 240
Registered: October 2005

Senior Member

On Fri, 2007-03-16 at 12:54 -0600, Eric W. Biederman wrote:
> Dave Hansen <hansendc@us.ibm.com> writes:
> > http://linux-mm.org/SoftwareZones

> Looking at your page, and I'm too lazy to figure out how to update it
> I have a couple of comments.

You just need to create an account by clicking the Login button.  It
lets you edit things after that.  But, I'd be happy to put anything in
there you see fit.  

> - Why do limits have to apply to the unmapped page cache?

To me, it is just because it consumes memory.  Unmapped cache is, of
couse, much more easily reclaimed than mapped files, but it still
fundamentally causes pressure on the VM.  

To me, a process sitting there doing constant reads of 10 pages has the
same overhead to the VM as a process sitting there with a 10 page file
mmaped, and reading that.

> - Could you mention proper multi process RSS limits.
>   (I.e.  we count the number of pages each group of processes have mapped
>    and limit that).
>   It is the same basic idea as partial page ownership, but instead of
>   page ownership you just count how many pages each group is using and
>   strictly limit that.  There is no page owner ship or partial charges.
>   The overhead is just walking the rmap list at map and unmap time to
>   see if this is the first users in the container.  No additional kernel
>   data structures are needed.

I've tried to capture this.  Let me know what else you think it needs.

http://linux-mm.org/SoftwareZones

-- Dave

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17854 is a reply to message #17849]

Fri, 16 March 2007 18:54

ebiederm
Messages: 1354
Registered: February 2006

Senior Member

Dave Hansen <hansendc@us.ibm.com> writes:

> On Thu, 2007-03-15 at 18:55 -0600, Eric W. Biederman wrote:
>> To create a DOS attack.
>> 
>> - Allocate some memory you know your victim will want in the future,
>>   (shared libraries and the like).
>> - Wait until your victim is using the memory you allocated.
>> - Terminate your memory resource group.
>> - Victim is pushed over memory limits by your exiting.
>> - Victim can no longer allocate memory
>> - Victim dies
>> 
>> It's not quite that easy unless your victim calls mlockall(MCL_FUTURE),
>> but the potential is clearly there.
>> 
>> Am I missing something?  Or is this fundamental to any first touch scenario?
>> 
>> I just know I have problems with first touch because it is darn hard to
>> reason about.
>
> I think it's fundamental to any case where two containers share the use
> of the page, but either one _can_ be charged but does not receive a
> _full_ charge for it.

Reasonable.

> I don't think it's uniquely associated with first-touch schemes.
>
> The software zones approach where there would be a set of "shared" zones
> would not have this problem, because any sharing would have to occur on
> data on which neither one was being charged.

True.   The "shared" zones approach would simply have the problem that it
would make sharing hard and thus reduce the effectiveness of the page cache.

The "shared" zone approach also would seem to interact in very weird ways
with real NUMA and memory hotplug or process migration.  The fact that we
actually have to care about the real memory size on the machine makes me
look at it strange.

Zones should definitely be penalized in some category for the reduction
in efficiency of the page cache.  It took us decades to learn that the
most efficient page cache was one that could resize and reallocate memory
on demand based on the current usage.  Zones and possibly anything else
with the concept of page ownership seems to be trying to be ignoring
that wisdom.

> http://linux-mm.org/SoftwareZones

Looking at your page, and I'm too lazy to figure out how to update it
I have a couple of comments.

- Why do limits have to apply to the unmapped page cache?

- Could you mention proper multi process RSS limits.
  (I.e.  we count the number of pages each group of processes have mapped
   and limit that).
  It is the same basic idea as partial page ownership, but instead of
  page ownership you just count how many pages each group is using and
  strictly limit that.  There is no page owner ship or partial charges.
  The overhead is just walking the rmap list at map and unmap time to
  see if this is the first users in the container.  No additional kernel
  data structures are needed.

Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17878 is a reply to message #17737]

Sun, 18 March 2007 16:58

ebiederm
Messages: 1354
Registered: February 2006

Senior Member

Dave Hansen <hansendc@us.ibm.com> writes:

> On Mon, 2007-03-12 at 23:41 +0100, Herbert Poetzl wrote:
>> 
>> let me give a real world example here:
>> 
>>  - typical guest with 600MB disk space
>>  - about 100MB guest specific data (not shared)
>>  - assumed that 80% of the libs/tools are used
>
> I get the general idea here, but I just don't think those numbers are
> very accurate.  My laptop has a bunch of gunk open (xterm, evolution,
> firefox, xchat, etc...).  I ran this command:
>
> lsof | egrep '/(usr/|lib.*\.so)' | awk '{print $9}' | sort | uniq | xargs du
> -Dcs
>
> and got:
>
> 113840  total
>
> On a web/database server that I have (ps aux | wc -l == 128), I just ran
> the same:
>
> 39168   total
>
> That's assuming that all of the libraries are fully read in and
> populated, just by their on-disk sizes. Is that not a reasonable measure
> of the kinds of things that we can expect to be shared in a vserver?  If
> so, it's a long way from 400MB.
>
> Could you try a similar measurement on some of your machines?  Perhaps
> mine are just weird.

Think shell scripts and the like.  From what I have seen I would agree
that is typical for application code not to dominate application memory usage.
However on the flip side it is non uncommon for application code to dominate
disk usage.  Some of us have giant music, video or code databases that consume
a lot of disk space but in many instances servers don't have enormous chunks
of private files, and even when they do they share the files from the distribution.

The result of this is that there are a lot of unmapped pages cached in the page
cache for rarely run executables, that are cached just in case we need them.

So while Herbert's numbers may be a little off the general principle of the entire
system doing better if you can share the page cache is very real.

That the page cache isn't accounted for here isn't terribly important we still
get the global benefit.

> I don't doubt this, but doing this two-level page-out thing for
> containers/vservers over their limits is surely something that we should
> consider farther down the road, right?

It is what the current VM of linux does.  There is removing a page from
processes and then there is writing it out to disk.  I think the normal
term is second chance replacement.  The idea is that once you remove
a page from being mapped you let it age a little before it is paged
back in.  This allows pages in high demand to avoid being written
to disk, all they incur are minor not major fault costs.

> It's important to you, but you're obviously not doing any of the
> mainline coding, right?

Tread carefully here.  Herbert may not be doing a lot of mainline coding
or extremely careful review of potential patches but he does seem to have
a decent grasp of the basic issues.   In addition to a reasonable amount
of experience so it is worth listening to what he says.

In addition Herbert does seem to be doing some testing of the mainline
code as we get it going.  So he is contributing.

>> > What are the consequences if this isn't done?  Doesn't 
>> > a loaded system eventually have all of its pages used 
>> > anyway, so won't this always be a temporary situation?
>> 
>> let's consider a quite limited guest (or several
>> of them) which have a 'RAM' limit of 64MB and 
>> additional 64MB of 'virtual swap' assigned ...
>> 
>> if they use roughly 96MB (memory footprint) then
>> having this 'fluffy' optimization will keep them
>> running without any effect on the host side, but
>> without, they will continously swap in and out
>> which will affect not only the host, but also the
>> other guests ...

Ugh.  You really want swap > RAM here.  Because there are real
cases when you are swapping when all of your pages in RAM can
be cached in the page cache.  96MB with 64MB RSS and 64MB swap is
almost a sure way to hit your swap page limit and die.

> All workloads that use $limit+1 pages of memory will always pay the
> price, right?  :)

They should.  When you remove an anonymous page from the pages tables it
needs to be allocated and placed in the swap cache.  Once you do that
it can sit in the page cache like any file backed page.  So the
container that hits $limit+1 should get the paging pressure and a lot
more minor faults.  However we still want to globally write thing to
disk and optimize that as we do right now.

Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17879 is a reply to message #17853]

Sun, 18 March 2007 17:42

ebiederm
Messages: 1354
Registered: February 2006

Senior Member

Dave Hansen <hansendc@us.ibm.com> writes:

> On Fri, 2007-03-16 at 12:54 -0600, Eric W. Biederman wrote:
>> Dave Hansen <hansendc@us.ibm.com> writes:
>
>> - Why do limits have to apply to the unmapped page cache?
>
> To me, it is just because it consumes memory.  Unmapped cache is, of
> couse, much more easily reclaimed than mapped files, but it still
> fundamentally causes pressure on the VM.  
>
> To me, a process sitting there doing constant reads of 10 pages has the
> same overhead to the VM as a process sitting there with a 10 page file
> mmaped, and reading that.

I can see temporarily accounting for pages in use for such a
read/write and possibly during things such as read ahead.

However I doubt it is enough memory to be significant, and as
such is probably a waste of time accounting for it.

A memory limit is not about accounting for memory pressure, so I think
the reasoning for wanting to account for unmapped pages as a hard
requirement is still suspect.  A memory limit is to prevent one container
from hogging all of the memory in the system, and denying it to other
containers.

The page cache by definition is a global resource that facilitates
global kernel optimizations.  If we kill those optimizations we
are on the wrong track.  By requiring limits there I think we are
very likely to kill our very important global optimizations, and bring
the performance of the entire system down.

>> - Could you mention proper multi process RSS limits.
>>   (I.e.  we count the number of pages each group of processes have mapped
>>    and limit that).
>>   It is the same basic idea as partial page ownership, but instead of
>>   page ownership you just count how many pages each group is using and
>>   strictly limit that.  There is no page owner ship or partial charges.
>>   The overhead is just walking the rmap list at map and unmap time to
>>   see if this is the first users in the container.  No additional kernel
>>   data structures are needed.
>
> I've tried to capture this.  Let me know what else you think it
> needs.

Requirements:
- The current kernel global optimizations are preserved and useful.

  This does mean one container can affect another when the
  optimizations go awry but on average it means much better
  performance.  For many the global optimizations are what make
  the in-kernel approach attractive over paravirtualization.

Very nice to have:
- Limits should be on things user space have control of.

  Saying you can only have X bytes of kernel memory for file
  descriptors and the like is very hard to work with.  Saying you
  can have only N file descriptors open is much easier to deal with.

- SMP Scalability.

  The final implementation should have per cpu counters or per task
  reservations so in most instances we don't need to bounce a global
  cache line around to perform the accounting.

Nice to have:

- Perfect precision.

  Having every last byte always accounted for is nice but a
  little bit of bounded fuzziness in the accounting is acceptable
  if it that make the accounting problem more tractable.

We need several more limits in this discussion to get a full picture,
otherwise we may to try and build the all singing all dancing limit.
- A limit on the number of anonymous pages.
  (Pages that are or may be in the swap cache).
- Filesystem per container quotas.  
  (Only applicable in some contexts but you get the idea).
- Inode, file descriptor, and similar limits.
- I/O limits.

Eric

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Re: [RFC][PATCH 2/7] RSS controller core [message #17880 is a reply to message #17783]

Sun, 18 March 2007 22:44

Paul Menage
Messages: 642
Registered: September 2006

Senior Member

On 3/13/07, Dave Hansen <hansendc@us.ibm.com> wrote:
> How do we determine what is shared, and goes into the shared zones?
> Once we've allocated a page, it's too late because we already picked.
> Do we just assume all page cache is shared?  Base it on filesystem,
> mount, ...?  Mount seems the most logical to me, that a sysadmin would
> have to set up a container's fs, anyway, and will likely be doing
> special things to shared data, anyway (r/o bind mounts :).

I played with an approach where you can bind a dentry to a set of
memory zones, and any children of that dentry would inherit the
mempolicy; I was envisaging that most data wouldn't be shared between
different containers/jobs, and that userspace would set up "shared"
zones for big shared regions such as /lib, /usr, /bin, and for
specially-known cases of sharing.

> If we really do bind a set of processes strongly to a set of memory on a
> set of nodes, then those really do become its home NUMA nodes.  If the
> CPUs there get overloaded, running it elsewhere will continue to grab
> pages from the home.  Would this basically keep us from ever being able
> to move tasks around a NUMA system?

move_pages() will let you shuffle tasks from one node to another
without too much intrusion.

Paul
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: Re: [RFC][PATCH 2/7] RSS controller core [message #17881 is a reply to message #17880]

Mon, 19 March 2007 17:41

ebiederm
Messages: 1354
Registered: February 2006

Senior Member

"Paul Menage" <menage@google.com> writes:

> On 3/13/07, Dave Hansen <hansendc@us.ibm.com> wrote:
>> How do we determine what is shared, and goes into the shared zones?
>> Once we've allocated a page, it's too late because we already picked.
>> Do we just assume all page cache is shared?  Base it on filesystem,
>> mount, ...?  Mount seems the most logical to me, that a sysadmin would
>> have to set up a container's fs, anyway, and will likely be doing
>> special things to shared data, anyway (r/o bind mounts :).
>
> I played with an approach where you can bind a dentry to a set of
> memory zones, and any children of that dentry would inherit the
> mempolicy; I was envisaging that most data wouldn't be shared between
> different containers/jobs, and that userspace would set up "shared"
> zones for big shared regions such as /lib, /usr, /bin, and for
> specially-known cases of sharing.

Here is a wacky one.

Suppose there is some NFS server that exports something that most machines
want to mount like company home directories.

Suppose multiple containers mount that NFS server based on local policy.
(If we can allow non-root users to mount filesystems a slightly more trusted
 guest admin certainly will be able to).

The NFS code as current written (unless I am confused) will do
everything in it's power to share the filesystem cache between the
different mounts (including the dentry tree).

How do we handle bit shared areas like that.

Dynamic programming solutions where we discovery the areas of sharing
at runtime seem a lot more general then a priori solutions where you
have to predict what will come next.

If a priori planning and knowledge about sharing is the best we can do
it is the best we can do and we will have to live with the limits that
imposes.  Given the inflexibility in use and setup I'm not yet ready
to concede that this is the best we can do.

Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17887 is a reply to message #17879]

Mon, 19 March 2007 15:48

Herbert Poetzl
Messages: 239
Registered: February 2006

Senior Member

On Sun, Mar 18, 2007 at 11:42:15AM -0600, Eric W. Biederman wrote:
> Dave Hansen <hansendc@us.ibm.com> writes:
> 
> > On Fri, 2007-03-16 at 12:54 -0600, Eric W. Biederman wrote:
> >> Dave Hansen <hansendc@us.ibm.com> writes:
> >
> >> - Why do limits have to apply to the unmapped page cache?
> >
> > To me, it is just because it consumes memory.  Unmapped cache is, of
> > couse, much more easily reclaimed than mapped files, but it still
> > fundamentally causes pressure on the VM.  
> >
> > To me, a process sitting there doing constant reads of 10 pages has the
> > same overhead to the VM as a process sitting there with a 10 page file
> > mmaped, and reading that.
> 
> I can see temporarily accounting for pages in use for such a
> read/write and possibly during things such as read ahead.
> 
> However I doubt it is enough memory to be significant, and as
> such is probably a waste of time accounting for it.
> 
> A memory limit is not about accounting for memory pressure, so I think
> the reasoning for wanting to account for unmapped pages as a hard
> requirement is still suspect. 

> A memory limit is to prevent one container from hogging all of the
> memory in the system, and denying it to other containers.

exactly!

nevertheless, you might want to extend that to swapping
and to the very expensive page in/out operations too

> The page cache by definition is a global resource that facilitates
> global kernel optimizations.  If we kill those optimizations we
> are on the wrong track.  By requiring limits there I think we are
> very likely to kill our very important global optimizations, and bring
> the performance of the entire system down.

that is my major concern for most of the 'straight forward'
virtualizations proposed (see Xen comment)

> >> - Could you mention proper multi process RSS limits.
> >>   (I.e.  we count the number of pages each group of processes have mapped
> >>    and limit that).
> >>   It is the same basic idea as partial page ownership, but instead of
> >>   page ownership you just count how many pages each group is using and
> >>   strictly limit that.  There is no page owner ship or partial charges.
> >>   The overhead is just walking the rmap list at map and unmap time to
> >>   see if this is the first users in the container.  No additional kernel
> >>   data structures are needed.
> >
> > I've tried to capture this.  Let me know what else you think it
> > needs.
> 
> Requirements:
> - The current kernel global optimizations are preserved and useful.
> 
>   This does mean one container can affect another when the
>   optimizations go awry but on average it means much better
>   performance.  For many the global optimizations are what make
>   the in-kernel approach attractive over paravirtualization.

total agreement here

> Very nice to have:
> - Limits should be on things user space have control of.
>   
>   Saying you can only have X bytes of kernel memory for file
>   descriptors and the like is very hard to work with.  Saying you
>   can have only N file descriptors open is much easier to deal with.

yep, and IMHO more natural ...

> - SMP Scalability.
> 
>   The final implementation should have per cpu counters or per task
>   reservations so in most instances we don't need to bounce a global
>   cache line around to perform the accounting.

agreed, we want to optimize for small systems
as well as for large ones, and SMP/NUMA is quite
common in the server area (even for small servers)

> Nice to have:
> 
> - Perfect precision.
> 
>   Having every last byte always accounted for is nice but a
>   little bit of bounded fuzziness in the accounting is acceptable
>   if it that make the accounting problem more tractable.

as long as the accounting is consistant, i.e.
you do not lose resources by repetitive operations
inside the guest (or through guest-guest interaction)
as this could be used for DoS and intentional unfairness

> We need several more limits in this discussion to get a full picture,
> otherwise we may to try and build the all singing all dancing limit.

> - A limit on the number of anonymous pages.
>   (Pages that are or may be in the swap cache).

> - Filesystem per container quotas.  
>   (Only applicable in some contexts but you get the idea).

with shared files, otherwise an lvm partition does
a good job for that already ...

> - Inode, file descriptor, and similar limits.

> - I/O limits.

I/O and CPU limits are special, as they have the temporal
component, i.e. you are not interested in 10s CPU time,
instead you want 0.5s/s CPU (same for I/O)

note: this is probably also true for page in/out

- sockets 
- locks
- dentries

HTH,
Herbert

> Eric
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

controlling mmap()'d vs read/write() pages [message #17892 is a reply to message #17879]

Tue, 20 March 2007 16:15

Dave Hansen
Messages: 240
Registered: October 2005

Senior Member

On Sun, 2007-03-18 at 11:42 -0600, Eric W. Biederman wrote:
> Dave Hansen <hansendc@us.ibm.com> writes:
> > To me, a process sitting there doing constant reads of 10 pages has the
> > same overhead to the VM as a process sitting there with a 10 page file
> > mmaped, and reading that.
> 
> I can see temporarily accounting for pages in use for such a
> read/write and possibly during things such as read ahead.
> 
> However I doubt it is enough memory to be significant, and as
> such is probably a waste of time accounting for it.
> 
> A memory limit is not about accounting for memory pressure, so I think
> the reasoning for wanting to account for unmapped pages as a hard
> requirement is still suspect.  A memory limit is to prevent one container
> from hogging all of the memory in the system, and denying it to other
> containers.
> 
> The page cache by definition is a global resource that facilitates
> global kernel optimizations.  If we kill those optimizations we
> are on the wrong track.  By requiring limits there I think we are
> very likely to kill our very important global optimizations, and bring
> the performance of the entire system down.

Let's say you have an mmap'd file.  It has zero pages brought in right
now.  You do a write to it.  It is well within the kernel's rights to
let you write one word to an mmap'd file, then unmap it, write it to
disk, and free the page.

To me, mmap() is an interface, not a directive to tell the kernel to
keep things in memory.  The fact that two reads of a bytes from an
mmap()'d file tends to not go to disk or even cause a fault for the
second read is because the page is in the page cache.  The fact that two
consecutive read()s of the same disk page tend to not cause two trips to
the disk is because the page is in the page cache.

Anybody who wants to get data in and out of a file can choose to use
either of these interfaces.  A page being brought into the system for
either a read or touch of an mmap()'d area causes the same kind of
memory pressure.

So, I think we have a difference of opinion.  I think it's _all_ about
memory pressure, and you think it is _not_ about accounting for memory
pressure. :)  Perhaps we mean different things, but we appear to
disagree greatly on the surface.

Can we agree that there must be _some_ way to control the amounts of
unmapped page cache?  Whether that's related somehow to the same way we
control RSS or done somehow at the I/O level, there must be some way to
control it.  Agree?

> >> - Could you mention proper multi process RSS limits.
> >>   (I.e.  we count the number of pages each group of processes have mapped
> >>    and limit that).
> >>   It is the same basic idea as partial page ownership, but instead of
> >>   page ownership you just count how many pages each group is using and
> >>   strictly limit that.  There is no page owner ship or partial charges.
> >>   The overhead is just walking the rmap list at map and unmap time to
> >>   see if this is the first users in the container.  No additional kernel
> >>   data structures are needed.
> >
> > I've tried to capture this.  Let me know what else you think it
> > needs.
> 
> Requirements:
> - The current kernel global optimizations are preserved and useful.
> 
>   This does mean one container can affect another when the
>   optimizations go awry but on average it means much better
>   performance.  For many the global optimizations are what make
>   the in-kernel approach attractive over paravirtualization.
> 
> Very nice to have:
> - Limits should be on things user space have control of.
...
> - SMP Scalability.

> - Perfect precision.
...

I've tried to capture this:

http://linux-mm.org/SoftwareZones

> We need several more limits in this discussion to get a full picture,
> otherwise we may to try and build the all singing all dancing limit.
> - A limit on the number of anonymous pages.
>   (Pages that are or may be in the swap cache).
> - Filesystem per container quotas.  
>   (Only applicable in some contexts but you get the idea).
> - Inode, file descriptor, and similar limits.
> - I/O limits.

Definitely.  I think we've all agreed that memory is the hard one,
though.  If we can make progress on this one, we're set! :)

-- Dave

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: controlling mmap()'d vs read/write() pages [message #17898 is a reply to message #17892]

Tue, 20 March 2007 21:19

ebiederm
Messages: 1354
Registered: February 2006

Senior Member

Dave Hansen <hansendc@us.ibm.com> writes:

>
> So, I think we have a difference of opinion.  I think it's _all_ about
> memory pressure, and you think it is _not_ about accounting for memory
> pressure. :)  Perhaps we mean different things, but we appear to
> disagree greatly on the surface.

I think it is about preventing a badly behaved container from having a
significant effect on the rest of the system, and in particular other
containers on the system.

See below.  I think to reach agreement we should start by discussing
the algorithm that we see being used to keep the system function well
and the theory behind that algorithm.  Simply limiting memory is not
enough to understand why it works....

> Can we agree that there must be _some_ way to control the amounts of
> unmapped page cache?  Whether that's related somehow to the same way we
> control RSS or done somehow at the I/O level, there must be some way to
> control it.  Agree?

At lot depends on what we measure and what we try and control.
Currently what we have been measuring are amounts of RAM, and thus
what we are trying to control is the amount of RAM.  If we want to
control memory pressure we need a definition and a way to measure it.
I think there may be potential if we did that but we would still need
a memory limit to keep things like mlock in check.

So starting with a some definitions and theory.
RSS is short for resident set size.  The resident set being how many
of pages are current in memory and not on disk and used by the
application.  This includes the memory in page tables, but can
reasonably be extended to include any memory a process can be shown to
be using.

In theory there is some minimal RSS that you can give an application
at which it will get productive work done.  Below the minimal RSS
the application will spend the majority of real time waiting for
pages to come in from disk, so it can execute the next instruction.
The ultimate worst case here is a read instruction appearing on one
page and it's datum on another.  You have to have both pages in memory
at the same time for the read to complete.  If you set the RSS hard
limit to one page the problem will be continually restarting either
because the page it is on is not in memory or the page it is reading
from is not in memory.

What we want to accomplish is to have a system that runs multiple
containers without problems.  As a general memory management policy
we can accomplish this by ensuring each container has at least
it's minimal RSS quota of pages.  By watching the paging activity
of a container it is possible to detect when that container has
to few pages and is spend all of it's time I/O bound, and thus
has slipped below it's minimal RSS.

As such it is possible for the memory management system if we have
container RSS accounting to dynamically figure out how much memory
each container needs and to keep everyone above their minimal RSS
most of the time when that is possible.  Basically to do this the
memory manage code would need to keep dynamic RSS limits, and
adjust them based upon need.

There is still the case when not all containers can have their
minimal RSS, there is simply not enough memory in the system.

That is where having a hard settable RSS limit comes in.  With this
we communicate to the application and the users beyond which point we
consider their application to be abusing the system.

There is a lot of history with RSS limits showing their limitations
and how they work.  It is fundamentally a dynamic policy instead of
a static set of guarantees which allows for applications with a
diverse set of memory requirements to work in harmony.

One of the very neat things about a hard RSS limit is that if there
are extra resources on the system you can improve overall system
performance by cache pages in the page cache instead writing them
to disk.

> http://linux-mm.org/SoftwareZones

I will try and take a look in a bit.

Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: [RFC][PATCH 2/7] RSS controller core [message #17904 is a reply to message #17819]

Tue, 20 March 2007 18:57

mel
Messages: 4
Registered: March 2007

Junior Member

On (14/03/07 13:42), Dave Hansen didst pronounce:
> On Wed, 2007-03-14 at 15:38 +0000, Mel Gorman wrote:
> > On (13/03/07 10:05), Dave Hansen didst pronounce:
> > > How do we determine what is shared, and goes into the shared zones?
> > 
> > Assuming we had a means of creating a zone that was assigned to a container,
> > a second zone for shared data between a set of containers.  For shared data,
> > the time the pages are being allocated is at page fault time. At that point,
> > the faulting VMA is known and you also know if it's MAP_SHARED or not.
> 
> Well, but MAP_SHARED does not necessarily mean shared outside of the
> container, right? 

Well, the data could also be shared outside of the container. I would see
that happening for library text sections for example.

> Somebody wishing to get around resource limits could
> just MAP_SHARED any data they wished to use, and get it into the shared
> area before their initial use, right?
> 

They would only be able to impact other containers in a limited sense.
Specifically, if 5 containers have one shared area, then any process in
those 5 containers could exceed their container limits at the expense of
the shared area.

> How do normal read/write()s fit into this?
> 

A normal read/write if it's the first reader of a file would get charged to the
container, not to the shared area. It is less likely that a file that is read()
is expected to be shared where as mapping MAP_SHARED is relatively explicit.

> > > There's a conflict between the resize granularity of the zones, and the
> > > storage space their lookup consumes.  We'd want a container to have a
> > > limited ability to fill up memory with stuff like the dcache, so we'd
> > > appear to need to put the dentries inside the software zone.  But, that
> > > gets us to our inability to evict arbitrary dentries. 
> > 
> > Stuff like shrinking dentry caches is already pretty course-grained.
> > Last I looked, we couldn't even shrink within a specific node, let alone
> > a zone or a specific dentry. This is a separate problem.
> 
> I shouldn't have used dentries as an example.  I'm just saying that if
> we end up (or can end up with) with a whole ton of these software zones,
> we might have troubles storing them.  I would imagine the issue would
> come immediately from lack of page->flags to address lots of them.
> 

That is an immediate problem. There needs to be a way of mapping an arbitrary
page to a software zone. page_zone() as it is could only resolve the "main"
zone. If additional bits were used in page->flags, there would be very hard
limits on the number of containers that can exist.

If zones were physically contiguous to MAX_ORDER, pageblock flags from the
anti-fragmentation could be used to record that a block of pages was in a
container and what the ID is.  If non-contiguous software zones were required,
page->zone could be reintroduced for software zones to be used when a page
belongs to a container. It's not ideal the proper way of mapping pages to
software zones might be more obvious then when we'd see where page->zone
was used.

With either approach, the important thing that occured to me is be to be
sure that pages only came from the same hardware zone. For example, do
not mix HIGHMEM pages with DMA pages because it'll fail miserably. For RSS
accounting, this is not much of a restriction but it does have an impact on
keeping kernel allocations within a container on systems with HighMemory.

> > > After a while,
> > > would containers tend to pin an otherwise empty zone into place?  We
> > > could resize it, but what is the cost of keeping zones that can be
> > > resized down to a small enough size that we don't mind keeping it there?
> > > We could merge those "orphaned" zones back into the shared zone.
> > 
> > Merging "orphaned" zones back into the "main" zone would seem a sensible
> > choice.
> 
> OK, but merging wouldn't be possible if they're not physically
> contiguous.  I guess this could be worked around by just calling it a
> shared zone, no matter where it is physically.
> 

More than likely, yes.

> > > Were there any requirements about physical contiguity? 
> > 
> > For the lookup to software zone to be efficient, it would be easiest to have
> > them as MAX_ORDER_NR_PAGES contiguous. This would avoid having to break the
> > existing assumptions in the buddy allocator about MAX_ORDER_NR_PAGES
> > always being in the same zone.
> 
> I was mostly wondering about zones spanning other zones.  We _do_
> support this today

In practice, overlapping zones never happen today so a few new bugs
based on assumptions about MAX_ORDER_NR_PAGES being aligned in a zone
may crop up.

>, and it might make quite a bit more merging possible.
> 
> > > If we really do bind a set of processes strongly to a set of memory on a
> > > set of nodes, then those really do become its home NUMA nodes.  If the
> > > CPUs there get overloaded, running it elsewhere will continue to grab
> > > pages from the home.  Would this basically keep us from ever being able
> > > to move tasks around a NUMA system?
> > 
> > Moving the tasks around would not be easy. It would require a new zone
> > to be created based on the new NUMA node and all the data migrated. hmm
> 
> I know we _try_ to avoid this these days, but I'm not sure how taking it
> away as an option will affect anything.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Re: controlling mmap()'d vs read/write() pages [message #17992 is a reply to message #17898]

Fri, 23 March 2007 00:51

Herbert Poetzl
Messages: 239
Registered: February 2006

Senior Member

On Tue, Mar 20, 2007 at 03:19:16PM -0600, Eric W. Biederman wrote:
> Dave Hansen <hansendc@us.ibm.com> writes:
> 
> >
> > So, I think we have a difference of opinion. I think it's _all_
> > about memory pressure, and you think it is _not_ about accounting
> > for memory pressure. :) Perhaps we mean different things, but we
> > appear to disagree greatly on the surface.
>
> I think it is about preventing a badly behaved container from having a
> significant effect on the rest of the system, and in particular other
> containers on the system.
>
> See below. I think to reach agreement we should start by discussing
> the algorithm that we see being used to keep the system function well
> and the theory behind that algorithm. Simply limiting memory is not
> enough to understand why it works....
>
> > Can we agree that there must be _some_ way to control the amounts of
> > unmapped page cache? Whether that's related somehow to the same way
> > we control RSS or done somehow at the I/O level, there must be some
> > way to control it. Agree?
> 
> At lot depends on what we measure and what we try and control.
> Currently what we have been measuring are amounts of RAM, and thus
> what we are trying to control is the amount of RAM.  If we want to
> control memory pressure we need a definition and a way to measure it.
> I think there may be potential if we did that but we would still need
> a memory limit to keep things like mlock in check.
> 
> So starting with a some definitions and theory.
> RSS is short for resident set size.  The resident set being how many
> of pages are current in memory and not on disk and used by the
> application.  This includes the memory in page tables, but can
> reasonably be extended to include any memory a process can be shown to
> be using.
> 
> In theory there is some minimal RSS that you can give an application
> at which it will get productive work done.  Below the minimal RSS
> the application will spend the majority of real time waiting for
> pages to come in from disk, so it can execute the next instruction.
> The ultimate worst case here is a read instruction appearing on one
> page and it's datum on another.  You have to have both pages in memory
> at the same time for the read to complete.  If you set the RSS hard
> limit to one page the problem will be continually restarting either
> because the page it is on is not in memory or the page it is reading
> from is not in memory.
> 
> What we want to accomplish is to have a system that runs multiple
> containers without problems.  As a general memory management policy
> we can accomplish this by ensuring each container has at least
> it's minimal RSS quota of pages.  By watching the paging activity
> of a container it is possible to detect when that container has
> to few pages and is spend all of it's time I/O bound, and thus
> has slipped below it's minimal RSS.
> 
> As such it is possible for the memory management system if we have
> container RSS accounting to dynamically figure out how much memory
> each container needs and to keep everyone above their minimal RSS
> most of the time when that is possible.  Basically to do this the
> memory manage code would need to keep dynamic RSS limits, and
> adjust them based upon need.
> 
> There is still the case when not all containers can have their
> minimal RSS, there is simply not enough memory in the system.
> 
> That is where having a hard settable RSS limit comes in.  With this
> we communicate to the application and the users beyond which point we
> consider their application to be abusing the system.
> 
> There is a lot of history with RSS limits showing their limitations
> and how they work.  It is fundamentally a dynamic policy instead of
> a static set of guarantees which allows for applications with a
> diverse set of memory requirements to work in harmony.
> 
> One of the very neat things about a hard RSS limit is that if there
> are extra resources on the system you can improve overall system
> performance by cache pages in the page cache instead writing them
> to disk.

that is exactly what we (Linux-VServer) want ...
(sounds good to me, please keep up the good work in
this direction)

there is nothing wrong with hard limits if somebody
really wants them, even if they hurt the sysstem as
whole, but those limits shouldn't be the default ..

best,
Herbert

> > http://linux-mm.org/SoftwareZones
> 
> I will try and take a look in a bit.
> 
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

Report message to a moderator

Pages (4): [ « ‹ 1 2 3 4 › »]

Previous Topic:	Re: [ckrm-tech] [PATCH 7/7] containers (V7): Container interface to nsproxy subsystem
Next Topic:	Linux-VServer example results for sharing vs. separate mappings ...

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Mon Feb 16 10:19:37 GMT 2026

Total time taken to generate the page: 0.45012 seconds