Home » Mailing lists » Devel » [RFC][PATCH 0/7] Resource controllers based on process containers
|
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17713 is a reply to message #17702] |
Sun, 11 March 2007 14:32 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Sun, Mar 11, 2007 at 12:08:16PM +0300, Pavel Emelianov wrote:
> Herbert Poetzl wrote:
>> On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote:
>>> On Tue, 06 Mar 2007 17:55:29 +0300
>>> Pavel Emelianov <xemul@sw.ru> wrote:
>>>
>>>> +struct rss_container {
>>>> + struct res_counter res;
>>>> + struct list_head page_list;
>>>> + struct container_subsys_state css;
>>>> +};
>>>> +
>>>> +struct page_container {
>>>> + struct page *page;
>>>> + struct rss_container *cnt;
>>>> + struct list_head list;
>>>> +};
>>> ah. This looks good. I'll find a hunk of time to go through this
>>> work and through Paul's patches. It'd be good to get both patchsets
>>> lined up in -mm within a couple of weeks. But..
>>
>> doesn't look so good for me, mainly becaus of the
>> additional per page data and per page processing
>>
>> on 4GB memory, with 100 guests, 50% shared for each
>> guest, this basically means ~1mio pages, 500k shared
>> and 1500k x sizeof(page_container) entries, which
>> roughly boils down to ~25MB of wasted memory ...
>>
>> increase the amount of shared pages and it starts
>> getting worse, but maybe I'm missing something here
>
> You are. Each page has only one page_container associated
> with it despite the number of containers it is shared
> between.
>
>>> We need to decide whether we want to do per-container memory
>>> limitation via these data structures, or whether we do it via
>>> a physical scan of some software zone, possibly based on Mel's
>>> patches.
>>
>> why not do simple page accounting (as done currently
>> in Linux) and use that for the limits, without
>> keeping the reference from container to page?
>
> As I've already answered in my previous letter simple
> limiting w/o per-container reclamation and per-container
> oom killer isn't a good memory management. It doesn't allow
> to handle resource shortage gracefully.
per container OOM killer does not require any container
page reference, you know _what_ tasks belong to the
container, and you know their _badness_ from the normal
OOM calculations, so doing them for a container is really
straight forward without having any page 'tagging'
for the reclamation part, please elaborate how that will
differ in a (shared memory) guest from what the kernel
currently does ...
TIA,
Herbert
> This patchset provides more grace way to handle this, but
> full memory management includes accounting of VMA-length
> as well (returning ENOMEM from system call) but we've decided
> to start with RSS.
>
>> best,
>> Herbert
>>
>>> _______________________________________________
>>> Containers mailing list
>>> Containers@lists.osdl.org
>>> https://lists.osdl.org/mailman/listinfo/containers
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17717 is a reply to message #17713] |
Sun, 11 March 2007 15:04 |
xemul
Messages: 248 Registered: November 2005
|
Senior Member |
|
|
Herbert Poetzl wrote:
> On Sun, Mar 11, 2007 at 12:08:16PM +0300, Pavel Emelianov wrote:
>> Herbert Poetzl wrote:
>>> On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote:
>>>> On Tue, 06 Mar 2007 17:55:29 +0300
>>>> Pavel Emelianov <xemul@sw.ru> wrote:
>>>>
>>>>> +struct rss_container {
>>>>> + struct res_counter res;
>>>>> + struct list_head page_list;
>>>>> + struct container_subsys_state css;
>>>>> +};
>>>>> +
>>>>> +struct page_container {
>>>>> + struct page *page;
>>>>> + struct rss_container *cnt;
>>>>> + struct list_head list;
>>>>> +};
>>>> ah. This looks good. I'll find a hunk of time to go through this
>>>> work and through Paul's patches. It'd be good to get both patchsets
>>>> lined up in -mm within a couple of weeks. But..
>>> doesn't look so good for me, mainly becaus of the
>>> additional per page data and per page processing
>>>
>>> on 4GB memory, with 100 guests, 50% shared for each
>>> guest, this basically means ~1mio pages, 500k shared
>>> and 1500k x sizeof(page_container) entries, which
>>> roughly boils down to ~25MB of wasted memory ...
>>>
>>> increase the amount of shared pages and it starts
>>> getting worse, but maybe I'm missing something here
>> You are. Each page has only one page_container associated
>> with it despite the number of containers it is shared
>> between.
>>
>>>> We need to decide whether we want to do per-container memory
>>>> limitation via these data structures, or whether we do it via
>>>> a physical scan of some software zone, possibly based on Mel's
>>>> patches.
>>> why not do simple page accounting (as done currently
>>> in Linux) and use that for the limits, without
>>> keeping the reference from container to page?
>> As I've already answered in my previous letter simple
>> limiting w/o per-container reclamation and per-container
>> oom killer isn't a good memory management. It doesn't allow
>> to handle resource shortage gracefully.
>
> per container OOM killer does not require any container
> page reference, you know _what_ tasks belong to the
> container, and you know their _badness_ from the normal
> OOM calculations, so doing them for a container is really
> straight forward without having any page 'tagging'
That's true. If you look at the patches you'll
find out that no code in oom killer uses page 'tag'.
> for the reclamation part, please elaborate how that will
> differ in a (shared memory) guest from what the kernel
> currently does ...
This is all described in the code and in the
discussions we had before.
> TIA,
> Herbert
>
>> This patchset provides more grace way to handle this, but
>> full memory management includes accounting of VMA-length
>> as well (returning ENOMEM from system call) but we've decided
>> to start with RSS.
>>
>>> best,
>>> Herbert
>>>
>>>> _______________________________________________
>>>> Containers mailing list
>>>> Containers@lists.osdl.org
>>>> https://lists.osdl.org/mailman/listinfo/containers
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17719 is a reply to message #17717] |
Mon, 12 March 2007 00:41 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Sun, Mar 11, 2007 at 06:04:28PM +0300, Pavel Emelianov wrote:
> Herbert Poetzl wrote:
> > On Sun, Mar 11, 2007 at 12:08:16PM +0300, Pavel Emelianov wrote:
> >> Herbert Poetzl wrote:
> >>> On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote:
> >>>> On Tue, 06 Mar 2007 17:55:29 +0300
> >>>> Pavel Emelianov <xemul@sw.ru> wrote:
> >>>>
> >>>>> +struct rss_container {
> >>>>> + struct res_counter res;
> >>>>> + struct list_head page_list;
> >>>>> + struct container_subsys_state css;
> >>>>> +};
> >>>>> +
> >>>>> +struct page_container {
> >>>>> + struct page *page;
> >>>>> + struct rss_container *cnt;
> >>>>> + struct list_head list;
> >>>>> +};
> >>>> ah. This looks good. I'll find a hunk of time to go through this
> >>>> work and through Paul's patches. It'd be good to get both patchsets
> >>>> lined up in -mm within a couple of weeks. But..
> >>> doesn't look so good for me, mainly becaus of the
> >>> additional per page data and per page processing
> >>>
> >>> on 4GB memory, with 100 guests, 50% shared for each
> >>> guest, this basically means ~1mio pages, 500k shared
> >>> and 1500k x sizeof(page_container) entries, which
> >>> roughly boils down to ~25MB of wasted memory ...
> >>>
> >>> increase the amount of shared pages and it starts
> >>> getting worse, but maybe I'm missing something here
> >> You are. Each page has only one page_container associated
> >> with it despite the number of containers it is shared
> >> between.
> >>
> >>>> We need to decide whether we want to do per-container memory
> >>>> limitation via these data structures, or whether we do it via
> >>>> a physical scan of some software zone, possibly based on Mel's
> >>>> patches.
> >>> why not do simple page accounting (as done currently
> >>> in Linux) and use that for the limits, without
> >>> keeping the reference from container to page?
> >> As I've already answered in my previous letter simple
> >> limiting w/o per-container reclamation and per-container
> >> oom killer isn't a good memory management. It doesn't allow
> >> to handle resource shortage gracefully.
> >
> > per container OOM killer does not require any container
> > page reference, you know _what_ tasks belong to the
> > container, and you know their _badness_ from the normal
> > OOM calculations, so doing them for a container is really
> > straight forward without having any page 'tagging'
>
> That's true. If you look at the patches you'll
> find out that no code in oom killer uses page 'tag'.
so what do we keep the context -> page reference
then at all?
> > for the reclamation part, please elaborate how that will
> > differ in a (shared memory) guest from what the kernel
> > currently does ...
>
> This is all described in the code and in the
> discussions we had before.
must have missed some of them, please can you
point me to the relevant threads ...
TIA,
Herbert
> > TIA,
> > Herbert
> >
> >> This patchset provides more grace way to handle this, but
> >> full memory management includes accounting of VMA-length
> >> as well (returning ENOMEM from system call) but we've decided
> >> to start with RSS.
> >>
> >>> best,
> >>> Herbert
> >>>
> >>>> _______________________________________________
> >>>> Containers mailing list
> >>>> Containers@lists.osdl.org
> >>>> https://lists.osdl.org/mailman/listinfo/containers
> >>> -
> >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>> Please read the FAQ at http://www.tux.org/lkml/
> >>>
> >
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17720 is a reply to message #11001] |
Mon, 12 March 2007 01:00 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Sun, Mar 11, 2007 at 04:51:11AM -0800, Andrew Morton wrote:
> > On Sun, 11 Mar 2007 15:26:41 +0300 Kirill Korotaev <dev@sw.ru> wrote:
> > Andrew Morton wrote:
> > > On Tue, 06 Mar 2007 17:55:29 +0300
> > > Pavel Emelianov <xemul@sw.ru> wrote:
> > >
> > >
> > >>+struct rss_container {
> > >>+ struct res_counter res;
> > >>+ struct list_head page_list;
> > >>+ struct container_subsys_state css;
> > >>+};
> > >>+
> > >>+struct page_container {
> > >>+ struct page *page;
> > >>+ struct rss_container *cnt;
> > >>+ struct list_head list;
> > >>+};
> > >
> > >
> > > ah. This looks good. I'll find a hunk of time to go through
> > > this work and through Paul's patches. It'd be good to get both
> > > patchsets lined up in -mm within a couple of weeks. But..
> > >
> > > We need to decide whether we want to do per-container memory
> > > limitation via these data structures, or whether we do it via
> > > a physical scan of some software zone, possibly based on Mel's
> > > patches.
> > i.e. a separate memzone for each container?
>
> Yep. Straightforward machine partitioning. An attractive thing is that
> it 100% reuses existing page reclaim, unaltered.
>
> > imho memzone approach is inconvinient for pages sharing and shares
> > accounting. it also makes memory management more strict, forbids
> > overcommiting per-container etc.
>
> umm, who said they were requirements?
well, I guess all existing OS-Level virtualizations
(Linux-VServer, OpenVZ, and FreeVPS) have stated more
than one time that _sharing_ of resources is a central
element, and one especially important resource to share
is memory (RAM) ...
if your aim is full partitioning, we do not need to
bother with OS-Level isolation, we can simply use
Paravirtualization and be done ...
> > Maybe you have some ideas how we can decide on this?
>
> We need to work out what the requirements are before we can
> settle on an implementation.
Linux-VServer (and probably OpenVZ):
- shared mappings of 'shared' files (binaries
and libraries) to allow for reduced memory
footprint when N identical guests are running
- virtual 'physical' limit should not cause
swap out when there are still pages left on
the host system (but pages of over limit guests
can be preferred for swapping)
- accounting and limits have to be consistent
and should roughly represent the actual used
memory/swap (modulo optimizations, I can go
into detail here, if necessary)
- OOM handling on a per guest basis, i.e. some
out of memory condition in guest A must not
affect guest B
HTC,
Herbert
> Sigh. Who is running this show? Anyone?
>
> You can actually do a form of overcommittment by allowing multiple
> containers to share one or more of the zones. Whether that is
> sufficient or suitable I don't know. That depends on the requirements,
> and we haven't even discussed those, let alone agreed to them.
>
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 1/7] Resource counters [message #17721 is a reply to message #17706] |
Mon, 12 March 2007 01:16 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Sun, Mar 11, 2007 at 01:00:15PM -0600, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
>
> >
> > Linux-VServer does the accounting with atomic counters,
> > so that works quite fine, just do the checks at the
> > beginning of whatever resource allocation and the
> > accounting once the resource is acquired ...
>
> Atomic operations versus locks is only a granularity thing.
> You still need the cache line which is the cost on SMP.
>
> Are you using atomic_add_return or atomic_add_unless or
> are you performing you actions in two separate steps
> which is racy? What I have seen indicates you are using
> a racy two separate operation form.
yes, this is the current implementation which
is more than sufficient, but I'm aware of the
potential issues here, and I have an experimental
patch sitting here which removes this race with
the following change:
- doesn't store the accounted value but
limit - accounted (i.e. the free resource)
- uses atomic_add_return()
- when negative, an error is returned and
the resource amount is added back
changes to the limit have to adjust the 'current'
value too, but that is again simple and atomic
best,
Herbert
PS: atomic_add_unless() didn't exist back then
(at least I think so) but that might be an option
too ...
> >> If we'll remove failcnt this would look like
> >> while (atomic_cmpxchg(...))
> >> which is also not that good.
> >>
> >> Moreover - in RSS accounting patches I perform page list
> >> manipulations under this lock, so this also saves one atomic op.
> >
> > it still hasn't been shown that this kind of RSS limit
> > doesn't add big time overhead to normal operations
> > (inside and outside of such a resource container)
> >
> > note that the 'usual' memory accounting is much more
> > lightweight and serves similar purposes ...
>
> Perhaps....
>
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting [message #17723 is a reply to message #10891] |
Mon, 12 March 2007 16:48 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
> now VE2 maps the same page. You can't determine whether this page is mapped
> to this container or another one w/o page->container pointer.
Hi Kirill,
I thought we can always get from the page to the VMA. rmap provides
this to us via page->mapping and the 'struct address_space' or anon_vma.
Do we agree on that?
We can also get from the vma to the mm very easily, via vma->vm_mm,
right?
We can also get from a task to the container quite easily.
So, the only question becomes whether there is a 1:1 relationship
between mm_structs and containers. Does each mm_struct belong to one
and only one container? Basically, can a threaded process have
different threads in different containers?
It seems that we could bridge the gap pretty easily by either assigning
each mm_struct to a container directly, or putting some kind of
task-to-mm lookup. Perhaps just a list like
mm->tasks_using_this_mm_list.
Not rocket science, right?
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting [message #17726 is a reply to message #10891] |
Mon, 12 March 2007 17:27 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
On Mon, 2007-03-12 at 20:19 +0300, Pavel Emelianov wrote:
> Dave Hansen wrote:
> > On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
> >> now VE2 maps the same page. You can't determine whether this page is mapped
> >> to this container or another one w/o page->container pointer.
> >
> > Hi Kirill,
> >
> > I thought we can always get from the page to the VMA. rmap provides
> > this to us via page->mapping and the 'struct address_space' or anon_vma.
> > Do we agree on that?
>
> Not completely. When page is unmapped from the *very last*
> user its *first* toucher may already be dead. So we'll never
> find out who it was.
OK, but this is assuming that we didn't *un*account for the page when
the last user of the "owning" container stopped using the page.
> > We can also get from the vma to the mm very easily, via vma->vm_mm,
> > right?
> >
> > We can also get from a task to the container quite easily.
> >
> > So, the only question becomes whether there is a 1:1 relationship
> > between mm_structs and containers. Does each mm_struct belong to one
>
> No. The question is "how to get a container that touched the
> page first" which is the same as "how to find mm_struct which
> touched the page first". Obviously there's no answer on this
> question unless we hold some direct page->container reference.
> This may be a hash, a direct on-page pointer, or mirrored
> array of pointers.
Or, you keep track of when the last user from the container goes away,
and you effectively account it to another one.
Are there problems with shifting ownership around like this?
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17727 is a reply to message #10892] |
Mon, 12 March 2007 17:33 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
On Mon, 2007-03-12 at 20:07 +0300, Kirill Korotaev wrote:
> > On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
> >>For these you essentially need per-container page->_mapcount counter,
> >>otherwise you can't detect whether rss group still has the page in question being mapped
> >>in its processes' address spaces or not.
> >
> > What do you mean by this? You can always tell whether a process has a
> > particular page mapped. Could you explain the issue a bit more. I'm
> > not sure I get it.
> When we do charge/uncharge we have to answer on another question:
> "whether *any* task from the *container* has this page mapped", not the
> "whether *this* task has this page mapped".
That's a bit more clear. ;)
OK, just so I make sure I'm getting your argument here. It would be too
expensive to go looking through all of the rmap data for _any_ other
task that might be sharing the charge (in the same container) with the
current task that is doing the unmapping.
The requirements you're presenting so far appear to be:
1. The first user of a page in a container must be charged
2. The second user of a page in a container must not be charged
3. A container using a page must take a diminished charge when
another container is already using the page.
4. Additional fields in data structures (including 'struct page') are
permitted
What have I missed? What are your requirements for performance?
I'm not quite sure how the page->container stuff fits in here, though.
page->container would appear to be strictly assigning one page to one
container, but I know that beancounters can do partial page charges.
Care to fill me in?
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17730 is a reply to message #17719] |
Mon, 12 March 2007 08:31 |
xemul
Messages: 248 Registered: November 2005
|
Senior Member |
|
|
[snip]
>>>>>> We need to decide whether we want to do per-container memory
>>>>>> limitation via these data structures, or whether we do it via
>>>>>> a physical scan of some software zone, possibly based on Mel's
>>>>>> patches.
>>>>> why not do simple page accounting (as done currently
>>>>> in Linux) and use that for the limits, without
>>>>> keeping the reference from container to page?
>>>> As I've already answered in my previous letter simple
>>>> limiting w/o per-container reclamation and per-container
>>>> oom killer isn't a good memory management. It doesn't allow
>>>> to handle resource shortage gracefully.
>>> per container OOM killer does not require any container
>>> page reference, you know _what_ tasks belong to the
>>> container, and you know their _badness_ from the normal
>>> OOM calculations, so doing them for a container is really
>>> straight forward without having any page 'tagging'
>> That's true. If you look at the patches you'll
>> find out that no code in oom killer uses page 'tag'.
>
> so what do we keep the context -> page reference
> then at all?
We need this for
1. keeping page's owner to uncharge to IT when page
goes away. Or do you propose to uncharge it to
current (i.e. ANY) container like you do all across
Vserver accounting which screws up accounting with
pages sharing?
2. managing LRU lists for good reclamation. See Balbir's
patches for details.
3. possible future uses - correct sharing accounting,
dirty pages accounting, etc
>>> for the reclamation part, please elaborate how that will
>>> differ in a (shared memory) guest from what the kernel
>>> currently does ...
>> This is all described in the code and in the
>> discussions we had before.
>
> must have missed some of them, please can you
> point me to the relevant threads ...
lkml.org archives and google will help you :)
> TIA,
> Herbert
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17731 is a reply to message #17720] |
Mon, 12 March 2007 18:42 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
How about we drill down on these a bit more.
On Mon, 2007-03-12 at 02:00 +0100, Herbert Poetzl wrote:
> - shared mappings of 'shared' files (binaries
> and libraries) to allow for reduced memory
> footprint when N identical guests are running
So, it sounds like this can be phrased as a requirement like:
"Guests must be able to share pages."
Can you give us an idea why this is so? On a typical vserver system,
how much memory would be lost if guests were not permitted to share
pages like this? How much does this decrease the density of vservers?
> - virtual 'physical' limit should not cause
> swap out when there are still pages left on
> the host system (but pages of over limit guests
> can be preferred for swapping)
Is this a really hard requirement? It seems a bit fluffy to me. An
added bonus if we can do it, but certainly not the most important
requirement in the bunch.
What are the consequences if this isn't done? Doesn't a loaded system
eventually have all of its pages used anyway, so won't this always be a
temporary situation?
This also seems potentially harmful if we aren't able to get pages
*back* that we've given to a guest. Tasks can pin pages in lots of
creative ways.
> - accounting and limits have to be consistent
> and should roughly represent the actual used
> memory/swap (modulo optimizations, I can go
> into detail here, if necessary)
So, consistency is important, but is precision? If we, for instance,
used one of the hashing schemes, we could have some imprecise decisions
made but the system would stay consistent overall.
This requirement also doesn't seem to push us in the direction of having
distinct page owners, or some sharing mechanism, because both would be
consistent.
> - OOM handling on a per guest basis, i.e. some
> out of memory condition in guest A must not
> affect guest B
I'll agree that this one is important and well stated as-is. Any
disagreement on this one?
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17736 is a reply to message #17648] |
Mon, 12 March 2007 09:55 |
Balbir Singh
Messages: 491 Registered: August 2006
|
Senior Member |
|
|
> doesn't look so good for me, mainly becaus of the
> additional per page data and per page processing
>
> on 4GB memory, with 100 guests, 50% shared for each
> guest, this basically means ~1mio pages, 500k shared
> and 1500k x sizeof(page_container) entries, which
> roughly boils down to ~25MB of wasted memory ...
>
> increase the amount of shared pages and it starts
> getting worse, but maybe I'm missing something here
>
> > We need to decide whether we want to do per-container memory
> > limitation via these data structures, or whether we do it via a
> > physical scan of some software zone, possibly based on Mel's patches.
>
> why not do simple page accounting (as done currently
> in Linux) and use that for the limits, without
> keeping the reference from container to page?
>
> best,
> Herbert
>
Herbert,
You lost me in the cc list and I almost missed this part of the
thread. Could you please not modify the "cc" list.
Thanks,
Balbir
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17737 is a reply to message #10890] |
Mon, 12 March 2007 23:02 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
On Mon, 2007-03-12 at 23:41 +0100, Herbert Poetzl wrote:
> On Mon, Mar 12, 2007 at 11:42:59AM -0700, Dave Hansen wrote:
> > How about we drill down on these a bit more.
> >
> > On Mon, 2007-03-12 at 02:00 +0100, Herbert Poetzl wrote:
> > > - shared mappings of 'shared' files (binaries
> > > and libraries) to allow for reduced memory
> > > footprint when N identical guests are running
> >
> > So, it sounds like this can be phrased as a requirement like:
> >
> > "Guests must be able to share pages."
> >
> > Can you give us an idea why this is so?
>
> sure, one reason for this is that guests tend to
> be similar (or almost identical) which results
> in quite a lot of 'shared' libraries and executables
> which would otherwise get cached for each guest and
> would also be mapped for each guest separately
>
> > On a typical vserver system,
>
> there is nothing like a typical Linux-VServer system :)
>
> > how much memory would be lost if guests were not permitted
> > to share pages like this?
>
> let me give a real world example here:
>
> - typical guest with 600MB disk space
> - about 100MB guest specific data (not shared)
> - assumed that 80% of the libs/tools are used
I get the general idea here, but I just don't think those numbers are
very accurate. My laptop has a bunch of gunk open (xterm, evolution,
firefox, xchat, etc...). I ran this command:
lsof | egrep '/(usr/|lib.*\.so)' | awk '{print $9}' | sort | uniq | xargs du -Dcs
and got:
113840 total
On a web/database server that I have (ps aux | wc -l == 128), I just ran
the same:
39168 total
That's assuming that all of the libraries are fully read in and
populated, just by their on-disk sizes. Is that not a reasonable measure
of the kinds of things that we can expect to be shared in a vserver? If
so, it's a long way from 400MB.
Could you try a similar measurement on some of your machines? Perhaps
mine are just weird.
> > > - virtual 'physical' limit should not cause
> > > swap out when there are still pages left on
> > > the host system (but pages of over limit guests
> > > can be preferred for swapping)
> >
> > Is this a really hard requirement?
>
> no, not hard, but a reasonable optimization ...
>
> let me note once again, that for full isolation
> you better go with Xen or some other Hypervisor
> because if you make it work like Xen, it will
> become as slow and resource hungry as any other
> paravirtualization solution ...
Believe me, _I_ don't want Xen. :)
> > It seems a bit fluffy to me.
>
> most optimizations might look strange at first
> glance, but when you check what the limitting
> factors for OS-Level virtualizations are, you
> will find that it looks like this:
>
> (in order of decreasing relevance)
>
> - I/O subsystem
> - available memory
> - network performance
> - CPU performance
>
> note: this is for 'typical' guests, not for
> number crunching or special database, or pure
> network bound applications/guests ...
I don't doubt this, but doing this two-level page-out thing for
containers/vservers over their limits is surely something that we should
consider farther down the road, right?
It's important to you, but you're obviously not doing any of the
mainline coding, right?
> > What are the consequences if this isn't done? Doesn't
> > a loaded system eventually have all of its pages used
> > anyway, so won't this always be a temporary situation?
>
> let's consider a quite limited guest (or several
> of them) which have a 'RAM' limit of 64MB and
> additional 64MB of 'virtual swap' assigned ...
>
> if they use roughly 96MB (memory footprint) then
> having this 'fluffy' optimization will keep them
> running without any effect on the host side, but
> without, they will continously swap in and out
> which will affect not only the host, but also the
> other guests ...
All workloads that use $limit+1 pages of memory will always pay the
price, right? :)
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting [message #17739 is a reply to message #17707] |
Mon, 12 March 2007 16:16 |
dev
Messages: 1693 Registered: September 2005 Location: Moscow
|
Senior Member |
|
|
Eric W. Biederman wrote:
> Pavel Emelianov <xemul@sw.ru> writes:
>
>
>>Adds needed pointers to mm_struct and page struct,
>>places hooks to core code for mm_struct initialization
>>and hooks in container_init_early() to preinitialize
>>RSS accounting subsystem.
>
>
> An extra pointer in struct page is unlikely to fly.
> Both because it increases the size of a size critical structure,
> and because conceptually it is ridiculous.
as it was discussed multiple times (and according OLS):
- it is not critical nowdays to expand struct page a bit in case
accounting is on.
- it can be done w/o extending, e.g. via mapping page <-> container
using hash or some other data structure.
i.e. we can optimize it on size if considered needed.
> If you are limiting the RSS size you are counting the number of pages in
> the page tables. You don't care about the page itself.
>
> With the rmap code it is relatively straight forward to see if this is
> the first time a page has been added to a page table in your rss
> group, or if this is the last reference to a particular page in your
> rss group. The counters should only increment the first time a
> particular page is added to your rss group. The counters should only
> decrement when it is the last reference in your rss subsystem.
You are fundamentally wrong if shared pages are concerned.
Imagine a glibc page shared between 2 containers - VE1 and VE2.
VE1 was the first who mapped it, so it is accounted to VE1
(rmap count was increased by it).
now VE2 maps the same page. You can't determine whether this page is mapped
to this container or another one w/o page->container pointer.
All the choices you have are:
a) do not account this page, since it is allready accounted to some other VE.
b) account this page again to current container.
(a) is bad, since VE1 can unmap this page first, and the last user will be VE2.
Which means VE1 will be charged for it, while VE2 uncharged. Accounting screws up.
b) is bad, since:
- the same page is accounted multiple times, which makes impossible
to understand how much real memory pages container needs/consumes
- and because on container enter the process and it's pages
are essentially moved to another context, while accounting
can not be fixed up easily and we essentially have (a).
> This allow important little cases like glibc to be properly accounted
> for. One of the key features of a rss limit is that the kernel can
> still keep pages that you need in-core, that are accessible with just
> a minor fault. Directly owning pages works directly against that
> principle.
Sorry, can't understand what you mean. It doesn't work against.
Each container has it's own LRU. So if glibc has the most
often used pages - it won't be thrashed out.
Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
|
Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting [message #17745 is a reply to message #17723] |
Mon, 12 March 2007 17:19 |
xemul
Messages: 248 Registered: November 2005
|
Senior Member |
|
|
Dave Hansen wrote:
> On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
>> now VE2 maps the same page. You can't determine whether this page is mapped
>> to this container or another one w/o page->container pointer.
>
> Hi Kirill,
>
> I thought we can always get from the page to the VMA. rmap provides
> this to us via page->mapping and the 'struct address_space' or anon_vma.
> Do we agree on that?
Not completely. When page is unmapped from the *very last*
user its *first* toucher may already be dead. So we'll never
find out who it was.
> We can also get from the vma to the mm very easily, via vma->vm_mm,
> right?
>
> We can also get from a task to the container quite easily.
>
> So, the only question becomes whether there is a 1:1 relationship
> between mm_structs and containers. Does each mm_struct belong to one
No. The question is "how to get a container that touched the
page first" which is the same as "how to find mm_struct which
touched the page first". Obviously there's no answer on this
question unless we hold some direct page->container reference.
This may be a hash, a direct on-page pointer, or mirrored
array of pointers.
> and only one container? Basically, can a threaded process have
> different threads in different containers?
>
> It seems that we could bridge the gap pretty easily by either assigning
> each mm_struct to a container directly, or putting some kind of
> task-to-mm lookup. Perhaps just a list like
> mm->tasks_using_this_mm_list.
This could work for reclamation: we scan through all the
mm_struct-s within the container and shrink its' pages, but
we can't make LRU this way.
> Not rocket science, right?
>
> -- Dave
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17761 is a reply to message #17732] |
Mon, 12 March 2007 21:11 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Mon, Mar 12, 2007 at 12:02:01PM +0300, Pavel Emelianov wrote:
> >>> Maybe you have some ideas how we can decide on this?
> >> We need to work out what the requirements are before we can
> >> settle on an implementation.
> >
> > Linux-VServer (and probably OpenVZ):
> >
> > - shared mappings of 'shared' files (binaries
> > and libraries) to allow for reduced memory
> > footprint when N identical guests are running
>
> This is done in current patches.
nice, but the question was about _requirements_
(so your requirements are?)
> > - virtual 'physical' limit should not cause
> > swap out when there are still pages left on
> > the host system (but pages of over limit guests
> > can be preferred for swapping)
>
> So what to do when virtual physical limit is hit?
> OOM-kill current task?
when the RSS limit is hit, but there _are_ enough
pages left on the physical system, there is no
good reason to swap out the page at all
- there is no benefit in doing so (performance
wise, that is)
- it actually hurts performance, and could
become a separate source for DoS
what should happen instead (in an ideal world :)
is that the page is considered swapped out for
the guest (add guest penality for swapout), and
when the page would be swapped in again, the guest
takes a penalty (for the 'virtual' page in) and
the page is returned to the guest, possibly kicking
out (again virtually) a different page
> > - accounting and limits have to be consistent
> > and should roughly represent the actual used
> > memory/swap (modulo optimizations, I can go
> > into detail here, if necessary)
>
> This is true for current implementation for
> booth - this patchset ang OpenVZ beancounters.
>
> If you sum up the physpages values for all containers
> you'll get the exact number of RAM pages used.
hmm, including or excluding the host pages?
> > - OOM handling on a per guest basis, i.e. some
> > out of memory condition in guest A must not
> > affect guest B
>
> This is done in current patches.
> Herbert, did you look at the patches before
> sending this mail or do you just want to
> 'take part' in conversation w/o understanding
> of hat is going on?
again, the question was about requirements, not
your patches, and yes, I had a look at them _and_
the OpenVZ implementations ...
best,
Herbert
PS: hat is going on? :)
> > HTC,
> > Herbert
> >
> >> Sigh. Who is running this show? Anyone?
> >>
> >> You can actually do a form of overcommittment by allowing multiple
> >> containers to share one or more of the zones. Whether that is
> >> sufficient or suitable I don't know. That depends on the requirements,
> >> and we haven't even discussed those, let alone agreed to them.
> >>
> >> _______________________________________________
> >> Containers mailing list
> >> Containers@lists.osdl.org
> >> https://lists.osdl.org/mailman/listinfo/containers
> >
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting [message #17763 is a reply to message #17723] |
Mon, 12 March 2007 17:21 |
Balbir Singh
Messages: 491 Registered: August 2006
|
Senior Member |
|
|
On 3/12/07, Dave Hansen <hansendc@us.ibm.com> wrote:
> On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
> > now VE2 maps the same page. You can't determine whether this page is mapped
> > to this container or another one w/o page->container pointer.
>
> Hi Kirill,
>
> I thought we can always get from the page to the VMA. rmap provides
> this to us via page->mapping and the 'struct address_space' or anon_vma.
> Do we agree on that?
>
> We can also get from the vma to the mm very easily, via vma->vm_mm,
> right?
>
> We can also get from a task to the container quite easily.
>
> So, the only question becomes whether there is a 1:1 relationship
> between mm_structs and containers. Does each mm_struct belong to one
> and only one container? Basically, can a threaded process have
> different threads in different containers?
>
> It seems that we could bridge the gap pretty easily by either assigning
> each mm_struct to a container directly, or putting some kind of
> task-to-mm lookup. Perhaps just a list like
> mm->tasks_using_this_mm_list.
>
> Not rocket science, right?
>
> -- Dave
>
These patches are very similar to what I posted at
http://lwn.net/Articles/223829/
In my patches, the thread group leader owns the mm_struct and all
threads belong to the same container. I did not have a per container
LRU, walking the global list for reclaim was a bit slow, but otherwise
my patches did not add anything to struct page
I used rmap information to get to the VMA and then the mm_struct.
Kirill, it is possible to determine all the containers that map the
page. Please see the page_in_container() function of
http://lkml.org/lkml/2007/2/26/7.
I was also thinking of using the page table(s) to identify all pages
belonging to a container, by obtaining all the mm_structs of tasks
belonging to a container. But this approach would not work well for
the page cache controller, when we add that to our memory controller.
Balbir
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17764 is a reply to message #17731] |
Mon, 12 March 2007 22:41 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Mon, Mar 12, 2007 at 11:42:59AM -0700, Dave Hansen wrote:
> How about we drill down on these a bit more.
>
> On Mon, 2007-03-12 at 02:00 +0100, Herbert Poetzl wrote:
> > - shared mappings of 'shared' files (binaries
> > and libraries) to allow for reduced memory
> > footprint when N identical guests are running
>
> So, it sounds like this can be phrased as a requirement like:
>
> "Guests must be able to share pages."
>
> Can you give us an idea why this is so?
sure, one reason for this is that guests tend to
be similar (or almost identical) which results
in quite a lot of 'shared' libraries and executables
which would otherwise get cached for each guest and
would also be mapped for each guest separately
> On a typical vserver system,
there is nothing like a typical Linux-VServer system :)
> how much memory would be lost if guests were not permitted
> to share pages like this?
let me give a real world example here:
- typical guest with 600MB disk space
- about 100MB guest specific data (not shared)
- assumed that 80% of the libs/tools are used
gives 400MB of shared read only data
assumed you are running 100 guests on a host,
that makes ~39GB of virtual memory which will
get paged in and out over and over again ...
.. compared to 400MB shared pages in memory :)
> How much does this decrease the density of vservers?
well, let's look at the overall memory resource
function with the above assumptions:
with sharing: f(N) = N*80M + 400M
without sharing: g(N) = N*480M
so the decrease N->inf: g/f -> 6 (factor)
which is quite realistic, if you consider that
there are only so many distributions, OTOH, the
factor might become less important when the
guest specific data grows ...
> > - virtual 'physical' limit should not cause
> > swap out when there are still pages left on
> > the host system (but pages of over limit guests
> > can be preferred for swapping)
>
> Is this a really hard requirement?
no, not hard, but a reasonable optimization ...
let me note once again, that for full isolation
you better go with Xen or some other Hypervisor
because if you make it work like Xen, it will
become as slow and resource hungry as any other
paravirtualization solution ...
> It seems a bit fluffy to me.
most optimizations might look strange at first
glance, but when you check what the limitting
factors for OS-Level virtualizations are, you
will find that it looks like this:
(in order of decreasing relevance)
- I/O subsystem
- available memory
- network performance
- CPU performance
note: this is for 'typical' guests, not for
number crunching or special database, or pure
network bound applications/guests ...
> An added bonus if we can do it, but certainly not the
> most important requirement in the bunch.
nope, not the _most_ important one, but it
all summs up :)
> What are the consequences if this isn't done? Doesn't
> a loaded system eventually have all of its pages used
> anyway, so won't this always be a temporary situation?
let's consider a quite limited guest (or several
of them) which have a 'RAM' limit of 64MB and
additional 64MB of 'virtual swap' assigned ...
if they use roughly 96MB (memory footprint) then
having this 'fluffy' optimization will keep them
running without any effect on the host side, but
without, they will continously swap in and out
which will affect not only the host, but also the
other guests ...
> This also seems potentially harmful if we aren't able
> to get pages *back* that we've given to a guest.
no, the idea is not to keep them unconditionally,
the concept is to allow them to stay, even if the
guest has reached the RSS limit and a 'real' system
would have to swap pages out (or simply drop them)
to get other pages mapped ...
> Tasks can pin pages in lots of creative ways.
sure, this is why we should have proper limits
for that too :)
> > - accounting and limits have to be consistent
> > and should roughly represent the actual used
> > memory/swap (modulo optimizations, I can go
> > into detail here, if necessary)
>
> So, consistency is important, but is precision?
IMHO precision is not that important, of course,
the values should be in the same ballpark ...
> If we, for instance, used one of the hashing schemes,
> we could have some imprecise decisions made but the
> system would stay consistent overall.
it is also important that the lack of precision
cannot be exploited to allocate unreasonable
ammounts of resources ...
at least Linux-VServer could live with +/- 10%
(or probably more) as I said, it is mainly used
for preventing DoS or DoR attacks ...
> This requirement also doesn't seem to push us in the
> direction of having distinct page owners, or some
> sharing mechanism, because both would be consistent.
> > - OOM handling on a per guest basis, i.e. some
> > out of memory condition in guest A must not
> > affect guest B
>
> I'll agree that this one is important and well stated
> as-is. Any disagreement on this one?
nope ...
best,
Herbert
> -- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17769 is a reply to message #17736] |
Mon, 12 March 2007 23:43 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Mon, Mar 12, 2007 at 03:25:07PM +0530, Balbir Singh wrote:
> > doesn't look so good for me, mainly becaus of the
> > additional per page data and per page processing
> >
> > on 4GB memory, with 100 guests, 50% shared for each
> > guest, this basically means ~1mio pages, 500k shared
> > and 1500k x sizeof(page_container) entries, which
> > roughly boils down to ~25MB of wasted memory ...
> >
> > increase the amount of shared pages and it starts
> > getting worse, but maybe I'm missing something here
> >
> > > We need to decide whether we want to do per-container memory
> > > limitation via these data structures, or whether we do it via
> > > a physical scan of some software zone, possibly based on Mel's
> > > patches.
> >
> > why not do simple page accounting (as done currently
> > in Linux) and use that for the limits, without
> > keeping the reference from container to page?
> >
> > best,
> > Herbert
> >
>
> Herbert,
>
> You lost me in the cc list and I almost missed this part of the
> thread.
hmm, it is very unlikely that this would happen,
for several reasons ... and indeed, checking the
thread in my mailbox shows that akpm dropped you ...
--------------------------------------------------------------------
Subject: [RFC][PATCH 2/7] RSS controller core
From: Pavel Emelianov <xemul@sw.ru>
To: Andrew Morton <akpm@osdl.org>, Paul Menage <menage@google.com>,
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
Balbir Singh <balbir@in.ibm.com>
Cc: containers@lists.osdl.org,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Date: Tue, 06 Mar 2007 17:55:29 +0300
--------------------------------------------------------------------
Subject: Re: [RFC][PATCH 2/7] RSS controller core
From: Andrew Morton <akpm@linux-foundation.org>
To: Pavel Emelianov <xemul@sw.ru>
Cc: Kirill@smtp.osdl.org, Linux@smtp.osdl.org, containers@lists.osdl.org,
Paul Menage <menage@google.com>,
List <linux-kernel@vger.kernel.org>
Date: Tue, 6 Mar 2007 14:00:36 -0800
--------------------------------------------------------------------
that's the one I 'group' replied to ...
> Could you please not modify the "cc" list.
I never modify the cc unless explicitely asked
to do so. I wish others would have it that way
too :)
best,
Herbert
> Thanks,
> Balbir
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17770 is a reply to message #17724] |
Mon, 12 March 2007 23:54 |
Herbert Poetzl
Messages: 239 Registered: February 2006
|
Senior Member |
|
|
On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
> On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
> >
> > For these you essentially need per-container page->_mapcount counter,
> > otherwise you can't detect whether rss group still has the page
> > in question being mapped in its processes' address spaces or not.
> What do you mean by this? You can always tell whether a process has a
> particular page mapped. Could you explain the issue a bit more. I'm
> not sure I get it.
OpenVZ wants to account _shared_ pages in a guest
different than separate pages, so that the RSS
accounted values reflect the actual used RAM instead
of the sum of all processes RSS' pages, which for
sure is more relevant to the administrator, but IMHO
not so terribly important to justify memory consuming
structures and sacrifice performance to get it right
YMMV, but maybe we can find a smart solution to the
issue too :)
best,
Herbert
> -- Dave
>
> _______________________________________________
> Containers mailing list
> Containers@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 1/7] Resource counters [message #17774 is a reply to message #17721] |
Tue, 13 March 2007 09:09 |
ebiederm
Messages: 1354 Registered: February 2006
|
Senior Member |
|
|
Herbert Poetzl <herbert@13thfloor.at> writes:
> On Sun, Mar 11, 2007 at 01:00:15PM -0600, Eric W. Biederman wrote:
>> Herbert Poetzl <herbert@13thfloor.at> writes:
>>
>> >
>> > Linux-VServer does the accounting with atomic counters,
>> > so that works quite fine, just do the checks at the
>> > beginning of whatever resource allocation and the
>> > accounting once the resource is acquired ...
>>
>> Atomic operations versus locks is only a granularity thing.
>> You still need the cache line which is the cost on SMP.
>>
>> Are you using atomic_add_return or atomic_add_unless or
>> are you performing you actions in two separate steps
>> which is racy? What I have seen indicates you are using
>> a racy two separate operation form.
>
> yes, this is the current implementation which
> is more than sufficient, but I'm aware of the
> potential issues here, and I have an experimental
> patch sitting here which removes this race with
> the following change:
>
> - doesn't store the accounted value but
> limit - accounted (i.e. the free resource)
> - uses atomic_add_return()
> - when negative, an error is returned and
> the resource amount is added back
>
> changes to the limit have to adjust the 'current'
> value too, but that is again simple and atomic
>
> best,
> Herbert
>
> PS: atomic_add_unless() didn't exist back then
> (at least I think so) but that might be an option
> too ...
I think as far as having this discussion if you can remove that race
people will be more willing to talk about what vserver does.
That said anything that uses locks or atomic operations (finer grained locks)
because of the cache line ping pong is going to have scaling issues on large
boxes.
So in that sense anything short of per cpu variables sucks at scale. That said
I would much rather get a simple correct version without the complexity of
per cpu counters, before we optimize the counters that much.
Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17776 is a reply to message #17770] |
Tue, 13 March 2007 09:58 |
ebiederm
Messages: 1354 Registered: February 2006
|
Senior Member |
|
|
Herbert Poetzl <herbert@13thfloor.at> writes:
> On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
>> On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
>> >
>> > For these you essentially need per-container page->_mapcount counter,
>> > otherwise you can't detect whether rss group still has the page
>> > in question being mapped in its processes' address spaces or not.
>
>> What do you mean by this? You can always tell whether a process has a
>> particular page mapped. Could you explain the issue a bit more. I'm
>> not sure I get it.
>
> OpenVZ wants to account _shared_ pages in a guest
> different than separate pages, so that the RSS
> accounted values reflect the actual used RAM instead
> of the sum of all processes RSS' pages, which for
> sure is more relevant to the administrator, but IMHO
> not so terribly important to justify memory consuming
> structures and sacrifice performance to get it right
>
> YMMV, but maybe we can find a smart solution to the
> issue too :)
I will tell you what I want.
I want a shared page cache that has nothing to do with RSS limits.
I want an RSS limit that once I know I can run a deterministic
application with a fixed set of inputs in I want to know it will
always run.
First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not. Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application. If the next I run and it isn't running my application
will fail. That is ridiculous.
I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.
Now sharing is sufficiently rare that I'm pretty certain that problems
come up rarely. So maybe these problems have not shown up in testing
yet. But until I see the proof that actually doing the accounting for
sharing properly has intolerable overhead. I want proper accounting
not this hand waving that is only accurate on the third Tuesday of the
month.
Ideally all of this will be followed by smarter rss based swapping.
There are some very cool things that can be done to eliminate machine
overload once you have the ability to track real rss values.
Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17777 is a reply to message #17769] |
Tue, 13 March 2007 01:57 |
Balbir Singh
Messages: 491 Registered: August 2006
|
Senior Member |
|
|
> hmm, it is very unlikely that this would happen,
> for several reasons ... and indeed, checking the
> thread in my mailbox shows that akpm dropped you ...
>
But, I got Andrew's email.
> --------------------------------------------------------------------
> Subject: [RFC][PATCH 2/7] RSS controller core
> From: Pavel Emelianov <xemul@sw.ru>
> To: Andrew Morton <akpm@osdl.org>, Paul Menage <menage@google.com>,
> Srivatsa Vaddagiri <vatsa@in.ibm.com>,
> Balbir Singh <balbir@in.ibm.com>
> Cc: containers@lists.osdl.org,
> Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
> Date: Tue, 06 Mar 2007 17:55:29 +0300
> --------------------------------------------------------------------
> Subject: Re: [RFC][PATCH 2/7] RSS controller core
> From: Andrew Morton <akpm@linux-foundation.org>
> To: Pavel Emelianov <xemul@sw.ru>
> Cc: Kirill@smtp.osdl.org, Linux@smtp.osdl.org, containers@lists.osdl.org,
> Paul Menage <menage@google.com>,
> List <linux-kernel@vger.kernel.org>
> Date: Tue, 6 Mar 2007 14:00:36 -0800
> --------------------------------------------------------------------
> that's the one I 'group' replied to ...
>
> > Could you please not modify the "cc" list.
>
> I never modify the cc unless explicitely asked
> to do so. I wish others would have it that way
> too :)
>
Thats good to know, but my mailer shows
Andrew Morton <akpm@linux-foundation.org>
to Pavel Emelianov <xemul@sw.ru>
cc
Paul Menage <menage@google.com>,
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
Balbir Singh <balbir@in.ibm.com> (see I am <<HERE>>),
devel@openvz.org,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
containers@lists.osdl.org,
Kirill Korotaev <dev@sw.ru>
date Mar 7, 2007 3:30 AM
subject Re: [RFC][PATCH 2/7] RSS controller core
mailed-by vger.kernel.org
On Tue, 06 Mar 2007 17:55:29 +0300
and your reply as
Andrew Morton <akpm@linux-foundation.org>,
Pavel Emelianov <xemul@sw.ru>,
Kirill@smtp.osdl.org,
Linux@smtp.osdl.org,
containers@lists.osdl.org,
Paul Menage <menage@google.com>,
List <linux-kernel@vger.kernel.org>
to Andrew Morton <akpm@linux-foundation.org>
cc
Pavel Emelianov <xemul@sw.ru>,
Kirill@smtp.osdl.org,
Linux@smtp.osdl.org,
containers@lists.osdl.org,
Paul Menage <menage@google.com>,
List <linux-kernel@vger.kernel.org>
date Mar 9, 2007 10:18 PM
subject Re: [RFC][PATCH 2/7] RSS controller core
mailed-by vger.kernel.org
I am not sure what went wrong. Could you please check your mail
client, cause it seemed to even change email address to smtp.osdl.org
which bounced back when I wrote to you earlier.
> best,
> Herbert
>
Cheers,
Balbir
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
|
Re: [RFC][PATCH 4/7] RSS accounting hooks over the code [message #17782 is a reply to message #10892] |
Tue, 13 March 2007 16:01 |
ebiederm
Messages: 1354 Registered: February 2006
|
Senior Member |
|
|
Nick Piggin <nickpiggin@yahoo.com.au> writes:
> Eric W. Biederman wrote:
>>
>> First touch page ownership does not guarantee give me anything useful
>> for knowing if I can run my application or not. Because of page
>> sharing my application might run inside the rss limit only because
>> I got lucky and happened to share a lot of pages with another running
>> application. If the next I run and it isn't running my application
>> will fail. That is ridiculous.
>
> Let's be practical here, what you're asking is basically impossible.
>
> Unless by deterministic you mean that it never enters the a non
> trivial syscall, in which case, you just want to know about maximum
> RSS of the process, which we already account).
Not per process I want this on a group of processes, and yes that
is all I want just. I just want accounting of the maximum RSS of
a group of processes and then the mechanism to limit that maximum rss.
>> I don't want sharing between vservers/VE/containers to affect how many
>> pages I can have mapped into my processes at once.
>
> You seem to want total isolation. You could use virtualization?
No. I don't want the meaning of my rss limit to be affected by what
other processes are doing. We have constraints of how many resources
the box actually has. But I don't want accounting so sloppy that
processes outside my group of processes can artificially
lower my rss value, which magically raises my rss limit.
>> Now sharing is sufficiently rare that I'm pretty certain that problems
>> come up rarely. So maybe these problems have not shown up in testing
>> yet. But until I see the proof that actually doing the accounting for
>> sharing properly has intolerable overhead. I want proper accounting
>> not this hand waving that is only accurate on the third Tuesday of the
>> month.
>
> It is basically handwaving anyway. The only approach I've seen with
> a sane (not perfect, but good) way of accounting memory use is this
> one. If you care to define "proper", then we could discuss that.
I will agree that this patchset is probably in the right general ballpark.
But the fact that pages are assigned exactly one owner is pure non-sense.
We can do better. That is all I am asking for someone to at least attempt
to actually account for the rss of a group of processes and get the numbers
right when we have shared pages, between different groups of
processes. We have the data structures to support this with rmap.
Let me describe the situation where I think the accounting in the
patchset goes totally wonky.
Gcc as I recall maps the pages it is compiling with mmap.
If in a single kernel tree I do:
make -jN O=../compile1 &
make -jN O=../compile2 &
But set it up so that the two compiles are in different rss groups.
If I run the concurrently they will use the same files at the same
time and most likely because of the first touch rss limit rule even
if I have a draconian rss limit the compiles will both be able to
complete and finish. However if I run either of them alone if I
use the most draconian rss limit I can that allows both compiles to
finish I won't be able to compile a single kernel tree.
The reason for the failure with a single tree (in my thought
experiment) is that the rss limit was set below the what is actually
needed for the code to work. When we were compiling two kernels and
they were mapping the same pages at the same time we could put the rss
limit below the minimum rss needed for the compile to execute and
still have it complete because of with first touch only one group
accounted for the pages and the other just leached of the first, as
long as both compiles grabbed some of the pages they could complete.
No I know in practice most draconian limits will simply result in the
page staying in the page cache but not mapped into processes in the
group with the draconian limit, or they will result in pages of the
group with the draconian limit being pushed out into the swap cache.
So the chances of actual application failure even with a draconian
rss limit are quite unlikely. (I actually really appreciate this
fact).
However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me. Especially when we have
the infrastructure to do it right.
Does that make more sense?
Eric
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17783 is a reply to message #11085] |
Tue, 13 March 2007 17:05 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
On Tue, 2007-03-13 at 03:48 -0800, Andrew Morton wrote:
> If we use a physical zone-based containment scheme: fake-numa,
> variable-sized zones, etc then it all becomes moot. You set up a container
> which has 1.5GB of physial memory then toss processes into it. As that
> process set increases in size it will toss out stray pages which shouldn't
> be there, then it will start reclaiming and swapping out its own pages and
> eventually it'll get an oom-killing.
I was just reading through the (comprehensive) thread about this from
last week, so forgive me if I missed some of it. The idea is really
tempting, precisely because I don't think anyone really wants to have to
screw with the reclaim logic.
I'm just brain-dumping here, hoping that somebody has already thought
through some of this stuff. It's not a bitch-fest, I promise. :)
How do we determine what is shared, and goes into the shared zones?
Once we've allocated a page, it's too late because we already picked.
Do we just assume all page cache is shared? Base it on filesystem,
mount, ...? Mount seems the most logical to me, that a sysadmin would
have to set up a container's fs, anyway, and will likely be doing
special things to shared data, anyway (r/o bind mounts :).
There's a conflict between the resize granularity of the zones, and the
storage space their lookup consumes. We'd want a container to have a
limited ability to fill up memory with stuff like the dcache, so we'd
appear to need to put the dentries inside the software zone. But, that
gets us to our inability to evict arbitrary dentries. After a while,
would containers tend to pin an otherwise empty zone into place? We
could resize it, but what is the cost of keeping zones that can be
resized down to a small enough size that we don't mind keeping it there?
We could merge those "orphaned" zones back into the shared zone. Were
there any requirements about physical contiguity? What about minimum
zone sizes?
If we really do bind a set of processes strongly to a set of memory on a
set of nodes, then those really do become its home NUMA nodes. If the
CPUs there get overloaded, running it elsewhere will continue to grab
pages from the home. Would this basically keep us from ever being able
to move tasks around a NUMA system?
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting [message #17784 is a reply to message #17726] |
Tue, 13 March 2007 07:10 |
xemul
Messages: 248 Registered: November 2005
|
Senior Member |
|
|
Dave Hansen wrote:
> On Mon, 2007-03-12 at 20:19 +0300, Pavel Emelianov wrote:
>> Dave Hansen wrote:
>>> On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
>>>> now VE2 maps the same page. You can't determine whether this page is mapped
>>>> to this container or another one w/o page->container pointer.
>>> Hi Kirill,
>>>
>>> I thought we can always get from the page to the VMA. rmap provides
>>> this to us via page->mapping and the 'struct address_space' or anon_vma.
>>> Do we agree on that?
>> Not completely. When page is unmapped from the *very last*
>> user its *first* toucher may already be dead. So we'll never
>> find out who it was.
>
> OK, but this is assuming that we didn't *un*account for the page when
> the last user of the "owning" container stopped using the page.
That's exactly what we agreed on during our discussions:
When page is get touched it is charged to this container.
When page is get touched again by new container it is NOT
charged to new container, but keeps holding the old one
till it (the page) is completely freed. Nobody worried the
fact that a single page can hold container for good.
OpenVZ beancounters work the other way (and we proposed this
solution when we first sent the patches). We keep track of
*all* the containers (i.e. beancounters) holding this page.
>>> We can also get from the vma to the mm very easily, via vma->vm_mm,
>>> right?
>>>
>>> We can also get from a task to the container quite easily.
>>>
>>> So, the only question becomes whether there is a 1:1 relationship
>>> between mm_structs and containers. Does each mm_struct belong to one
>> No. The question is "how to get a container that touched the
>> page first" which is the same as "how to find mm_struct which
>> touched the page first". Obviously there's no answer on this
>> question unless we hold some direct page->container reference.
>> This may be a hash, a direct on-page pointer, or mirrored
>> array of pointers.
>
> Or, you keep track of when the last user from the container goes away,
> and you effectively account it to another one.
We can migrate page to another user but we decided
to implement it later after accepting simple accounting.
> Are there problems with shifting ownership around like this?
>
> -- Dave
>
>
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17786 is a reply to message #17761] |
Tue, 13 March 2007 07:17 |
xemul
Messages: 248 Registered: November 2005
|
Senior Member |
|
|
Herbert Poetzl wrote:
> On Mon, Mar 12, 2007 at 12:02:01PM +0300, Pavel Emelianov wrote:
>>>>> Maybe you have some ideas how we can decide on this?
>>>> We need to work out what the requirements are before we can
>>>> settle on an implementation.
>>> Linux-VServer (and probably OpenVZ):
>>>
>>> - shared mappings of 'shared' files (binaries
>>> and libraries) to allow for reduced memory
>>> footprint when N identical guests are running
>> This is done in current patches.
>
> nice, but the question was about _requirements_
> (so your requirements are?)
>
>>> - virtual 'physical' limit should not cause
>>> swap out when there are still pages left on
>>> the host system (but pages of over limit guests
>>> can be preferred for swapping)
>> So what to do when virtual physical limit is hit?
>> OOM-kill current task?
>
> when the RSS limit is hit, but there _are_ enough
> pages left on the physical system, there is no
> good reason to swap out the page at all
>
> - there is no benefit in doing so (performance
> wise, that is)
>
> - it actually hurts performance, and could
> become a separate source for DoS
>
> what should happen instead (in an ideal world :)
> is that the page is considered swapped out for
> the guest (add guest penality for swapout), and
Is the page stays mapped for the container or not?
If yes then what's the use of limits? Container mapped
pages more than the limit is but all the pages are
still in memory. Sounds weird.
> when the page would be swapped in again, the guest
> takes a penalty (for the 'virtual' page in) and
> the page is returned to the guest, possibly kicking
> out (again virtually) a different page
>
>>> - accounting and limits have to be consistent
>>> and should roughly represent the actual used
>>> memory/swap (modulo optimizations, I can go
>>> into detail here, if necessary)
>> This is true for current implementation for
>> booth - this patchset ang OpenVZ beancounters.
>>
>> If you sum up the physpages values for all containers
>> you'll get the exact number of RAM pages used.
>
> hmm, including or excluding the host pages?
Depends on whether you account host pages or not.
>>> - OOM handling on a per guest basis, i.e. some
>>> out of memory condition in guest A must not
>>> affect guest B
>> This is done in current patches.
>
>> Herbert, did you look at the patches before
>> sending this mail or do you just want to
>> 'take part' in conversation w/o understanding
>> of hat is going on?
>
> again, the question was about requirements, not
> your patches, and yes, I had a look at them _and_
> the OpenVZ implementations ...
>
> best,
> Herbert
>
> PS: hat is going on? :)
>
>>> HTC,
>>> Herbert
>>>
>>>> Sigh. Who is running this show? Anyone?
>>>>
>>>> You can actually do a form of overcommittment by allowing multiple
>>>> containers to share one or more of the zones. Whether that is
>>>> sufficient or suitable I don't know. That depends on the requirements,
>>>> and we haven't even discussed those, let alone agreed to them.
>>>>
>>>> _______________________________________________
>>>> Containers mailing list
>>>> Containers@lists.osdl.org
>>>> https://lists.osdl.org/mailman/listinfo/containers
>
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17787 is a reply to message #17760] |
Tue, 13 March 2007 17:26 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
On Mon, 2007-03-12 at 22:04 -0800, Andrew Morton wrote:
> So these mmapped pages will contiue to be shared across all guests. The
> problem boils down to "which guest(s) get charged for each shared page".
>
> A simple and obvious and easy-to-implement answer is "the guest which paged
> it in". I think we should firstly explain why that is insufficient.
My first worry was that this approach is unfair to the poor bastard that
happened to get started up first. If we have a bunch of containerized
web servers, the poor guy who starts Apache first will pay the price for
keeping it in memory for everybody else.
That said, I think this is naturally worked around. The guy charged
unfairly will get reclaim started on himself sooner. This will tend to
page out those pages that he was being unfairly charged for. Hopefully,
they will eventually get pretty randomly (eventually evenly) spread
among all users. We just might want to make sure that we don't allow
ptes (or other new references) to be re-established to pages like this
when we're trying to reclaim them. Either that, or force the next
toucher to take ownership of the thing. But, that kind of arbitrary
ownership transfer can't happen if we have rigidly defined boundaries
for the containers.
The other concern is that the memory load on the system doesn't come
from the first user ("the guy who paged it in"). The long-term load
comes from "the guy who keeps using it." The best way to exemplify this
is somebody who read()s a page in, followed by another guy mmap()ing the
same page. The guy who did the read will get charged, and the mmap()er
will get a free ride. We could probably get an idea when this kind of
stuff is happening by comparing page->count and page->_mapcount, but it
certainly wouldn't be conclusive. But, does this kind of nonsense even
happen in practice?
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 1/7] Resource counters [message #17790 is a reply to message #17774] |
Tue, 13 March 2007 09:27 |
xemul
Messages: 248 Registered: November 2005
|
Senior Member |
|
|
Eric W. Biederman wrote:
> Herbert Poetzl <herbert@13thfloor.at> writes:
>
>> On Sun, Mar 11, 2007 at 01:00:15PM -0600, Eric W. Biederman wrote:
>>> Herbert Poetzl <herbert@13thfloor.at> writes:
>>>
>>>> Linux-VServer does the accounting with atomic counters,
>>>> so that works quite fine, just do the checks at the
>>>> beginning of whatever resource allocation and the
>>>> accounting once the resource is acquired ...
>>> Atomic operations versus locks is only a granularity thing.
>>> You still need the cache line which is the cost on SMP.
>>>
>>> Are you using atomic_add_return or atomic_add_unless or
>>> are you performing you actions in two separate steps
>>> which is racy? What I have seen indicates you are using
>>> a racy two separate operation form.
>> yes, this is the current implementation which
>> is more than sufficient, but I'm aware of the
>> potential issues here, and I have an experimental
>> patch sitting here which removes this race with
>> the following change:
>>
>> - doesn't store the accounted value but
>> limit - accounted (i.e. the free resource)
>> - uses atomic_add_return()
>> - when negative, an error is returned and
>> the resource amount is added back
>>
>> changes to the limit have to adjust the 'current'
>> value too, but that is again simple and atomic
>>
>> best,
>> Herbert
>>
>> PS: atomic_add_unless() didn't exist back then
>> (at least I think so) but that might be an option
>> too ...
>
> I think as far as having this discussion if you can remove that race
> people will be more willing to talk about what vserver does.
>
> That said anything that uses locks or atomic operations (finer grained locks)
> because of the cache line ping pong is going to have scaling issues on large
> boxes.
BTW atomic_add_unless() is essentially a loop!!! Just
like spin_lock() is, so why is one better that another?
spin_lock() can go to schedule() on preemptive kernels
thus increasing interactivity, while atomic can't.
> So in that sense anything short of per cpu variables sucks at scale. That said
> I would much rather get a simple correct version without the complexity of
> per cpu counters, before we optimize the counters that much.
>
> Eric
>
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Re: [RFC][PATCH 2/7] RSS controller core [message #17791 is a reply to message #10890] |
Tue, 13 March 2007 20:28 |
Dave Hansen
Messages: 240 Registered: October 2005
|
Senior Member |
|
|
On Tue, 2007-03-13 at 19:09 +0000, Alan Cox wrote:
> > stuff is happening by comparing page->count and page->_mapcount, but it
> > certainly wouldn't be conclusive. But, does this kind of nonsense even
> > happen in practice?
>
> "Is it useful for me as a bad guy to make it happen ?"
A very fine question. ;)
To exploit this, you'd need to:
1. need to access common data with another user
2. be patient enough to wait
3. determine when one of those users had actually pulled
a page in from disk, which sys_mincore() can do, right?
I guess that might be a decent reason to not charge the guy who brings
the page in for the page's entire lifetime.
So, unless we can change page ownership after it has been allocated,
anyone accessing shared data can get around resource limits if they are
patient.
-- Dave
_______________________________________________
Containers mailing list
Containers@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/containers
|
|
|
Goto Forum:
Current Time: Mon Nov 11 13:44:52 GMT 2024
Total time taken to generate the page: 0.03970 seconds
|