OpenVZ Forum


Home » Mailing lists » Devel » [PATCH 0/5] Kernel memory accounting container (v5)
Re: Re: [PATCH 4/5] Setup the control group [message #21086 is a reply to message #21083] Mon, 01 October 2007 16:04 Go to previous messageGo to next message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 10/1/07, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> Excellent, I prefer the later as well, but it would mean overheads
> for controllers not using the hierarchy.

I don't think it would have to with the ideas I've been thinking about
- each task would still have a set of pointers to subsystems which
could be dereferenced just as quickly. The complexity comes in trying
to map a task to its actual cgroup object in a given hierarchy - this
would involve a bit more work on the part of the cgroup framework, but
wouldn't be a fast path operation.

See my mail last week titled "Thoughts on virtualizing task containers".

> a design such that parents<->children can effectively share resources,
> track them and do so recursively, that would be really nice.

I think the recursive tracking would probably need to be supplied by
the subsystem rather than by the framework. But there's no reason that
multiple subsystems couldn't re-use the same hierarchy code via e.g.
resource counters. So when you initialize a resource counter you'd
tell it about its parent resource counter, and it would handle the
recursion automatically in charge/uncharge.

Paul
Re: [PATCH 0/5] Kernel memory accounting container (v5) [message #21087 is a reply to message #20718] Mon, 01 October 2007 16:32 Go to previous messageGo to next message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
Hi Pavel,

One question about the general design of this - have you tested an
approach where rather than tagging each object within the cache with
the cgroup that allocated it, you instead have (inside the cache code)
a separate cache structure for each cgroup? So the space overheads
would go from having a per-object overhead (one pointer per object?)
to having a "wastage" overhead (on average half a slab per cgroup).
And the time overhead would be the time required to lookup the
relevant cache for a cgroup at the start of the allocation operation,
and the relevant cache for an object (from its struct page) at
deallocation, rather then the time required to update the per-object
housekeeping pointer.

Each cache would need to be assigned a unique ID, used as an index
into a per-cgroup lookup table of localized caches. (This could almost
be regarded as a form of kmem_cache namespace).

It seems to me that this alternative approach would be a lower memory
overhead for people who have the kernel memory controller compiled in
but aren't using it, or are only using a few groups.

Paul

On 9/25/07, Pavel Emelyanov <xemul@openvz.org> wrote:
> Changes since v.4:
> * make SLAB_NOTIFY caches mark pages as SlabDebug. That
>   makes the interesting paths simpler (thanks to Christoph);
> * the change above caused appropriate changes in "turn
>   notifications on" path - all available pages must become
>   SlabDebug and page's freelists must be flushed;
> * added two more events - "on" and "off" to make kmalloc
>   caches disabling more gracefully;
> * turning notifications "off" is marked as "TODO". Right
>   now it's hard w/o massive rework of slub.c in respect to
>   full slabs handling.
>
> Changes since v.3:
> * moved alloc/free notification into slow path and make
>   "notify-able" caches walk this path always;
> * introduced some optimization for the case, when there's
>   only one listener for SLUB events (saves more that 10%
>   of performance);
> * ported on 2.6.23-rc6-mm1 tree.
>
> Changes since v.2:
> * introduced generic notifiers for slub. right now there
>   are only events, needed by accounting, but this set can
>   be extended in the future;
> * moved the controller core into separate file, so that
>   its extension and/or porting on slAb will look more
>   logical;
> * fixed this message :).
>
> Changes since v.1:
> * fixed Paul's comment about subsystem registration;
> * return ERR_PTR from ->create callback, not NULL;
> * make container-to-object assignment in rcu-safe section;
> * make turning accounting on and off with "1" and "0".
>
> ============================================================
>
> Long time ago we decided to start memory control with the
> user memory container. Now this container in -mm tree and
> I think we can start with the kmem one.
>
> First of all - why do we need this kind of control. The major
> "pros" is that kernel memory control protects the system
> from DoS attacks by processes that live in container. As our
> experience shows many exploits simply do not work in the
> container with limited kernel memory.
>
> I can split the kernel memory container into 4 parts:
>
> 1. kmalloc-ed objects control
> 2. vmalloc-ed objects control
> 3. buddy allocated pages control
> 4. kmem_cache_alloc-ed objects control
>
> the control of first tree types of objects has one peculiarity:
> one need to explicitly point out which allocations he wants to
> account and this becomes not-configurable and is to be discussed.
>
> On the other hands such objects as anon_vma-s, file-s, sighangds,
> vfsmounts, etc are created by user request always and should
> always be accounted. Fortunately they are allocated from their
> own caches and thus the whole kmem cache can be accountable.
>
> This is exactly what this patchset does - it adds the ability
> to account for the total size of kmem-cache-allocated objects
> from specified kmem caches.
>
> This is based on the SLUB allocator, Paul's control groups and the
> resource counters I made for RSS controller and which are in
> -mm tree already.
>
> To play with it, one need to mount the container file system
> with -o kmem and then mark some caches as accountable via
> /sys/slab/<cache_name>/cache_notify.
>
> As I have already told kmalloc caches cannot be accounted easily
> so turning the accounting on for them will fail with -EINVAL.
>
> Turning the accounting off is possible only if the cache has
> no objects. This is done so because turning accounting off
> implies marking of all the slabs in the cache as not-debug, but
> due to full-pages in slub are not stored in any lists (usually)
> this is impossible to do so, however this is in todo list.
>
> Thanks,
> Pavel
>
Re: [PATCH 3/5] Switch caches notification dynamically [message #21105 is a reply to message #21058] Mon, 01 October 2007 20:39 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Mon, 1 Oct 2007, Balbir Singh wrote:

> Is this documented somewhere or is this interpreted from looking
> at the code of other file handlers?

Documentation/vm/slub.txt

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 5/5] Account for the slub objects [message #21106 is a reply to message #21060] Mon, 01 October 2007 20:41 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Mon, 1 Oct 2007, Pavel Emelyanov wrote:

> >> +
> > 
> > Quick check, slub_free_notify() and slab_alloc_notify() are called
> > from serialized contexts, right?
> 
> Yup.

How is it serialized?


_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 5/5] Account for the slub objects [message #21141 is a reply to message #21106] Tue, 02 October 2007 12:44 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Christoph Lameter wrote:
> On Mon, 1 Oct 2007, Pavel Emelyanov wrote:
> 
>>>> +
>>> Quick check, slub_free_notify() and slab_alloc_notify() are called
>>> from serialized contexts, right?
>> Yup.
> 
> How is it serialized?

They are booth called from __slab_alloc()/__slab_free() from under 
the slab_lock(page).

Thanks,
Pavel

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 0/5] Kernel memory accounting container (v5) [message #21142 is a reply to message #21087] Tue, 02 October 2007 12:51 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Paul Menage wrote:
> Hi Pavel,
> 
> One question about the general design of this - have you tested an
> approach where rather than tagging each object within the cache with
> the cgroup that allocated it, you instead have (inside the cache code)
> a separate cache structure for each cgroup? So the space overheads
> would go from having a per-object overhead (one pointer per object?)
> to having a "wastage" overhead (on average half a slab per cgroup).
> And the time overhead would be the time required to lookup the
> relevant cache for a cgroup at the start of the allocation operation,
> and the relevant cache for an object (from its struct page) at
> deallocation, rather then the time required to update the per-object
> housekeeping pointer.

Such a lookup would require a hastable or something similar. We already
have such a bad experience (with OpenVZ RSS fractions accounting for
example). Hash lookups imply the CPU caches screwup and hurt the performance.
See also the comment below.

> Each cache would need to be assigned a unique ID, used as an index
> into a per-cgroup lookup table of localized caches. (This could almost
> be regarded as a form of kmem_cache namespace).
> 
> It seems to me that this alternative approach would be a lower memory
> overhead for people who have the kernel memory controller compiled in
> but aren't using it, or are only using a few groups.

I thought the same some time ago and tried to make a per-beancounter kmem
caches. The result was awful - the memory waste was much larger than in the
case of pointer-per-object approach. Let alone the performance questions -
each kmalloc required a synchronized hash table lookup that was too bad.

If you insist I can try to repeat the experiment, but I'm afraid the result
would be the same.

> Paul
>
Re: [PATCH 5/5] Account for the slub objects [message #21159 is a reply to message #21141] Tue, 02 October 2007 18:04 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Tue, 2 Oct 2007, Pavel Emelyanov wrote:

> Christoph Lameter wrote:
> > On Mon, 1 Oct 2007, Pavel Emelyanov wrote:
> > 
> >>>> +
> >>> Quick check, slub_free_notify() and slab_alloc_notify() are called
> >>> from serialized contexts, right?
> >> Yup.
> > 
> > How is it serialized?
> 
> They are booth called from __slab_alloc()/__slab_free() from under 
> the slab_lock(page).

This means they are serialized per slab. Which means you can guarantee 
that multiple of these callbacks are not done at the same time for the 
same object. Is that what you need?

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 5/5] Account for the slub objects [message #21178 is a reply to message #21159] Wed, 03 October 2007 07:29 Go to previous messageGo to next message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Christoph Lameter wrote:
> On Tue, 2 Oct 2007, Pavel Emelyanov wrote:
> 
>> Christoph Lameter wrote:
>>> On Mon, 1 Oct 2007, Pavel Emelyanov wrote:
>>>
>>>>>> +
>>>>> Quick check, slub_free_notify() and slab_alloc_notify() are called
>>>>> from serialized contexts, right?
>>>> Yup.
>>> How is it serialized?
>> They are booth called from __slab_alloc()/__slab_free() from under 
>> the slab_lock(page).
> 
> This means they are serialized per slab. Which means you can guarantee 
> that multiple of these callbacks are not done at the same time for the 
> same object. Is that what you need?


Yes I know it :) But I do not rely on this lock inside the callbacks. What
I need is to notify each new object only once, but this doesn't matter for
the callbacks whether there exists some lock or not. In other words - this
is just a coincidence that these callbacks are called from under this lock,
this was not done deliberately. Fortunately, the rollback in case when the
callbacks return an error is done easily under this lock.

Thanks,
Pavel
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [PATCH 0/5] Kernel memory accounting container (v5) [message #21329 is a reply to message #21142] Fri, 05 October 2007 07:11 Go to previous messageGo to next message
Paul Menage is currently offline  Paul Menage
Messages: 642
Registered: September 2006
Senior Member
On 10/2/07, Pavel Emelyanov <xemul@openvz.org> wrote:
>
> Such a lookup would require a hastable or something similar. We already
> have such a bad experience (with OpenVZ RSS fractions accounting for
> example). Hash lookups imply the CPU caches screwup and hurt the performance.
> See also the comment below.

I think you could do it with an array lookup if you assigned an index
to each cache as it was created, and used that as an offset into a
per-cgroup array.

>
> I thought the same some time ago and tried to make a per-beancounter kmem
> caches. The result was awful - the memory waste was much larger than in the
> case of pointer-per-object approach.

OK, fair enough.

Was this with a large number of bean counters? I imagine that with a
small number, the waste might be rather more reasonable.

Paul
Re: [PATCH 0/5] Kernel memory accounting container (v5) [message #21342 is a reply to message #21329] Fri, 05 October 2007 13:17 Go to previous message
Pavel Emelianov is currently offline  Pavel Emelianov
Messages: 1149
Registered: September 2006
Senior Member
Paul Menage wrote:
> On 10/2/07, Pavel Emelyanov <xemul@openvz.org> wrote:
>> Such a lookup would require a hastable or something similar. We already
>> have such a bad experience (with OpenVZ RSS fractions accounting for
>> example). Hash lookups imply the CPU caches screwup and hurt the performance.
>> See also the comment below.
> 
> I think you could do it with an array lookup if you assigned an index
> to each cache as it was created, and used that as an offset into a
> per-cgroup array.
> 
>> I thought the same some time ago and tried to make a per-beancounter kmem
>> caches. The result was awful - the memory waste was much larger than in the
>> case of pointer-per-object approach.
> 
> OK, fair enough.
> 
> Was this with a large number of bean counters? I imagine that with a
> small number, the waste might be rather more reasonable.

Yup. I do not remember the exact number, but this model didn't scale
well enough in respect to the number of beancounters.

> Paul
>
Previous Topic: [PATCH 0/3] Make tasks always have non-zero pids
Next Topic: [PATCH 2/5] make netlink processing routines semi-synchronious (inspired by rtnl) v2
Goto Forum:
  


Current Time: Wed Nov 13 05:47:02 GMT 2024

Total time taken to generate the page: 0.03315 seconds