|
Re: [PATCH 0/5] Kernel memory accounting container (v5) [message #21087 is a reply to message #20718] |
Mon, 01 October 2007 16:32 |
Paul Menage
Messages: 642 Registered: September 2006
|
Senior Member |
|
|
Hi Pavel,
One question about the general design of this - have you tested an
approach where rather than tagging each object within the cache with
the cgroup that allocated it, you instead have (inside the cache code)
a separate cache structure for each cgroup? So the space overheads
would go from having a per-object overhead (one pointer per object?)
to having a "wastage" overhead (on average half a slab per cgroup).
And the time overhead would be the time required to lookup the
relevant cache for a cgroup at the start of the allocation operation,
and the relevant cache for an object (from its struct page) at
deallocation, rather then the time required to update the per-object
housekeeping pointer.
Each cache would need to be assigned a unique ID, used as an index
into a per-cgroup lookup table of localized caches. (This could almost
be regarded as a form of kmem_cache namespace).
It seems to me that this alternative approach would be a lower memory
overhead for people who have the kernel memory controller compiled in
but aren't using it, or are only using a few groups.
Paul
On 9/25/07, Pavel Emelyanov <xemul@openvz.org> wrote:
> Changes since v.4:
> * make SLAB_NOTIFY caches mark pages as SlabDebug. That
> makes the interesting paths simpler (thanks to Christoph);
> * the change above caused appropriate changes in "turn
> notifications on" path - all available pages must become
> SlabDebug and page's freelists must be flushed;
> * added two more events - "on" and "off" to make kmalloc
> caches disabling more gracefully;
> * turning notifications "off" is marked as "TODO". Right
> now it's hard w/o massive rework of slub.c in respect to
> full slabs handling.
>
> Changes since v.3:
> * moved alloc/free notification into slow path and make
> "notify-able" caches walk this path always;
> * introduced some optimization for the case, when there's
> only one listener for SLUB events (saves more that 10%
> of performance);
> * ported on 2.6.23-rc6-mm1 tree.
>
> Changes since v.2:
> * introduced generic notifiers for slub. right now there
> are only events, needed by accounting, but this set can
> be extended in the future;
> * moved the controller core into separate file, so that
> its extension and/or porting on slAb will look more
> logical;
> * fixed this message :).
>
> Changes since v.1:
> * fixed Paul's comment about subsystem registration;
> * return ERR_PTR from ->create callback, not NULL;
> * make container-to-object assignment in rcu-safe section;
> * make turning accounting on and off with "1" and "0".
>
> ============================================================
>
> Long time ago we decided to start memory control with the
> user memory container. Now this container in -mm tree and
> I think we can start with the kmem one.
>
> First of all - why do we need this kind of control. The major
> "pros" is that kernel memory control protects the system
> from DoS attacks by processes that live in container. As our
> experience shows many exploits simply do not work in the
> container with limited kernel memory.
>
> I can split the kernel memory container into 4 parts:
>
> 1. kmalloc-ed objects control
> 2. vmalloc-ed objects control
> 3. buddy allocated pages control
> 4. kmem_cache_alloc-ed objects control
>
> the control of first tree types of objects has one peculiarity:
> one need to explicitly point out which allocations he wants to
> account and this becomes not-configurable and is to be discussed.
>
> On the other hands such objects as anon_vma-s, file-s, sighangds,
> vfsmounts, etc are created by user request always and should
> always be accounted. Fortunately they are allocated from their
> own caches and thus the whole kmem cache can be accountable.
>
> This is exactly what this patchset does - it adds the ability
> to account for the total size of kmem-cache-allocated objects
> from specified kmem caches.
>
> This is based on the SLUB allocator, Paul's control groups and the
> resource counters I made for RSS controller and which are in
> -mm tree already.
>
> To play with it, one need to mount the container file system
> with -o kmem and then mark some caches as accountable via
> /sys/slab/<cache_name>/cache_notify.
>
> As I have already told kmalloc caches cannot be accounted easily
> so turning the accounting on for them will fail with -EINVAL.
>
> Turning the accounting off is possible only if the cache has
> no objects. This is done so because turning accounting off
> implies marking of all the slabs in the cache as not-debug, but
> due to full-pages in slub are not stored in any lists (usually)
> this is impossible to do so, however this is in todo list.
>
> Thanks,
> Pavel
>
|
|
|
|
|
|
|
|
Re: [PATCH 5/5] Account for the slub objects [message #21178 is a reply to message #21159] |
Wed, 03 October 2007 07:29 |
Pavel Emelianov
Messages: 1149 Registered: September 2006
|
Senior Member |
|
|
Christoph Lameter wrote:
> On Tue, 2 Oct 2007, Pavel Emelyanov wrote:
>
>> Christoph Lameter wrote:
>>> On Mon, 1 Oct 2007, Pavel Emelyanov wrote:
>>>
>>>>>> +
>>>>> Quick check, slub_free_notify() and slab_alloc_notify() are called
>>>>> from serialized contexts, right?
>>>> Yup.
>>> How is it serialized?
>> They are booth called from __slab_alloc()/__slab_free() from under
>> the slab_lock(page).
>
> This means they are serialized per slab. Which means you can guarantee
> that multiple of these callbacks are not done at the same time for the
> same object. Is that what you need?
Yes I know it :) But I do not rely on this lock inside the callbacks. What
I need is to notify each new object only once, but this doesn't matter for
the callbacks whether there exists some lock or not. In other words - this
is just a coincidence that these callbacks are called from under this lock,
this was not done deliberately. Fortunately, the rollback in case when the
callbacks return an error is done easily under this lock.
Thanks,
Pavel
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
|
|
|
|
Re: [PATCH 0/5] Kernel memory accounting container (v5) [message #21342 is a reply to message #21329] |
Fri, 05 October 2007 13:17 |
Pavel Emelianov
Messages: 1149 Registered: September 2006
|
Senior Member |
|
|
Paul Menage wrote:
> On 10/2/07, Pavel Emelyanov <xemul@openvz.org> wrote:
>> Such a lookup would require a hastable or something similar. We already
>> have such a bad experience (with OpenVZ RSS fractions accounting for
>> example). Hash lookups imply the CPU caches screwup and hurt the performance.
>> See also the comment below.
>
> I think you could do it with an array lookup if you assigned an index
> to each cache as it was created, and used that as an offset into a
> per-cgroup array.
>
>> I thought the same some time ago and tried to make a per-beancounter kmem
>> caches. The result was awful - the memory waste was much larger than in the
>> case of pointer-per-object approach.
>
> OK, fair enough.
>
> Was this with a large number of bean counters? I imagine that with a
> small number, the waste might be rather more reasonable.
Yup. I do not remember the exact number, but this model didn't scale
well enough in respect to the number of beancounters.
> Paul
>
|
|
|