OpenVZ Forum


Home » Mailing lists » Devel » [RFC][PATCH] UBC: user resource beancounters
Re: Re: [ckrm-tech] [RFC][PATCH 3/7] UBC: ub context and inheritance [message #5294 is a reply to message #5290] Thu, 17 August 2006 13:21 Go to previous messageGo to next message
xemul is currently offline  xemul
Messages: 248
Registered: November 2005
Senior Member
Srivatsa Vaddagiri wrote:
> On Wed, Aug 16, 2006 at 07:38:44PM +0400, Kirill Korotaev wrote:
>
>> Contains code responsible for setting UB on task,
>> it's inheriting and setting host context in interrupts.
>>
>> Task references three beancounters:
>> 1. exec_ub current context. all resources are
>> charged to this beancounter.
>> 2. task_ub beancounter to which task_struct is
>> charged itself.
>> 3. fork_sub beancounter which is inherited by
>> task's children on fork
>>
>
> Is there a case where exec_ub and fork_sub can differ? I dont see that
> in the patch.
>
Look in context changing in interrupts - "set_exec_ub(&ub0);" is done there.
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5295 is a reply to message #5237] Thu, 17 August 2006 13:25 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

> If I'm reading this patch right then seems like you are making page
> allocations to fail w/o (for example) trying to purge some pages from
> the page cache belonging to this container. Or is that reclaim going to
> come later?

charged kernel objects can't be _reclaimed_. how do you propose
to reclaim tasks page tables or files or task struct or vma or etc.?

Kirill
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5296 is a reply to message #5217] Thu, 17 August 2006 13:29 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Dave Hansen wrote:
> On Wed, 2006-08-16 at 19:40 +0400, Kirill Korotaev wrote:
>
>>--- ./include/linux/mm.h.kmemcore 2006-08-16 19:10:38.000000000
>>+0400
>>+++ ./include/linux/mm.h 2006-08-16 19:10:51.000000000 +0400
>>@@ -274,8 +274,14 @@ struct page {
>> unsigned int gfp_mask;
>> unsigned long trace[8];
>> #endif
>>+#ifdef CONFIG_USER_RESOURCE
>>+ union {
>>+ struct user_beancounter *page_ub;
>>+ } bc;
>>+#endif
>> };
>
>
> Is everybody OK with adding this accounting to the 'struct page'? Is
> there any kind of noticeable performance penalty for this? I thought
> that we had this aligned pretty well on cacheline boundaries.
When I discussed this with Hugh Dickins on summit we agreed
that +4 bytes on page struct for kernel using accounting
are ok and almost unavoidable.

it can be stored not on the struct page, but in this
case you need to introduce some kind of hash to lookup ub
quickly from page, which is slower for accounting-enabled kernels.

> How many things actually use this? Can we have the slab ubcs without
> the struct page pointer?
slab doesn't use this pointer on the page.
It is used for pages allocated by buddy
alocator implicitly (e.g. LDT pages, page tables, ...).

Kirill
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5297 is a reply to message #5239] Thu, 17 August 2006 13:33 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

> On Wed, 2006-08-16 at 11:47 -0700, Dave Hansen wrote:
>
>>On Wed, 2006-08-16 at 19:40 +0400, Kirill Korotaev wrote:
>>
>>>--- ./include/linux/mm.h.kmemcore 2006-08-16 19:10:38.000000000
>>>+0400
>>>+++ ./include/linux/mm.h 2006-08-16 19:10:51.000000000 +0400
>>>@@ -274,8 +274,14 @@ struct page {
>>> unsigned int gfp_mask;
>>> unsigned long trace[8];
>>> #endif
>>>+#ifdef CONFIG_USER_RESOURCE
>>>+ union {
>>>+ struct user_beancounter *page_ub;
>>>+ } bc;
>>>+#endif
>>> };
>>
>>Is everybody OK with adding this accounting to the 'struct page'?
>
>
> My preference would be to have container (I keep on saying container,
> but resource beancounter) pointer embeded in task, mm(not sure),
> address_space and anon_vma structures. This should allow us to track
> user land pages optimally. But for tracking kernel usage on behalf of
> user, we will have to use an additional field (unless we can re-use
> mapping). Please correct me if I'm wrong, though all the kernel
> resources will be allocated/freed in context of a user process. And at
> that time we know if a allocation should succeed or not. So we may
> actually not need to track kernel pages that closely. We are not going
> to run reclaim on any of them anyways.
objects are really allocated in process context
(except for TCP/IP and other softirqs which are done in arbitrary
process context!)
And objects are not always freed in correct context (!).

Note, page_ub is not for _user_ pages. user pages accounting will be added
in next patch set. page_ub is added to track kernel allocations.

Kirill
Re: [RFC][PATCH 7/7] UBC: proc interface [message #5299 is a reply to message #5232] Thu, 17 August 2006 13:41 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

> On Wed, Aug 16, 2006 at 07:44:30PM +0400, Kirill Korotaev wrote:
>
>>Add proc interface (/proc/user_beancounters) allowing to see current
>>state (usage/limits/fails for each UB). Implemented via seq files.
>
>
> Ugh, why /proc? This doesn't have anything to do with processes, just
> users, right? What's wrong with /sys/kernel/ instead?
We can move it, if there are much objections.
It is just here for more than 3 years (AFAIK starting from Alan's UBC)
and would be nice to have for compatibility (at least with existing OpenVZ).
But if it is required -- will do.

> Or /sys/kernel/debug/user_beancounters/ in debugfs as this is just a
> debugging thing, right?
debugfs is usually OFF imho. you don't export meminfo information in debugfs,
correct? user usages are the same imho...

Kirill
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5300 is a reply to message #5208] Thu, 17 August 2006 13:45 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

> The
>
> + ub->ub_parms[UB_KMEMSIZE].limit = 32 * 1024 * 1024
>
> seems a bit arbitary. 32Mb is variously vast amounts of memory and not
> enough to boot depending if you are booting a PDA or a 4096 core Itanic
> box
this limit is for newly created UBs, host system (ub0) is
_unlimited_ by default.
The idea was to limit the user by default to make system secure.
do you think it is good idea to have unlimited users created by default?
Anyway, after creating UB context normal behaviour would be to set
some limits.

Thanks,
Kirill
Re: [ckrm-tech] [RFC][PATCH] UBC: user resource beancounters [message #5301 is a reply to message #5289] Thu, 17 August 2006 13:53 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

> On Wed, Aug 16, 2006 at 07:24:03PM +0400, Kirill Korotaev wrote:
>
>>As the first step we want to propose for discussion
>>the most complicated parts of resource management:
>>kernel memory and virtual memory.
>
> Do you have any plans to post a CPU controller? Is that tied to UBC
> interface as well?

Not everything at once :) To tell the truth I think CPU controller
is even more complicated than user memory accounting/limiting.

No, fair CPU scheduler is not tied to UBC in any regard.
As we discussed before, it is valuable to have an ability to limit
different resources separately (CPU, disk I/O, memory, etc.).
For example, it can be possible to place some mission critical
kernel threads (like kjournald) in a separate contanier.

This patches are related to kernel memory and nothing more :)

Thanks,
Kirill
Re: [ckrm-tech] [RFC][PATCH 2/7] UBC: core (structures, API) [message #5302 is a reply to message #5291] Thu, 17 August 2006 14:00 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

> On Wed, Aug 16, 2006 at 07:37:26PM +0400, Kirill Korotaev wrote:
>
>>+struct user_beancounter
>>+{
>>+ atomic_t ub_refcount;
>>+ spinlock_t ub_lock;
>>+ uid_t ub_uid;
>>+ struct hlist_node hash;
>>+
>>+ struct user_beancounter *parent;
>
>
> This seems to hint at some heirarchy of ubc? How would that heirarchy be
> used? I cant find anything in the patch which forms this heirarchy
> (basically I dont see any place where beancounter_findcreate() is called
> with non-NULL 2nd arg).
yes, it is possible to use hierarchical beancounters.
kernel memory, user memory and TCP/IP buffers are accounted hierarchicaly.
user interface for this is not provided yet as it would complicate patchset
and increase number of topics for discussion :)

> [snip]
>
>
>>+static void init_beancounter_syslimits(struct user_beancounter *ub)
>>+{
>>+ int k;
>>+
>>+ for (k = 0; k < UB_RESOURCES; k++)
>>+ ub->ub_parms[k].barrier = ub->ub_parms[k].limit;
>
>
> This sets barrier to 0. Is this value of 0 interpreted differently by
> different controllers? One way to interpret it is "dont allocate any
> resource", other way to interpret it is "don't care - give me what you
> can" (which makes sense for stuff like CPU and network bandwidth).
every patch which adds a resource modifies this function and sets
some default limit. Check: [PATCH 5/7] UBC: kernel memory accounting (core)

Thanks,
Kirill
Re: [ckrm-tech] [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5303 is a reply to message #5288] Thu, 17 August 2006 14:03 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Srivatsa Vaddagiri wrote:
> On Wed, Aug 16, 2006 at 07:39:43PM +0400, Kirill Korotaev wrote:
>
>
>>+/*
>>+ * The setbeanlimit syscall
>>+ */
>>+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
>>+ unsigned long *limits)
>>+{
>
>
> [snip]
>
>
>>+ spin_lock_irqsave(&ub->ub_lock, flags);
>>+ ub->ub_parms[resource].barrier = new_limits[0];
>>+ ub->ub_parms[resource].limit = new_limits[1];
>
>
> Would it be usefull to notify the "resource" controller about this
> change in limits? For ex: in case of the CPU controller I wrote
> (http://lkml.org/lkml/2006/8/4/9), I was finding it usefull to recv
> notification of changes to these limits, so that internal structures
> (which are kept per-task-group) can be updated.
I think this can be added when needed, no?
See no much reason to add notifications which are not used yet.

Please, keep in mind. This patch set can be extended in infinite number
of ways. But!!! It contains only the required minimal functionality.
When we are to add code requiring to know about limit changes or fails
or whatever we can always extend it accordingly.

Thanks,
Kirill
Re: [ckrm-tech] [RFC][PATCH 2/7] UBC: core (structures, API) [message #5305 is a reply to message #5275] Thu, 17 August 2006 14:32 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Thu, 2006-08-17 at 15:45 +0400, Kirill Korotaev wrote:
> We need more complex decrement/locking scheme than krefs
> provide. e.g. in __put_beancounter() we need
> atomic_dec_and_lock_irqsave() semantics for performance optimizations.

Is it possible to put the locking in the destructor? It seems like that
should give similar behavior.

-- Dave
Re: [ckrm-tech] [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5313 is a reply to message #5295] Thu, 17 August 2006 14:38 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Thu, 2006-08-17 at 17:27 +0400, Kirill Korotaev wrote:
> charged kernel objects can't be _reclaimed_. how do you propose
> to reclaim tasks page tables or files or task struct or vma or etc.?

Do you have any statistics on which of these objects are the most
troublesome? If it _is_ pagetables, for instance, it is quite
conceivable that we could reclaim them.

This one probably deserves a big, fat comment, though. ;)

-- Dave
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5315 is a reply to message #5296] Thu, 17 August 2006 14:36 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Thu, 2006-08-17 at 17:31 +0400, Kirill Korotaev wrote:
> > How many things actually use this? Can we have the slab ubcs
> without
> > the struct page pointer?
> slab doesn't use this pointer on the page.
> It is used for pages allocated by buddy
> alocator implicitly (e.g. LDT pages, page tables, ...).

Hmmm. There aren't _that_ many of those cases, right? Are there any
that absolutely need raw access to the buddy allocator? I'm pretty sure
that pagetables can be moved over to a slab, as long as we bump up the
alignment.

It does seem a wee bit silly to have the pointer in _all_ of the struct
pages, even the ones for which we will never do any accounting (and even
on kernels that never need it). But, a hashing scheme sounds like a
fine idea.

-- Dave
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5317 is a reply to message #5285] Thu, 17 August 2006 15:40 Go to previous messageGo to next message
Andrew Morton is currently offline  Andrew Morton
Messages: 127
Registered: December 2005
Senior Member
On Thu, 17 Aug 2006 16:13:30 +0400
Kirill Korotaev <dev@sw.ru> wrote:

> > I was more thinking about (for example) user land physical memory limit
> > for that bean counter. If the limits are going down, then the system
> > call should try to flush out page cache pages or swap out anonymous
> > memory. But you are right that it won't be possible in all cases, like
> > for in kernel memory limits.
> Such kind of memory management is less efficient than the one
> making decisions based on global shortages and global LRU alogrithm.

I also was quite surprised that openvz appears to have no way of
constraining a container's memory usage. "I want to run this bunch of
processes in a 4.5GB container".

> The problem here is that doing swap out takes more expensive disk I/O
> influencing other users.

A well-set-up container would presumably be working against its own
spindle(s). If the operator has gone to all the trouble of isolating a job
from the system's other jobs, he'd be pretty dumb to go and let all the
"isolated" jobs share a stinky-slow resource like a disk.

But yes, swap is a problem. To do this properly we'd need a way of saying
"this container here uses that swap device over there".
Re: [RFC][PATCH 7/7] UBC: proc interface [message #5318 is a reply to message #5299] Thu, 17 August 2006 15:40 Go to previous messageGo to next message
Greg KH is currently offline  Greg KH
Messages: 27
Registered: February 2006
Junior Member
On Thu, Aug 17, 2006 at 05:43:16PM +0400, Kirill Korotaev wrote:
> >On Wed, Aug 16, 2006 at 07:44:30PM +0400, Kirill Korotaev wrote:
> >
> >>Add proc interface (/proc/user_beancounters) allowing to see current
> >>state (usage/limits/fails for each UB). Implemented via seq files.
> >
> >
> >Ugh, why /proc? This doesn't have anything to do with processes, just
> >users, right? What's wrong with /sys/kernel/ instead?
> We can move it, if there are much objections.

I am objecting. /proc is for processes so do not add any new files
there that do not deal with processes.

> >Or /sys/kernel/debug/user_beancounters/ in debugfs as this is just a
> >debugging thing, right?
> debugfs is usually OFF imho.

No, distros enable it.

> you don't export meminfo information in debugfs, correct?

That is because the meminfo is tied to processes, or was added to proc
before debugfs came about.

Then how about just /sys/kernel/ instead and use sysfs? Just remember,
one value per file please.

thanks,

greg k-h
Re: Re: [RFC][PATCH 7/7] UBC: proc interface [message #5321 is a reply to message #5318] Thu, 17 August 2006 16:12 Go to previous messageGo to next message
kir is currently offline  kir
Messages: 1645
Registered: August 2005
Location: Moscow, Russia
Senior Member

Greg KH wrote:
> On Thu, Aug 17, 2006 at 05:43:16PM +0400, Kirill Korotaev wrote:
>
>>> On Wed, Aug 16, 2006 at 07:44:30PM +0400, Kirill Korotaev wrote:
>>>
>>>
>>>> Add proc interface (/proc/user_beancounters) allowing to see current
>>>> state (usage/limits/fails for each UB). Implemented via seq files.
>>>>
>>> Ugh, why /proc? This doesn't have anything to do with processes, just
>>> users, right? What's wrong with /sys/kernel/ instead?
>>>
>> We can move it, if there are much objections.
>>
>
> I am objecting. /proc is for processes so do not add any new files
> there that do not deal with processes.
>
>
>>> Or /sys/kernel/debug/user_beancounters/ in debugfs as this is just a
>>> debugging thing, right?
>>>
>> debugfs is usually OFF imho.
>>
>
> No, distros enable it.
>
>
>> you don't export meminfo information in debugfs, correct?
>>
>
> That is because the meminfo is tied to processes, or was added to proc
> before debugfs came about.
>
> Then how about just /sys/kernel/ instead and use sysfs? Just remember,
> one value per file please.
>
I see two problems with that. But let me first describe the current
/proc/user_beancounters. This is how it looks like from inside a container:

# cat /proc/user_beancounters
Version: 2.5
uid resource held maxheld barrier limit failcnt
123: kmemsize 836919 1005343 2752512 2936012 0
lockedpages 0 0 32 32 0
privvmpages 4587 7289 49152 53575 0
............(more lines like that).........................................


I.e. a container owner can take a glance over the current parameters,
their usage and (the thing that is really important) fail counters. Fail
counter increases each time a parameter hits the limit. This is very
straightforward way for container's owner to see if everything is OK or not.

So, the problems with /sys are:

(1) Gettng such info from 40+ files requires at least some script, while
now cat is just fine.

(2) Do we want to virtualize sysfs and enable /sys for every container?
Note that user_beancounters statistics is really needed for container's
owner to see. At the same time, container's owner should not be able to
modify it -- so we should end up with read/write ubc entries for the
host system and read-only ones for the container.

Taking into account those two issues, current /proc/user_beancounters
might be not that bad.
Re: [ckrm-tech] [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5322 is a reply to message #5303] Thu, 17 August 2006 16:19 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Thu, Aug 17, 2006 at 06:04:59PM +0400, Kirill Korotaev wrote:
> Please, keep in mind. This patch set can be extended in infinite number
> of ways. But!!! It contains only the required minimal functionality.

Sure ..But going by this it should mean that we don't see any code which
hints of heirarchy (->parent)? :)

> When we are to add code requiring to know about limit changes or fails
> or whatever we can always extend it accordingly.

--
Regards,
vatsa
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5325 is a reply to message #5241] Thu, 17 August 2006 16:36 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Thu, 2006-08-17 at 01:22 +0100, Alan Cox wrote:
> Ar Mer, 2006-08-16 am 12:15 -0700, ysgrifennodd Rohit Seth:
> > resources will be allocated/freed in context of a user process. And at
> > that time we know if a allocation should succeed or not. So we may
> > actually not need to track kernel pages that closely.
>
> Quite the reverse, tracking kernel pages is critical,

Having the knowledge of how many kernel pages are getting used by each
container is indeed very useful. But as long as the context in which
they are created and destroyed is identifiable, there is no need to
really physically tag each page with container id. And for the cases
where we have no context, it will be worth while to see if mapping field
could be used.


> user pages kind of
> balance out for many cases kernel ones don't.
>

It is useful to put limits on some group of applications. So it is
really not only about balancing out.

-rohit
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5328 is a reply to message #5279] Thu, 17 August 2006 16:55 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Thu, 2006-08-17 at 15:53 +0400, Kirill Korotaev wrote:
> Rohit Seth wrote:
> > On Wed, 2006-08-16 at 19:37 +0400, Kirill Korotaev wrote:
> >
> >>Core functionality and interfaces of UBC:
> >>find/create beancounter, initialization,
> >>charge/uncharge of resource, core objects' declarations.
> >>
> >>Basic structures:
> >> ubparm - resource description
> >> user_beancounter - set of resources, id, lock
> >>
> >>Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
> >>Signed-Off-By: Kirill Korotaev <dev@sw.ru>
> >>
> >>---
> >> include/ub/beancounter.h | 157 ++++++++++++++++++
> >> init/main.c | 4
> >> kernel/Makefile | 1
> >> kernel/ub/Makefile | 7
> >> kernel/ub/beancounter.c | 398 +++++++++++++++++++++++++++++++++++++++++++++++
> >> 5 files changed, 567 insertions(+)
> >>
> >>--- /dev/null 2006-07-18 14:52:43.075228448 +0400
> >>+++ ./include/ub/beancounter.h 2006-08-10 14:58:27.000000000 +0400
> >>@@ -0,0 +1,157 @@
> >>+/*
> >>+ * include/ub/beancounter.h
> >>+ *
> >>+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
> >>+ *
> >>+ */
> >>+
> >>+#ifndef _LINUX_BEANCOUNTER_H
> >>+#define _LINUX_BEANCOUNTER_H
> >>+
> >>+/*
> >>+ * Resource list.
> >>+ */
> >>+
> >>+#define UB_RESOURCES 0
> >>+
> >>+struct ubparm {
> >>+ /*
> >>+ * A barrier over which resource allocations are failed gracefully.
> >>+ * e.g. if the amount of consumed memory is over the barrier further
> >>+ * sbrk() or mmap() calls fail, the existing processes are not killed.
> >>+ */
> >>+ unsigned long barrier;
> >>+ /* hard resource limit */
> >>+ unsigned long limit;
> >>+ /* consumed resources */
> >>+ unsigned long held;
> >>+ /* maximum amount of consumed resources through the last period */
> >>+ unsigned long maxheld;
> >>+ /* minimum amount of consumed resources through the last period */
> >>+ unsigned long minheld;
> >>+ /* count of failed charges */
> >>+ unsigned long failcnt;
> >>+};
> >
> >
> > What is the difference between barrier and limit. They both sound like
> > hard limits. No?
> check __charge_beancounter_locked and severity.
> It provides some kind of soft and hard limits.
>

Would be easier to just rename them as soft and hard limits...

> >>+
> >>+/*
> >>+ * Kernel internal part.
> >>+ */
> >>+
> >>+#ifdef __KERNEL__
> >>+
> >>+#include <linux/config.h>
> >>+#include <linux/spinlock.h>
> >>+#include <linux/list.h>
> >>+#include <asm/atomic.h>
> >>+
> >>+/*
> >>+ * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
> >>+ */
> >>+ /* resources statistics and settings */
> >>+ struct ubparm ub_parms[UB_RESOURCES];
> >>+};
> >>+
> >
> >
> > I presume UB_RESOURCES value is going to change as different resources
> > start getting tracked.
> what's wrong with it?
>

...just that user land will need to be some how informed about that.

> > I think something like configfs should be used for user interface. It
> > automatically presents the right interfaces to user land (based on
> > kernel implementation). And you wouldn't need any changes in glibc etc.
> 1. UBC doesn't require glibc modificatins.

You are right not for setting the limits. But for adding any new
functionality related to containers....as in you just added a new system
call to get the limits.

> 2. if you think a bit more about it, adding UB parameters doesn't
> require user space changes as well.
> 3. it is possible to add any kind of interface for UBC. but do you like the idea
> to grep 200(containers)x20(parameters) files for getting current usages?

How are you doing it currently and how much more efficient it is in
comparison to configfs?

> Do you like the idea to convert numbers to strings and back w/o
> thinking of data types?

IMO, setting up limits and containers (themselves) is not a common
operation. I wouldn't be too worried about loosing those few extra
cycles in setting them up.

-rohit
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5329 is a reply to message #5295] Thu, 17 August 2006 17:02 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Thu, 2006-08-17 at 17:27 +0400, Kirill Korotaev wrote:
> > If I'm reading this patch right then seems like you are making page
> > allocations to fail w/o (for example) trying to purge some pages from
> > the page cache belonging to this container. Or is that reclaim going to
> > come later?
>
> charged kernel objects can't be _reclaimed_. how do you propose
> to reclaim tasks page tables or files or task struct or vma or etc.?


I agree that kernel objects cann't be reclaimed easily. But what you
are proposing is also not right. Returning failure w/o doing any
reclaim on pages (that are reclaimable) is not useful. And this is why
I asked, is this change going to be part of next set of patches (as
current set of patches are only tracking kernel usage).

-rohit
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5330 is a reply to message #5284] Thu, 17 August 2006 17:05 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Thu, 2006-08-17 at 16:04 +0400, Kirill Korotaev wrote:
> >>Add the following system calls for UB management:
> >> 1. sys_getluid - get current UB id
> >> 2. sys_setluid - changes exec_ and fork_ UBs on current
> >> 3. sys_setublimit - set limits for resources consumtions
> >>
> >
> >
> > Why not have another system call for getting the current limits?
> will add sys_getublimit().
>
> > But as I said in previous mail, configfs seems like a better choice for
> > user interface. That way user has to go to one place to read/write
> > limits, see the current usage and other stats.
> Check another email about interfaces. I have arguments against it :/
>

...and I'm still not convinced that syscall is the right approach.

> > I think there should be a check here for seeing if the new limits are
> > lower than the current usage of a resource. If so then take appropriate
> > action.
> any idea what exact action to add here?
> Looks like can be added when needed, agree?
>

When you have the support of user memory, then operations like flush the
extra pages belonging to the container to disk seems reasonable.

-rohit
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5331 is a reply to message #5285] Thu, 17 August 2006 17:08 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Thu, 2006-08-17 at 16:13 +0400, Kirill Korotaev wrote:
> Rohit Seth wrote:
> > On Wed, 2006-08-16 at 20:04 +0100, Alan Cox wrote:
> >
> >>Ar Mer, 2006-08-16 am 11:17 -0700, ysgrifennodd Rohit Seth:
> >>
> >>>I think there should be a check here for seeing if the new limits are
> >>>lower than the current usage of a resource. If so then take appropriate
> >>>action.
> >>
> >>Generally speaking there isn't a sane appropriate action because the
> >>resources can't just be yanked.
> >>
> >
> >
> > I was more thinking about (for example) user land physical memory limit
> > for that bean counter. If the limits are going down, then the system
> > call should try to flush out page cache pages or swap out anonymous
> > memory. But you are right that it won't be possible in all cases, like
> > for in kernel memory limits.
> Such kind of memory management is less efficient than the one
> making decisions based on global shortages and global LRU alogrithm.
>
> The problem here is that doing swap out takes more expensive disk I/O
> influencing other users.
>
> So throttling algorithms if wanted should be optional, not mandatory.
> Lets postpone it and concentrate on the core.
>


I'm really interested in seeing what changes you make in alloc_page when
the container limits are hit.

When a container is throttling then yes it will have some additional
cost to other containers but that is the cost of sharing an underlying
platform.

-rohit
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5332 is a reply to message #5297] Thu, 17 August 2006 17:13 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Thu, 2006-08-17 at 17:35 +0400, Kirill Korotaev wrote:

> > My preference would be to have container (I keep on saying container,
> > but resource beancounter) pointer embeded in task, mm(not sure),
> > address_space and anon_vma structures. This should allow us to track
> > user land pages optimally. But for tracking kernel usage on behalf of
> > user, we will have to use an additional field (unless we can re-use
> > mapping). Please correct me if I'm wrong, though all the kernel
> > resources will be allocated/freed in context of a user process. And at
> > that time we know if a allocation should succeed or not. So we may
> > actually not need to track kernel pages that closely. We are not going
> > to run reclaim on any of them anyways.
> objects are really allocated in process context
> (except for TCP/IP and other softirqs which are done in arbitrary
> process context!)

Can these pages be tagged using mapping field of the page struct.

> And objects are not always freed in correct context (!).
>
You mean beyond Networking and softirq.

> Note, page_ub is not for _user_ pages. user pages accounting will be added
> in next patch set. page_ub is added to track kernel allocations.
>

But will the page_ub be used for some purpose for user land pages?

-rohit
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5347 is a reply to message #5279] Fri, 18 August 2006 05:31 Go to previous messageGo to next message
Andrew Morton is currently offline  Andrew Morton
Messages: 127
Registered: December 2005
Senior Member
On Thu, 17 Aug 2006 15:53:40 +0400
Kirill Korotaev <dev@sw.ru> wrote:

> >>+struct user_beancounter
> >>+{
> >>+ atomic_t ub_refcount;
> >>+ spinlock_t ub_lock;
> >>+ uid_t ub_uid;
> >
> >
> > Why uid? Will it be possible to club processes belonging to different
> > users to same bean counter.
> oh, its a misname. Should be ub_id. it is ID of user_beancounter
> and has nothing to do with user id.

But it uses a uid_t. That's more than a misnaming?
Re: [ckrm-tech] [RFC][PATCH 2/7] UBC: core (structures, API) [message #5348 is a reply to message #5302] Thu, 17 August 2006 18:59 Go to previous messageGo to next message
Chandra Seetharaman is currently offline  Chandra Seetharaman
Messages: 88
Registered: August 2006
Member
On Thu, 2006-08-17 at 18:02 +0400, Kirill Korotaev wrote:
<snip>
> >>+static void init_beancounter_syslimits(struct user_beancounter *ub)
> >>+{
> >>+ int k;
> >>+
> >>+ for (k = 0; k < UB_RESOURCES; k++)
> >>+ ub->ub_parms[k].barrier = ub->ub_parms[k].limit;
> >
> >
> > This sets barrier to 0. Is this value of 0 interpreted differently by
> > different controllers? One way to interpret it is "dont allocate any
> > resource", other way to interpret it is "don't care - give me what you
> > can" (which makes sense for stuff like CPU and network bandwidth).
> every patch which adds a resource modifies this function and sets
> some default limit. Check: [PATCH 5/7] UBC: kernel memory accounting (core)

The idea of upper layer code changing the lower layer's code doesn't
sound good. May be you can think of defining some interface to do it.

>
> Thanks,
> Kirill
>
>
> ------------------------------------------------------------ -------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&b id=263057&dat=121642
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--

------------------------------------------------------------ ----------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
------------------------------------------------------------ ----------
Re: [ckrm-tech] [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5349 is a reply to message #5199] Fri, 18 August 2006 02:31 Go to previous messageGo to next message
Matt Helsley is currently offline  Matt Helsley
Messages: 86
Registered: August 2006
Member
On Wed, 2006-08-16 at 19:39 +0400, Kirill Korotaev wrote:
> Add the following system calls for UB management:
> 1. sys_getluid - get current UB id
> 2. sys_setluid - changes exec_ and fork_ UBs on current
> 3. sys_setublimit - set limits for resources consumtions
>
> Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
> Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>
> ---
> arch/i386/kernel/syscall_table.S | 3
> arch/ia64/kernel/entry.S | 3
> arch/sparc/kernel/systbls.S | 2
> arch/sparc64/kernel/systbls.S | 2
> include/asm-i386/unistd.h | 5 +
> include/asm-ia64/unistd.h | 5 +
> include/asm-powerpc/systbl.h | 3
> include/asm-powerpc/unistd.h | 5 +
> include/asm-sparc/unistd.h | 3
> include/asm-sparc64/unistd.h | 3
> include/asm-x86_64/unistd.h | 8 ++
> kernel/ub/Makefile | 1
> kernel/ub/sys.c | 126 +++++++++++++++++++++++++++++++++++++++
> 13 files changed, 163 insertions(+), 6 deletions(-)
>
> --- ./arch/i386/kernel/syscall_table.S.ubsys 2006-07-10 12:39:10.000000000 +0400
> +++ ./arch/i386/kernel/syscall_table.S 2006-07-31 14:36:59.000000000 +0400
> @@ -317,3 +317,6 @@ ENTRY(sys_call_table)
> .long sys_vmsplice
> .long sys_move_pages
> .long sys_getcpu
> + .long sys_getluid
> + .long sys_setluid
> + .long sys_setublimit /* 320 */
> --- ./arch/ia64/kernel/entry.S.ubsys 2006-07-10 12:39:10.000000000 +0400
> +++ ./arch/ia64/kernel/entry.S 2006-07-31 15:25:36.000000000 +0400
> @@ -1610,5 +1610,8 @@ sys_call_table:
> data8 sys_sync_file_range // 1300
> data8 sys_tee
> data8 sys_vmsplice
> + daat8 sys_getluid
> + data8 sys_setluid
> + data8 sys_setublimit // 1305
>
> .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
> --- ./arch/sparc/kernel/systbls.S.arsys 2006-07-10 12:39:10.000000000 +0400
> +++ ./arch/sparc/kernel/systbls.S 2006-08-10 17:07:15.000000000 +0400
> @@ -78,7 +78,7 @@ sys_call_table:
> /*285*/ .long sys_mkdirat, sys_mknodat, sys_fchownat, sys_futimesat, sys_fstatat64
> /*290*/ .long sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
> /*295*/ .long sys_fchmodat, sys_faccessat, sys_pselect6, sys_ppoll, sys_unshare
> -/*300*/ .long sys_set_robust_list, sys_get_robust_list
> +/*300*/ .long sys_set_robust_list, sys_get_robust_list, sys_getluid, sys_setluid, sys_setublimit
>
> #ifdef CONFIG_SUNOS_EMUL
> /* Now the SunOS syscall table. */
> --- ./arch/sparc64/kernel/systbls.S.arsys 2006-07-10 12:39:11.000000000 +0400
> +++ ./arch/sparc64/kernel/systbls.S 2006-08-10 17:08:52.000000000 +0400
> @@ -79,7 +79,7 @@ sys_call_table32:
> .word sys_mkdirat, sys_mknodat, sys_fchownat, compat_sys_futimesat, compat_sys_fstatat64
> /*290*/ .word sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
> .word sys_fchmodat, sys_faccessat, compat_sys_pselect6, compat_sys_ppoll, sys_unshare
> -/*300*/ .word compat_sys_set_robust_list, compat_sys_get_robust_list
> +/*300*/ .word compat_sys_set_robust_list, compat_sys_get_robust_list, sys_getluid, sys_setluid, sys_setublimit
>
> #endif /* CONFIG_COMPAT */
>
> --- ./include/asm-i386/unistd.h.ubsys 2006-07-10 12:39:19.000000000 +0400
> +++ ./include/asm-i386/unistd.h 2006-07-31 15:56:31.000000000 +0400
> @@ -323,10 +323,13 @@
> #define __NR_vmsplice 316
> #define __NR_move_pages 317
> #define __NR_getcpu 318
> +#define __NR_getluid 319
> +#define __NR_setluid 320
> +#define __NR_setublimit 321
>
> #ifdef __KERNEL__
>
> -#define NR_syscalls 318
> +#define NR_syscalls 322
> #include <linux/err.h>
>
> /*
> --- ./include/asm-ia64/unistd.h.ubsys 2006-07-10 12:39:19.000000000 +0400
> +++ ./include/asm-ia64/unistd.h 2006-07-31 15:57:23.000000000 +0400
> @@ -291,11 +291,14 @@
> #define __NR_sync_file_range 1300
> #define __NR_tee 1301
> #define __NR_vmsplice 1302
> +#define __NR_getluid 1303
> +#define __NR_setluid 1304
> +#define __NR_setublimit 1305
>
> #ifdef __KERNEL__
>
>
> -#define NR_syscalls 279 /* length of syscall table */
> +#define NR_syscalls 282 /* length of syscall table */
>
> #define __ARCH_WANT_SYS_RT_SIGACTION
>
> --- ./include/asm-powerpc/systbl.h.arsys 2006-07-10 12:39:19.000000000 +0400
> +++ ./include/asm-powerpc/systbl.h 2006-08-10 17:05:53.000000000 +0400
> @@ -304,3 +304,6 @@ SYSCALL_SPU(fchmodat)
> SYSCALL_SPU(faccessat)
> COMPAT_SYS_SPU(get_robust_list)
> COMPAT_SYS_SPU(set_robust_list)
> +SYSCALL(sys_getluid)
> +SYSCALL(sys_setluid)
> +SYSCALL(sys_setublimit)
> --- ./include/asm-powerpc/unistd.h.arsys 2006-07-10 12:39:19.000000000 +0400
> +++ ./include/asm-powerpc/unistd.h 2006-08-10 17:06:28.000000000 +0400
> @@ -323,10 +323,13 @@
> #define __NR_faccessat 298
> #define __NR_get_robust_list 299
> #define __NR_set_robust_list 300
> +#define __NR_getluid 301
> +#define __NR_setluid 302
> +#define __NR_setublimit 303
>
> #ifdef __KERNEL__
>
> -#define __NR_syscalls 301
> +#define __NR_syscalls 304
>
> #define __NR__exit __NR_exit
> #define NR_syscalls __NR_syscalls
> --- ./include/asm-sparc/unistd.h.arsys 2006-07-10 12:39:19.000000000 +0400
> +++ ./include/asm-sparc/unistd.h 2006-08-10 17:08:19.000000000 +0400
> @@ -318,6 +318,9 @@
> #define __NR_unshare 299
> #define __NR_set_robust_list 300
> #define __NR_get_robust_list 301
> +#define __NR_getluid 302
> +#define __NR_setluid 303
> +#define __NR_setublimit 304
>
> #ifdef __KERNEL__
> /* WARNING: You MAY NOT add syscall numbers larger than 301, since
> --- ./include/asm-sparc64/unistd.h.arsys 2006-07-10 12:39:19.000000000 +0400
> +++ ./include/asm-sparc64/unistd.h 2006-08-10 17:09:24.000000000 +0400
> @@ -320,6 +320,9 @@
> #define __NR_unshare 299
> #define __NR_set_robust_list 300
> #define __NR_get_robust_list 301
> +#define __NR_getluid 302
> +#define __NR_setluid 303
> +#define __NR_setublimit 304
>
> #ifdef __KERNEL__
> /* WARNING: You MAY NOT add syscall numbers larger than 301, since
> --- ./include/asm-x86_64/unistd.h.ubsys 2006-07-10 12:39:19.000000000 +0400
> +++ ./include/asm-x86_64/unistd.h 2006-07-31 16:00:01.000000000 +0400
> @@ -619,10 +619,16 @@ __SYSCALL(__NR_sync_file_range, sys_sync
> __SYSCALL(__NR_vmsplice, sys_vmsplice)
> #define __NR_move_pages 279
> __SYSCALL(__NR_move_pages, sys_move_pages)
> +#define __NR_getluid 280
> +__SYSCALL(__NR_getluid, sys_getluid)
> +#define __NR_setluid 281
> +__SYSCALL(__NR_setluid, sys_setluid)
> +#define __NR_setublimit 282
> +__SYSCALL(__NR_setublimit, sys_setublimit)
>
> #ifdef __KERNEL__
>
> -#define __NR_syscall_max __NR_move_pages
> +#define __NR_syscall_max __NR_setublimit
> #include <linux/err.h>
>
> #ifndef __NO_STUBS
> --- ./kernel/ub/Makefile.ubsys 2006-07-28 14:08:37.000000000 +0400
> +++ ./kernel/ub/Makefile 2006-08-01 11:08:39.000000000 +0400
> @@ -6,3 +6,4 @@
>
> obj-$(CONFIG_USER_RESOURCE) += beancounter.o
> obj-$(CONFIG_USER_RESOURCE) += misc.o
> +obj-y += sys.o
> --- ./kernel/ub/sys.c.ubsys 2006-07-28 18:52:18.000000000 +0400
> +++ ./kernel/ub/sys.c 2006-08-03 16:14:23.000000000 +0400
> @@ -0,0 +1,126 @@
> +/*
> + * kernel/ub/sys.c
> + *
> + * Copyright (C) 2006 OpenVZ. SWsoft Inc
> + *
> + */
> +
> +#include <linux/config.h>
> +#include <linux/sched.h>
> +#include <asm/uaccess.h>
> +
> +#include <ub/beancounter.h>
> +#include <ub/task.h>
> +
> +#ifndef CONFIG_USER_RESOURCE

Get rid of the #ifdef since this file should only be compiled if
CONFIG_USER_RESOURCE=y anyway.

> +asmlinkage long sys_getluid(void)
> +{
> + return -ENOSYS;
> +}
> +
> +asmlinkage long sys_setluid(uid_t uid)
> +{
> + return -ENOSYS;
> +}
> +
> +asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
> + unsigned long *limits)
> +{
> + return -ENOSYS;
> +}

Looks to me like you want to add:

cond_syscall(sys_getluid);
...

in kernel/sys_ni.c and then you won't have to worry about making these
empty functions.

> +#else /* CONFIG_USER_RESOURCE */
> +
> +/*
> + * The (rather boring) getluid syscall
> + */
> +asmlinkage long sys_getluid(void)
> +{
> + struct user_beancounter *ub;
> +
> + ub = get_exec_ub();
> + if (ub == NULL)
> + return -EINVAL;
> +
> + return ub->ub_uid;
> +}
> +
> +/*
> + * The setluid syscall
> + */
> +asmlinkage long sys_setluid(uid_t uid)
> +{
> + int error;
> + struct user_beancounter *ub;
> + struct task_beancounter *task_bc;
> +
> + task_bc = &current->task_bc;
> +
> + /* You may not disown a setluid */
> + error = -EINVAL;
> + if (uid == (uid_t)-1)
> + goto out;
> +
> + /* You may only set an ub as root */
> + error = -EPERM;
> + if (!capable(CAP_SETUID))
> + goto out;

With resource groups you don't necessarily have to be root -- just the
owner of the group and task.

Filesystems and appropriate share representations offer a way to give
regular users the ability to manage their resources without requiring
CAP_FOO.

> + /* Ok - set up a beancounter entry for this user */
> + error = -ENOBUFS;
> + ub = beancounter_findcreate(uid, NULL, UB_ALLOC);
> + if (ub == NULL)
> + goto out;
> +
> + /* install bc */
> + put_beancounter(task_bc->exec_ub);
> +
...

Re: [ckrm-tech] [RFC][PATCH 3/7] UBC: ub context and inheritance [message #5350 is a reply to message #5198] Fri, 18 August 2006 02:42 Go to previous messageGo to next message
Matt Helsley is currently offline  Matt Helsley
Messages: 86
Registered: August 2006
Member
On Wed, 2006-08-16 at 19:38 +0400, Kirill Korotaev wrote:
> Contains code responsible for setting UB on task,
> it's inheriting and setting host context in interrupts.
>
> Task references three beancounters:
> 1. exec_ub current context. all resources are
> charged to this beancounter.

nit: 2-3 below seem to contradict "all". If you mean "the rest" then
perhaps you ought to reorder these:

1. task_ub ...
2. fork_sub ...
3. exec_ub Current context. Resources not charged to task_ub
or fork_sub are charged to this beancounter.

> 2. task_ub beancounter to which task_struct is
> charged itself.

Is task_ub frequently the parent beancounter of exec_ub? If it's always
the parent then perhaps the one or more of these _ub fields in the task
struct are not necessary. Also in that case keeping copies of the
"parent" user_beancounter pointers in the task_beancounters would seem
bug-prone -- if the hierarchy of beancounters changes then these would
need to be changed too.

> 3. fork_sub beancounter which is inherited by
> task's children on fork

Is this frequently the same as exec_ub?

> Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
> Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>
> ---
> include/linux/sched.h | 5 +++++
> include/ub/task.h | 42 ++++++++++++++++++++++++++++++++++++++++++
> kernel/fork.c | 21 ++++++++++++++++-----
> kernel/irq/handle.c | 9 +++++++++
> kernel/softirq.c | 8 ++++++++
> kernel/ub/Makefile | 1 +
> kernel/ub/beancounter.c | 4 ++++
> kernel/ub/misc.c | 34 ++++++++++++++++++++++++++++++++++
> 8 files changed, 119 insertions(+), 5 deletions(-)
>
> --- ./include/linux/sched.h.ubfork 2006-07-17 17:01:12.000000000 +0400
> +++ ./include/linux/sched.h 2006-07-31 16:01:54.000000000 +0400
> @@ -81,6 +81,8 @@ struct sched_param {
> #include <linux/timer.h>
> #include <linux/hrtimer.h>
>
> +#include <ub/task.h>
> +
> #include <asm/processor.h>
>
> struct exec_domain;
> @@ -997,6 +999,9 @@ struct task_struct {
> spinlock_t delays_lock;
> struct task_delay_info *delays;
> #endif
> +#ifdef CONFIG_USER_RESOURCE
> + struct task_beancounter task_bc;
> +#endif
> };
>
> static inline pid_t process_group(struct task_struct *tsk)
> --- ./include/ub/task.h.ubfork 2006-07-28 18:53:52.000000000 +0400
> +++ ./include/ub/task.h 2006-08-01 15:26:08.000000000 +0400
> @@ -0,0 +1,42 @@
> +/*
> + * include/ub/task.h
> + *
> + * Copyright (C) 2006 OpenVZ. SWsoft Inc
> + *
> + */
> +
> +#ifndef __UB_TASK_H_
> +#define __UB_TASK_H_
> +
> +#include <linux/config.h>
> +
> +struct user_beancounter;
> +
> +struct task_beancounter {
> + struct user_beancounter *exec_ub;
> + struct user_beancounter *task_ub;
> + struct user_beancounter *fork_sub;
> +};
> +
> +#ifdef CONFIG_USER_RESOURCE
> +#define get_exec_ub() (current->task_bc.exec_ub)
> +#define set_exec_ub(newub) \
> + ({ \
> + struct user_beancounter *old; \
> + struct task_beancounter *tbc; \
> + tbc = &current->task_bc; \
> + old = tbc->exec_ub; \
> + tbc->exec_ub = newub; \
> + old; \
> + })
> +

How about making these static inlines?

> +int ub_task_charge(struct task_struct *parent, struct task_struct *new);
> +void ub_task_uncharge(struct task_struct *tsk);
> +
> +#else /* CONFIG_USER_RESOURCE */
> +#define get_exec_ub() (NULL)
> +#define set_exec_ub(__ub) (NULL)
> +#define ub_task_charge(p, t) (0)
> +#define ub_task_uncharge(t) do { } while (0)
> +#endif /* CONFIG_USER_RESOURCE */
> +#endif /* __UB_TASK_H_ */
> --- ./kernel/irq/handle.c.ubirq 2006-07-10 12:39:20.000000000 +0400
> +++ ./kernel/irq/handle.c 2006-08-01 12:39:34.000000000 +0400
> @@ -16,6 +16,9 @@
> #include <linux/interrupt.h>
> #include <linux/kernel_stat.h>
>
> +#include <ub/beancounter.h>
> +#include <ub/task.h>
> +
> #include "internals.h"
>
> /**
> @@ -166,6 +169,9 @@ fastcall unsigned int __do_IRQ(unsigned
> struct irq_desc *desc = irq_desc + irq;
> struct irqaction *action;
> unsigned int status;
> + struct user_beancounter *ub;
> +
> + ub = set_exec_ub(&ub0);

Perhaps a comment: "/* Don't charge resources gained in interrupts to current */

> kstat_this_cpu.irqs[irq]++;
> if (CHECK_IRQ_PER_CPU(desc->status)) {
> @@ -178,6 +184,8 @@ fastcall unsigned int __do_IRQ(unsigned
> desc->chip->ack(irq);
> action_ret = handle_IRQ_event(irq, regs, desc->action);
> desc->chip->end(irq);
> +
> + (void) set_exec_ub(ub);
> return 1;
> }
>
> @@ -246,6 +254,7 @@ out:
> desc->chip->end(irq);
> spin_unlock(&desc->lock);
>
> + (void) set_exec_ub(ub);


Seems like a WARN_ON() would be appropriate rather than ignoring the
return code.

> return 1;
> }
>
> --- ./kernel/softirq.c.ubirq 2006-07-17 17:01:12.000000000 +0400
> +++ ./kernel/softirq.c 2006-08-01 12:40:44.000000000 +0400
> @@ -18,6 +18,9 @@
> #include <linux/rcupdate.h>
> #include <linux/smp.h>
>
> +#include <ub/beancounter.h>
> +#include <ub/task.h>
> +
> #include <asm/irq.h>
> /*
> - No shared variables, all the data are CPU local.
> @@ -191,6 +194,9 @@ asmlinkage void __do_softirq(void)
> __u32 pending;
> int max_restart = MAX_SOFTIRQ_RESTART;
> int cpu;
> + struct user_beancounter *ub;
> +
> + ub = set_exec_ub(&ub0);

Perhaps add the same comment...

> pending = local_softirq_pending();
> account_system_vtime(current);
> @@ -229,6 +235,8 @@ restart:
>
> account_system_vtime(current);
> _local_bh_enable();
> +
> + (void) set_exec_ub(ub);

.. and the same WARN_ON.

> }
>
> #ifndef __ARCH_HAS_DO_SOFTIRQ
> --- ./kernel/fork.c.ubfork 2006-07-17 17:01:12.000000000 +0400
> +++ ./kernel/fork.c 2006-08-01 12:58:36.000000000 +0400
> @@ -46,6 +46,8 @@
> #include <linux/delayacct.h>
> #include <linux/taskstats_kern.h>
>
> +#include <ub/task.h>
> +
> #include <asm/pgtable.h>
> #include <asm/pgalloc.h>
> #include <asm/uaccess.h>
> @@ -102,6 +104,7 @@ static kmem_cache_t *mm_cachep;
>
> void free_task(struct task_struct *tsk)
> {
> + ub_task_uncharge(tsk);
> free_thread_info(tsk->thread_info);
> rt_mutex_debug_task_free(tsk);
> free_task_struct(tsk);
> @@ -162,18 +165,19 @@ static struct task_struct *dup_task_stru
>
> tsk = alloc_task_struct();
> if (!tsk)
> - return NULL;
> + goto out;
>
> ti = alloc_thread_info(tsk);
> - if (!ti) {
> - free_task_struct(tsk);
> - return NULL;
> - }
> + if (!ti)
> + goto out_tsk;
>
> *tsk = *orig;
> tsk->thread_info = ti;
> setup_thread_stack(tsk, orig);
>
> + if (ub_task_charge(orig, tsk))
> + goto out_ti;
> +
> /* One for us, one for whoever does the "release_task()" (usually parent) */
> atomic_set(&tsk->usage,2);
> atomic_set(&tsk->fs_excl, 0);
> @@ -180,6 +184,13 @@ static struct task_struct *dup_task_stru
> #endif
> tsk->splice_pipe = NULL;
> return tsk;
> +
> +out_ti:
> + free_thread_info(ti);
> +out_tsk:
> + free_task_struct(tsk);
> +out:
> + return NULL;

Ugh. This is starting to look like copy_process(). Any reason you
couldn't move the bean counter bits to copy_process() instead?

> }
>
> #ifdef CONFIG_MMU
> --- ./kernel/ub/Makefile.ubcore 2006-08-03 16:24:56.000000000 +0400
> +++ ./kernel/ub/Makefile 2006-08-01 11:08:39.000000000 +0400
> @@ -5,3 +5,4 @@
> #
>
> obj-$(CONFIG_USER_RESOURCE) += beancounter.o
> +obj-$(CONFIG_USER_RESOURCE) += misc.o
> --- ./kernel/ub/beancounter.c.ubcore 2006-07-28 13:07:44.000000000 +0400
> +++ ./kernel/ub/beancounter.c 2006-08-03 16:14:17.000000000 +0400
> @@ -395,6 +395,10 @@
> spin_lock_init(&ub_hash_lock);
> slot = &ub_hash[ub_hash_fun(ub->ub_uid)];
> hlist_add_head(&ub->hash, slot);
> +
> + current->task_bc.exec_ub = ub;
> + current->task_bc.task_ub = get_beancounter(ub);
> + current->task_bc.fork_sub = get_beancounter(ub);
> }
>
> void __init ub_init_late(void)
> --- ./kernel/ub/misc.c.ubfork 2006-07-31 16:23:44.000000000 +0400
> +++ ./kernel/ub/misc.c 2006-07-31 16:28:47.000000000 +0400
> @@ -0,0 +1,34 @@
> +/*
> + * kernel/ub/misc.c
> + *
> + * Copyright (C) 2006 OpenVZ. SWsoft Inc.
> + *
> + */
> +
> +#include <linux/sched.h>
> +
> +#include <ub/beancounter.h>
> +#include <ub/task.h>
> +
> +int ub_task_charge(struct task_struct *parent, struct task_struct *new)
> +{

parent could be derived from new if you move the charge to copy_process
instead of dup_task_struct.

> + struct task_beancounter *old_bc;
> + struct task_beancounter *new_bc;
> + struct user_beancounter *ub;
> +
> + old_bc = &parent->task_bc;
> + new_bc = &new->task_bc;
> +
> + ub = old_bc->fork_sub;
> + new_bc->exec_ub = get_beancounter(ub);
> + new_bc->task_ub = get_beancounter(ub);
> + new_bc->fork_sub = get_beancounter(ub);
> + return 0;
> +}
> +
> +void ub_task_uncharge(struct task_struct *tsk)
> +{
> + put_beancounter(tsk->task_bc.exec_ub);
> + put_beancounter(t
...

Re: [PATCH 2/7] UBC: core (structures, API) [message #5352 is a reply to message #5347] Fri, 18 August 2006 07:35 Go to previous messageGo to next message
Andrey Savochkin is currently offline  Andrey Savochkin
Messages: 47
Registered: December 2005
Member
On Thu, Aug 17, 2006 at 10:31:37PM -0700, Andrew Morton wrote:
> On Thu, 17 Aug 2006 15:53:40 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
> > >>+struct user_beancounter
> > >>+{
> > >>+ atomic_t ub_refcount;
> > >>+ spinlock_t ub_lock;
> > >>+ uid_t ub_uid;
> > >
> > >
> > > Why uid? Will it be possible to club processes belonging to different
> > > users to same bean counter.
> > oh, its a misname. Should be ub_id. it is ID of user_beancounter
> > and has nothing to do with user id.
>
> But it uses a uid_t. That's more than a misnaming?

It used to be uid-related in ancient times when the notion of container
hadn't formed up.
"user" part of user_beancounter name has the same origin :)

Now ub_id_t or something like that would be the most logical type.

Andrey
Re: [PATCH 4/7] UBC: syscalls (user interface) [message #5357 is a reply to message #5317] Fri, 18 August 2006 08:08 Go to previous messageGo to next message
Andrey Savochkin is currently offline  Andrey Savochkin
Messages: 47
Registered: December 2005
Member
Hi,

On Thu, Aug 17, 2006 at 08:40:33AM -0700, Andrew Morton wrote:
> On Thu, 17 Aug 2006 16:13:30 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
> > > I was more thinking about (for example) user land physical memory limit
> > > for that bean counter. If the limits are going down, then the system
> > > call should try to flush out page cache pages or swap out anonymous
> > > memory. But you are right that it won't be possible in all cases, like
> > > for in kernel memory limits.
> > Such kind of memory management is less efficient than the one
> > making decisions based on global shortages and global LRU alogrithm.
>
> I also was quite surprised that openvz appears to have no way of
> constraining a container's memory usage. "I want to run this bunch of
> processes in a 4.5GB container".

I'd like to share my view on the subject of memory usage limiting.

The task of limiting a container to 4.5GB of memory bottles down to the
question: what to do when the container starts to use more than assigned
4.5GB of memory?

At this moment there are only 3 viable alternatives.

A) Have separate memory management for each container,
with separate buddy allocator, lru lists, page replacement mechanism.
That implies a considerable overhead, and the main challenge there
is sharing of pages between these separate memory managers.

B) Return errors on extension of mappings, but not on page faults, where
memory is actually consumed.
In this case it makes sense to take into account not only the size of used
memory, but the size of created mappings as well.
This is approximately what "privvmpages" accounting/limiting provides in
UBC.

C) Rely on OOM killer.
This is a fall-back method in UBC, for the case "privvmpages" limits
still leave the possibility to overload the system.

It would be nice, indeed, to invent something new.
The ideal mechanism would
- slow down the container over-using memory, to signal the user that
he is over his limits,
- at the same time this slowdown shouldn't lead to the increase of memory
usage: for example, a simple slowdown of apache web server would lead
to the growth of the number of serving children and consumption of more
memory while showing worse performance,
- and, at the same time, it shouldn't penalize the rest of the system from
the performance point of view...
May be this can be achieved via carefully tuned swapout mechanism together
with disk bandwidth management capable of tracking asynchronous write
requests, may be something else is required.
It's really a big challenge.

Meanwhile, I guess we can only make small steps in improving Linux resource
management features for this moment.

Best regards

Andrey
Re: [ckrm-tech] [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5359 is a reply to message #5315] Fri, 18 August 2006 08:12 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Dave Hansen wrote:
> On Thu, 2006-08-17 at 17:31 +0400, Kirill Korotaev wrote:
>
>>>How many things actually use this? Can we have the slab ubcs
>>
>>without
>>
>>>the struct page pointer?
>>
>>slab doesn't use this pointer on the page.
>>It is used for pages allocated by buddy
>>alocator implicitly (e.g. LDT pages, page tables, ...).
>
>
> Hmmm. There aren't _that_ many of those cases, right? Are there any
> that absolutely need raw access to the buddy allocator? I'm pretty sure
> that pagetables can be moved over to a slab, as long as we bump up the
> alignment.
LDT takes from 1 to 16 pages. and is allocated by vmalloc.
do you propose to replace it with slab which can fail due to memory
fragmentation?

the same applies to fdset, fdarray, ipc ids and iptables entries.

> It does seem a wee bit silly to have the pointer in _all_ of the struct
> pages, even the ones for which we will never do any accounting (and even
> on kernels that never need it). But, a hashing scheme sounds like a
> fine idea.
It seems a silly for you since 2nd patchset accounting user pages
is not here yet. As you can see we added a union into page,
which is shared between kernel memory and user memory accounting.

THERE IS NOT USER ACCOUNTING HERE YET GUYS! :) THIS FIELD WILL BE USED!!!

Thanks,
Kirill
Re: [ckrm-tech] [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5364 is a reply to message #5325] Fri, 18 August 2006 08:43 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Rohit Seth wrote:
> On Thu, 2006-08-17 at 01:22 +0100, Alan Cox wrote:
>
>>Ar Mer, 2006-08-16 am 12:15 -0700, ysgrifennodd Rohit Seth:
>>
>>>resources will be allocated/freed in context of a user process. And at
>>>that time we know if a allocation should succeed or not. So we may
>>>actually not need to track kernel pages that closely.
>>
>>Quite the reverse, tracking kernel pages is critical,
>
>
> Having the knowledge of how many kernel pages are getting used by each
> container is indeed very useful. But as long as the context in which
> they are created and destroyed is identifiable, there is no need to
> really physically tag each page with container id. And for the cases
> where we have no context, it will be worth while to see if mapping field
> could be used.
as I described in another email this field is also reused for tracking
user pages.

Kirill
Re: [ckrm-tech] [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5365 is a reply to message #5332] Fri, 18 August 2006 08:47 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Rohit Seth wrote:
> On Thu, 2006-08-17 at 17:35 +0400, Kirill Korotaev wrote:
>
>
>>>My preference would be to have container (I keep on saying container,
>>>but resource beancounter) pointer embeded in task, mm(not sure),
>>>address_space and anon_vma structures. This should allow us to track
>>>user land pages optimally. But for tracking kernel usage on behalf of
>>>user, we will have to use an additional field (unless we can re-use
>>>mapping). Please correct me if I'm wrong, though all the kernel
>>>resources will be allocated/freed in context of a user process. And at
>>>that time we know if a allocation should succeed or not. So we may
>>>actually not need to track kernel pages that closely. We are not going
>>>to run reclaim on any of them anyways.
>>
>>objects are really allocated in process context
>>(except for TCP/IP and other softirqs which are done in arbitrary
>>process context!)
>
>
> Can these pages be tagged using mapping field of the page struct.
kernel pages can be taged with mapping field.
User pages - not. So we introduce 2 pointers in the unoin:
union {
page_ub // for kernel pages
page_pb // for user pages
}

>
>
>>And objects are not always freed in correct context (!).
>>
>
> You mean beyond Networking and softirq.
>
>
>>Note, page_ub is not for _user_ pages. user pages accounting will be added
>>in next patch set. page_ub is added to track kernel allocations.
>>
>
>
> But will the page_ub be used for some purpose for user land pages?
yes. see above.

Kirill
Re: [ckrm-tech] [PATCH 2/7] UBC: core (structures, API) [message #5370 is a reply to message #5352] Fri, 18 August 2006 08:26 Go to previous messageGo to next message
Matt Helsley is currently offline  Matt Helsley
Messages: 86
Registered: August 2006
Member
On Fri, 2006-08-18 at 11:35 +0400, Andrey Savochkin wrote:
> On Thu, Aug 17, 2006 at 10:31:37PM -0700, Andrew Morton wrote:
> > On Thu, 17 Aug 2006 15:53:40 +0400
> > Kirill Korotaev <dev@sw.ru> wrote:
> >
> > > >>+struct user_beancounter
> > > >>+{
> > > >>+ atomic_t ub_refcount;
> > > >>+ spinlock_t ub_lock;
> > > >>+ uid_t ub_uid;
> > > >
> > > >
> > > > Why uid? Will it be possible to club processes belonging to different
> > > > users to same bean counter.
> > > oh, its a misname. Should be ub_id. it is ID of user_beancounter
> > > and has nothing to do with user id.
> >
> > But it uses a uid_t. That's more than a misnaming?
>
> It used to be uid-related in ancient times when the notion of container
> hadn't formed up.
> "user" part of user_beancounter name has the same origin :)

Is it similarly irrelevant now? If so perhaps a big rename could be used
to make the names clearer (s/user_//, s/ub_/bc_/, ...).

<snip>

Cheers,
-Matt Helsley
Re: [ckrm-tech] [RFC][PATCH 3/7] UBC: ub context and inheritance [message #5371 is a reply to message #5350] Fri, 18 August 2006 09:21 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Matt Helsley wrote:
> On Wed, 2006-08-16 at 19:38 +0400, Kirill Korotaev wrote:
>
>>Contains code responsible for setting UB on task,
>>it's inheriting and setting host context in interrupts.
>>
>>Task references three beancounters:
>> 1. exec_ub current context. all resources are
>> charged to this beancounter.
>
>
> nit: 2-3 below seem to contradict "all". If you mean "the rest" then
> perhaps you ought to reorder these:
>
> 1. task_ub ...
> 2. fork_sub ...
> 3. exec_ub Current context. Resources not charged to task_ub
> or fork_sub are charged to this beancounter.
not sure what you mean.
task_ub - where _task_ _itself_ is charged as an object.
following patches will add charging of "number of tasks" using it.
fork_sub - beancounter which is inherited on fork() (chaning task beancounter).
exec_ub - is current context.


>> 2. task_ub beancounter to which task_struct is
>> charged itself.
>
>
> Is task_ub frequently the parent beancounter of exec_ub? If it's always
> the parent then perhaps the one or more of these _ub fields in the task
> struct are not necessary.
no, task_ub != exec_ub of parent task
when task is created anything can happen: task can change ub, parent can change ub,
task can be reparented. But the UB we charged task to should be known.

> Also in that case keeping copies of the
> "parent" user_beancounter pointers in the task_beancounters would seem
> bug-prone -- if the hierarchy of beancounters changes then these would
> need to be changed too.
>
>
>> 3. fork_sub beancounter which is inherited by
>> task's children on fork
>
>
> Is this frequently the same as exec_ub?
frequently, but not always. exec_ub is changed in softirq for example.
consider exec_ub as 'current' pointer in kernel.

see other comments below

>>Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
>>Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>>
>>---
>> include/linux/sched.h | 5 +++++
>> include/ub/task.h | 42 ++++++++++++++++++++++++++++++++++++++++++
>> kernel/fork.c | 21 ++++++++++++++++-----
>> kernel/irq/handle.c | 9 +++++++++
>> kernel/softirq.c | 8 ++++++++
>> kernel/ub/Makefile | 1 +
>> kernel/ub/beancounter.c | 4 ++++
>> kernel/ub/misc.c | 34 ++++++++++++++++++++++++++++++++++
>> 8 files changed, 119 insertions(+), 5 deletions(-)
>>
>>--- ./include/linux/sched.h.ubfork 2006-07-17 17:01:12.000000000 +0400
>>+++ ./include/linux/sched.h 2006-07-31 16:01:54.000000000 +0400
>>@@ -81,6 +81,8 @@ struct sched_param {
>> #include <linux/timer.h>
>> #include <linux/hrtimer.h>
>>
>>+#include <ub/task.h>
>>+
>> #include <asm/processor.h>
>>
>> struct exec_domain;
>>@@ -997,6 +999,9 @@ struct task_struct {
>> spinlock_t delays_lock;
>> struct task_delay_info *delays;
>> #endif
>>+#ifdef CONFIG_USER_RESOURCE
>>+ struct task_beancounter task_bc;
>>+#endif
>> };
>>
>> static inline pid_t process_group(struct task_struct *tsk)
>>--- ./include/ub/task.h.ubfork 2006-07-28 18:53:52.000000000 +0400
>>+++ ./include/ub/task.h 2006-08-01 15:26:08.000000000 +0400
>>@@ -0,0 +1,42 @@
>>+/*
>>+ * include/ub/task.h
>>+ *
>>+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
>>+ *
>>+ */
>>+
>>+#ifndef __UB_TASK_H_
>>+#define __UB_TASK_H_
>>+
>>+#include <linux/config.h>
>>+
>>+struct user_beancounter;
>>+
>>+struct task_beancounter {
>>+ struct user_beancounter *exec_ub;
>>+ struct user_beancounter *task_ub;
>>+ struct user_beancounter *fork_sub;
>>+};
>>+
>>+#ifdef CONFIG_USER_RESOURCE
>>+#define get_exec_ub() (current->task_bc.exec_ub)
>>+#define set_exec_ub(newub) \
>>+ ({ \
>>+ struct user_beancounter *old; \
>>+ struct task_beancounter *tbc; \
>>+ tbc = &current->task_bc; \
>>+ old = tbc->exec_ub; \
>>+ tbc->exec_ub = newub; \
>>+ old; \
>>+ })
>>+
>
>
> How about making these static inlines?
possible, but this requires including sched.h, which includes this file...
so this one is easier and more separated.

>>+int ub_task_charge(struct task_struct *parent, struct task_struct *new);
>>+void ub_task_uncharge(struct task_struct *tsk);
>>+
>>+#else /* CONFIG_USER_RESOURCE */
>>+#define get_exec_ub() (NULL)
>>+#define set_exec_ub(__ub) (NULL)
>>+#define ub_task_charge(p, t) (0)
>>+#define ub_task_uncharge(t) do { } while (0)
>>+#endif /* CONFIG_USER_RESOURCE */
>>+#endif /* __UB_TASK_H_ */
>>--- ./kernel/irq/handle.c.ubirq 2006-07-10 12:39:20.000000000 +0400
>>+++ ./kernel/irq/handle.c 2006-08-01 12:39:34.000000000 +0400
>>@@ -16,6 +16,9 @@
>> #include <linux/interrupt.h>
>> #include <linux/kernel_stat.h>
>>
>>+#include <ub/beancounter.h>
>>+#include <ub/task.h>
>>+
>> #include "internals.h"
>>
>> /**
>>@@ -166,6 +169,9 @@ fastcall unsigned int __do_IRQ(unsigned
>> struct irq_desc *desc = irq_desc + irq;
>> struct irqaction *action;
>> unsigned int status;
>>+ struct user_beancounter *ub;
>>+
>>+ ub = set_exec_ub(&ub0);
>
>
> Perhaps a comment: "/* Don't charge resources gained in interrupts to current */
ok, will add comment:
/* UBC charges should be done to host system */
>
>
>> kstat_this_cpu.irqs[irq]++;
>> if (CHECK_IRQ_PER_CPU(desc->status)) {
>>@@ -178,6 +184,8 @@ fastcall unsigned int __do_IRQ(unsigned
>> desc->chip->ack(irq);
>> action_ret = handle_IRQ_event(irq, regs, desc->action);
>> desc->chip->end(irq);
>>+
>>+ (void) set_exec_ub(ub);
>> return 1;
>> }
>>
>>@@ -246,6 +254,7 @@ out:
>> desc->chip->end(irq);
>> spin_unlock(&desc->lock);
>>
>>+ (void) set_exec_ub(ub);
>
>
>
> Seems like a WARN_ON() would be appropriate rather than ignoring the
> return code.
BUG_ON(ret != &ub0) ?

maybe introduce a kind of
reset_exec_ub(old_ub, expected_current_ub)
{
ret = set_exec_ub(old_ub);
BUG_ON(ret != expected_current_ub);
}
?


>> return 1;
>> }
>>
>>--- ./kernel/softirq.c.ubirq 2006-07-17 17:01:12.000000000 +0400
>>+++ ./kernel/softirq.c 2006-08-01 12:40:44.000000000 +0400
>>@@ -18,6 +18,9 @@
>> #include <linux/rcupdate.h>
>> #include <linux/smp.h>
>>
>>+#include <ub/beancounter.h>
>>+#include <ub/task.h>
>>+
>> #include <asm/irq.h>
>> /*
>> - No shared variables, all the data are CPU local.
>>@@ -191,6 +194,9 @@ asmlinkage void __do_softirq(void)
>> __u32 pending;
>> int max_restart = MAX_SOFTIRQ_RESTART;
>> int cpu;
>>+ struct user_beancounter *ub;
>>+
>>+ ub = set_exec_ub(&ub0);
>
>
> Perhaps add the same comment...
ok

>
>
>> pending = local_softirq_pending();
>> account_system_vtime(current);
>>@@ -229,6 +235,8 @@ restart:
>>
>> account_system_vtime(current);
>> _local_bh_enable();
>>+
>>+ (void) set_exec_ub(ub);
>
>
> .. and the same WARN_ON.
>
>
>> }
>>
>> #ifndef __ARCH_HAS_DO_SOFTIRQ
>>--- ./kernel/fork.c.ubfork 2006-07-17 17:01:12.000000000 +0400
>>+++ ./kernel/fork.c 2006-08-01 12:58:36.000000000 +0400
>>@@ -46,6 +46,8 @@
>> #include <linux/delayacct.h>
>> #include <linux/taskstats_kern.h>
>>
>>+#include <ub/task.h>
>>+
>> #include <asm/pgtable.h>
>> #include <asm/pgalloc.h>
>> #include <asm/uaccess.h>
>>@@ -102,6 +104,7 @@ static kmem_cache_t *mm_cachep;
>>
>> void free_task(struct task_struct *tsk)
>> {
>>+ ub_task_uncharge(tsk);
>> free_thread_info(tsk->thread_info);
>> rt_mutex_debug_task_free(tsk);
>> free_task_struct(tsk);
>>@@ -162,18 +165,19 @@ static struct task_struct *dup_task_stru
>>
>> tsk = alloc_task_struct();
>> if (!tsk)
>>- return NULL;
>>+ goto out;
>>
>> ti = alloc_thread_info(tsk);
>>- if (!ti) {
>>- free_task_struct(tsk);
>>- return NULL;
>>- }
>>+ if (!ti)
>>+ goto out_tsk;
>>
>> *tsk = *orig;
>> tsk->thread_info = ti;
>> setup_thread_stack(tsk, orig);
>>
>>+ if (ub_task_charge(orig, tsk))
>>+ goto out_ti;
>>+
>> /* One for us, one for whoever does the "release_task()" (usually parent) */
>> atomic_set(&tsk->usage,2);
>> atomic_set(&tsk->fs_excl, 0);
>>@@ -180,6 +184,13 @@ static struct task_struct *dup_task_stru
>> #endif
>> tsk->splice_pipe = NULL;
>> return tsk;
>>+
>>+out_ti:
>>+ free_thread_info(ti);
>>+out_tsk:
>>+ free_task_struct(tsk);
>>+out:
>>+ return NULL;
>
>
> Ugh. This is starting to look like copy_process(). Any reason you
> couldn't move the bean counter bits to copy_process() instead?
This is more logical place since we _will_ charge task here
(next patchset fo
...

Re: [ckrm-tech] [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5372 is a reply to message #5313] Fri, 18 August 2006 09:29 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Dave Hansen wrote:
> On Thu, 2006-08-17 at 17:27 +0400, Kirill Korotaev wrote:
>
>>charged kernel objects can't be _reclaimed_. how do you propose
>>to reclaim tasks page tables or files or task struct or vma or etc.?
>
>
> Do you have any statistics on which of these objects are the most
> troublesome? If it _is_ pagetables, for instance, it is quite
> conceivable that we could reclaim them.
they all are troublesome :/
user can create lots of vmas, w/o page tables.
lots of fdsets, ipcids.
These are not reclaimable.

Also consider the following scenario with reclaimable page tables.
e.g. user hit kmemsize limit due to fat page tables.
kernel reclaims some of the page tables and frees user kenerl memory.
after that user creates some uncreclaimable objects like fdsets or ipcs
and then accesses memory with reclaimed page tables.
Sooner or later we kill user with SIGSEGV from page fault due to
no memory. This is worse then returning ENOMEM from poll() or
mmap() where user allocates kernel objects.

> This one probably deserves a big, fat comment, though. ;)
tell me where to write it and what? :)

Thanks,
Kirill
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5374 is a reply to message #5329] Fri, 18 August 2006 09:36 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Rohit Seth wrote:
> On Thu, 2006-08-17 at 17:27 +0400, Kirill Korotaev wrote:
>
>>>If I'm reading this patch right then seems like you are making page
>>>allocations to fail w/o (for example) trying to purge some pages from
>>>the page cache belonging to this container. Or is that reclaim going to
>>>come later?
>>
>>charged kernel objects can't be _reclaimed_. how do you propose
>>to reclaim tasks page tables or files or task struct or vma or etc.?
>
>
>
> I agree that kernel objects cann't be reclaimed easily. But what you
> are proposing is also not right. Returning failure w/o doing any
> reclaim on pages (that are reclaimable) is not useful. And this is why
> I asked, is this change going to be part of next set of patches (as
> current set of patches are only tracking kernel usage).
1. reclaiming user resources is not that good idea as it looks to you.
such solutions end up with lots of resources spent on reclaim.
for user memory reclaims mean consumption of expensive disk I/O bandwidth
which reduces overall system throughput and influences other users.

2. kernel memory is mostly not reclaimable. can you reclaim vma structs or ipc ids?
even with page tables it is not that easy.
And the fact is:
- kernel memory consumtion is usually less then user memory,
so it's not worth reclaiming it.
- reclaiming can result in kind of user service deadlocks when you are
unable to handle user requests gracefully anymore.
See my email to Dave with an example.
- our solution _is_ right and works (for >3 years in production already).
If desired it _can_ be extended with reclamation.

Kirill
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5383 is a reply to message #5317] Fri, 18 August 2006 11:03 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Andrew Morton wrote:
> On Thu, 17 Aug 2006 16:13:30 +0400
> Kirill Korotaev <dev@sw.ru> wrote:
>
>
>>>I was more thinking about (for example) user land physical memory limit
>>>for that bean counter. If the limits are going down, then the system
>>>call should try to flush out page cache pages or swap out anonymous
>>>memory. But you are right that it won't be possible in all cases, like
>>>for in kernel memory limits.
>>
>>Such kind of memory management is less efficient than the one
>>making decisions based on global shortages and global LRU alogrithm.
>
>
> I also was quite surprised that openvz appears to have no way of
> constraining a container's memory usage. "I want to run this bunch of
> processes in a 4.5GB container".
If you mean user memory, then it is possible to set
container limits to 4,5GB. This is what most people care about
and it is not a problem.

Or you mean that you are suprised there are lots of parameters
and there is no a single one limiting the _whole_ memory set of container
memory (sum of kernel memory, user space memory and other resources memory)?

>>The problem here is that doing swap out takes more expensive disk I/O
>>influencing other users.
>
>
> A well-set-up container would presumably be working against its own
> spindle(s). If the operator has gone to all the trouble of isolating a job
> from the system's other jobs, he'd be pretty dumb to go and let all the
> "isolated" jobs share a stinky-slow resource like a disk.
why do you assume that it is always an operator who controls the applications
inside the container?
users can run any application inside and it is systems job to
introduce resource isolation between users, not the operators full-time
job doing monitoring of users.

> But yes, swap is a problem. To do this properly we'd need a way of saying
> "this container here uses that swap device over there".
yep, this is possible with page beancounters as it tracks user pages.
more over, we have an intention of building a system with a single container
memory parameter, but we think this is more user interface question and
still requires all the UBC resources accounting.

Thanks,
Kirill
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5384 is a reply to message #5328] Fri, 18 August 2006 11:13 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Rohit Seth wrote:
> On Thu, 2006-08-17 at 15:53 +0400, Kirill Korotaev wrote:
>
>>Rohit Seth wrote:
>>
>>>On Wed, 2006-08-16 at 19:37 +0400, Kirill Korotaev wrote:
>>>
>>>
>>>>Core functionality and interfaces of UBC:
>>>>find/create beancounter, initialization,
>>>>charge/uncharge of resource, core objects' declarations.
>>>>
>>>>Basic structures:
>>>> ubparm - resource description
>>>> user_beancounter - set of resources, id, lock
>>>>
>>>>Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
>>>>Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>>>>
>>>>---
>>>>include/ub/beancounter.h | 157 ++++++++++++++++++
>>>>init/main.c | 4
>>>>kernel/Makefile | 1
>>>>kernel/ub/Makefile | 7
>>>>kernel/ub/beancounter.c | 398 +++++++++++++++++++++++++++++++++++++++++++++++
>>>>5 files changed, 567 insertions(+)
>>>>
>>>>--- /dev/null 2006-07-18 14:52:43.075228448 +0400
>>>>+++ ./include/ub/beancounter.h 2006-08-10 14:58:27.000000000 +0400
>>>>@@ -0,0 +1,157 @@
>>>>+/*
>>>>+ * include/ub/beancounter.h
>>>>+ *
>>>>+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
>>>>+ *
>>>>+ */
>>>>+
>>>>+#ifndef _LINUX_BEANCOUNTER_H
>>>>+#define _LINUX_BEANCOUNTER_H
>>>>+
>>>>+/*
>>>>+ * Resource list.
>>>>+ */
>>>>+
>>>>+#define UB_RESOURCES 0
>>>>+
>>>>+struct ubparm {
>>>>+ /*
>>>>+ * A barrier over which resource allocations are failed gracefully.
>>>>+ * e.g. if the amount of consumed memory is over the barrier further
>>>>+ * sbrk() or mmap() calls fail, the existing processes are not killed.
>>>>+ */
>>>>+ unsigned long barrier;
>>>>+ /* hard resource limit */
>>>>+ unsigned long limit;
>>>>+ /* consumed resources */
>>>>+ unsigned long held;
>>>>+ /* maximum amount of consumed resources through the last period */
>>>>+ unsigned long maxheld;
>>>>+ /* minimum amount of consumed resources through the last period */
>>>>+ unsigned long minheld;
>>>>+ /* count of failed charges */
>>>>+ unsigned long failcnt;
>>>>+};
>>>
>>>
>>>What is the difference between barrier and limit. They both sound like
>>>hard limits. No?
>>
>>check __charge_beancounter_locked and severity.
>>It provides some kind of soft and hard limits.
>>
>
>
> Would be easier to just rename them as soft and hard limits...
>
>
>>>>+
>>>>+/*
>>>>+ * Kernel internal part.
>>>>+ */
>>>>+
>>>>+#ifdef __KERNEL__
>>>>+
>>>>+#include <linux/config.h>
>>>>+#include <linux/spinlock.h>
>>>>+#include <linux/list.h>
>>>>+#include <asm/atomic.h>
>>>>+
>>>>+/*
>>>>+ * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
>>>>+ */
>>>>+ /* resources statistics and settings */
>>>>+ struct ubparm ub_parms[UB_RESOURCES];
>>>>+};
>>>>+
>>>
>>>
>>>I presume UB_RESOURCES value is going to change as different resources
>>>start getting tracked.
>>
>>what's wrong with it?
>>
>
>
> ...just that user land will need to be some how informed about that.
the same way user space knows that system call is (not) implemented.
(include unistd.h :))) )

>>>I think something like configfs should be used for user interface. It
>>>automatically presents the right interfaces to user land (based on
>>>kernel implementation). And you wouldn't need any changes in glibc etc.
>>
>>1. UBC doesn't require glibc modificatins.
>
>
> You are right not for setting the limits. But for adding any new
> functionality related to containers....as in you just added a new system
> call to get the limits.
Do you state that glibc describes _all_ the existing system calls with some wrappers?

>>2. if you think a bit more about it, adding UB parameters doesn't
>> require user space changes as well.
>>3. it is possible to add any kind of interface for UBC. but do you like the idea
>> to grep 200(containers)x20(parameters) files for getting current usages?
>
>
> How are you doing it currently and how much more efficient it is in
> comparison to configfs?
currently it is done with a single file read.
you can grep it, sum up resources or do what ever you want from bash.
what is important! you can check whether container hits its limits
with a single command, while with configs you would have to look through
20 files...

IMHO it is convinient to have a text file representing the whole information state
and system call for applications.

>> Do you like the idea to convert numbers to strings and back w/o
>> thinking of data types?
>
>
> IMO, setting up limits and containers (themselves) is not a common
> operation. I wouldn't be too worried about loosing those few extra
> cycles in setting them up.
it is not the question of performance...

Kirill
Re: [ckrm-tech] [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5388 is a reply to message #5349] Fri, 18 August 2006 11:43 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Matt Helsley wrote:

[... snip ...]
>>--- ./kernel/ub/sys.c.ubsys 2006-07-28 18:52:18.000000000 +0400
>>+++ ./kernel/ub/sys.c 2006-08-03 16:14:23.000000000 +0400
>>@@ -0,0 +1,126 @@
>>+/*
>>+ * kernel/ub/sys.c
>>+ *
>>+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
>>+ *
>>+ */
>>+
>>+#include <linux/config.h>
>>+#include <linux/sched.h>
>>+#include <asm/uaccess.h>
>>+
>>+#include <ub/beancounter.h>
>>+#include <ub/task.h>
>>+
>>+#ifndef CONFIG_USER_RESOURCE
>
>
> Get rid of the #ifdef since this file should only be compiled if
> CONFIG_USER_RESOURCE=y anyway.
>
>
>>+asmlinkage long sys_getluid(void)
>>+{
>>+ return -ENOSYS;
>>+}
>>+
>>+asmlinkage long sys_setluid(uid_t uid)
>>+{
>>+ return -ENOSYS;
>>+}
>>+
>>+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
>>+ unsigned long *limits)
>>+{
>>+ return -ENOSYS;
>>+}
>
>
> Looks to me like you want to add:
>
> cond_syscall(sys_getluid);
> ...
>
> in kernel/sys_ni.c and then you won't have to worry about making these
> empty functions.
Good note. Thanks, will do it!

>>+#else /* CONFIG_USER_RESOURCE */
>>+
>>+/*
>>+ * The (rather boring) getluid syscall
>>+ */
>>+asmlinkage long sys_getluid(void)
>>+{
>>+ struct user_beancounter *ub;
>>+
>>+ ub = get_exec_ub();
>>+ if (ub == NULL)
>>+ return -EINVAL;
>>+
>>+ return ub->ub_uid;
>>+}
>>+
>>+/*
>>+ * The setluid syscall
>>+ */
>>+asmlinkage long sys_setluid(uid_t uid)
>>+{
>>+ int error;
>>+ struct user_beancounter *ub;
>>+ struct task_beancounter *task_bc;
>>+
>>+ task_bc = &current->task_bc;
>>+
>>+ /* You may not disown a setluid */
>>+ error = -EINVAL;
>>+ if (uid == (uid_t)-1)
>>+ goto out;
>>+
>>+ /* You may only set an ub as root */
>>+ error = -EPERM;
>>+ if (!capable(CAP_SETUID))
>>+ goto out;
>
>
> With resource groups you don't necessarily have to be root -- just the
> owner of the group and task.
the question is - who is the owner of group?
user, user group or who?
Both are bad, since the same user can run inside the container and thus
container will be potentially controllable/breakable from inside.


> Filesystems and appropriate share representations offer a way to give
> regular users the ability to manage their resources without requiring
> CAP_FOO.
not sure what you propose...

we can introduce the following rules:

containers (UB) can be created by process with SETUID cap only.
subcontainers (SUB) can be created by any process.

what do you think?


>>+ /* Ok - set up a beancounter entry for this user */
>>+ error = -ENOBUFS;
>>+ ub = beancounter_findcreate(uid, NULL, UB_ALLOC);
>>+ if (ub == NULL)
>>+ goto out;
>>+
>>+ /* install bc */
>>+ put_beancounter(task_bc->exec_ub);
>>+ task_bc->exec_ub = ub;
>>+ put_beancounter(task_bc->fork_sub);
>>+ task_bc->fork_sub = get_beancounter(ub);
>>+ error = 0;
>>+out:
>>+ return error;
>>+}
>>+
>>+/*
>>+ * The setbeanlimit syscall
>>+ */
>>+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
>>+ unsigned long *limits)
>>+{
>>+ int error;
>>+ unsigned long flags;
>>+ struct user_beancounter *ub;
>>+ unsigned long new_limits[2];
>>+
>>+ error = -EPERM;
>>+ if(!capable(CAP_SYS_RESOURCE))
>>+ goto out;
>
>
> Again, a filesystem interface would give us more flexibility when it
> comes to allowing users to manage their resources while still preventing
> them from exceeding limits.
we can have 2 different root users with uid = 0 in 2 different containers.

> I doubt you really want to give owners of a container CAP_SYS_RESOURCE
> and CAP_USER (i.e. total control over resource management) just to allow
> them to manage their subset of the resources.
The origin idea is that administator of the node can manage user
resources only. Users can't, since otherwise they can increase the limits.
But we can allow them to manage sub beancoutners imho...

>>+ error = -EINVAL;
>>+ if (resource >= UB_RESOURCES)
>>+ goto out;
>>+
>>+ error = -EFAULT;
>>+ if (copy_from_user(&new_limits, limits, sizeof(new_limits)))
>>+ goto out;
>>+
>>+ error = -EINVAL;
>>+ if (new_limits[0] > UB_MAXVALUE || new_limits[1] > UB_MAXVALUE)
>>+ goto out;
>>+
>>+ error = -ENOENT;
>>+ ub = beancounter_findcreate(uid, NULL, 0);
>>+ if (ub == NULL)
>>+ goto out;
>>+
>>+ spin_lock_irqsave(&ub->ub_lock, flags);
>>+ ub->ub_parms[resource].barrier = new_limits[0];
>>+ ub->ub_parms[resource].limit = new_limits[1];
>>+ spin_unlock_irqrestore(&ub->ub_lock, flags);
>>+
>>+ put_beancounter(ub);
>>+ error = 0;
>>+out:
>>+ return error;
>>+}
>>+#endif
>>
>> ------------------------------------------------------------ -------------
>>Using Tomcat but need to do more? Need to support web services, security?
>>Get stuff done quickly with pre-integrated technology to make your job easier
>>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&b id=263057&dat=121642
>>_______________________________________________
>>ckrm-tech mailing list
>>https://lists.sourceforge.net/lists/listinfo/ckrm-tech
>
>
>
Re: [ckrm-tech] [PATCH 2/7] UBC: core (structures, API) [message #5391 is a reply to message #5370] Fri, 18 August 2006 11:50 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Matt Helsley wrote:
> On Fri, 2006-08-18 at 11:35 +0400, Andrey Savochkin wrote:
>
>>On Thu, Aug 17, 2006 at 10:31:37PM -0700, Andrew Morton wrote:
>>
>>>On Thu, 17 Aug 2006 15:53:40 +0400
>>>Kirill Korotaev <dev@sw.ru> wrote:
>>>
>>>
>>>>>>+struct user_beancounter
>>>>>>+{
>>>>>>+ atomic_t ub_refcount;
>>>>>>+ spinlock_t ub_lock;
>>>>>>+ uid_t ub_uid;
>>>>>
>>>>>
>>>>>Why uid? Will it be possible to club processes belonging to different
>>>>>users to same bean counter.
>>>>
>>>>oh, its a misname. Should be ub_id. it is ID of user_beancounter
>>>>and has nothing to do with user id.
>>>
>>>But it uses a uid_t. That's more than a misnaming?
>>
>>It used to be uid-related in ancient times when the notion of container
>>hadn't formed up.
>>"user" part of user_beancounter name has the same origin :)
>
>
> Is it similarly irrelevant now? If so perhaps a big rename could be used
> to make the names clearer (s/user_//, s/ub_/bc_/, ...).
hm... let's try :)

Kirill
Re: [ckrm-tech] [RFC][PATCH 2/7] UBC: core (structures, API) [message #5394 is a reply to message #5305] Fri, 18 August 2006 12:34 Go to previous messageGo to previous message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Dave Hansen wrote:
> On Thu, 2006-08-17 at 15:45 +0400, Kirill Korotaev wrote:
>
>>We need more complex decrement/locking scheme than krefs
>>provide. e.g. in __put_beancounter() we need
>>atomic_dec_and_lock_irqsave() semantics for performance optimizations.
>
>
> Is it possible to put the locking in the destructor? It seems like that
> should give similar behavior.
objects live in hashes also so you need to distinguish objects being freed
on lookup somehow.

Kirill
Previous Topic: [PATCH] vzlist: Fix "cast from pointer to integer of different size" warnings
Next Topic: [RFC][PATCH 1/2] add user namespace [try #2]
Goto Forum:
  


Current Time: Sun Oct 26 20:55:43 GMT 2025

Total time taken to generate the page: 0.10486 seconds