| Home » Mailing lists » Devel » [RFC][PATCH] UBC: user resource beancounters Goto Forum:
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5297 is a reply to message #5239] | Thu, 17 August 2006 13:33   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| > On Wed, 2006-08-16 at 11:47 -0700, Dave Hansen wrote: >
 >>On Wed, 2006-08-16 at 19:40 +0400, Kirill Korotaev wrote:
 >>
 >>>--- ./include/linux/mm.h.kmemcore       2006-08-16 19:10:38.000000000
 >>>+0400
 >>>+++ ./include/linux/mm.h        2006-08-16 19:10:51.000000000 +0400
 >>>@@ -274,8 +274,14 @@ struct page {
 >>>        unsigned int gfp_mask;
 >>>        unsigned long trace[8];
 >>> #endif
 >>>+#ifdef CONFIG_USER_RESOURCE
 >>>+       union {
 >>>+               struct user_beancounter *page_ub;
 >>>+       } bc;
 >>>+#endif
 >>> };
 >>
 >>Is everybody OK with adding this accounting to the 'struct page'?
 >
 >
 > My preference would be to have container (I keep on saying container,
 > but resource beancounter) pointer embeded in task, mm(not sure),
 > address_space and anon_vma structures.  This should allow us to track
 > user land pages optimally.  But for tracking kernel usage on behalf of
 > user, we will have to use an additional field (unless we can re-use
 > mapping).  Please correct me if I'm wrong, though all the kernel
 > resources will be allocated/freed in context of a user process.  And at
 > that time we know if a allocation should succeed or not.  So we may
 > actually not need to track kernel pages that closely.  We are not going
 > to run reclaim on any of them anyways.
 objects are really allocated in process context
 (except for TCP/IP and other softirqs which are done in arbitrary
 process context!)
 And objects are not always freed in correct context (!).
 
 Note, page_ub is not for _user_ pages. user pages accounting will be added
 in next patch set. page_ub is added to track kernel allocations.
 
 Kirill
 |  
	|  |  |  
	| 
		
			| Re: [RFC][PATCH 7/7] UBC: proc interface [message #5299 is a reply to message #5232] | Thu, 17 August 2006 13:41   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| > On Wed, Aug 16, 2006 at 07:44:30PM +0400, Kirill Korotaev wrote: >
 >>Add proc interface (/proc/user_beancounters) allowing to see current
 >>state (usage/limits/fails for each UB). Implemented via seq files.
 >
 >
 > Ugh, why /proc?  This doesn't have anything to do with processes, just
 > users, right?  What's wrong with /sys/kernel/ instead?
 We can move it, if there are much objections.
 It is just here for more than 3 years (AFAIK starting from Alan's UBC)
 and would be nice to have for compatibility (at least with existing OpenVZ).
 But if it is required -- will do.
 
 > Or /sys/kernel/debug/user_beancounters/ in debugfs as this is just a
 > debugging thing, right?
 debugfs is usually OFF imho. you don't export meminfo information in debugfs,
 correct? user usages are the same imho...
 
 Kirill
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [RFC][PATCH] UBC: user resource beancounters [message #5301 is a reply to message #5289] | Thu, 17 August 2006 13:53   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| > On Wed, Aug 16, 2006 at 07:24:03PM +0400, Kirill Korotaev wrote: >
 >>As the first step we want to propose for discussion
 >>the most complicated parts of resource management:
 >>kernel memory and virtual memory.
 >
 > Do you have any plans to post a CPU controller? Is that tied to UBC
 > interface as well?
 
 Not everything at once :) To tell the truth I think CPU controller
 is even more complicated than user memory accounting/limiting.
 
 No, fair CPU scheduler is not tied to UBC in any regard.
 As we discussed before, it is valuable to have an ability to limit
 different resources separately (CPU, disk I/O, memory, etc.).
 For example, it can be possible to place some mission critical
 kernel threads (like kjournald) in a separate contanier.
 
 This patches are related to kernel memory and nothing more :)
 
 Thanks,
 Kirill
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5303 is a reply to message #5288] | Thu, 17 August 2006 14:03   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Srivatsa Vaddagiri wrote: > On Wed, Aug 16, 2006 at 07:39:43PM +0400, Kirill Korotaev wrote:
 >
 >
 >>+/*
 >>+ *	The setbeanlimit syscall
 >>+ */
 >>+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
 >>+		unsigned long *limits)
 >>+{
 >
 >
 > [snip]
 >
 >
 >>+	spin_lock_irqsave(&ub->ub_lock, flags);
 >>+	ub->ub_parms[resource].barrier = new_limits[0];
 >>+	ub->ub_parms[resource].limit = new_limits[1];
 >
 >
 > Would it be usefull to notify the "resource" controller about this
 > change in limits? For ex: in case of the CPU controller I wrote
 > (http://lkml.org/lkml/2006/8/4/9), I was finding it usefull to recv
 > notification of changes to these limits, so that internal structures
 > (which are kept per-task-group) can be updated.
 I think this can be added when needed, no?
 See no much reason to add notifications which are not used yet.
 
 Please, keep in mind. This patch set can be extended in infinite number
 of ways. But!!! It contains only the required minimal functionality.
 When we are to add code requiring to know about limit changes or fails
 or whatever we can always extend it accordingly.
 
 Thanks,
 Kirill
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [RFC][PATCH 7/7] UBC: proc interface [message #5318 is a reply to message #5299] | Thu, 17 August 2006 15:40   |  
			| 
				
				
					|  Greg KH Messages: 27
 Registered: February 2006
 | Junior Member |  |  |  
	| On Thu, Aug 17, 2006 at 05:43:16PM +0400, Kirill Korotaev wrote: > >On Wed, Aug 16, 2006 at 07:44:30PM +0400, Kirill Korotaev wrote:
 > >
 > >>Add proc interface (/proc/user_beancounters) allowing to see current
 > >>state (usage/limits/fails for each UB). Implemented via seq files.
 > >
 > >
 > >Ugh, why /proc?  This doesn't have anything to do with processes, just
 > >users, right?  What's wrong with /sys/kernel/ instead?
 > We can move it, if there are much objections.
 
 I am objecting.  /proc is for processes so do not add any new files
 there that do not deal with processes.
 
 > >Or /sys/kernel/debug/user_beancounters/ in debugfs as this is just a
 > >debugging thing, right?
 > debugfs is usually OFF imho.
 
 No, distros enable it.
 
 > you don't export meminfo information in debugfs, correct?
 
 That is because the meminfo is tied to processes, or was added to proc
 before debugfs came about.
 
 Then how about just /sys/kernel/ instead and use sysfs?  Just remember,
 one value per file please.
 
 thanks,
 
 greg k-h
 |  
	|  |  |  
	| 
		
			| Re:  Re: [RFC][PATCH 7/7] UBC: proc interface [message #5321 is a reply to message #5318] | Thu, 17 August 2006 16:12   |  
			|  |  
	| Greg KH wrote: > On Thu, Aug 17, 2006 at 05:43:16PM +0400, Kirill Korotaev wrote:
 >
 >>> On Wed, Aug 16, 2006 at 07:44:30PM +0400, Kirill Korotaev wrote:
 >>>
 >>>
 >>>> Add proc interface (/proc/user_beancounters) allowing to see current
 >>>> state (usage/limits/fails for each UB). Implemented via seq files.
 >>>>
 >>> Ugh, why /proc?  This doesn't have anything to do with processes, just
 >>> users, right?  What's wrong with /sys/kernel/ instead?
 >>>
 >> We can move it, if there are much objections.
 >>
 >
 > I am objecting.  /proc is for processes so do not add any new files
 > there that do not deal with processes.
 >
 >
 >>> Or /sys/kernel/debug/user_beancounters/ in debugfs as this is just a
 >>> debugging thing, right?
 >>>
 >> debugfs is usually OFF imho.
 >>
 >
 > No, distros enable it.
 >
 >
 >> you don't export meminfo information in debugfs, correct?
 >>
 >
 > That is because the meminfo is tied to processes, or was added to proc
 > before debugfs came about.
 >
 > Then how about just /sys/kernel/ instead and use sysfs?  Just remember,
 > one value per file please.
 >
 I see two problems with that. But let me first describe the current
 /proc/user_beancounters.  This is how it looks like from inside a container:
 
 # cat /proc/user_beancounters
 Version: 2.5
 uid  resource           held    maxheld    barrier      limit    failcnt
 123: kmemsize         836919    1005343    2752512    2936012          0
 lockedpages           0          0         32         32          0
 privvmpages        4587       7289      49152      53575          0
 ............(more lines like that).........................................
 
 
 I.e. a container owner can take a glance over the current parameters,
 their usage and (the thing that is really important) fail counters. Fail
 counter increases each time a parameter hits the limit. This is very
 straightforward way for container's owner to see if everything is OK or not.
 
 So, the problems with /sys are:
 
 (1) Gettng such info from 40+ files requires at least some script, while
 now cat is just fine.
 
 (2) Do we want to virtualize sysfs and enable /sys for every container?
 Note that user_beancounters statistics is really needed for container's
 owner to see. At the same time, container's owner should not be able to
 modify it -- so we should end up with read/write ubc entries for the
 host system and read-only ones for the container.
 
 Taking into account those two issues, current /proc/user_beancounters
 might be not that bad.
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5328 is a reply to message #5279] | Thu, 17 August 2006 16:55   |  
			| 
				
				
					|  Rohit Seth Messages: 101
 Registered: August 2006
 | Senior Member |  |  |  
	| On Thu, 2006-08-17 at 15:53 +0400, Kirill Korotaev wrote: > Rohit Seth wrote:
 > > On Wed, 2006-08-16 at 19:37 +0400, Kirill Korotaev wrote:
 > >
 > >>Core functionality and interfaces of UBC:
 > >>find/create beancounter, initialization,
 > >>charge/uncharge of resource, core objects' declarations.
 > >>
 > >>Basic structures:
 > >>  ubparm           - resource description
 > >>  user_beancounter - set of resources, id, lock
 > >>
 > >>Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 > >>Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 > >>
 > >>---
 > >> include/ub/beancounter.h |  157 ++++++++++++++++++
 > >> init/main.c              |    4
 > >> kernel/Makefile          |    1
 > >> kernel/ub/Makefile       |    7
 > >> kernel/ub/beancounter.c  |  398 +++++++++++++++++++++++++++++++++++++++++++++++
 > >> 5 files changed, 567 insertions(+)
 > >>
 > >>--- /dev/null	2006-07-18 14:52:43.075228448 +0400
 > >>+++ ./include/ub/beancounter.h	2006-08-10 14:58:27.000000000 +0400
 > >>@@ -0,0 +1,157 @@
 > >>+/*
 > >>+ *  include/ub/beancounter.h
 > >>+ *
 > >>+ *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 > >>+ *
 > >>+ */
 > >>+
 > >>+#ifndef _LINUX_BEANCOUNTER_H
 > >>+#define _LINUX_BEANCOUNTER_H
 > >>+
 > >>+/*
 > >>+ *	Resource list.
 > >>+ */
 > >>+
 > >>+#define UB_RESOURCES	0
 > >>+
 > >>+struct ubparm {
 > >>+	/*
 > >>+	 * A barrier over which resource allocations are failed gracefully.
 > >>+	 * e.g. if the amount of consumed memory is over the barrier further
 > >>+	 * sbrk() or mmap() calls fail, the existing processes are not killed.
 > >>+	 */
 > >>+	unsigned long	barrier;
 > >>+	/* hard resource limit */
 > >>+	unsigned long	limit;
 > >>+	/* consumed resources */
 > >>+	unsigned long	held;
 > >>+	/* maximum amount of consumed resources through the last period */
 > >>+	unsigned long	maxheld;
 > >>+	/* minimum amount of consumed resources through the last period */
 > >>+	unsigned long	minheld;
 > >>+	/* count of failed charges */
 > >>+	unsigned long	failcnt;
 > >>+};
 > >
 > >
 > > What is the difference between barrier and limit. They both sound like
 > > hard limits.  No?
 > check __charge_beancounter_locked and severity.
 > It provides some kind of soft and hard limits.
 >
 
 Would be easier to just rename them as soft and hard limits...
 
 > >>+
 > >>+/*
 > >>+ * Kernel internal part.
 > >>+ */
 > >>+
 > >>+#ifdef __KERNEL__
 > >>+
 > >>+#include <linux/config.h>
 > >>+#include <linux/spinlock.h>
 > >>+#include <linux/list.h>
 > >>+#include <asm/atomic.h>
 > >>+
 > >>+/*
 > >>+ * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
 > >>+ */
 > >>+	/* resources statistics and settings */
 > >>+	struct ubparm		ub_parms[UB_RESOURCES];
 > >>+};
 > >>+
 > >
 > >
 > > I presume UB_RESOURCES value is going to change as different resources
 > > start getting tracked.
 > what's wrong with it?
 >
 
 ...just that user land will need to be some how informed about that.
 
 > > I think something like configfs should be used for user interface.  It
 > > automatically presents the right interfaces to user land (based on
 > > kernel implementation).  And you wouldn't need any changes in glibc etc.
 > 1. UBC doesn't require glibc modificatins.
 
 You are right not for setting the limits.  But for adding any new
 functionality related to containers....as in you just added a new system
 call to get the limits.
 
 > 2. if you think a bit more about it, adding UB parameters doesn't
 >    require user space changes as well.
 > 3. it is possible to add any kind of interface for UBC. but do you like the idea
 >    to grep 200(containers)x20(parameters) files for getting current usages?
 
 How are you doing it currently and how much more efficient it is in
 comparison to configfs?
 
 >    Do you like the idea to convert numbers to strings and back w/o
 >    thinking of data types?
 
 IMO, setting up limits and containers (themselves) is not a common
 operation.    I wouldn't be too worried about loosing those few extra
 cycles in setting them up.
 
 -rohit
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5331 is a reply to message #5285] | Thu, 17 August 2006 17:08   |  
			| 
				
				
					|  Rohit Seth Messages: 101
 Registered: August 2006
 | Senior Member |  |  |  
	| On Thu, 2006-08-17 at 16:13 +0400, Kirill Korotaev wrote: > Rohit Seth wrote:
 > > On Wed, 2006-08-16 at 20:04 +0100, Alan Cox wrote:
 > >
 > >>Ar Mer, 2006-08-16 am 11:17 -0700, ysgrifennodd Rohit Seth:
 > >>
 > >>>I think there should be a check here for seeing if the new limits are
 > >>>lower than the current usage of a resource.  If so then take appropriate
 > >>>action.
 > >>
 > >>Generally speaking there isn't a sane appropriate action because the
 > >>resources can't just be yanked.
 > >>
 > >
 > >
 > > I was more thinking about (for example) user land physical memory limit
 > > for that bean counter.  If the limits are going down, then the system
 > > call should try to flush out page cache pages or swap out anonymous
 > > memory.  But you are right that it won't be possible in all cases, like
 > > for in kernel memory limits.
 > Such kind of memory management is less efficient than the one
 > making decisions based on global shortages and global LRU alogrithm.
 >
 > The problem here is that doing swap out takes more expensive disk I/O
 > influencing other users.
 >
 > So throttling algorithms if wanted should be optional, not mandatory.
 > Lets postpone it and concentrate on the core.
 >
 
 
 I'm really interested in seeing what changes you make in alloc_page when
 the container limits are hit.
 
 When a container is throttling then yes it will have some additional
 cost to other containers but that is the cost of sharing an underlying
 platform.
 
 -rohit
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5349 is a reply to message #5199] | Fri, 18 August 2006 02:31   |  
			| 
				
				
					|  Matt Helsley Messages: 86
 Registered: August 2006
 | Member |  |  |  
	| On Wed, 2006-08-16 at 19:39 +0400, Kirill Korotaev wrote: > Add the following system calls for UB management:
 >   1. sys_getluid    - get current UB id
 >   2. sys_setluid    - changes exec_ and fork_ UBs on current
 >   3. sys_setublimit - set limits for resources consumtions
 >
 > Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 > Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 >
 > ---
 >  arch/i386/kernel/syscall_table.S |    3
 >  arch/ia64/kernel/entry.S         |    3
 >  arch/sparc/kernel/systbls.S      |    2
 >  arch/sparc64/kernel/systbls.S    |    2
 >  include/asm-i386/unistd.h        |    5 +
 >  include/asm-ia64/unistd.h        |    5 +
 >  include/asm-powerpc/systbl.h     |    3
 >  include/asm-powerpc/unistd.h     |    5 +
 >  include/asm-sparc/unistd.h       |    3
 >  include/asm-sparc64/unistd.h     |    3
 >  include/asm-x86_64/unistd.h      |    8 ++
 >  kernel/ub/Makefile               |    1
 >  kernel/ub/sys.c                  |  126 +++++++++++++++++++++++++++++++++++++++
 >  13 files changed, 163 insertions(+), 6 deletions(-)
 >
 > --- ./arch/i386/kernel/syscall_table.S.ubsys	2006-07-10 12:39:10.000000000 +0400
 > +++ ./arch/i386/kernel/syscall_table.S	2006-07-31 14:36:59.000000000 +0400
 > @@ -317,3 +317,6 @@ ENTRY(sys_call_table)
 >  	.long sys_vmsplice
 >  	.long sys_move_pages
 >  	.long sys_getcpu
 > +	.long sys_getluid
 > +	.long sys_setluid
 > +	.long sys_setublimit		/* 320 */
 > --- ./arch/ia64/kernel/entry.S.ubsys	2006-07-10 12:39:10.000000000 +0400
 > +++ ./arch/ia64/kernel/entry.S	2006-07-31 15:25:36.000000000 +0400
 > @@ -1610,5 +1610,8 @@ sys_call_table:
 >  	data8 sys_sync_file_range		// 1300
 >  	data8 sys_tee
 >  	data8 sys_vmsplice
 > +	daat8 sys_getluid
 > +	data8 sys_setluid
 > +	data8 sys_setublimit			// 1305
 >
 >  	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
 > --- ./arch/sparc/kernel/systbls.S.arsys	2006-07-10 12:39:10.000000000 +0400
 > +++ ./arch/sparc/kernel/systbls.S	2006-08-10 17:07:15.000000000 +0400
 > @@ -78,7 +78,7 @@ sys_call_table:
 >  /*285*/	.long sys_mkdirat, sys_mknodat, sys_fchownat, sys_futimesat, sys_fstatat64
 >  /*290*/	.long sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
 >  /*295*/	.long sys_fchmodat, sys_faccessat, sys_pselect6, sys_ppoll, sys_unshare
 > -/*300*/	.long sys_set_robust_list, sys_get_robust_list
 > +/*300*/	.long sys_set_robust_list, sys_get_robust_list, sys_getluid, sys_setluid, sys_setublimit
 >
 >  #ifdef CONFIG_SUNOS_EMUL
 >  	/* Now the SunOS syscall table. */
 > --- ./arch/sparc64/kernel/systbls.S.arsys	2006-07-10 12:39:11.000000000 +0400
 > +++ ./arch/sparc64/kernel/systbls.S	2006-08-10 17:08:52.000000000 +0400
 > @@ -79,7 +79,7 @@ sys_call_table32:
 >  	.word sys_mkdirat, sys_mknodat, sys_fchownat, compat_sys_futimesat, compat_sys_fstatat64
 >  /*290*/	.word sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
 >  	.word sys_fchmodat, sys_faccessat, compat_sys_pselect6, compat_sys_ppoll, sys_unshare
 > -/*300*/	.word compat_sys_set_robust_list, compat_sys_get_robust_list
 > +/*300*/	.word compat_sys_set_robust_list, compat_sys_get_robust_list, sys_getluid, sys_setluid, sys_setublimit
 >
 >  #endif /* CONFIG_COMPAT */
 >
 > --- ./include/asm-i386/unistd.h.ubsys	2006-07-10 12:39:19.000000000 +0400
 > +++ ./include/asm-i386/unistd.h	2006-07-31 15:56:31.000000000 +0400
 > @@ -323,10 +323,13 @@
 >  #define __NR_vmsplice		316
 >  #define __NR_move_pages		317
 >  #define __NR_getcpu		318
 > +#define __NR_getluid		319
 > +#define __NR_setluid		320
 > +#define __NR_setublimit		321
 >
 >  #ifdef __KERNEL__
 >
 > -#define NR_syscalls 318
 > +#define NR_syscalls 322
 >  #include <linux/err.h>
 >
 >  /*
 > --- ./include/asm-ia64/unistd.h.ubsys	2006-07-10 12:39:19.000000000 +0400
 > +++ ./include/asm-ia64/unistd.h	2006-07-31 15:57:23.000000000 +0400
 > @@ -291,11 +291,14 @@
 >  #define __NR_sync_file_range		1300
 >  #define __NR_tee			1301
 >  #define __NR_vmsplice			1302
 > +#define __NR_getluid			1303
 > +#define __NR_setluid			1304
 > +#define __NR_setublimit			1305
 >
 >  #ifdef __KERNEL__
 >
 >
 > -#define NR_syscalls			279 /* length of syscall table */
 > +#define NR_syscalls			282 /* length of syscall table */
 >
 >  #define __ARCH_WANT_SYS_RT_SIGACTION
 >
 > --- ./include/asm-powerpc/systbl.h.arsys	2006-07-10 12:39:19.000000000 +0400
 > +++ ./include/asm-powerpc/systbl.h	2006-08-10 17:05:53.000000000 +0400
 > @@ -304,3 +304,6 @@ SYSCALL_SPU(fchmodat)
 >  SYSCALL_SPU(faccessat)
 >  COMPAT_SYS_SPU(get_robust_list)
 >  COMPAT_SYS_SPU(set_robust_list)
 > +SYSCALL(sys_getluid)
 > +SYSCALL(sys_setluid)
 > +SYSCALL(sys_setublimit)
 > --- ./include/asm-powerpc/unistd.h.arsys	2006-07-10 12:39:19.000000000 +0400
 > +++ ./include/asm-powerpc/unistd.h	2006-08-10 17:06:28.000000000 +0400
 > @@ -323,10 +323,13 @@
 >  #define __NR_faccessat		298
 >  #define __NR_get_robust_list	299
 >  #define __NR_set_robust_list	300
 > +#define __NR_getluid		301
 > +#define __NR_setluid		302
 > +#define __NR_setublimit		303
 >
 >  #ifdef __KERNEL__
 >
 > -#define __NR_syscalls		301
 > +#define __NR_syscalls		304
 >
 >  #define __NR__exit __NR_exit
 >  #define NR_syscalls	__NR_syscalls
 > --- ./include/asm-sparc/unistd.h.arsys	2006-07-10 12:39:19.000000000 +0400
 > +++ ./include/asm-sparc/unistd.h	2006-08-10 17:08:19.000000000 +0400
 > @@ -318,6 +318,9 @@
 >  #define __NR_unshare		299
 >  #define __NR_set_robust_list	300
 >  #define __NR_get_robust_list	301
 > +#define __NR_getluid		302
 > +#define __NR_setluid		303
 > +#define __NR_setublimit		304
 >
 >  #ifdef __KERNEL__
 >  /* WARNING: You MAY NOT add syscall numbers larger than 301, since
 > --- ./include/asm-sparc64/unistd.h.arsys	2006-07-10 12:39:19.000000000 +0400
 > +++ ./include/asm-sparc64/unistd.h	2006-08-10 17:09:24.000000000 +0400
 > @@ -320,6 +320,9 @@
 >  #define __NR_unshare		299
 >  #define __NR_set_robust_list	300
 >  #define __NR_get_robust_list	301
 > +#define __NR_getluid		302
 > +#define __NR_setluid		303
 > +#define __NR_setublimit		304
 >
 >  #ifdef __KERNEL__
 >  /* WARNING: You MAY NOT add syscall numbers larger than 301, since
 > --- ./include/asm-x86_64/unistd.h.ubsys	2006-07-10 12:39:19.000000000 +0400
 > +++ ./include/asm-x86_64/unistd.h	2006-07-31 16:00:01.000000000 +0400
 > @@ -619,10 +619,16 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 >  __SYSCALL(__NR_vmsplice, sys_vmsplice)
 >  #define __NR_move_pages		279
 >  __SYSCALL(__NR_move_pages, sys_move_pages)
 > +#define __NR_getluid		280
 > +__SYSCALL(__NR_getluid, sys_getluid)
 > +#define __NR_setluid		281
 > +__SYSCALL(__NR_setluid, sys_setluid)
 > +#define __NR_setublimit		282
 > +__SYSCALL(__NR_setublimit, sys_setublimit)
 >
 >  #ifdef __KERNEL__
 >
 > -#define __NR_syscall_max __NR_move_pages
 > +#define __NR_syscall_max __NR_setublimit
 >  #include <linux/err.h>
 >
 >  #ifndef __NO_STUBS
 > --- ./kernel/ub/Makefile.ubsys	2006-07-28 14:08:37.000000000 +0400
 > +++ ./kernel/ub/Makefile	2006-08-01 11:08:39.000000000 +0400
 > @@ -6,3 +6,4 @@
 >
 >  obj-$(CONFIG_USER_RESOURCE) += beancounter.o
 >  obj-$(CONFIG_USER_RESOURCE) += misc.o
 > +obj-y += sys.o
 > --- ./kernel/ub/sys.c.ubsys	2006-07-28 18:52:18.000000000 +0400
 > +++ ./kernel/ub/sys.c	2006-08-03 16:14:23.000000000 +0400
 > @@ -0,0 +1,126 @@
 > +/*
 > + *  kernel/ub/sys.c
 > + *
 > + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 > + *
 > + */
 > +
 > +#include <linux/config.h>
 > +#include <linux/sched.h>
 > +#include <asm/uaccess.h>
 > +
 > +#include <ub/beancounter.h>
 > +#include <ub/task.h>
 > +
 > +#ifndef CONFIG_USER_RESOURCE
 
 Get rid of the #ifdef since this file should only be compiled if
 CONFIG_USER_RESOURCE=y anyway.
 
 > +asmlinkage long sys_getluid(void)
 > +{
 > +	return -ENOSYS;
 > +}
 > +
 > +asmlinkage long sys_setluid(uid_t uid)
 > +{
 > +	return -ENOSYS;
 > +}
 > +
 > +asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
 > +		unsigned long *limits)
 > +{
 > +	return -ENOSYS;
 > +}
 
 Looks to me like you want to add:
 
 cond_syscall(sys_getluid);
 ...
 
 in kernel/sys_ni.c and then you won't have to worry about making these
 empty functions.
 
 > +#else /* CONFIG_USER_RESOURCE */
 > +
 > +/*
 > + *	The (rather boring) getluid syscall
 > + */
 > +asmlinkage long sys_getluid(void)
 > +{
 > +	struct user_beancounter *ub;
 > +
 > +	ub = get_exec_ub();
 > +	if (ub == NULL)
 > +		return -EINVAL;
 > +
 > +	return ub->ub_uid;
 > +}
 > +
 > +/*
 > + *	The setluid syscall
 > + */
 > +asmlinkage long sys_setluid(uid_t uid)
 > +{
 > +	int error;
 > +	struct user_beancounter *ub;
 > +	struct task_beancounter *task_bc;
 > +
 > +	task_bc = ¤t->task_bc;
 > +
 > +	/* You may not disown a setluid */
 > +	error = -EINVAL;
 > +	if (uid == (uid_t)-1)
 > +		goto out;
 > +
 > +	/* You may only set an ub as root */
 > +	error = -EPERM;
 > +	if (!capable(CAP_SETUID))
 > +		goto out;
 
 With resource groups you don't necessarily have to be root -- just the
 owner of the group and task.
 
 Filesystems and appropriate share representations offer a way to give
 regular users the ability to manage their resources without requiring
 CAP_FOO.
 
 > +	/* Ok - set up a beancounter entry for this user */
 > +	error = -ENOBUFS;
 > +	ub = beancounter_findcreate(uid, NULL, UB_ALLOC);
 > +	if (ub == NULL)
 > +		goto out;
 > +
 > +	/* install bc */
 > +	put_beancounter(task_bc->exec_ub);
 > +
...
 
 
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [RFC][PATCH 3/7] UBC: ub context and inheritance [message #5350 is a reply to message #5198] | Fri, 18 August 2006 02:42   |  
			| 
				
				
					|  Matt Helsley Messages: 86
 Registered: August 2006
 | Member |  |  |  
	| On Wed, 2006-08-16 at 19:38 +0400, Kirill Korotaev wrote: > Contains code responsible for setting UB on task,
 > it's inheriting and setting host context in interrupts.
 >
 > Task references three beancounters:
 >   1. exec_ub  current context. all resources are
 >               charged to this beancounter.
 
 nit: 2-3 below seem to contradict "all". If you mean "the rest" then
 perhaps you ought to reorder these:
 
 1. task_ub ...
 2. fork_sub ...
 3. exec_ub Current context. Resources not charged to task_ub
 or fork_sub are charged to this beancounter.
 
 >   2. task_ub  beancounter to which task_struct is
 >               charged itself.
 
 Is task_ub frequently the parent beancounter of exec_ub? If it's always
 the parent then perhaps the one or more of these _ub fields in the task
 struct are not necessary. Also in that case keeping copies of the
 "parent" user_beancounter pointers in the task_beancounters would seem
 bug-prone -- if the hierarchy of beancounters changes then these would
 need to be changed too.
 
 >   3. fork_sub beancounter which is inherited by
 >               task's children on fork
 
 Is this frequently the same as exec_ub?
 
 > Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 > Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 >
 > ---
 >  include/linux/sched.h   |    5 +++++
 >  include/ub/task.h       |   42 ++++++++++++++++++++++++++++++++++++++++++
 >  kernel/fork.c           |   21 ++++++++++++++++-----
 >  kernel/irq/handle.c     |    9 +++++++++
 >  kernel/softirq.c        |    8 ++++++++
 >  kernel/ub/Makefile      |    1 +
 >  kernel/ub/beancounter.c |    4 ++++
 >  kernel/ub/misc.c        |   34 ++++++++++++++++++++++++++++++++++
 >  8 files changed, 119 insertions(+), 5 deletions(-)
 >
 > --- ./include/linux/sched.h.ubfork	2006-07-17 17:01:12.000000000 +0400
 > +++ ./include/linux/sched.h	2006-07-31 16:01:54.000000000 +0400
 > @@ -81,6 +81,8 @@ struct sched_param {
 >  #include <linux/timer.h>
 >  #include <linux/hrtimer.h>
 >
 > +#include <ub/task.h>
 > +
 >  #include <asm/processor.h>
 >
 >  struct exec_domain;
 > @@ -997,6 +999,9 @@ struct task_struct {
 >  	spinlock_t delays_lock;
 >  	struct task_delay_info *delays;
 >  #endif
 > +#ifdef CONFIG_USER_RESOURCE
 > +	struct task_beancounter	task_bc;
 > +#endif
 >  };
 >
 >  static inline pid_t process_group(struct task_struct *tsk)
 > --- ./include/ub/task.h.ubfork	2006-07-28 18:53:52.000000000 +0400
 > +++ ./include/ub/task.h	2006-08-01 15:26:08.000000000 +0400
 > @@ -0,0 +1,42 @@
 > +/*
 > + *  include/ub/task.h
 > + *
 > + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 > + *
 > + */
 > +
 > +#ifndef __UB_TASK_H_
 > +#define __UB_TASK_H_
 > +
 > +#include <linux/config.h>
 > +
 > +struct user_beancounter;
 > +
 > +struct task_beancounter {
 > +	struct user_beancounter *exec_ub;
 > +	struct user_beancounter *task_ub;
 > +	struct user_beancounter *fork_sub;
 > +};
 > +
 > +#ifdef CONFIG_USER_RESOURCE
 > +#define get_exec_ub()		(current->task_bc.exec_ub)
 > +#define set_exec_ub(newub)			\
 > +	({					\
 > +		 struct user_beancounter *old;	\
 > +		 struct task_beancounter *tbc;	\
 > +		 tbc = ¤t->task_bc;	\
 > +		 old = tbc->exec_ub;		\
 > +		 tbc->exec_ub = newub;		\
 > +		 old;				\
 > +	 })
 > +
 
 How about making these static inlines?
 
 > +int ub_task_charge(struct task_struct *parent, struct task_struct *new);
 > +void ub_task_uncharge(struct task_struct *tsk);
 > +
 > +#else /* CONFIG_USER_RESOURCE */
 > +#define get_exec_ub()		(NULL)
 > +#define set_exec_ub(__ub)	(NULL)
 > +#define ub_task_charge(p, t)	(0)
 > +#define ub_task_uncharge(t)	do { } while (0)
 > +#endif /* CONFIG_USER_RESOURCE */
 > +#endif /* __UB_TASK_H_ */
 > --- ./kernel/irq/handle.c.ubirq	2006-07-10 12:39:20.000000000 +0400
 > +++ ./kernel/irq/handle.c	2006-08-01 12:39:34.000000000 +0400
 > @@ -16,6 +16,9 @@
 >  #include <linux/interrupt.h>
 >  #include <linux/kernel_stat.h>
 >
 > +#include <ub/beancounter.h>
 > +#include <ub/task.h>
 > +
 >  #include "internals.h"
 >
 >  /**
 > @@ -166,6 +169,9 @@ fastcall unsigned int __do_IRQ(unsigned
 >  	struct irq_desc *desc = irq_desc + irq;
 >  	struct irqaction *action;
 >  	unsigned int status;
 > +	struct user_beancounter *ub;
 > +
 > +	ub = set_exec_ub(&ub0);
 
 Perhaps a comment: "/* Don't charge resources gained in interrupts to current */
 
 >  	kstat_this_cpu.irqs[irq]++;
 >  	if (CHECK_IRQ_PER_CPU(desc->status)) {
 > @@ -178,6 +184,8 @@ fastcall unsigned int __do_IRQ(unsigned
 >  			desc->chip->ack(irq);
 >  		action_ret = handle_IRQ_event(irq, regs, desc->action);
 >  		desc->chip->end(irq);
 > +
 > +		(void) set_exec_ub(ub);
 >  		return 1;
 >  	}
 >
 > @@ -246,6 +254,7 @@ out:
 >  	desc->chip->end(irq);
 >  	spin_unlock(&desc->lock);
 >
 > +	(void) set_exec_ub(ub);
 
 
 Seems like a WARN_ON() would be appropriate rather than ignoring the
 return code.
 
 >  	return 1;
 >  }
 >
 > --- ./kernel/softirq.c.ubirq	2006-07-17 17:01:12.000000000 +0400
 > +++ ./kernel/softirq.c	2006-08-01 12:40:44.000000000 +0400
 > @@ -18,6 +18,9 @@
 >  #include <linux/rcupdate.h>
 >  #include <linux/smp.h>
 >
 > +#include <ub/beancounter.h>
 > +#include <ub/task.h>
 > +
 >  #include <asm/irq.h>
 >  /*
 >     - No shared variables, all the data are CPU local.
 > @@ -191,6 +194,9 @@ asmlinkage void __do_softirq(void)
 >  	__u32 pending;
 >  	int max_restart = MAX_SOFTIRQ_RESTART;
 >  	int cpu;
 > +	struct user_beancounter *ub;
 > +
 > +	ub = set_exec_ub(&ub0);
 
 Perhaps add the same comment...
 
 >  	pending = local_softirq_pending();
 >  	account_system_vtime(current);
 > @@ -229,6 +235,8 @@ restart:
 >
 >  	account_system_vtime(current);
 >  	_local_bh_enable();
 > +
 > +	(void) set_exec_ub(ub);
 
 .. and the same WARN_ON.
 
 >  }
 >
 >  #ifndef __ARCH_HAS_DO_SOFTIRQ
 > --- ./kernel/fork.c.ubfork	2006-07-17 17:01:12.000000000 +0400
 > +++ ./kernel/fork.c	2006-08-01 12:58:36.000000000 +0400
 > @@ -46,6 +46,8 @@
 >  #include <linux/delayacct.h>
 >  #include <linux/taskstats_kern.h>
 >
 > +#include <ub/task.h>
 > +
 >  #include <asm/pgtable.h>
 >  #include <asm/pgalloc.h>
 >  #include <asm/uaccess.h>
 > @@ -102,6 +104,7 @@ static kmem_cache_t *mm_cachep;
 >
 >  void free_task(struct task_struct *tsk)
 >  {
 > +	ub_task_uncharge(tsk);
 >  	free_thread_info(tsk->thread_info);
 >  	rt_mutex_debug_task_free(tsk);
 >  	free_task_struct(tsk);
 > @@ -162,18 +165,19 @@ static struct task_struct *dup_task_stru
 >
 >  	tsk = alloc_task_struct();
 >  	if (!tsk)
 > -		return NULL;
 > +		goto out;
 >
 >  	ti = alloc_thread_info(tsk);
 > -	if (!ti) {
 > -		free_task_struct(tsk);
 > -		return NULL;
 > -	}
 > +	if (!ti)
 > +		goto out_tsk;
 >
 >  	*tsk = *orig;
 >  	tsk->thread_info = ti;
 >  	setup_thread_stack(tsk, orig);
 >
 > +	if (ub_task_charge(orig, tsk))
 > +		goto out_ti;
 > +
 >  	/* One for us, one for whoever does the "release_task()" (usually parent) */
 >  	atomic_set(&tsk->usage,2);
 >  	atomic_set(&tsk->fs_excl, 0);
 > @@ -180,6 +184,13 @@ static struct task_struct *dup_task_stru
 >  #endif
 >  	tsk->splice_pipe = NULL;
 >  	return tsk;
 > +
 > +out_ti:
 > +	free_thread_info(ti);
 > +out_tsk:
 > +	free_task_struct(tsk);
 > +out:
 > +	return NULL;
 
 Ugh. This is starting to look like copy_process(). Any reason you
 couldn't move the bean counter bits to copy_process() instead?
 
 >  }
 >
 >  #ifdef CONFIG_MMU
 > --- ./kernel/ub/Makefile.ubcore	2006-08-03 16:24:56.000000000 +0400
 > +++ ./kernel/ub/Makefile	2006-08-01 11:08:39.000000000 +0400
 > @@ -5,3 +5,4 @@
 >  #
 >
 >  obj-$(CONFIG_USER_RESOURCE) += beancounter.o
 > +obj-$(CONFIG_USER_RESOURCE) += misc.o
 > --- ./kernel/ub/beancounter.c.ubcore	2006-07-28 13:07:44.000000000 +0400
 > +++ ./kernel/ub/beancounter.c	2006-08-03 16:14:17.000000000 +0400
 > @@ -395,6 +395,10 @@
 >  	spin_lock_init(&ub_hash_lock);
 >  	slot = &ub_hash[ub_hash_fun(ub->ub_uid)];
 >  	hlist_add_head(&ub->hash, slot);
 > +
 > +	current->task_bc.exec_ub = ub;
 > +	current->task_bc.task_ub = get_beancounter(ub);
 > +	current->task_bc.fork_sub = get_beancounter(ub);
 >  }
 >
 >  void __init ub_init_late(void)
 > --- ./kernel/ub/misc.c.ubfork	2006-07-31 16:23:44.000000000 +0400
 > +++ ./kernel/ub/misc.c	2006-07-31 16:28:47.000000000 +0400
 > @@ -0,0 +1,34 @@
 > +/*
 > + * kernel/ub/misc.c
 > + *
 > + * Copyright (C) 2006 OpenVZ. SWsoft Inc.
 > + *
 > + */
 > +
 > +#include <linux/sched.h>
 > +
 > +#include <ub/beancounter.h>
 > +#include <ub/task.h>
 > +
 > +int ub_task_charge(struct task_struct *parent, struct task_struct *new)
 > +{
 
 parent could be derived from new if you move the charge to copy_process
 instead of dup_task_struct.
 
 > +	struct task_beancounter *old_bc;
 > +	struct task_beancounter *new_bc;
 > +	struct user_beancounter *ub;
 > +
 > +	old_bc = &parent->task_bc;
 > +	new_bc = &new->task_bc;
 > +
 > +	ub = old_bc->fork_sub;
 > +	new_bc->exec_ub = get_beancounter(ub);
 > +	new_bc->task_ub = get_beancounter(ub);
 > +	new_bc->fork_sub = get_beancounter(ub);
 > +	return 0;
 > +}
 > +
 > +void ub_task_uncharge(struct task_struct *tsk)
 > +{
 > +	put_beancounter(tsk->task_bc.exec_ub);
 > +	put_beancounter(tsk->task_bc.task_ub);
 > +	put_beancounter(tsk->task_bc.fork_sub);
 > +}
 >  ------------------------------------------------------------ -------------
 > Using Tomcat but need to do more? Need to support web services, security?
 > Get stuff done quickly with pre-integrated technology to make your job easier
 > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
 >  http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&b id=263057&dat=121642
 > _______________________________________________
 > ckrm-tech mailing list
 > https://lists.sourceforge.net/lists/listinfo/ckrm-tech
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [PATCH 4/7] UBC: syscalls (user interface) [message #5357 is a reply to message #5317] | Fri, 18 August 2006 08:08   |  
			| 
				
				
					|  Andrey Savochkin Messages: 47
 Registered: December 2005
 | Member |  |  |  
	| Hi, 
 On Thu, Aug 17, 2006 at 08:40:33AM -0700, Andrew Morton wrote:
 > On Thu, 17 Aug 2006 16:13:30 +0400
 > Kirill Korotaev <dev@sw.ru> wrote:
 >
 > > > I was more thinking about (for example) user land physical memory limit
 > > > for that bean counter.  If the limits are going down, then the system
 > > > call should try to flush out page cache pages or swap out anonymous
 > > > memory.  But you are right that it won't be possible in all cases, like
 > > > for in kernel memory limits.
 > > Such kind of memory management is less efficient than the one
 > > making decisions based on global shortages and global LRU alogrithm.
 >
 > I also was quite surprised that openvz appears to have no way of
 > constraining a container's memory usage.  "I want to run this bunch of
 > processes in a 4.5GB container".
 
 I'd like to share my view on the subject of memory usage limiting.
 
 The task of limiting a container to 4.5GB of memory bottles down to the
 question: what to do when the container starts to use more than assigned
 4.5GB of memory?
 
 At this moment there are only 3 viable alternatives.
 
 A) Have separate memory management for each container,
 with separate buddy allocator, lru lists, page replacement mechanism.
 That implies a considerable overhead, and the main challenge there
 is sharing of pages between these separate memory managers.
 
 B) Return errors on extension of mappings, but not on page faults, where
 memory is actually consumed.
 In this case it makes sense to take into account not only the size of used
 memory, but the size of created mappings as well.
 This is approximately what "privvmpages" accounting/limiting provides in
 UBC.
 
 C) Rely on OOM killer.
 This is a fall-back method in UBC, for the case "privvmpages" limits
 still leave the possibility to overload the system.
 
 It would be nice, indeed, to invent something new.
 The ideal mechanism would
 - slow down the container over-using memory, to signal the user that
 he is over his limits,
 - at the same time this slowdown shouldn't lead to the increase of memory
 usage: for example, a simple slowdown of apache web server would lead
 to the growth of the number of serving children and consumption of more
 memory while showing worse performance,
 - and, at the same time, it shouldn't penalize the rest of the system from
 the performance point of view...
 May be this can be achieved via carefully tuned swapout mechanism together
 with disk bandwidth management capable of tracking asynchronous write
 requests, may be something else is required.
 It's really a big challenge.
 
 Meanwhile, I guess we can only make small steps in improving Linux resource
 management features for this moment.
 
 Best regards
 
 Andrey
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [RFC][PATCH 5/7] UBC: kernel memory accounting	(core) [message #5359 is a reply to message #5315] | Fri, 18 August 2006 08:12   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Dave Hansen wrote: > On Thu, 2006-08-17 at 17:31 +0400, Kirill Korotaev wrote:
 >
 >>>How many things actually use this?  Can we have the slab ubcs
 >>
 >>without
 >>
 >>>the struct page pointer?
 >>
 >>slab doesn't use this pointer on the page.
 >>It is used for pages allocated by buddy
 >>alocator implicitly (e.g. LDT pages, page tables, ...).
 >
 >
 > Hmmm.  There aren't _that_ many of those cases, right?  Are there any
 > that absolutely need raw access to the buddy allocator?  I'm pretty sure
 > that pagetables can be moved over to a slab, as long as we bump up the
 > alignment.
 LDT takes from 1 to 16 pages. and is allocated by vmalloc.
 do you propose to replace it with slab which can fail due to memory
 fragmentation?
 
 the same applies to fdset, fdarray, ipc ids and iptables entries.
 
 > It does seem a wee bit silly to have the pointer in _all_ of the struct
 > pages, even the ones for which we will never do any accounting (and even
 > on kernels that never need it).  But, a hashing scheme sounds like a
 > fine idea.
 It seems a silly for you since 2nd patchset accounting user pages
 is not here yet. As you can see we added a union into page,
 which is shared between kernel memory and user memory accounting.
 
 THERE IS NOT USER ACCOUNTING HERE YET GUYS! :) THIS FIELD WILL BE USED!!!
 
 Thanks,
 Kirill
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [PATCH 2/7] UBC: core (structures, API) [message #5370 is a reply to message #5352] | Fri, 18 August 2006 08:26   |  
			| 
				
				
					|  Matt Helsley Messages: 86
 Registered: August 2006
 | Member |  |  |  
	| On Fri, 2006-08-18 at 11:35 +0400, Andrey Savochkin wrote: > On Thu, Aug 17, 2006 at 10:31:37PM -0700, Andrew Morton wrote:
 > > On Thu, 17 Aug 2006 15:53:40 +0400
 > > Kirill Korotaev <dev@sw.ru> wrote:
 > >
 > > > >>+struct user_beancounter
 > > > >>+{
 > > > >>+	atomic_t		ub_refcount;
 > > > >>+	spinlock_t		ub_lock;
 > > > >>+	uid_t			ub_uid;
 > > > >
 > > > >
 > > > > Why uid?  Will it be possible to club processes belonging to different
 > > > > users to same bean counter.
 > > > oh, its a misname. Should be ub_id. it is ID of user_beancounter
 > > > and has nothing to do with user id.
 > >
 > > But it uses a uid_t.  That's more than a misnaming?
 >
 > It used to be uid-related in ancient times when the notion of container
 > hadn't formed up.
 > "user" part of user_beancounter name has the same origin :)
 
 Is it similarly irrelevant now? If so perhaps a big rename could be used
 to make the names clearer (s/user_//, s/ub_/bc_/, ...).
 
 <snip>
 
 Cheers,
 -Matt Helsley
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [RFC][PATCH 3/7] UBC: ub context and inheritance [message #5371 is a reply to message #5350] | Fri, 18 August 2006 09:21   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Matt Helsley wrote: > On Wed, 2006-08-16 at 19:38 +0400, Kirill Korotaev wrote:
 >
 >>Contains code responsible for setting UB on task,
 >>it's inheriting and setting host context in interrupts.
 >>
 >>Task references three beancounters:
 >>  1. exec_ub  current context. all resources are
 >>              charged to this beancounter.
 >
 >
 > nit: 2-3 below seem to contradict "all". If you mean "the rest" then
 > perhaps you ought to reorder these:
 >
 > 1. task_ub ...
 > 2. fork_sub ...
 > 3. exec_ub Current context. Resources not charged to task_ub
 >            or fork_sub are charged to this beancounter.
 not sure what you mean.
 task_ub - where _task_ _itself_ is charged as an object.
 following patches will add charging of "number of tasks" using it.
 fork_sub - beancounter which is inherited on fork() (chaning task beancounter).
 exec_ub - is current context.
 
 
 >>  2. task_ub  beancounter to which task_struct is
 >>              charged itself.
 >
 >
 > Is task_ub frequently the parent beancounter of exec_ub? If it's always
 > the parent then perhaps the one or more of these _ub fields in the task
 > struct are not necessary.
 no, task_ub != exec_ub of parent task
 when task is created anything can happen: task can change ub, parent can change ub,
 task can be reparented. But the UB we charged task to should be known.
 
 > Also in that case keeping copies of the
 > "parent" user_beancounter pointers in the task_beancounters would seem
 > bug-prone -- if the hierarchy of beancounters changes then these would
 > need to be changed too.
 >
 >
 >>  3. fork_sub beancounter which is inherited by
 >>              task's children on fork
 >
 >
 > Is this frequently the same as exec_ub?
 frequently, but not always. exec_ub is changed in softirq for example.
 consider exec_ub as 'current' pointer in kernel.
 
 see other comments below
 
 >>Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 >>Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 >>
 >>---
 >> include/linux/sched.h   |    5 +++++
 >> include/ub/task.h       |   42 ++++++++++++++++++++++++++++++++++++++++++
 >> kernel/fork.c           |   21 ++++++++++++++++-----
 >> kernel/irq/handle.c     |    9 +++++++++
 >> kernel/softirq.c        |    8 ++++++++
 >> kernel/ub/Makefile      |    1 +
 >> kernel/ub/beancounter.c |    4 ++++
 >> kernel/ub/misc.c        |   34 ++++++++++++++++++++++++++++++++++
 >> 8 files changed, 119 insertions(+), 5 deletions(-)
 >>
 >>--- ./include/linux/sched.h.ubfork	2006-07-17 17:01:12.000000000 +0400
 >>+++ ./include/linux/sched.h	2006-07-31 16:01:54.000000000 +0400
 >>@@ -81,6 +81,8 @@ struct sched_param {
 >> #include <linux/timer.h>
 >> #include <linux/hrtimer.h>
 >>
 >>+#include <ub/task.h>
 >>+
 >> #include <asm/processor.h>
 >>
 >> struct exec_domain;
 >>@@ -997,6 +999,9 @@ struct task_struct {
 >> 	spinlock_t delays_lock;
 >> 	struct task_delay_info *delays;
 >> #endif
 >>+#ifdef CONFIG_USER_RESOURCE
 >>+	struct task_beancounter	task_bc;
 >>+#endif
 >> };
 >>
 >> static inline pid_t process_group(struct task_struct *tsk)
 >>--- ./include/ub/task.h.ubfork	2006-07-28 18:53:52.000000000 +0400
 >>+++ ./include/ub/task.h	2006-08-01 15:26:08.000000000 +0400
 >>@@ -0,0 +1,42 @@
 >>+/*
 >>+ *  include/ub/task.h
 >>+ *
 >>+ *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 >>+ *
 >>+ */
 >>+
 >>+#ifndef __UB_TASK_H_
 >>+#define __UB_TASK_H_
 >>+
 >>+#include <linux/config.h>
 >>+
 >>+struct user_beancounter;
 >>+
 >>+struct task_beancounter {
 >>+	struct user_beancounter *exec_ub;
 >>+	struct user_beancounter *task_ub;
 >>+	struct user_beancounter *fork_sub;
 >>+};
 >>+
 >>+#ifdef CONFIG_USER_RESOURCE
 >>+#define get_exec_ub()		(current->task_bc.exec_ub)
 >>+#define set_exec_ub(newub)			\
 >>+	({					\
 >>+		 struct user_beancounter *old;	\
 >>+		 struct task_beancounter *tbc;	\
 >>+		 tbc = ¤t->task_bc;	\
 >>+		 old = tbc->exec_ub;		\
 >>+		 tbc->exec_ub = newub;		\
 >>+		 old;				\
 >>+	 })
 >>+
 >
 >
 > How about making these static inlines?
 possible, but this requires including sched.h, which includes this file...
 so this one is easier and more separated.
 
 >>+int ub_task_charge(struct task_struct *parent, struct task_struct *new);
 >>+void ub_task_uncharge(struct task_struct *tsk);
 >>+
 >>+#else /* CONFIG_USER_RESOURCE */
 >>+#define get_exec_ub()		(NULL)
 >>+#define set_exec_ub(__ub)	(NULL)
 >>+#define ub_task_charge(p, t)	(0)
 >>+#define ub_task_uncharge(t)	do { } while (0)
 >>+#endif /* CONFIG_USER_RESOURCE */
 >>+#endif /* __UB_TASK_H_ */
 >>--- ./kernel/irq/handle.c.ubirq	2006-07-10 12:39:20.000000000 +0400
 >>+++ ./kernel/irq/handle.c	2006-08-01 12:39:34.000000000 +0400
 >>@@ -16,6 +16,9 @@
 >> #include <linux/interrupt.h>
 >> #include <linux/kernel_stat.h>
 >>
 >>+#include <ub/beancounter.h>
 >>+#include <ub/task.h>
 >>+
 >> #include "internals.h"
 >>
 >> /**
 >>@@ -166,6 +169,9 @@ fastcall unsigned int __do_IRQ(unsigned
 >> 	struct irq_desc *desc = irq_desc + irq;
 >> 	struct irqaction *action;
 >> 	unsigned int status;
 >>+	struct user_beancounter *ub;
 >>+
 >>+	ub = set_exec_ub(&ub0);
 >
 >
 > Perhaps a comment: "/* Don't charge resources gained in interrupts to current */
 ok, will add comment:
 /* UBC charges should be done to host system */
 >
 >
 >> 	kstat_this_cpu.irqs[irq]++;
 >> 	if (CHECK_IRQ_PER_CPU(desc->status)) {
 >>@@ -178,6 +184,8 @@ fastcall unsigned int __do_IRQ(unsigned
 >> 			desc->chip->ack(irq);
 >> 		action_ret = handle_IRQ_event(irq, regs, desc->action);
 >> 		desc->chip->end(irq);
 >>+
 >>+		(void) set_exec_ub(ub);
 >> 		return 1;
 >> 	}
 >>
 >>@@ -246,6 +254,7 @@ out:
 >> 	desc->chip->end(irq);
 >> 	spin_unlock(&desc->lock);
 >>
 >>+	(void) set_exec_ub(ub);
 >
 >
 >
 > Seems like a WARN_ON() would be appropriate rather than ignoring the
 > return code.
 BUG_ON(ret != &ub0) ?
 
 maybe introduce a kind of
 reset_exec_ub(old_ub, expected_current_ub)
 {
 ret = set_exec_ub(old_ub);
 BUG_ON(ret != expected_current_ub);
 }
 ?
 
 
 >> 	return 1;
 >> }
 >>
 >>--- ./kernel/softirq.c.ubirq	2006-07-17 17:01:12.000000000 +0400
 >>+++ ./kernel/softirq.c	2006-08-01 12:40:44.000000000 +0400
 >>@@ -18,6 +18,9 @@
 >> #include <linux/rcupdate.h>
 >> #include <linux/smp.h>
 >>
 >>+#include <ub/beancounter.h>
 >>+#include <ub/task.h>
 >>+
 >> #include <asm/irq.h>
 >> /*
 >>    - No shared variables, all the data are CPU local.
 >>@@ -191,6 +194,9 @@ asmlinkage void __do_softirq(void)
 >> 	__u32 pending;
 >> 	int max_restart = MAX_SOFTIRQ_RESTART;
 >> 	int cpu;
 >>+	struct user_beancounter *ub;
 >>+
 >>+	ub = set_exec_ub(&ub0);
 >
 >
 > Perhaps add the same comment...
 ok
 
 >
 >
 >> 	pending = local_softirq_pending();
 >> 	account_system_vtime(current);
 >>@@ -229,6 +235,8 @@ restart:
 >>
 >> 	account_system_vtime(current);
 >> 	_local_bh_enable();
 >>+
 >>+	(void) set_exec_ub(ub);
 >
 >
 > .. and the same WARN_ON.
 >
 >
 >> }
 >>
 >> #ifndef __ARCH_HAS_DO_SOFTIRQ
 >>--- ./kernel/fork.c.ubfork	2006-07-17 17:01:12.000000000 +0400
 >>+++ ./kernel/fork.c	2006-08-01 12:58:36.000000000 +0400
 >>@@ -46,6 +46,8 @@
 >> #include <linux/delayacct.h>
 >> #include <linux/taskstats_kern.h>
 >>
 >>+#include <ub/task.h>
 >>+
 >> #include <asm/pgtable.h>
 >> #include <asm/pgalloc.h>
 >> #include <asm/uaccess.h>
 >>@@ -102,6 +104,7 @@ static kmem_cache_t *mm_cachep;
 >>
 >> void free_task(struct task_struct *tsk)
 >> {
 >>+	ub_task_uncharge(tsk);
 >> 	free_thread_info(tsk->thread_info);
 >> 	rt_mutex_debug_task_free(tsk);
 >> 	free_task_struct(tsk);
 >>@@ -162,18 +165,19 @@ static struct task_struct *dup_task_stru
 >>
 >> 	tsk = alloc_task_struct();
 >> 	if (!tsk)
 >>-		return NULL;
 >>+		goto out;
 >>
 >> 	ti = alloc_thread_info(tsk);
 >>-	if (!ti) {
 >>-		free_task_struct(tsk);
 >>-		return NULL;
 >>-	}
 >>+	if (!ti)
 >>+		goto out_tsk;
 >>
 >> 	*tsk = *orig;
 >> 	tsk->thread_info = ti;
 >> 	setup_thread_stack(tsk, orig);
 >>
 >>+	if (ub_task_charge(orig, tsk))
 >>+		goto out_ti;
 >>+
 >> 	/* One for us, one for whoever does the "release_task()" (usually parent) */
 >> 	atomic_set(&tsk->usage,2);
 >> 	atomic_set(&tsk->fs_excl, 0);
 >>@@ -180,6 +184,13 @@ static struct task_struct *dup_task_stru
 >> #endif
 >> 	tsk->splice_pipe = NULL;
 >> 	return tsk;
 >>+
 >>+out_ti:
 >>+	free_thread_info(ti);
 >>+out_tsk:
 >>+	free_task_struct(tsk);
 >>+out:
 >>+	return NULL;
 >
 >
 > Ugh. This is starting to look like copy_process(). Any reason you
 > couldn't move the bean counter bits to copy_process() instead?
 This is more logical place since we _will_ charge task here
 (next patchset for numproc).
 It is logically better to charge objects in places where
 they are allocated. At the same time we inherit tasks ubs here.
 
 >> }
 >>
 >> #ifdef CONFIG_MMU
 >>--- ./kernel/ub/Makefile.ubcore	2006-08-03 16:24:56.000000000 +0400
 >>+++ ./kernel/ub/Makefile	2006-08-01 11:08:39.000000000 +0400
 >>@@ -5,3 +5,4 @@
 >> #
 >>
 >> obj-$(CONFIG_USER_RESOURCE) += beancounter.o
 >>+obj-$(CONFIG_USER_RESOURCE) += misc.o
 >>--- ./kernel/ub/beancounter.c.ubcore	2006-07-28 13:07:44.000000000 +0400
 >>+++ ./kernel/ub/beancounter.c	2006-08-03 16:14:17.000000000 +0400
 >>@@ -395,6 +395,10 @@
 >> 	spin_lock_init(&ub_hash_lock);
 >> 	slot = &ub_hash[ub_hash_fun(ub->ub_uid)];
 >> 	hlist_add_head(&ub->hash, slot);
 >>+
 >>+	current->task_bc.exec_ub = ub;
 >>+	current->task_bc.task_ub = get_beancounter(ub);
 >>+	current->task_bc.fork_sub = get_beancounter(ub);
 >> }
 >>
 >> void __init ub_init_late(void)
 >>--- ./kernel/ub/misc.c.ubfork	2006-07-31 16:23:44.000000000 +0400
 >>+++ ./kernel/ub/misc.c	2006-07-31 16:28:47.000000000 +0400
 >>@@ -0,0 +1,34 @@
 >>+/*
 >>+ * kernel/ub/misc.c
 >>+ *
 >>+ * Copyright (C) 2006 OpenVZ. SWsoft Inc.
 >>+ *
 >>+ */
 >>+
 >>+#include <linux/sched.h>
 >>+
 >>+#include <ub/beancounter.h>
 >>+#include <ub/task.h>
 >>+
 >>+int ub_task_charge(struct task_struct *parent, struct task_struct *new)
 >>+{
 >
 >
 > parent could be derived from new if you move the charge to copy_process
 > instead of dup_task_struct.
 we can split it into:
 ub_charge_task() in dup_task_struct to account _task_ itself.
 ub_copy_process() in copy_process() to inherit and initialize
 exec_ub and fork_sub
 
 what do you think?
 
 >>+	struct task_beancounter *old_bc;
 >>+	struct task_beancounter *new_bc;
 >>+	struct user_beancounter *ub;
 >>+
 >>+	old_bc = &parent->task_bc;
 >>+	new_bc = &new->task_bc;
 >>+
 >>+	ub = old_bc->fork_sub;
 >>+	new_bc->exec_ub = get_beancounter(ub);
 >>+	new_bc->task_ub = get_beancounter(ub);
 >>+	new_bc->fork_sub = get_beancounter(ub);
 >>+	return 0;
 >>+}
 >>+
 >>+void ub_task_uncharge(struct task_struct *tsk)
 >>+{
 >>+	put_beancounter(tsk->task_bc.exec_ub);
 >>+	put_beancounter(tsk->task_bc.task_ub);
 >>+	put_beancounter(tsk->task_bc.fork_sub);
 >>+}
 >> ------------------------------------------------------------ -------------
 >>Using Tomcat but need to do more? Need to support web services, security?
 >>Get stuff done quickly with pre-integrated technology to make your job easier
 >>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
 >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&b id=263057&dat=121642
 >>_______________________________________________
 >>ckrm-tech mailing list
 >>https://lists.sourceforge.net/lists/listinfo/ckrm-tech
 >
 >
 >
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [RFC][PATCH 5/7] UBC: kernel memory accounting	(core) [message #5372 is a reply to message #5313] | Fri, 18 August 2006 09:29   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Dave Hansen wrote: > On Thu, 2006-08-17 at 17:27 +0400, Kirill Korotaev wrote:
 >
 >>charged kernel objects can't be _reclaimed_. how do you propose
 >>to reclaim tasks page tables or files or task struct or vma or etc.?
 >
 >
 > Do you have any statistics on which of these objects are the most
 > troublesome?  If it _is_ pagetables, for instance, it is quite
 > conceivable that we could reclaim them.
 they all are troublesome :/
 user can create lots of vmas, w/o page tables.
 lots of fdsets, ipcids.
 These are not reclaimable.
 
 Also consider the following scenario with reclaimable page tables.
 e.g. user hit kmemsize limit due to fat page tables.
 kernel reclaims some of the page tables and frees user kenerl memory.
 after that user creates some uncreclaimable objects like fdsets or ipcs
 and then accesses memory with reclaimed page tables.
 Sooner or later we kill user with SIGSEGV from page fault due to
 no memory. This is worse then returning ENOMEM from poll() or
 mmap() where user allocates kernel objects.
 
 > This one probably deserves a big, fat comment, though. ;)
 tell me where to write it and what? :)
 
 Thanks,
 Kirill
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5383 is a reply to message #5317] | Fri, 18 August 2006 11:03   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Andrew Morton wrote: > On Thu, 17 Aug 2006 16:13:30 +0400
 > Kirill Korotaev <dev@sw.ru> wrote:
 >
 >
 >>>I was more thinking about (for example) user land physical memory limit
 >>>for that bean counter.  If the limits are going down, then the system
 >>>call should try to flush out page cache pages or swap out anonymous
 >>>memory.  But you are right that it won't be possible in all cases, like
 >>>for in kernel memory limits.
 >>
 >>Such kind of memory management is less efficient than the one
 >>making decisions based on global shortages and global LRU alogrithm.
 >
 >
 > I also was quite surprised that openvz appears to have no way of
 > constraining a container's memory usage.  "I want to run this bunch of
 > processes in a 4.5GB container".
 If you mean user memory, then it is possible to set
 container limits to 4,5GB. This is what most people care about
 and it is not a problem.
 
 Or you mean that you are suprised there are lots of parameters
 and there is no a single one limiting the _whole_ memory set of container
 memory (sum of kernel memory, user space memory and other resources memory)?
 
 >>The problem here is that doing swap out takes more expensive disk I/O
 >>influencing other users.
 >
 >
 > A well-set-up container would presumably be working against its own
 > spindle(s).  If the operator has gone to all the trouble of isolating a job
 > from the system's other jobs, he'd be pretty dumb to go and let all the
 > "isolated" jobs share a stinky-slow resource like a disk.
 why do you assume that it is always an operator who controls the applications
 inside the container?
 users can run any application inside and it is systems job to
 introduce resource isolation between users, not the operators full-time
 job doing monitoring of users.
 
 > But yes, swap is a problem.  To do this properly we'd need a way of saying
 > "this container here uses that swap device over there".
 yep, this is possible with page beancounters as it tracks user pages.
 more over, we have an intention of building a system with a single container
 memory parameter, but we think this is more user interface question and
 still requires all the UBC resources accounting.
 
 Thanks,
 Kirill
 |  
	|  |  |  
	| 
		
			| Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5384 is a reply to message #5328] | Fri, 18 August 2006 11:13   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Rohit Seth wrote: > On Thu, 2006-08-17 at 15:53 +0400, Kirill Korotaev wrote:
 >
 >>Rohit Seth wrote:
 >>
 >>>On Wed, 2006-08-16 at 19:37 +0400, Kirill Korotaev wrote:
 >>>
 >>>
 >>>>Core functionality and interfaces of UBC:
 >>>>find/create beancounter, initialization,
 >>>>charge/uncharge of resource, core objects' declarations.
 >>>>
 >>>>Basic structures:
 >>>> ubparm           - resource description
 >>>> user_beancounter - set of resources, id, lock
 >>>>
 >>>>Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 >>>>Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 >>>>
 >>>>---
 >>>>include/ub/beancounter.h |  157 ++++++++++++++++++
 >>>>init/main.c              |    4
 >>>>kernel/Makefile          |    1
 >>>>kernel/ub/Makefile       |    7
 >>>>kernel/ub/beancounter.c  |  398 +++++++++++++++++++++++++++++++++++++++++++++++
 >>>>5 files changed, 567 insertions(+)
 >>>>
 >>>>--- /dev/null	2006-07-18 14:52:43.075228448 +0400
 >>>>+++ ./include/ub/beancounter.h	2006-08-10 14:58:27.000000000 +0400
 >>>>@@ -0,0 +1,157 @@
 >>>>+/*
 >>>>+ *  include/ub/beancounter.h
 >>>>+ *
 >>>>+ *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 >>>>+ *
 >>>>+ */
 >>>>+
 >>>>+#ifndef _LINUX_BEANCOUNTER_H
 >>>>+#define _LINUX_BEANCOUNTER_H
 >>>>+
 >>>>+/*
 >>>>+ *	Resource list.
 >>>>+ */
 >>>>+
 >>>>+#define UB_RESOURCES	0
 >>>>+
 >>>>+struct ubparm {
 >>>>+	/*
 >>>>+	 * A barrier over which resource allocations are failed gracefully.
 >>>>+	 * e.g. if the amount of consumed memory is over the barrier further
 >>>>+	 * sbrk() or mmap() calls fail, the existing processes are not killed.
 >>>>+	 */
 >>>>+	unsigned long	barrier;
 >>>>+	/* hard resource limit */
 >>>>+	unsigned long	limit;
 >>>>+	/* consumed resources */
 >>>>+	unsigned long	held;
 >>>>+	/* maximum amount of consumed resources through the last period */
 >>>>+	unsigned long	maxheld;
 >>>>+	/* minimum amount of consumed resources through the last period */
 >>>>+	unsigned long	minheld;
 >>>>+	/* count of failed charges */
 >>>>+	unsigned long	failcnt;
 >>>>+};
 >>>
 >>>
 >>>What is the difference between barrier and limit. They both sound like
 >>>hard limits.  No?
 >>
 >>check __charge_beancounter_locked and severity.
 >>It provides some kind of soft and hard limits.
 >>
 >
 >
 > Would be easier to just rename them as soft and hard limits...
 >
 >
 >>>>+
 >>>>+/*
 >>>>+ * Kernel internal part.
 >>>>+ */
 >>>>+
 >>>>+#ifdef __KERNEL__
 >>>>+
 >>>>+#include <linux/config.h>
 >>>>+#include <linux/spinlock.h>
 >>>>+#include <linux/list.h>
 >>>>+#include <asm/atomic.h>
 >>>>+
 >>>>+/*
 >>>>+ * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
 >>>>+ */
 >>>>+	/* resources statistics and settings */
 >>>>+	struct ubparm		ub_parms[UB_RESOURCES];
 >>>>+};
 >>>>+
 >>>
 >>>
 >>>I presume UB_RESOURCES value is going to change as different resources
 >>>start getting tracked.
 >>
 >>what's wrong with it?
 >>
 >
 >
 > ...just that user land will need to be some how informed about that.
 the same way user space knows that system call is (not) implemented.
 (include unistd.h :))) )
 
 >>>I think something like configfs should be used for user interface.  It
 >>>automatically presents the right interfaces to user land (based on
 >>>kernel implementation).  And you wouldn't need any changes in glibc etc.
 >>
 >>1. UBC doesn't require glibc modificatins.
 >
 >
 > You are right not for setting the limits.  But for adding any new
 > functionality related to containers....as in you just added a new system
 > call to get the limits.
 Do you state that glibc describes _all_ the existing system calls with some wrappers?
 
 >>2. if you think a bit more about it, adding UB parameters doesn't
 >>   require user space changes as well.
 >>3. it is possible to add any kind of interface for UBC. but do you like the idea
 >>   to grep 200(containers)x20(parameters) files for getting current usages?
 >
 >
 > How are you doing it currently and how much more efficient it is in
 > comparison to configfs?
 currently it is done with a single file read.
 you can grep it, sum up resources or do what ever you want from bash.
 what is important! you can check whether container hits its limits
 with a single command, while with configs you would have to look through
 20 files...
 
 IMHO it is convinient to have a text file representing the whole information state
 and system call for applications.
 
 >>   Do you like the idea to convert numbers to strings and back w/o
 >>   thinking of data types?
 >
 >
 > IMO, setting up limits and containers (themselves) is not a common
 > operation.    I wouldn't be too worried about loosing those few extra
 > cycles in setting them up.
 it is not the question of performance...
 
 Kirill
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5388 is a reply to message #5349] | Fri, 18 August 2006 11:43   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Matt Helsley wrote: 
 [... snip ...]
 >>--- ./kernel/ub/sys.c.ubsys	2006-07-28 18:52:18.000000000 +0400
 >>+++ ./kernel/ub/sys.c	2006-08-03 16:14:23.000000000 +0400
 >>@@ -0,0 +1,126 @@
 >>+/*
 >>+ *  kernel/ub/sys.c
 >>+ *
 >>+ *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 >>+ *
 >>+ */
 >>+
 >>+#include <linux/config.h>
 >>+#include <linux/sched.h>
 >>+#include <asm/uaccess.h>
 >>+
 >>+#include <ub/beancounter.h>
 >>+#include <ub/task.h>
 >>+
 >>+#ifndef CONFIG_USER_RESOURCE
 >
 >
 > Get rid of the #ifdef since this file should only be compiled if
 > CONFIG_USER_RESOURCE=y anyway.
 >
 >
 >>+asmlinkage long sys_getluid(void)
 >>+{
 >>+	return -ENOSYS;
 >>+}
 >>+
 >>+asmlinkage long sys_setluid(uid_t uid)
 >>+{
 >>+	return -ENOSYS;
 >>+}
 >>+
 >>+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
 >>+		unsigned long *limits)
 >>+{
 >>+	return -ENOSYS;
 >>+}
 >
 >
 > Looks to me like you want to add:
 >
 > cond_syscall(sys_getluid);
 > ...
 >
 > in kernel/sys_ni.c and then you won't have to worry about making these
 > empty functions.
 Good note. Thanks, will do it!
 
 >>+#else /* CONFIG_USER_RESOURCE */
 >>+
 >>+/*
 >>+ *	The (rather boring) getluid syscall
 >>+ */
 >>+asmlinkage long sys_getluid(void)
 >>+{
 >>+	struct user_beancounter *ub;
 >>+
 >>+	ub = get_exec_ub();
 >>+	if (ub == NULL)
 >>+		return -EINVAL;
 >>+
 >>+	return ub->ub_uid;
 >>+}
 >>+
 >>+/*
 >>+ *	The setluid syscall
 >>+ */
 >>+asmlinkage long sys_setluid(uid_t uid)
 >>+{
 >>+	int error;
 >>+	struct user_beancounter *ub;
 >>+	struct task_beancounter *task_bc;
 >>+
 >>+	task_bc = ¤t->task_bc;
 >>+
 >>+	/* You may not disown a setluid */
 >>+	error = -EINVAL;
 >>+	if (uid == (uid_t)-1)
 >>+		goto out;
 >>+
 >>+	/* You may only set an ub as root */
 >>+	error = -EPERM;
 >>+	if (!capable(CAP_SETUID))
 >>+		goto out;
 >
 >
 > With resource groups you don't necessarily have to be root -- just the
 > owner of the group and task.
 the question is - who is the owner of group?
 user, user group or who?
 Both are bad, since the same user can run inside the container and thus
 container will be potentially controllable/breakable from inside.
 
 
 > Filesystems and appropriate share representations offer a way to give
 > regular users the ability to manage their resources without requiring
 > CAP_FOO.
 not sure what you propose...
 
 we can introduce the following rules:
 
 containers (UB) can be created by process with SETUID cap only.
 subcontainers (SUB) can be created by any process.
 
 what do you think?
 
 
 >>+	/* Ok - set up a beancounter entry for this user */
 >>+	error = -ENOBUFS;
 >>+	ub = beancounter_findcreate(uid, NULL, UB_ALLOC);
 >>+	if (ub == NULL)
 >>+		goto out;
 >>+
 >>+	/* install bc */
 >>+	put_beancounter(task_bc->exec_ub);
 >>+	task_bc->exec_ub = ub;
 >>+	put_beancounter(task_bc->fork_sub);
 >>+	task_bc->fork_sub = get_beancounter(ub);
 >>+	error = 0;
 >>+out:
 >>+	return error;
 >>+}
 >>+
 >>+/*
 >>+ *	The setbeanlimit syscall
 >>+ */
 >>+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
 >>+		unsigned long *limits)
 >>+{
 >>+	int error;
 >>+	unsigned long flags;
 >>+	struct user_beancounter *ub;
 >>+	unsigned long new_limits[2];
 >>+
 >>+	error = -EPERM;
 >>+	if(!capable(CAP_SYS_RESOURCE))
 >>+		goto out;
 >
 >
 > Again, a filesystem interface would give us more flexibility when it
 > comes to allowing users to manage their resources while still preventing
 > them from exceeding limits.
 we can have 2 different root users with uid = 0 in 2 different containers.
 
 > I doubt you really want to give owners of a container CAP_SYS_RESOURCE
 > and CAP_USER (i.e. total control over resource management) just to allow
 > them to manage their subset of the resources.
 The origin idea is that administator of the node can manage user
 resources only. Users can't, since otherwise they can increase the limits.
 But we can allow them to manage sub beancoutners imho...
 
 >>+	error = -EINVAL;
 >>+	if (resource >= UB_RESOURCES)
 >>+		goto out;
 >>+
 >>+	error = -EFAULT;
 >>+	if (copy_from_user(&new_limits, limits, sizeof(new_limits)))
 >>+		goto out;
 >>+
 >>+	error = -EINVAL;
 >>+	if (new_limits[0] > UB_MAXVALUE || new_limits[1] > UB_MAXVALUE)
 >>+		goto out;
 >>+
 >>+	error = -ENOENT;
 >>+	ub = beancounter_findcreate(uid, NULL, 0);
 >>+	if (ub == NULL)
 >>+		goto out;
 >>+
 >>+	spin_lock_irqsave(&ub->ub_lock, flags);
 >>+	ub->ub_parms[resource].barrier = new_limits[0];
 >>+	ub->ub_parms[resource].limit = new_limits[1];
 >>+	spin_unlock_irqrestore(&ub->ub_lock, flags);
 >>+
 >>+	put_beancounter(ub);
 >>+	error = 0;
 >>+out:
 >>+	return error;
 >>+}
 >>+#endif
 >>
 >> ------------------------------------------------------------ -------------
 >>Using Tomcat but need to do more? Need to support web services, security?
 >>Get stuff done quickly with pre-integrated technology to make your job easier
 >>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
 >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&b id=263057&dat=121642
 >>_______________________________________________
 >>ckrm-tech mailing list
 >>https://lists.sourceforge.net/lists/listinfo/ckrm-tech
 >
 >
 >
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [PATCH 2/7] UBC: core (structures, API) [message #5391 is a reply to message #5370] | Fri, 18 August 2006 11:50   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Matt Helsley wrote: > On Fri, 2006-08-18 at 11:35 +0400, Andrey Savochkin wrote:
 >
 >>On Thu, Aug 17, 2006 at 10:31:37PM -0700, Andrew Morton wrote:
 >>
 >>>On Thu, 17 Aug 2006 15:53:40 +0400
 >>>Kirill Korotaev <dev@sw.ru> wrote:
 >>>
 >>>
 >>>>>>+struct user_beancounter
 >>>>>>+{
 >>>>>>+	atomic_t		ub_refcount;
 >>>>>>+	spinlock_t		ub_lock;
 >>>>>>+	uid_t			ub_uid;
 >>>>>
 >>>>>
 >>>>>Why uid?  Will it be possible to club processes belonging to different
 >>>>>users to same bean counter.
 >>>>
 >>>>oh, its a misname. Should be ub_id. it is ID of user_beancounter
 >>>>and has nothing to do with user id.
 >>>
 >>>But it uses a uid_t.  That's more than a misnaming?
 >>
 >>It used to be uid-related in ancient times when the notion of container
 >>hadn't formed up.
 >>"user" part of user_beancounter name has the same origin :)
 >
 >
 > Is it similarly irrelevant now? If so perhaps a big rename could be used
 > to make the names clearer (s/user_//, s/ub_/bc_/, ...).
 hm... let's try :)
 
 Kirill
 |  
	|  |  |  
	|  | 
 
 
 Current Time: Fri Oct 31 23:40:40 GMT 2025 
 Total time taken to generate the page: 0.23592 seconds |