| Home » Mailing lists » Devel » [PATCH] BC: resource beancounters (v4) (added user memory) Goto Forum:
	| 
		
			| [PATCH] BC: resource beancounters (v4) (added user memory) [message #5922] | Tue, 05 September 2006 14:59  |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Core Resource Beancounters (BC) + kernel/user memory control. 
 BC allows to account and control consumption
 of kernel resources used by group of processes.
 
 Draft UBC description on OpenVZ wiki can be found at
 http://wiki.openvz.org/UBC_parameters
 
 The full BC patch set allows to control:
 - kernel memory. All the kernel objects allocatable
 on user demand should be accounted and limited
 for DoS protection.
 E.g. page tables, task structs, vmas etc.
 
 - virtual memory pages. BCs allow to
 limit a container to some amount of memory and
 introduces 2-level OOM killer taking into account
 container's consumption.
 pages shared between containers are correctly
 charged as fractions (tunable).
 
 - network buffers. These includes TCP/IP rcv/snd
 buffers, dgram snd buffers, unix, netlinks and
 other buffers.
 
 - minor resources accounted/limited by number:
 tasks, files, flocks, ptys, siginfo, pinned dcache
 mem, sockets, iptentries (for containers with
 virtualized networking)
 
 As the first step we want to propose for discussion
 the most complicated parts of resource management:
 kernel memory and virtual memory.
 The patch set to be sent provides core for BC and
 management of kernel memory only. Virtual memory
 management will be sent in a couple of days.
 
 The patches in these series are:
 diff-atomic-dec-and-lock-irqsave.patch
 introduce atomic_dec_and_lock_irqsave()
 
 diff-bc-kconfig.patch:
 Adds kernel/bc/Kconfig file with UBC options and
 includes it into arch Kconfigs
 
 diff-bc-core.patch:
 Contains core functionality and interfaces of BC:
 find/create beancounter, initialization,
 charge/uncharge of resource, core objects' declarations.
 
 diff-bc-task.patch:
 Contains code responsible for setting BC on task,
 it's inheriting and setting host context in interrupts.
 
 Task contains three beancounters:
 1. exec_bc  - current context. all resources are charged
 to this beancounter.
 2. fork_bc  - beancounter which is inherited by
 task's children on fork
 
 diff-bc-syscalls.patch:
 Patch adds system calls for BC management:
 1. sys_get_bcid    - get current BC id
 2. sys_set_bcid    - changes exec_ and fork_ BCs on current
 3. sys_set_bclimit - set limits for resources consumtions
 4. sys_get_bcstat  - returns limits/usages/fails for BC
 
 diff-bc-kmem-core.patch:
 Introduces BC_KMEMSIZE resource which accounts kernel
 objects allocated by task's request.
 
 Objects are accounted via struct page and slab objects.
 For the latter ones each slab contains a set of pointers
 corresponding object is charged to.
 
 Allocation charge rules:
 1. Pages - if allocation is performed with __GFP_BC flag - page
 is charged to current's exec_bc.
 2. Slabs - kmem_cache may be created with SLAB_BC flag - in this
 case each allocation is charged. Caches used by kmalloc are
 created with SLAB_BC | SLAB_BC_NOCHARGE flags. In this case
 only __GFP_BC allocations are charged.
 
 diff-bc-kmem-charge.patch:
 Adds SLAB_BC and __GFP_BC flags in appropriate places
 to cause charging/limiting of specified resources.
 
 diff-bc-vmlocked-core.patch:
 Introduces new resource BC_LOCKEDPAGES for accounting
 of mlock-ed user pages.
 
 diff-bc-vmlocked-charge.patch:
 Places calls to BC core over the kernel to charge locked memory.
 
 diff-bc-privvm.patch:
 This patch instroduces new resource - BC_PRIVVMPAGES.
 Privvmpages acointing is described in details in
 http://wiki.openvz.org/User_pages_accounting
 
 diff-bc-vmrss-prep.patch:
 This patch intruduces small preparations for vmrss accounting
 to make reviewing simpler.
 
 diff-bc-vmrss-core.patch:
 This is the core of vmrss accounting.
 Pages are accounted in fractions and it is described in details in
 http://wiki.openvz.org/RSS_fractions_accounting
 
 diff-bc-vmrss-charge.patch:
 Calls to vmrss core code over the kernel to do accounting.
 
 
 Summary of changes from v3 patch set:
 
 * Added basic user pages accounting (lockedpages/privvmpages)
 * spell in Kconfig
 * Makefile reworked
 * EXPORT_SYMBOL_GPL
 * union w/o name in struct page
 * bc_task_charge is void now
 * adjust minheld/maxheld splitted
 
 Summary of changes from v2 patch set:
 
 * introduced atomic_dec_and_lock_irqsave()
 * bc_adjust_held_minmax comment
 * added __must_check for bc_*charge* funcs
 * use hash_long() instead of own one
 * bc/Kconfig is sourced from init/Kconfig now
 * introduced bcid_t type with comment from Alan Cox
 * check for barrier <= limit in sys_set_bclimit()
 * removed (bc == NULL) checks
 * replaced memcpy in beancounter_findcrate with assignment
 * moved check 'if (mask & BC_ALLOC)' out of the lock
 * removed unnecessary memset()
 
 Summary of changes from v1 patch set:
 
 * CONFIG_BEANCOUNTERS is 'n' by default
 * fixed Kconfig includes in arches
 * removed hierarchical beancounters to simplify first patchset
 * removed unused 'private' pointer
 * removed unused EXPORTS
 * MAXVALUE redeclared as LONG_MAX
 * beancounter_findcreate clarification
 * renamed UBC -> BC, ub -> bc etc.
 * moved BC inheritance into copy_process
 * introduced reset_exec_bc() with proposed BUG_ON
 * removed task_bc beancounter (not used yet, for numproc)
 * fixed syscalls for sparc
 * added sys_get_bcstat(): return info that was in /proc
 * cond_syscall instead of #ifdefs
 
 Many thanks to Oleg Nesterov, Alan Cox, Matt Helsley and others
 for patch review and comments.
 
 Patch set is applicable to 2.6.18-rc5-mm1
 
 Thanks,
 Kirill
 |  
	|  |  |  
	| 
		
			| [PATCH 1/13] BC: introduce atomic_dec_and_lock_irqsave() [message #5923 is a reply to message #5922] | Tue, 05 September 2006 15:16   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Oleg Nesterov noticed to me that the construction like (used in beancounter patches and free_uid()):
 
 local_irq_save(flags);
 if (atomic_dec_and_lock(&refcnt, &lock))
 ...
 
 is not that good for preemtible kernels, since with preemption
 spin_lock() can schedule() to reduce latency. However, it won't schedule
 if interrupts are disabled.
 
 So this patch introduces atomic_dec_and_lock_irqsave() as a logical
 counterpart to atomic_dec_and_lock().
 
 Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 include/linux/spinlock.h |    6 ++++++
 kernel/user.c            |    5 +----
 lib/dec_and_lock.c       |   19 +++++++++++++++++++
 3 files changed, 26 insertions(+), 4 deletions(-)
 
 --- ./include/linux/spinlock.h.dlirq	2006-08-28 10:17:35.000000000 +0400
 +++ ./include/linux/spinlock.h	2006-08-28 11:22:37.000000000 +0400
 @@ -266,6 +266,12 @@ extern int _atomic_dec_and_lock(atomic_t
 #define atomic_dec_and_lock(atomic, lock) \
 __cond_lock(lock, _atomic_dec_and_lock(atomic, lock))
 
 +extern int _atomic_dec_and_lock_irqsave(atomic_t *atomic, spinlock_t *lock,
 +		unsigned long *flagsp);
 +#define atomic_dec_and_lock_irqsave(atomic, lock, flags) \
 +		__cond_lock(lock, \
 +			_atomic_dec_and_lock_irqsave(atomic, lock, &flags))
 +
 /**
 * spin_can_lock - would spin_trylock() succeed?
 * @lock: the spinlock in question.
 --- ./kernel/user.c.dlirq	2006-07-10 12:39:20.000000000 +0400
 +++ ./kernel/user.c	2006-08-28 11:08:56.000000000 +0400
 @@ -108,15 +108,12 @@ void free_uid(struct user_struct *up)
 if (!up)
 return;
 
 -	local_irq_save(flags);
 -	if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
 +	if (atomic_dec_and_lock_irqsave(&up->__count, &uidhash_lock, flags)) {
 uid_hash_remove(up);
 spin_unlock_irqrestore(&uidhash_lock, flags);
 key_put(up->uid_keyring);
 key_put(up->session_keyring);
 kmem_cache_free(uid_cachep, up);
 -	} else {
 -		local_irq_restore(flags);
 }
 }
 
 --- ./lib/dec_and_lock.c.dlirq	2006-04-21 11:59:36.000000000 +0400
 +++ ./lib/dec_and_lock.c	2006-08-28 11:22:08.000000000 +0400
 @@ -33,3 +33,22 @@ int _atomic_dec_and_lock(atomic_t *atomi
 }
 
 EXPORT_SYMBOL(_atomic_dec_and_lock);
 +
 +/*
 + * the same, but takes the lock with _irqsave
 + */
 +int _atomic_dec_and_lock_irqsave(atomic_t *atomic, spinlock_t *lock,
 +		unsigned long *flagsp)
 +{
 +#ifdef CONFIG_SMP
 +	if (atomic_add_unless(atomic, -1, 1))
 +		return 0;
 +#endif
 +	spin_lock_irqsave(lock, *flagsp);
 +	if (atomic_dec_and_test(atomic))
 +		return 1;
 +	spin_unlock_irqrestore(lock, *flagsp);
 +	return 0;
 +}
 +
 +EXPORT_SYMBOL(_atomic_dec_and_lock_irqsave);
 |  
	|  |  |  
	|  |  
	| 
		
			| [PATCH 3/17] BC: beancounters core (API) [message #5925 is a reply to message #5922] | Tue, 05 September 2006 15:17   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Core functionality and interfaces of BC: find/create beancounter, initialization,
 charge/uncharge of resource, core objects' declarations.
 
 Basic structures:
 bc_resource_parm - resource description
 beancounter      - set of resources, id, lock
 
 Signed-off-by: Pavel Emelianov <xemul@sw.ru>
 Signed-off-by: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 include/bc/beancounter.h |  155 +++++++++++++++++++++++++++
 include/linux/types.h    |   16 ++
 init/main.c              |    4
 kernel/Makefile          |    1
 kernel/bc/Makefile       |    7 +
 kernel/bc/beancounter.c  |  263 +++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 446 insertions(+)
 
 --- ./include/bc/beancounter.h.bccore	2006-09-05 12:06:35.000000000 +0400
 +++ ./include/bc/beancounter.h	2006-09-05 12:15:57.000000000 +0400
 @@ -0,0 +1,155 @@
 +/*
 + *  include/bc/beancounter.h
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#ifndef _LINUX_BEANCOUNTER_H
 +#define _LINUX_BEANCOUNTER_H
 +
 +/*
 + *	Resource list.
 + */
 +
 +#define BC_RESOURCES	0
 +
 +struct bc_resource_parm {
 +	unsigned long barrier;	/* A barrier over which resource allocations
 +				 * are failed gracefully. e.g. if the amount
 +				 * of consumed memory is over the barrier
 +				 * further sbrk() or mmap() calls fail, the
 +				 * existing processes are not killed.
 +				 */
 +	unsigned long limit;	/* hard resource limit */
 +	unsigned long held;	/* consumed resources */
 +	unsigned long maxheld;	/* maximum amount of consumed resources */
 +	unsigned long minheld;	/* minumum amount of consumed resources */
 +	unsigned long failcnt;	/* count of failed charges */
 +};
 +
 +/*
 + * Kernel internal part.
 + */
 +
 +#ifdef __KERNEL__
 +
 +#include <linux/spinlock.h>
 +#include <linux/list.h>
 +#include <asm/atomic.h>
 +
 +#define BC_MAXVALUE	LONG_MAX
 +
 +/*
 + *	Resource management structures
 + * Serialization issues:
 + *   beancounter list management is protected via bc_hash_lock
 + *   task pointers are set only for current task and only once
 + *   refcount is managed atomically
 + *   value and limit comparison and change are protected by per-bc spinlock
 + */
 +
 +struct beancounter {
 +	atomic_t		bc_refcount;
 +	spinlock_t		bc_lock;
 +	bcid_t			bc_id;
 +	struct hlist_node	hash;
 +
 +	/* resources statistics and settings */
 +	struct bc_resource_parm	bc_parms[BC_RESOURCES];
 +};
 +
 +enum bc_severity { BC_BARRIER, BC_LIMIT, BC_FORCE };
 +
 +/* Flags passed to beancounter_findcreate() */
 +#define BC_LOOKUP	0x00
 +#define BC_ALLOC	0x01	/* may allocate new one */
 +#define BC_ALLOC_ATOMIC	0x02	/* when BC_ALLOC is set causes
 +				 * GFP_ATOMIC allocation
 +				 */
 +
 +#ifdef CONFIG_BEANCOUNTERS
 +
 +/*
 + * These functions tune minheld and maxheld values for a given
 + * resource when held value changes
 + */
 +static inline void bc_adjust_maxheld(struct beancounter *bc, int resource)
 +{
 +	struct bc_resource_parm *parm;
 +
 +	parm = &bc->bc_parms[resource];
 +	if (parm->maxheld < parm->held)
 +		parm->maxheld = parm->held;
 +}
 +
 +static inline void bc_adjust_minheld(struct beancounter *bc, int resource)
 +{
 +	struct bc_resource_parm *parm;
 +
 +	parm = &bc->bc_parms[resource];
 +	if (parm->minheld > parm->held)
 +		parm->minheld = parm->held;
 +}
 +
 +int __must_check bc_charge_locked(struct beancounter *bc,
 +		int res, unsigned long val, enum bc_severity strict);
 +int __must_check bc_charge(struct beancounter *bc,
 +		int res, unsigned long val, enum bc_severity strict);
 +
 +void bc_uncharge_locked(struct beancounter *bc, int res, unsigned long val);
 +void bc_uncharge(struct beancounter *bc, int res, unsigned long val);
 +
 +struct beancounter *beancounter_findcreate(bcid_t id, int mask);
 +
 +static inline struct beancounter *get_beancounter(struct beancounter *bc)
 +{
 +	atomic_inc(&bc->bc_refcount);
 +	return bc;
 +}
 +
 +void put_beancounter(struct beancounter *bc);
 +
 +void bc_init_early(void);
 +void bc_init_late(void);
 +void bc_init_proc(void);
 +
 +extern struct beancounter init_bc;
 +extern const char *bc_rnames[];
 +
 +#else /* CONFIG_BEANCOUNTERS */
 +
 +#define beancounter_findcreate(id, f)			(NULL)
 +#define get_beancounter(bc)				(NULL)
 +#define put_beancounter(bc)				do { } while (0)
 +
 +static inline __must_check int bc_charge_locked(struct beancounter *bc,
 +		int res, unsigned long val, enum bc_severity strict)
 +{
 +	return 0;
 +}
 +
 +static inline __must_check int bc_charge(struct beancounter *bc,
 +		int res, unsigned long val, enum bc_severity strict)
 +{
 +	return 0;
 +}
 +
 +static inline void bc_uncharge_locked(struct beancounter *bc, int res,
 +		unsigned long val)
 +{
 +}
 +
 +static inline void bc_uncharge(struct beancounter *bc, int res,
 +		unsigned long val)
 +{
 +}
 +
 +#define bc_init_early()					do { } while (0)
 +#define bc_init_late()					do { } while (0)
 +#define bc_init_proc()					do { } while (0)
 +
 +#endif /* CONFIG_BEANCOUNTERS */
 +#endif /* __KERNEL__ */
 +
 +#endif /* _LINUX_BEANCOUNTER_H */
 --- ./include/linux/types.h.bccore	2006-09-05 11:47:33.000000000 +0400
 +++ ./include/linux/types.h	2006-09-05 12:06:35.000000000 +0400
 @@ -40,6 +40,21 @@ typedef __kernel_gid32_t	gid_t;
 typedef __kernel_uid16_t        uid16_t;
 typedef __kernel_gid16_t        gid16_t;
 
 +/*
 + * Type of beancounter id (CONFIG_BEANCOUNTERS)
 + *
 + * The ancient Unix implementations of this kind of resource management and
 + * security are built around setluid() which sets a uid value that cannot
 + * be changed again and is normally used for security purposes. That
 + * happened to be a uid_t and in simple setups at login uid = luid = euid
 + * would be the norm.
 + *
 + * Thus the Linux one happens to be a uid_t. It could be something else but
 + * for the "container per user" model whatever a container is must be able
 + * to hold all possible uid_t values. Alan Cox.
 + */
 +typedef uid_t    bcid_t;
 +
 #ifdef CONFIG_UID16
 /* This is defined by include/asm-{arch}/posix_types.h */
 typedef __kernel_old_uid_t	old_uid_t;
 @@ -52,6 +67,7 @@ typedef __kernel_old_gid_t	old_gid_t;
 #else
 typedef __kernel_uid_t		uid_t;
 typedef __kernel_gid_t		gid_t;
 +typedef __kernel_uid_t		bcid_t;
 #endif /* __KERNEL__ */
 
 #if defined(__GNUC__) && !defined(__STRICT_ANSI__)
 --- ./init/main.c.bccore	2006-09-05 11:47:33.000000000 +0400
 +++ ./init/main.c	2006-09-05 12:06:35.000000000 +0400
 @@ -50,6 +50,8 @@
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
 
 +#include <bc/beancounter.h>
 +
 #include <asm/io.h>
 #include <asm/bugs.h>
 #include <asm/setup.h>
 @@ -493,6 +495,7 @@ asmlinkage void __init start_kernel(void
 early_boot_irqs_off();
 early_init_irq_lock_class();
 
 +	bc_init_early();
 /*
 * Interrupts are still disabled. Do necessary setups, then
 * enable them
 @@ -585,6 +588,7 @@ asmlinkage void __init start_kernel(void
 #endif
 fork_init(num_physpages);
 proc_caches_init();
 +	bc_init_late();
 buffer_init();
 unnamed_dev_init();
 key_init();
 --- ./kernel/Makefile.bccore	2006-09-05 11:47:33.000000000 +0400
 +++ ./kernel/Makefile	2006-09-05 12:09:53.000000000 +0400
 @@ -12,6 +12,7 @@ obj-y     = sched.o fork.o exec_domain.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
 +obj-$(CONFIG_BEANCOUNTERS) += bc/
 obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
 obj-$(CONFIG_LOCKDEP) += lockdep.o
 ifeq ($(CONFIG_PROC_FS),y)
 --- ./kernel/bc/Makefile.bccore	2006-09-05 12:06:35.000000000 +0400
 +++ ./kernel/bc/Makefile	2006-09-05 12:10:05.000000000 +0400
 @@ -0,0 +1,7 @@
 +#
 +# Beancounters (BC)
 +#
 +# Copyright (C) 2006 OpenVZ. SWsoft Inc
 +#
 +
 +obj-y += beancounter.o
 --- ./kernel/bc/beancounter.c.bccore	2006-09-05 12:06:35.000000000 +0400
 +++ ./kernel/bc/beancounter.c	2006-09-05 12:16:50.000000000 +0400
 @@ -0,0 +1,263 @@
 +/*
 + *  kernel/bc/beancounter.c
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *  Original code by (C) 1998      Alan Cox
 + *                       1998-2000 Andrey Savochkin <saw@saw.sw.com.sg>
 + */
 +
 +#include <linux/slab.h>
 +#include <linux/module.h>
 +#include <linux/hash.h>
 +
 +#include <bc/beancounter.h>
 +
 +static kmem_cache_t *bc_cachep;
 +static struct beancounter default_beancounter;
 +
 +static void init_beancounter_struct(struct beancounter *bc, bcid_t id);
 +
 +struct beancounter init_bc;
 +
 +const char *bc_rnames[] = {
 +};
 +
 +#define BC_HASH_BITS		8
 +#define BC_HASH_SIZE		(1 << BC_HASH_BITS)
 +
 +static struct hlist_head bc_hash[BC_HASH_SIZE];
 +static spinlock_t bc_hash_lock;
 +#define bc_hash_fn(bcid)	(hash_long(bcid, BC_HASH_BITS))
 +
 +/*
 + *	Per resource beancounting. Resources are tied to their bc id.
 + *	The resource structure itself is tagged both to the process and
 + *	the charging resources (a socket doesn't want to have to search for
 + *	things at irq time for example). Reference counters keep things in
 + *	hand.
 + *
 + *	The case where a user creates resource, kills all his processes and
 + *	then starts new ones is correctly handled this way. The refcounters
 + *	will mean the old entry is still around with resource tied to it.
 + */
 +
 +struct beancounter *beancounter_findcreate(bcid_t id, int mask)
 +{
 +	struct beancounter *new_bc, *bc;
 +	unsigned long flags;
 +	struct hlist_head *slot;
 +	struct hlist_node *pos;
 +
 +	slot = &bc_hash[bc_hash_fn(id)];
 +	new_bc = NULL;
 +
 +retry:
 +	spin_lock_irqsave(&bc_hash_lock, flags);
 +	hlist_for_each_entry (bc, pos, slot, hash)
 +		if (bc->bc_id == id)
 +			break;
 +
 +	if (pos != NULL) {
 +		get_beancounter(bc);
 +		spin_unlock_irqrestore(&bc_hash_lock, flags);
 +
 +		if (new_bc != NULL)
 +			kmem_cache_free(bc_cachep, new_bc);
 +		return bc;
 +	}
 +
 +	if (new_bc != NULL)
 +		goto out_install;
 +
 +	spin_unlock_irqrestore(&bc_hash_lock, flags);
 +
 +	if (!(mask & BC_ALLOC))
 +		goto out;
 +
 +	new_bc = kmem_cache_alloc(bc_cachep,
 +			mask & BC_ALLOC_ATOMIC ? GFP_ATOMIC : GFP_KERNEL);
 +	if (new_bc == NULL)
 +		goto out;
 +
 +	*new_bc = default_beancounter;
 +	init_beancounter_struct(new_bc, id);
 +	goto retry;
 +
 +out_install:
 +	hlist_add_head(&new_bc-
...
 
 
 |  
	|  |  |  
	| 
		
			| [PATCH 4/13] BC: context inheriting and changing [message #5926 is a reply to message #5922] | Tue, 05 September 2006 15:19   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Contains code responsible for setting BC on task, it's inheriting and setting host context in interrupts.
 
 Task references 2 beancounters:
 1. exec_bc: current context. all resources are
 charged to this beancounter.
 3. fork_bc: beancounter which is inherited by
 task's children on fork
 
 Signed-off-by: Pavel Emelianov <xemul@sw.ru>
 Signed-off-by: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 include/bc/task.h       |   57 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h   |    5 ++++
 kernel/bc/Makefile      |    1
 kernel/bc/beancounter.c |    3 ++
 kernel/bc/misc.c        |   31 ++++++++++++++++++++++++++
 kernel/fork.c           |    5 ++++
 kernel/irq/handle.c     |    9 +++++++
 kernel/softirq.c        |    8 ++++++
 8 files changed, 119 insertions(+)
 
 --- ./include/bc/task.h.bctask	2006-09-05 12:24:07.000000000 +0400
 +++ ./include/bc/task.h	2006-09-05 12:38:53.000000000 +0400
 @@ -0,0 +1,57 @@
 +/*
 + *  include/bc/task.h
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#ifndef __BC_TASK_H_
 +#define __BC_TASK_H_
 +
 +struct beancounter;
 +
 +struct task_beancounter {
 +	struct beancounter *exec_bc;
 +	struct beancounter *fork_bc;
 +};
 +
 +#ifdef CONFIG_BEANCOUNTERS
 +
 +#define get_exec_bc()	(current->task_bc.exec_bc)
 +
 +#define set_exec_bc(new) ({				\
 +		struct task_beancounter *tbc;		\
 +		struct beancounter *old;		\
 +		tbc = ¤t->task_bc;		\
 +		old = tbc->exec_bc;			\
 +		tbc->exec_bc = new;			\
 +		old;					\
 +	})
 +
 +#define reset_exec_bc(old, expected) do {		\
 +		struct task_beancounter *tbc;		\
 +		tbc = ¤t->task_bc;		\
 +		BUG_ON(tbc->exec_bc != expected);	\
 +		tbc->exec_bc = old;			\
 +	} while (0)
 +
 +void bc_task_charge(struct task_struct *parent, struct task_struct *new);
 +void bc_task_uncharge(struct task_struct *tsk);
 +
 +#else
 +
 +#define get_exec_bc()			(NULL)
 +#define set_exec_bc(new)		(NULL)
 +#define reset_exec_bc(new, expected)	do { } while (0)
 +
 +static inline void bc_task_charge(struct task_struct *parent,
 +		struct task_struct *new)
 +{
 +}
 +
 +static inline void bc_task_uncharge(struct task_struct *tsk)
 +{
 +}
 +
 +#endif
 +#endif
 --- ./include/linux/sched.h.bctask	2006-09-05 11:47:33.000000000 +0400
 +++ ./include/linux/sched.h	2006-09-05 12:33:45.000000000 +0400
 @@ -83,6 +83,8 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 
 +#include <bc/task.h>
 +
 #include <asm/processor.h>
 
 struct exec_domain;
 @@ -1041,6 +1043,9 @@ struct task_struct {
 #ifdef	CONFIG_TASK_DELAY_ACCT
 struct task_delay_info *delays;
 #endif
 +#ifdef CONFIG_BEANCOUNTERS
 +	struct task_beancounter	task_bc;
 +#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
 --- ./kernel/bc/Makefile.bctask	2006-09-05 12:10:05.000000000 +0400
 +++ ./kernel/bc/Makefile	2006-09-05 12:24:39.000000000 +0400
 @@ -5,3 +5,4 @@
 #
 
 obj-y += beancounter.o
 +obj-y += misc.o
 --- ./kernel/bc/beancounter.c.bctask	2006-09-05 12:16:50.000000000 +0400
 +++ ./kernel/bc/beancounter.c	2006-09-05 12:24:07.000000000 +0400
 @@ -247,6 +247,9 @@ void __init bc_init_early(void)
 spin_lock_init(&bc_hash_lock);
 slot = &bc_hash[bc_hash_fn(bc->bc_id)];
 hlist_add_head(&bc->hash, slot);
 +
 +	current->task_bc.exec_bc = get_beancounter(bc);
 +	current->task_bc.fork_bc = get_beancounter(bc);
 }
 
 void __init bc_init_late(void)
 --- /dev/null	2006-07-18 14:52:43.075228448 +0400
 +++ ./kernel/bc/misc.c	2006-09-05 12:30:57.000000000 +0400
 @@ -0,0 +1,31 @@
 +/*
 + * kernel/bc/misc.c
 + *
 + * Copyright (C) 2006 OpenVZ. SWsoft Inc.
 + *
 + */
 +
 +#include <linux/sched.h>
 +
 +#include <bc/beancounter.h>
 +#include <bc/task.h>
 +
 +void bc_task_charge(struct task_struct *parent, struct task_struct *new)
 +{
 +	struct task_beancounter *old_bc;
 +	struct task_beancounter *new_bc;
 +	struct beancounter *bc;
 +
 +	old_bc = &parent->task_bc;
 +	new_bc = &new->task_bc;
 +
 +	bc = old_bc->fork_bc;
 +	new_bc->exec_bc = get_beancounter(bc);
 +	new_bc->fork_bc = get_beancounter(bc);
 +}
 +
 +void bc_task_uncharge(struct task_struct *tsk)
 +{
 +	put_beancounter(tsk->task_bc.exec_bc);
 +	put_beancounter(tsk->task_bc.fork_bc);
 +}
 --- ./kernel/fork.c.bctask	2006-09-05 11:47:33.000000000 +0400
 +++ ./kernel/fork.c	2006-09-05 12:30:38.000000000 +0400
 @@ -48,6 +48,8 @@
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 
 +#include <bc/task.h>
 +
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 @@ -104,6 +106,7 @@ static kmem_cache_t *mm_cachep;
 
 void free_task(struct task_struct *tsk)
 {
 +	bc_task_uncharge(tsk);
 free_thread_info(tsk->thread_info);
 rt_mutex_debug_task_free(tsk);
 free_task_struct(tsk);
 @@ -979,6 +982,8 @@ static struct task_struct *copy_process(
 if (!p)
 goto fork_out;
 
 +	bc_task_charge(current, p);
 +
 #ifdef CONFIG_TRACE_IRQFLAGS
 DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
 DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 --- ./kernel/irq/handle.c.bctask	2006-09-05 11:47:33.000000000 +0400
 +++ ./kernel/irq/handle.c	2006-09-05 12:24:07.000000000 +0400
 @@ -16,6 +16,9 @@
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
 
 +#include <bc/beancounter.h>
 +#include <bc/task.h>
 +
 #include "internals.h"
 
 /**
 @@ -171,6 +174,9 @@ fastcall unsigned int __do_IRQ(unsigned
 struct irq_desc *desc = irq_desc + irq;
 struct irqaction *action;
 unsigned int status;
 +	struct beancounter *bc;
 +
 +	bc = set_exec_bc(&init_bc);
 
 kstat_this_cpu.irqs[irq]++;
 if (CHECK_IRQ_PER_CPU(desc->status)) {
 @@ -183,6 +189,8 @@ fastcall unsigned int __do_IRQ(unsigned
 desc->chip->ack(irq);
 action_ret = handle_IRQ_event(irq, regs, desc->action);
 desc->chip->end(irq);
 +
 +		reset_exec_bc(bc, &init_bc);
 return 1;
 }
 
 @@ -251,6 +259,7 @@ out:
 desc->chip->end(irq);
 spin_unlock(&desc->lock);
 
 +	reset_exec_bc(bc, &init_bc);
 return 1;
 }
 
 --- ./kernel/softirq.c.bctask	2006-09-05 11:47:33.000000000 +0400
 +++ ./kernel/softirq.c	2006-09-05 12:38:42.000000000 +0400
 @@ -18,6 +18,9 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 
 +#include <bc/beancounter.h>
 +#include <bc/task.h>
 +
 #include <asm/irq.h>
 /*
 - No shared variables, all the data are CPU local.
 @@ -209,6 +212,9 @@ asmlinkage void __do_softirq(void)
 __u32 pending;
 int max_restart = MAX_SOFTIRQ_RESTART;
 int cpu;
 +	struct beancounter *bc;
 +
 +	bc = set_exec_bc(&init_bc);
 
 pending = local_softirq_pending();
 account_system_vtime(current);
 @@ -247,6 +253,8 @@ restart:
 
 account_system_vtime(current);
 _local_bh_enable();
 +
 +	reset_exec_bc(bc, &init_bc);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
 |  
	|  |  |  
	| 
		
			| [PATCH 5/13] BC: user interface (syscalls) [message #5927 is a reply to message #5922] | Tue, 05 September 2006 15:21   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Add the following system calls for BC management: 1. sys_get_bcid     - get current BC id
 2. sys_set_bcid     - change exec_ and fork_ BCs on current
 3. sys_set_bclimit  - set limits for resources consumtions
 4. sys_get_bcstat   - return br_resource_parm on resource
 
 Signed-off-by: Pavel Emelianov <xemul@sw.ru>
 Signed-off-by: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 arch/i386/kernel/syscall_table.S |    4 +
 arch/ia64/kernel/entry.S         |    4 +
 arch/sparc/kernel/entry.S        |    2
 arch/sparc/kernel/systbls.S      |    6 +
 arch/sparc64/kernel/entry.S      |    2
 arch/sparc64/kernel/systbls.S    |   10 ++-
 include/asm-i386/unistd.h        |    6 +
 include/asm-ia64/unistd.h        |    6 +
 include/asm-powerpc/systbl.h     |    4 +
 include/asm-powerpc/unistd.h     |    6 +
 include/asm-sparc/unistd.h       |    4 +
 include/asm-sparc64/unistd.h     |    4 +
 include/asm-x86_64/unistd.h      |   10 ++-
 kernel/bc/Makefile               |    1
 kernel/bc/sys.c                  |  120 +++++++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                  |    6 +
 16 files changed, 186 insertions(+), 9 deletions(-)
 
 --- ./arch/i386/kernel/syscall_table.S.bcsys	2006-09-05 11:47:31.000000000 +0400
 +++ ./arch/i386/kernel/syscall_table.S	2006-09-05 12:47:21.000000000 +0400
 @@ -318,3 +318,7 @@ ENTRY(sys_call_table)
 .long sys_vmsplice
 .long sys_move_pages
 .long sys_getcpu
 +	.long sys_get_bcid
 +	.long sys_set_bcid		/* 320 */
 +	.long sys_set_bclimit
 +	.long sys_get_bcstat
 --- ./arch/ia64/kernel/entry.S.bcsys	2006-09-05 11:47:31.000000000 +0400
 +++ ./arch/ia64/kernel/entry.S	2006-09-05 12:47:21.000000000 +0400
 @@ -1610,5 +1610,9 @@ sys_call_table:
 data8 sys_sync_file_range		// 1300
 data8 sys_tee
 data8 sys_vmsplice
 +	data8 sys_get_bcid
 +	data8 sys_set_bcid
 +	data8 sys_set_bclimit			// 1305
 +	data8 sys_get_bcstat
 
 .org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
 --- ./arch/sparc/kernel/entry.S.bcsys	2006-07-10 12:39:10.000000000 +0400
 +++ ./arch/sparc/kernel/entry.S	2006-09-05 12:47:21.000000000 +0400
 @@ -37,7 +37,7 @@
 
 #define curptr      g6
 
 -#define NR_SYSCALLS 300      /* Each OS is different... */
 +#define NR_SYSCALLS 304      /* Each OS is different... */
 
 /* These are just handy. */
 #define _SV	save	%sp, -STACKFRAME_SZ, %sp
 --- ./arch/sparc/kernel/systbls.S.bcsys	2006-07-10 12:39:10.000000000 +0400
 +++ ./arch/sparc/kernel/systbls.S	2006-09-05 12:47:21.000000000 +0400
 @@ -78,7 +78,8 @@ sys_call_table:
 /*285*/	.long sys_mkdirat, sys_mknodat, sys_fchownat, sys_futimesat, sys_fstatat64
 /*290*/	.long sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
 /*295*/	.long sys_fchmodat, sys_faccessat, sys_pselect6, sys_ppoll, sys_unshare
 -/*300*/	.long sys_set_robust_list, sys_get_robust_list
 +/*300*/	.long sys_set_robust_list, sys_get_robust_list, sys_get_bcid, sys_set_bcid, sys_set_bclimit
 +/*305*/	.long sys_get_bcstat
 
 #ifdef CONFIG_SUNOS_EMUL
 /* Now the SunOS syscall table. */
 @@ -192,4 +193,7 @@ sunos_sys_table:
 .long sunos_nosys, sunos_nosys, sunos_nosys
 .long sunos_nosys, sunos_nosys, sunos_nosys
 
 +	.long sunos_nosys, sunos_nosys, sunos_nosys,
 +	.long sunos_nosys
 +
 #endif
 --- ./arch/sparc64/kernel/entry.S.bcsys	2006-07-10 12:39:10.000000000 +0400
 +++ ./arch/sparc64/kernel/entry.S	2006-09-05 12:47:21.000000000 +0400
 @@ -25,7 +25,7 @@
 
 #define curptr      g6
 
 -#define NR_SYSCALLS 300      /* Each OS is different... */
 +#define NR_SYSCALLS 304      /* Each OS is different... */
 
 .text
 .align		32
 --- ./arch/sparc64/kernel/systbls.S.bcsys	2006-07-10 12:39:11.000000000 +0400
 +++ ./arch/sparc64/kernel/systbls.S	2006-09-05 12:47:21.000000000 +0400
 @@ -79,7 +79,8 @@ sys_call_table32:
 .word sys_mkdirat, sys_mknodat, sys_fchownat, compat_sys_futimesat, compat_sys_fstatat64
 /*290*/	.word sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
 .word sys_fchmodat, sys_faccessat, compat_sys_pselect6, compat_sys_ppoll, sys_unshare
 -/*300*/	.word compat_sys_set_robust_list, compat_sys_get_robust_list
 +/*300*/	.word compat_sys_set_robust_list, compat_sys_get_robust_list, sys_nis_syscall, sys_nis_syscall, sys_nis_syscall
 +	.word sys_nis_syscall
 
 #endif /* CONFIG_COMPAT */
 
 @@ -149,7 +150,9 @@ sys_call_table:
 .word sys_mkdirat, sys_mknodat, sys_fchownat, sys_futimesat, sys_fstatat64
 /*290*/	.word sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
 .word sys_fchmodat, sys_faccessat, sys_pselect6, sys_ppoll, sys_unshare
 -/*300*/	.word sys_set_robust_list, sys_get_robust_list
 +/*300*/	.word sys_set_robust_list, sys_get_robust_list, sys_get_bcid, sys_set_bcid, sys_set_bclimit
 +	.word sys_get_bcstat
 +
 
 #if defined(CONFIG_SUNOS_EMUL) || defined(CONFIG_SOLARIS_EMUL) || \
 defined(CONFIG_SOLARIS_EMUL_MODULE)
 @@ -263,4 +266,7 @@ sunos_sys_table:
 .word sunos_nosys, sunos_nosys, sunos_nosys
 .word sunos_nosys, sunos_nosys, sunos_nosys
 .word sunos_nosys, sunos_nosys, sunos_nosys
 +
 +	.word sunos_nosys, sunos_nosys, sunos_nosys
 +	.word sunos_nosys
 #endif
 --- ./include/asm-i386/unistd.h.bcsys	2006-09-05 11:47:33.000000000 +0400
 +++ ./include/asm-i386/unistd.h	2006-09-05 12:48:37.000000000 +0400
 @@ -324,8 +324,12 @@
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
 #define __NR_getcpu		318
 +#define __NR_get_bcid		319
 +#define __NR_set_bcid		320
 +#define __NR_set_bclimit	321
 +#define __NR_get_bcstat		322
 
 -#define NR_syscalls 318
 +#define NR_syscalls 323
 #include <linux/err.h>
 
 /*
 --- ./include/asm-ia64/unistd.h.bcsys	2006-09-05 11:47:33.000000000 +0400
 +++ ./include/asm-ia64/unistd.h	2006-09-05 12:47:21.000000000 +0400
 @@ -291,11 +291,15 @@
 #define __NR_sync_file_range		1300
 #define __NR_tee			1301
 #define __NR_vmsplice			1302
 +#define __NR_get_bcid			1303
 +#define __NR_set_bcid			1304
 +#define __NR_set_bclimit		1305
 +#define __NR_get_bcstat			1306
 
 #ifdef __KERNEL__
 
 
 -#define NR_syscalls			279 /* length of syscall table */
 +#define NR_syscalls			283 /* length of syscall table */
 
 #define __ARCH_WANT_SYS_RT_SIGACTION
 
 --- ./include/asm-powerpc/systbl.h.bcsys	2006-07-10 12:39:19.000000000 +0400
 +++ ./include/asm-powerpc/systbl.h	2006-09-05 12:47:21.000000000 +0400
 @@ -304,3 +304,7 @@ SYSCALL_SPU(fchmodat)
 SYSCALL_SPU(faccessat)
 COMPAT_SYS_SPU(get_robust_list)
 COMPAT_SYS_SPU(set_robust_list)
 +SYSCALL(sys_get_bcid)
 +SYSCALL(sys_set_bcid)
 +SYSCALL(sys_set_bclimit)
 +SYSCALL(sys_get_bcstat)
 --- ./include/asm-powerpc/unistd.h.bcsys	2006-09-05 11:47:33.000000000 +0400
 +++ ./include/asm-powerpc/unistd.h	2006-09-05 12:47:21.000000000 +0400
 @@ -323,10 +323,14 @@
 #define __NR_faccessat		298
 #define __NR_get_robust_list	299
 #define __NR_set_robust_list	300
 +#define __NR_get_bcid		301
 +#define __NR_set_bcid		302
 +#define __NR_set_bclimit	303
 +#define __NR_get_bcstat		304
 
 #ifdef __KERNEL__
 
 -#define __NR_syscalls		301
 +#define __NR_syscalls		305
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
 --- ./include/asm-sparc/unistd.h.bcsys	2006-09-05 11:47:33.000000000 +0400
 +++ ./include/asm-sparc/unistd.h	2006-09-05 12:47:21.000000000 +0400
 @@ -318,6 +318,10 @@
 #define __NR_unshare		299
 #define __NR_set_robust_list	300
 #define __NR_get_robust_list	301
 +#define __NR_get_bcid		302
 +#define __NR_set_bcid		303
 +#define __NR_set_bclimit	304
 +#define __NR_get_bcstat		305
 
 #ifdef __KERNEL__
 /* WARNING: You MAY NOT add syscall numbers larger than 301, since
 --- ./include/asm-sparc64/unistd.h.bcsys	2006-09-05 11:47:33.000000000 +0400
 +++ ./include/asm-sparc64/unistd.h	2006-09-05 12:47:21.000000000 +0400
 @@ -320,6 +320,10 @@
 #define __NR_unshare		299
 #define __NR_set_robust_list	300
 #define __NR_get_robust_list	301
 +#define __NR_get_bcid		302
 +#define __NR_set_bcid		303
 +#define __NR_set_bclimit	304
 +#define __NR_get_bcstat		305
 
 #ifdef __KERNEL__
 /* WARNING: You MAY NOT add syscall numbers larger than 301, since
 --- ./include/asm-x86_64/unistd.h.bcsys	2006-09-05 11:47:33.000000000 +0400
 +++ ./include/asm-x86_64/unistd.h	2006-09-05 12:49:03.000000000 +0400
 @@ -619,8 +619,16 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
 +#define __NR_get_bcid		280
 +__SYSCALL(__NR_get_bcid, sys_get_bcid)
 +#define __NR_set_bcid		281
 +__SYSCALL(__NR_set_bcid, sys_set_bcid)
 +#define __NR_set_bclimit	282
 +__SYSCALL(__NR_set_bclimit, sys_set_bclimit)
 +#define __NR_get_bcstat		283
 +__SYSCALL(__NR_get_bcstat, sys_get_bcstat)
 
 -#define __NR_syscall_max __NR_move_pages
 +#define __NR_syscall_max __NR_get_bcstat
 #include <linux/err.h>
 
 #ifndef __NO_STUBS
 --- ./kernel/bc/Makefile.bcsys	2006-09-05 12:24:39.000000000 +0400
 +++ ./kernel/bc/Makefile	2006-09-05 12:49:28.000000000 +0400
 @@ -6,3 +6,4 @@
 
 obj-y += beancounter.o
 obj-y += misc.o
 +obj-y += sys.o
 --- /dev/null	2006-07-18 14:52:43.075228448 +0400
 +++ ./kernel/bc/sys.c	2006-09-05 12:47:21.000000000 +0400
 @@ -0,0 +1,120 @@
 +/*
 + *  kernel/bc/sys.c
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#include <linux/sched.h>
 +#include <asm/uaccess.h>
 +
 +#include <bc/beancounter.h>
 +#include <bc/task.h>
 +
 +asmlinkage long sys_get_bcid(void)
 +{
 +	struct beancounter *bc;
 +
 +	bc = get_exec_bc();
 +	return bc->bc_id;
 +}
 +
 +asmlinkage long sys_set_bcid(bcid_t id)
 +{
 +	int error;
 +	struct beancounter *bc;
 +	struct task_beancounter *task_bc;
 +
 +	task_bc = ¤t->task_bc;
 +
 +	/* You may only set an bc as root */
 +	error = -EPERM;
 +	if (!capable(CAP_SETUID))
 +		goto out;
 +
 +	/* Ok - set up a beancounter entry for this user */
 +	error = -ENOMEM;
 +	bc = beancounter_findcreate(id, BC_ALLOC);
 +	if (bc == NULL)
 +		goto out;
 +
 +	/* install bc */
 +	put_beancounter(task_bc->exec_bc);
 +	task_bc->exec_bc = bc;
 +	put_beancounter(task_bc->fork_bc);
 +	task_bc->fork_bc = get_beancounter(bc);
 +	error = 0;
 +out:
 +	return error;
 +}
 +
 +asmlinkage long sys_set_bcl
...
 
 
 |  
	|  |  |  
	| 
		
			| [PATCH 6/13] BC: kernel memory (core) [message #5928 is a reply to message #5922] | Tue, 05 September 2006 15:21   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Introduce BC_KMEMSIZE resource which accounts kernel objects allocated by task's request.
 
 Reference to BC is kept on struct page or slab object.
 For slabs each struct slab contains a set of pointers
 corresponding objects are charged to.
 
 Allocation charge rules:
 1. Pages - if allocation is performed with __GFP_BC flag - page
 is charged to current's exec_bc.
 2. Slabs - kmem_cache may be created with SLAB_BC flag - in this
 case each allocation is charged. Caches used by kmalloc are
 created with SLAB_BC | SLAB_BC_NOCHARGE flags. In this case
 only __GFP_BC allocations are charged.
 
 Signed-off-by: Pavel Emelianov <xemul@sw.ru>
 Signed-off-by: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 include/bc/beancounter.h |    4 +
 include/bc/kmem.h        |   46 +++++++++++++++++
 include/linux/gfp.h      |    8 ++-
 include/linux/mm.h       |    4 +
 include/linux/slab.h     |    4 +
 include/linux/vmalloc.h  |    1
 kernel/bc/Makefile       |    1
 kernel/bc/beancounter.c  |    3 +
 kernel/bc/kmem.c         |   85 +++++++++++++++++++++++++++++++++
 mm/mempool.c             |    2
 mm/page_alloc.c          |   11 ++++
 mm/slab.c                |  121 ++++++++++++++++++++++++++++++++++++++---------
 mm/vmalloc.c             |    6 ++
 13 files changed, 271 insertions(+), 25 deletions(-)
 
 --- ./include/bc/beancounter.h.bckmemcore	2006-09-05 12:54:17.000000000 +0400
 +++ ./include/bc/beancounter.h	2006-09-05 12:54:40.000000000 +0400
 @@ -12,7 +12,9 @@
 *	Resource list.
 */
 
 -#define BC_RESOURCES	0
 +#define BC_KMEMSIZE	0
 +
 +#define BC_RESOURCES	1
 
 struct bc_resource_parm {
 unsigned long barrier;	/* A barrier over which resource allocations
 --- /dev/null	2006-07-18 14:52:43.075228448 +0400
 +++ ./include/bc/kmem.h	2006-09-05 12:54:40.000000000 +0400
 @@ -0,0 +1,46 @@
 +/*
 + *  include/bc/kmem.h
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#ifndef __BC_KMEM_H_
 +#define __BC_KMEM_H_
 +
 +/*
 + * BC_KMEMSIZE accounting
 + */
 +
 +struct mm_struct;
 +struct page;
 +struct beancounter;
 +
 +#ifdef CONFIG_BEANCOUNTERS
 +int __must_check bc_page_charge(struct page *page, int order, gfp_t flags);
 +void bc_page_uncharge(struct page *page, int order);
 +
 +int __must_check bc_slab_charge(kmem_cache_t *cachep, void *obj, gfp_t flags);
 +void bc_slab_uncharge(kmem_cache_t *cachep, void *obj);
 +#else
 +static inline int __must_check bc_page_charge(struct page *page,
 +		int order, gfp_t flags)
 +{
 +	return 0;
 +}
 +
 +static inline void bc_page_uncharge(struct page *page, int order)
 +{
 +}
 +
 +static inline int __must_check bc_slab_charge(kmem_cache_t *cachep,
 +		void *obj, gfp_t flags)
 +{
 +	return 0;
 +}
 +
 +static inline void bc_slab_uncharge(kmem_cache_t *cachep, void *obj)
 +{
 +}
 +#endif
 +#endif /* __BC_SLAB_H_ */
 --- ./include/linux/gfp.h.bckmemcore	2006-09-05 12:53:55.000000000 +0400
 +++ ./include/linux/gfp.h	2006-09-05 12:54:40.000000000 +0400
 @@ -46,15 +46,18 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
 +#define __GFP_BC	 ((__force gfp_t)0x80000u) /* Charge allocation with BC */
 +#define __GFP_BC_LIMIT ((__force gfp_t)0x100000u) /* Charge against BC limit */
 
 -#define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 +#define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* if you forget to add the bitmask here kernel will crash, period */
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
 -			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
 +			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE| \
 +			__GFP_BC|__GFP_BC_LIMIT)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
 @@ -63,6 +66,7 @@ struct vm_area_struct;
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 +#define GFP_KERNEL_BC	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_BC)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
 __GFP_HIGHMEM)
 --- ./include/linux/mm.h.bckmemcore	2006-09-05 12:53:55.000000000 +0400
 +++ ./include/linux/mm.h	2006-09-05 12:55:28.000000000 +0400
 @@ -274,8 +274,12 @@ struct page {
 unsigned int gfp_mask;
 unsigned long trace[8];
 #endif
 +#ifdef CONFIG_BEANCOUNTERS
 +	struct beancounter	*page_bc;
 +#endif
 };
 
 +#define page_bc(page)			((page)->page_bc)
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
 --- ./include/linux/slab.h.bckmemcore	2006-09-05 12:53:59.000000000 +0400
 +++ ./include/linux/slab.h	2006-09-05 12:54:40.000000000 +0400
 @@ -46,6 +46,8 @@ typedef struct kmem_cache kmem_cache_t;
 #define SLAB_PANIC		0x00040000UL	/* panic if kmem_cache_create() fails */
 #define SLAB_DESTROY_BY_RCU	0x00080000UL	/* defer freeing pages to RCU */
 #define SLAB_MEM_SPREAD		0x00100000UL	/* Spread some memory over cpuset */
 +#define SLAB_BC		0x00200000UL	/* Account with BC */
 +#define SLAB_BC_NOCHARGE	0x00400000UL	/* Explicit accounting */
 
 /* flags passed to a constructor func */
 #define	SLAB_CTOR_CONSTRUCTOR	0x001UL		/* if not set, then deconstructor */
 @@ -291,6 +293,8 @@ extern kmem_cache_t	*fs_cachep;
 extern kmem_cache_t	*sighand_cachep;
 extern kmem_cache_t	*bio_cachep;
 
 +struct beancounter;
 +struct beancounter **kmem_cache_bcp(kmem_cache_t *cachep, void *obj);
 #endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_SLAB_H */
 --- ./include/linux/vmalloc.h.bckmemcore	2006-09-05 12:53:59.000000000 +0400
 +++ ./include/linux/vmalloc.h	2006-09-05 12:54:40.000000000 +0400
 @@ -36,6 +36,7 @@ struct vm_struct {
 *	Highlevel APIs for driver use
 */
 extern void *vmalloc(unsigned long size);
 +extern void *vmalloc_bc(unsigned long size);
 extern void *vmalloc_user(unsigned long size);
 extern void *vmalloc_node(unsigned long size, int node);
 extern void *vmalloc_exec(unsigned long size);
 --- ./kernel/bc/Makefile.bckmemcore	2006-09-05 12:54:24.000000000 +0400
 +++ ./kernel/bc/Makefile	2006-09-05 12:54:50.000000000 +0400
 @@ -7,3 +7,4 @@
 obj-y += beancounter.o
 obj-y += misc.o
 obj-y += sys.o
 +obj-y += kmem.o
 --- ./kernel/bc/beancounter.c.bckmemcore	2006-09-05 12:54:21.000000000 +0400
 +++ ./kernel/bc/beancounter.c	2006-09-05 12:55:13.000000000 +0400
 @@ -20,6 +20,7 @@ static void init_beancounter_struct(stru
 struct beancounter init_bc;
 
 const char *bc_rnames[] = {
 +	"kmemsize",	/* 0 */
 };
 
 #define BC_HASH_BITS		8
 @@ -230,6 +231,8 @@ static void init_beancounter_syslimits(s
 {
 int k;
 
 +	bc->bc_parms[BC_KMEMSIZE].limit = 32 * 1024 * 1024;
 +
 for (k = 0; k < BC_RESOURCES; k++)
 bc->bc_parms[k].barrier = bc->bc_parms[k].limit;
 }
 --- /dev/null	2006-07-18 14:52:43.075228448 +0400
 +++ ./kernel/bc/kmem.c	2006-09-05 12:54:40.000000000 +0400
 @@ -0,0 +1,85 @@
 +/*
 + *  kernel/bc/kmem.c
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#include <linux/sched.h>
 +#include <linux/gfp.h>
 +#include <linux/slab.h>
 +#include <linux/mm.h>
 +
 +#include <bc/beancounter.h>
 +#include <bc/kmem.h>
 +#include <bc/task.h>
 +
 +/*
 + * Slab accounting
 + */
 +
 +int bc_slab_charge(kmem_cache_t *cachep, void *objp, gfp_t flags)
 +{
 +	unsigned int size;
 +	struct beancounter *bc, **slab_bcp;
 +
 +	bc = get_exec_bc();
 +
 +	size = kmem_cache_size(cachep);
 +	if (bc_charge(bc, BC_KMEMSIZE, size,
 +			(flags & __GFP_BC_LIMIT ? BC_LIMIT : BC_BARRIER)))
 +		return -ENOMEM;
 +
 +	slab_bcp = kmem_cache_bcp(cachep, objp);
 +	*slab_bcp = get_beancounter(bc);
 +	return 0;
 +}
 +
 +void bc_slab_uncharge(kmem_cache_t *cachep, void *objp)
 +{
 +	unsigned int size;
 +	struct beancounter *bc, **slab_bcp;
 +
 +	slab_bcp = kmem_cache_bcp(cachep, objp);
 +	if (*slab_bcp == NULL)
 +		return;
 +
 +	bc = *slab_bcp;
 +	size = kmem_cache_size(cachep);
 +	bc_uncharge(bc, BC_KMEMSIZE, size);
 +	put_beancounter(bc);
 +	*slab_bcp = NULL;
 +}
 +
 +/*
 + * Pages accounting
 + */
 +
 +int bc_page_charge(struct page *page, int order, gfp_t flags)
 +{
 +	struct beancounter *bc;
 +
 +	BUG_ON(page_bc(page) != NULL);
 +
 +	bc = get_exec_bc();
 +
 +	if (bc_charge(bc, BC_KMEMSIZE, PAGE_SIZE << order,
 +			(flags & __GFP_BC_LIMIT ? BC_LIMIT : BC_BARRIER)))
 +		return -ENOMEM;
 +
 +	page_bc(page) = get_beancounter(bc);
 +	return 0;
 +}
 +
 +void bc_page_uncharge(struct page *page, int order)
 +{
 +	struct beancounter *bc;
 +
 +	bc = page_bc(page);
 +	if (bc == NULL)
 +		return;
 +
 +	bc_uncharge(bc, BC_KMEMSIZE, PAGE_SIZE << order);
 +	put_beancounter(bc);
 +	page_bc(page) = NULL;
 +}
 --- ./mm/mempool.c.bckmemcore	2006-09-05 12:53:59.000000000 +0400
 +++ ./mm/mempool.c	2006-09-05 12:54:40.000000000 +0400
 @@ -119,6 +119,7 @@ int mempool_resize(mempool_t *pool, int
 unsigned long flags;
 
 BUG_ON(new_min_nr <= 0);
 +	gfp_mask &= ~__GFP_BC;
 
 spin_lock_irqsave(&pool->lock, flags);
 if (new_min_nr <= pool->min_nr) {
 @@ -212,6 +213,7 @@ void * mempool_alloc(mempool_t *pool, gf
 gfp_mask |= __GFP_NOMEMALLOC;	/* don't allocate emergency reserves */
 gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 +	gfp_mask &= ~__GFP_BC;		/* do not charge */
 
 gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
 
 --- ./mm/page_alloc.c.bckmemcore	2006-09-05 12:53:59.000000000 +0400
 +++ ./mm/page_alloc.c	2006-09-05 12:54:40.000000000 +0400
 @@ -40,6 +40,8 @@
 #include <linux/sort.h>
 #include <linux/pfn.h>
 
 +#include <bc/kmem.h>
 +
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
 #include "internal.h"
 @@ -516,6 +518,8 @@ static void __free_pages_ok(struct page
 if (reserved)
 return;
 
 +	bc_page_uncharge(page, order);
 +
 kernel_map_pages(page, 1 << order, 0);
 local_irq_save(flags);
 __count_vm_events(PGFREE, 1 << order);
 @@ -799,6 +803,8 @@ static void fastcall free_hot_cold_page(
 if (free_pages_check(page))
 return;
 
 +	bc_page_uncharge(page, 0);
 +
 kernel_map_pages(page, 1, 0);
 
 pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 @@ -1188,6 +1194,11 @@ nopage:
 show_mem();
 }
 got_pg:
 +	if ((gfp_mask & __GFP_BC) &&
 +			bc_page_charge(page, order, gfp_mask)) {
 +		__free_pages(page, order);
 +		page = NULL;
 +	}
 #ifdef CONFIG_PAGE_OWNER
 if (page)
 set_page_owner(page, order, gfp_mask);
 --- ./mm/slab.c.bckmemcore	2006-09-05 12:53:59.000000000 +0400
 +++ ./mm/slab.c	2006-09-05 12:54:40.000000000 +0400
 @@ -108,6 +108,8 @@
 #include	<linux/mutex.h>
 #include	<linux/rtmutex.h>
 
 +#include	<bc/kmem.h>
 +
 #include	<asm/uaccess.h>
 #include	<asm/cacheflush.h>
 #include	<asm/tlbflush.h>
 @@ -175,11 +177,13 @@
 SLAB_CACHE_DMA | \
 SLAB_MUST_HWCACHE_ALIGN | SLAB_STORE_USER | \
 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
 +			 SLAB_BC | SLAB_BC_NOCHARGE | \
 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
 #else
 # define CREATE_MASK	(SLAB_HWCACHE_ALIGN | \
 SLAB_CACHE_DMA | SLAB_MUST_HWCACHE_ALIGN | \
 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
 +			 SLAB_BC | SLAB_BC_NOCHARGE | \
 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
 #endif
 
 @@ -793,9 +797,33 @@ static struct kmem_cache *kmem_find_gene
 return __find_general_cachep(size, gfpflags);
 }
 
 -static size_t slab_mgmt_size(size_t nr_objs, size_t align)
 +static size_t slab_mgmt_size_raw(size_t nr_objs)
 {
 -	return ALIGN(sizeof(struct slab)+nr_objs*sizeof(kmem_bufctl_t), align);
 +	return sizeof(struct slab) + nr_objs * sizeof(kmem_bufctl_t);
 +}
 +
 +#ifdef CONFIG_BEANCOUNTERS
 +#define BC_EXTRASIZE	sizeof(struct beancounter *)
 +static inline size_t slab_mgmt_size_noalign(int flags, size_t nr_objs)
 +{
 +	size_t size;
 +
 +	size = slab_mgmt_size_raw(nr_objs);
 +	if (flags & SLAB_BC)
 +		size = ALIGN(size, BC_EXTRASIZE) + nr_objs * BC_EXTRASIZE;
 +	return size;
 +}
 +#else
 +#define BC_EXTRASIZE	0
 +static inline size_t slab_mgmt_size_noalign(int flags, size_t nr_objs)
 +{
 +	return slab_mgmt_size_raw(nr_objs);
 +}
 +#endif
 +
 +static inline size_t slab_mgmt_size(int flags, size_t nr_objs, size_t align)
 +{
 +	return ALIGN(slab_mgmt_size_noalign(flags, nr_objs), align);
 }
 
 /*
 @@ -840,20 +868,21 @@ static void cache_estimate(unsigned long
 * into account.
 */
 nr_objs = (slab_size - sizeof(struct slab)) /
 -			  (buffer_size + sizeof(kmem_bufctl_t));
 +			  (buffer_size + sizeof(kmem_bufctl_t) +
 +			  (flags & SLAB_BC ? BC_EXTRASIZE : 0));
 
 /*
 * This calculated number will be either the right
 * amount, or one greater than what we want.
 */
 -		if (slab_mgmt_size(nr_objs, align) + nr_objs*buffer_size
 +		if (slab_mgmt_size(flags, nr_objs, align) + nr_objs*buffer_size
 > slab_size)
 nr_objs--;
 
 if (nr_objs > SLAB_LIMIT)
 nr_objs = SLAB_LIMIT;
 
 -		mgmt_size = slab_mgmt_size(nr_objs, align);
 +		mgmt_size = slab_mgmt_size(flags, nr_objs, align);
 }
 *num = nr_objs;
 *left_over = slab_size - nr_objs*buffer_size - mgmt_size;
 @@ -1412,7 +1441,8 @@ void __init kmem_cache_init(void)
 sizes[INDEX_AC].cs_cachep = kmem_cache_create(names[INDEX_AC].name,
 sizes[INDEX_AC].cs_size,
 ARCH_KMALLOC_MINALIGN,
 -					ARCH_KMALLOC_FLAGS|SLAB_PANIC,
 +					ARCH_KMALLOC_FLAGS | SLAB_BC |
 +						SLAB_BC_NOCHARGE | SLAB_PANIC,
 NULL, NULL);
 
 if (INDEX_AC != INDEX_L3) {
 @@ -1420,7 +1450,8 @@ void __init kmem_cache_init(void)
 kmem_cache_create(names[INDEX_L3].name,
 sizes[INDEX_L3].cs_size,
 ARCH_KMALLOC_MINALIGN,
 -				ARCH_KMALLOC_FLAGS|SLAB_PANIC,
 +				ARCH_KMALLOC_FLAGS | SLAB_BC |
 +					SLAB_BC_NOCHARGE | SLAB_PANIC,
 NULL, NULL);
 }
 
 @@ -1438,7 +1469,8 @@ void __init kmem_cache_init(void)
 sizes->cs_cachep = kmem_cache_create(names->name,
 sizes->cs_size,
 ARCH_KMALLOC_MINALIGN,
 -					ARCH_KMALLOC_FLAGS|SLAB_PANIC,
 +					ARCH_KMALLOC_FLAGS | SLAB_BC |
 +						SLAB_BC_NOCHARGE | SLAB_PANIC,
 NULL, NULL);
 }
 
 @@ -1941,7 +1973,8 @@ static size_t calculate_slab_order(struc
 * looping condition in cache_grow().
 */
 offslab_limit = size - sizeof(struct slab);
 -			offslab_limit /= sizeof(kmem_bufctl_t);
 +			offslab_limit /= (sizeof(kmem_bufctl_t) +
 +					(flags & SLAB_BC ? BC_EXTRASIZE : 0));
 
 if (num > offslab_limit)
 break;
 @@ -2249,8 +2282,8 @@ kmem_cache_create (const char *name, siz
 cachep = NULL;
 goto oops;
 }
 -	slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t)
 -			  + sizeof(struct slab), align);
 +
 +	slab_size = slab_mgmt_size(flags, cachep->num, align);
 
 /*
 * If the slab has been placed off-slab, and we have enough space then
 @@ -2261,11 +2294,9 @@ kmem_cache_create (const char *name, siz
 left_over -= slab_size;
 }
 
 -	if (flags & CFLGS_OFF_SLAB) {
 +	if (flags & CFLGS_OFF_SLAB)
 /* really off slab. No need for manual alignment */
 -		slab_size =
 -		    cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab);
 -	}
 +		slab_size = slab_mgmt_size_noalign(flags, cachep->num);
 
 cachep->colour_off = cache_line_size();
 /* Offset must be a multiple of the alignment. */
 @@ -2509,6 +2540,30 @@ void kmem_cache_destroy(struct kmem_cach
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
 +static inline kmem_bufctl_t *slab_bufctl(struct slab *slabp)
 +{
 +	return (kmem_bufctl_t *) (slabp + 1);
 +}
 +
 +#ifdef CONFIG_BEANCOUNTERS
 +static inline struct beancounter **slab_bc_ptrs(kmem_cache_t *cachep,
 +		struct slab *slabp)
 +{
 +	return (struct beancounter **) ALIGN((unsigned long)
 +			(slab_bufctl(slabp) + cachep->num), BC_EXTRASIZE);
 +}
 +
 +struct beancounter **kmem_cache_bcp(kmem_cache_t *cachep, void *objp)
 +{
 +	struct slab *slabp;
 +	struct beancounter **bcs;
 +
 +	slabp = virt_to_slab(objp);
 +	bcs = slab_bc_ptrs(cachep, slabp);
 +	return bcs + obj_to_index(cachep, slabp, objp);
 +}
 +#endif
 +
 /*
 * Get the memory for a slab management obj.
 * For a slab cache when the slab descriptor is off-slab, slab descriptors
 @@ -2529,7 +2584,8 @@ static struct slab *alloc_slabmgmt(struc
 if (OFF_SLAB(cachep)) {
 /* Slab management obj is off-slab. */
 slabp = kmem_cache_alloc_node(cachep->slabp_cache,
 -					      local_flags, nodeid);
 +					      local_flags & (~__GFP_BC),
 +					      nodeid);
 if (!slabp)
 return NULL;
 } else {
 @@ -2540,14 +2596,14 @@ static struct slab *alloc_slabmgmt(struc
 slabp->colouroff = colour_off;
 slabp->s_mem = objp + colour_off;
 slabp->nodeid = nodeid;
 +#ifdef CONFIG_BEANCOUNTERS
 +	if (cachep->flags & SLAB_BC)
 +		memset(slab_bc_ptrs(cachep, slabp), 0,
 +				cachep->num * BC_EXTRASIZE);
 +#endif
 return slabp;
 }
 
 -static inline kmem_bufctl_t *slab_bufctl(struct slab *slabp)
 -{
 -	return (kmem_bufctl_t *) (slabp + 1);
 -}
 -
 static void cache_init_objs(struct kmem_cache *cachep,
 struct slab *slabp, unsigned long ctor_flags)
 {
 @@ -2725,7 +2781,7 @@ static int cache_grow(struct kmem_cache
 * Get mem for the objs.  Attempt to allocate a physical page from
 * 'nodeid'.
 */
 -	objp = kmem_getpages(cachep, flags, nodeid);
 +	objp = kmem_getpages(cachep, flags & (~__GFP_BC), nodeid);
 if (!objp)
 goto failed;
 
 @@ -3073,6 +3129,19 @@ static inline void *____cache_alloc(stru
 return objp;
 }
 
 +static inline int bc_should_charge(kmem_cache_t *cachep, gfp_t flags)
 +{
 +#ifdef CONFIG_BEANCOUNTERS
 +	if (!(cachep->flags & SLAB_BC))
 +		return 0;
 +	if (flags & __GFP_BC)
 +		return 1;
 +	if (!(cachep->flags & SLAB_BC_NOCHARGE))
 +		return 1;
 +#endif
 +	return 0;
 +}
 +
 static __always_inline void *__cache_alloc(struct kmem_cache *cachep,
 gfp_t flags, void *caller)
 {
 @@ -3086,6 +3155,12 @@ static __always_inline void *__cache_all
 local_irq_restore(save_flags);
 objp = cache_alloc_debugcheck_after(cachep, flags, objp,
 caller);
 +
 +	if (objp && bc_should_charge(cachep, flags))
 +		if (bc_slab_charge(cachep, objp, flags)) {
 +			kmem_cache_free(cachep, objp);
 +			objp = NULL;
 +		}
 prefetchw(objp);
 return objp;
 }
 @@ -3283,6 +3358,8 @@ static inline void __cache_free(struct k
 struct array_cache *ac = cpu_cache_get(cachep);
 
 check_irq_off();
 +	if (cachep->flags & SLAB_BC)
 +		bc_slab_uncharge(cachep, objp);
 objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
 if (cache_free_alien(cachep, objp))
 --- ./mm/vmalloc.c.bckmemcore	2006-09-05 12:53:59.000000000 +0400
 +++ ./mm/vmalloc.c	2006-09-05 12:54:40.000000000 +0400
 @@ -520,6 +520,12 @@ void *vmalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc);
 
 +void *vmalloc_bc(unsigned long size)
 +{
 +	return __vmalloc(size, GFP_KERNEL_BC | __GFP_HIGHMEM, PAGE_KERNEL);
 +}
 +EXPORT_SYMBOL(vmalloc_bc);
 +
 /**
 *	vmalloc_user  -  allocate virtually contiguous memory which has
 *			   been zeroed so it can be mapped to userspace without
 |  
	|  |  |  
	| 
		
			| [PATCH 7/13] BC: kernel memory (marks) [message #5929 is a reply to message #5922] | Tue, 05 September 2006 15:23   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Mark some kmem caches with SLAB_BC and some allocations with __GFP_BC to cause charging/limiting of appropriate
 kernel resources.
 
 Signed-off-by: Pavel Emelianov <xemul@sw.ru>
 Signed-off-by: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 arch/i386/kernel/ldt.c           |    4 ++--
 arch/i386/mm/init.c              |    4 ++--
 arch/i386/mm/pgtable.c           |    6 ++++--
 drivers/char/tty_io.c            |   10 +++++-----
 fs/file.c                        |    8 ++++----
 fs/locks.c                       |    2 +-
 fs/namespace.c                   |    3 ++-
 fs/select.c                      |    7 ++++---
 include/asm-i386/thread_info.h   |    4 ++--
 include/asm-ia64/pgalloc.h       |   24 +++++++++++++++++-------
 include/asm-x86_64/pgalloc.h     |   12 ++++++++----
 include/asm-x86_64/thread_info.h |    5 +++--
 ipc/msgutil.c                    |    4 ++--
 ipc/sem.c                        |    7 ++++---
 ipc/util.c                       |    8 ++++----
 kernel/fork.c                    |   15 ++++++++-------
 kernel/posix-timers.c            |    3 ++-
 kernel/signal.c                  |    2 +-
 kernel/user.c                    |    2 +-
 mm/rmap.c                        |    3 ++-
 mm/shmem.c                       |    3 ++-
 21 files changed, 80 insertions(+), 56 deletions(-)
 
 --- ./arch/i386/kernel/ldt.c.bckmemch	2006-09-05 12:53:51.000000000 +0400
 +++ ./arch/i386/kernel/ldt.c	2006-09-05 12:58:17.000000000 +0400
 @@ -39,9 +39,9 @@ static int alloc_ldt(mm_context_t *pc, i
 oldsize = pc->size;
 mincount = (mincount+511)&(~511);
 if (mincount*LDT_ENTRY_SIZE > PAGE_SIZE)
 -		newldt = vmalloc(mincount*LDT_ENTRY_SIZE);
 +		newldt = vmalloc_bc(mincount*LDT_ENTRY_SIZE);
 else
 -		newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL);
 +		newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL_BC);
 
 if (!newldt)
 return -ENOMEM;
 --- ./arch/i386/mm/init.c.bckmemch	2006-09-05 12:53:51.000000000 +0400
 +++ ./arch/i386/mm/init.c	2006-09-05 12:58:17.000000000 +0400
 @@ -709,7 +709,7 @@ void __init pgtable_cache_init(void)
 pmd_cache = kmem_cache_create("pmd",
 PTRS_PER_PMD*sizeof(pmd_t),
 PTRS_PER_PMD*sizeof(pmd_t),
 -					0,
 +					SLAB_BC,
 pmd_ctor,
 NULL);
 if (!pmd_cache)
 @@ -718,7 +718,7 @@ void __init pgtable_cache_init(void)
 pgd_cache = kmem_cache_create("pgd",
 PTRS_PER_PGD*sizeof(pgd_t),
 PTRS_PER_PGD*sizeof(pgd_t),
 -				0,
 +				SLAB_BC,
 pgd_ctor,
 PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
 if (!pgd_cache)
 --- ./arch/i386/mm/pgtable.c.bckmemch	2006-09-05 12:53:51.000000000 +0400
 +++ ./arch/i386/mm/pgtable.c	2006-09-05 12:58:17.000000000 +0400
 @@ -186,9 +186,11 @@ struct page *pte_alloc_one(struct mm_str
 struct page *pte;
 
 #ifdef CONFIG_HIGHPTE
 -	pte =  alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO , 0);
 +	pte =  alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO |
 +			__GFP_BC | __GFP_BC_LIMIT, 0);
 #else
 -	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 +	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO|
 +			__GFP_BC | __GFP_BC_LIMIT, 0);
 #endif
 return pte;
 }
 --- ./drivers/char/tty_io.c.bckmemch	2006-09-05 12:53:52.000000000 +0400
 +++ ./drivers/char/tty_io.c	2006-09-05 12:58:17.000000000 +0400
 @@ -165,7 +165,7 @@ static void release_mem(struct tty_struc
 
 static struct tty_struct *alloc_tty_struct(void)
 {
 -	return kzalloc(sizeof(struct tty_struct), GFP_KERNEL);
 +	return kzalloc(sizeof(struct tty_struct), GFP_KERNEL_BC);
 }
 
 static void tty_buffer_free_all(struct tty_struct *);
 @@ -1904,7 +1904,7 @@ static int init_dev(struct tty_driver *d
 
 if (!*tp_loc) {
 tp = (struct termios *) kmalloc(sizeof(struct termios),
 -						GFP_KERNEL);
 +						GFP_KERNEL_BC);
 if (!tp)
 goto free_mem_out;
 *tp = driver->init_termios;
 @@ -1912,7 +1912,7 @@ static int init_dev(struct tty_driver *d
 
 if (!*ltp_loc) {
 ltp = (struct termios *) kmalloc(sizeof(struct termios),
 -						 GFP_KERNEL);
 +						 GFP_KERNEL_BC);
 if (!ltp)
 goto free_mem_out;
 memset(ltp, 0, sizeof(struct termios));
 @@ -1937,7 +1937,7 @@ static int init_dev(struct tty_driver *d
 
 if (!*o_tp_loc) {
 o_tp = (struct termios *)
 -				kmalloc(sizeof(struct termios), GFP_KERNEL);
 +				kmalloc(sizeof(struct termios), GFP_KERNEL_BC);
 if (!o_tp)
 goto free_mem_out;
 *o_tp = driver->other->init_termios;
 @@ -1945,7 +1945,7 @@ static int init_dev(struct tty_driver *d
 
 if (!*o_ltp_loc) {
 o_ltp = (struct termios *)
 -				kmalloc(sizeof(struct termios), GFP_KERNEL);
 +				kmalloc(sizeof(struct termios), GFP_KERNEL_BC);
 if (!o_ltp)
 goto free_mem_out;
 memset(o_ltp, 0, sizeof(struct termios));
 --- ./fs/file.c.bckmemch	2006-09-05 12:53:55.000000000 +0400
 +++ ./fs/file.c	2006-09-05 12:58:17.000000000 +0400
 @@ -44,9 +44,9 @@ struct file ** alloc_fd_array(int num)
 int size = num * sizeof(struct file *);
 
 if (size <= PAGE_SIZE)
 -		new_fds = (struct file **) kmalloc(size, GFP_KERNEL);
 +		new_fds = (struct file **) kmalloc(size, GFP_KERNEL_BC);
 else
 -		new_fds = (struct file **) vmalloc(size);
 +		new_fds = (struct file **) vmalloc_bc(size);
 return new_fds;
 }
 
 @@ -213,9 +213,9 @@ fd_set * alloc_fdset(int num)
 int size = num / 8;
 
 if (size <= PAGE_SIZE)
 -		new_fdset = (fd_set *) kmalloc(size, GFP_KERNEL);
 +		new_fdset = (fd_set *) kmalloc(size, GFP_KERNEL_BC);
 else
 -		new_fdset = (fd_set *) vmalloc(size);
 +		new_fdset = (fd_set *) vmalloc_bc(size);
 return new_fdset;
 }
 
 --- ./fs/locks.c.bckmemch	2006-09-05 12:53:55.000000000 +0400
 +++ ./fs/locks.c	2006-09-05 12:58:17.000000000 +0400
 @@ -2228,7 +2228,7 @@ EXPORT_SYMBOL(lock_may_write);
 static int __init filelock_init(void)
 {
 filelock_cache = kmem_cache_create("file_lock_cache",
 -			sizeof(struct file_lock), 0, SLAB_PANIC,
 +			sizeof(struct file_lock), 0, SLAB_PANIC | SLAB_BC,
 init_once, NULL);
 return 0;
 }
 --- ./fs/namespace.c.bckmemch	2006-09-05 12:53:55.000000000 +0400
 +++ ./fs/namespace.c	2006-09-05 12:58:17.000000000 +0400
 @@ -1812,7 +1812,8 @@ void __init mnt_init(unsigned long mempa
 init_rwsem(&namespace_sem);
 
 mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct vfsmount),
 -			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL, NULL);
 +			0, SLAB_HWCACHE_ALIGN | SLAB_BC | SLAB_PANIC,
 +			NULL, NULL);
 
 mount_hashtable = (struct list_head *)__get_free_page(GFP_ATOMIC);
 
 --- ./fs/select.c.bckmemch	2006-09-05 12:53:55.000000000 +0400
 +++ ./fs/select.c	2006-09-05 12:58:17.000000000 +0400
 @@ -103,7 +103,8 @@ static struct poll_table_entry *poll_get
 if (!table || POLL_TABLE_FULL(table)) {
 struct poll_table_page *new_table;
 
 -		new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
 +		new_table = (struct poll_table_page *)
 +			__get_free_page(GFP_KERNEL_BC);
 if (!new_table) {
 p->error = -ENOMEM;
 __set_current_state(TASK_RUNNING);
 @@ -339,7 +340,7 @@ static int core_sys_select(int n, fd_set
 if (size > sizeof(stack_fds) / 6) {
 /* Not enough space in on-stack array; must use kmalloc */
 ret = -ENOMEM;
 -		bits = kmalloc(6 * size, GFP_KERNEL);
 +		bits = kmalloc(6 * size, GFP_KERNEL_BC);
 if (!bits)
 goto out_nofds;
 }
 @@ -693,7 +694,7 @@ int do_sys_poll(struct pollfd __user *uf
 if (!stack_pp)
 stack_pp = pp = (struct poll_list *)stack_pps;
 else {
 -			pp = kmalloc(size, GFP_KERNEL);
 +			pp = kmalloc(size, GFP_KERNEL_BC);
 if (!pp)
 goto out_fds;
 }
 --- ./include/asm-i386/thread_info.h.bckmemch	2006-07-10 12:39:19.000000000 +0400
 +++ ./include/asm-i386/thread_info.h	2006-09-05 12:58:17.000000000 +0400
 @@ -99,13 +99,13 @@ static inline struct thread_info *curren
 ({							\
 struct thread_info *ret;			\
 \
 -		ret = kmalloc(THREAD_SIZE, GFP_KERNEL);		\
 +		ret = kmalloc(THREAD_SIZE, GFP_KERNEL_BC);	\
 if (ret)					\
 memset(ret, 0, THREAD_SIZE);		\
 ret;						\
 })
 #else
 -#define alloc_thread_info(tsk) kmalloc(THREAD_SIZE, GFP_KERNEL)
 +#define alloc_thread_info(tsk) kmalloc(THREAD_SIZE, GFP_KERNEL_BC)
 #endif
 
 #define free_thread_info(info)	kfree(info)
 --- ./include/asm-ia64/pgalloc.h.bckmemch	2006-07-10 12:39:19.000000000 +0400
 +++ ./include/asm-ia64/pgalloc.h	2006-09-05 12:58:17.000000000 +0400
 @@ -19,6 +19,8 @@
 #include <linux/page-flags.h>
 #include <linux/threads.h>
 
 +#include <bc/kmem.h>
 +
 #include <asm/mmu_context.h>
 
 DECLARE_PER_CPU(unsigned long *, __pgtable_quicklist);
 @@ -37,7 +39,7 @@ static inline long pgtable_quicklist_tot
 return ql_size;
 }
 
 -static inline void *pgtable_quicklist_alloc(void)
 +static inline void *pgtable_quicklist_alloc(int charge)
 {
 unsigned long *ret = NULL;
 
 @@ -45,13 +47,20 @@ static inline void *pgtable_quicklist_al
 
 ret = pgtable_quicklist;
 if (likely(ret != NULL)) {
 +		if (charge && bc_page_charge(virt_to_page(ret),
 +					0, __GFP_BC_LIMIT)) {
 +			ret = NULL;
 +			goto out;
 +		}
 pgtable_quicklist = (unsigned long *)(*ret);
 ret[0] = 0;
 --pgtable_quicklist_size;
 +out:
 preempt_enable();
 } else {
 preempt_enable();
 -		ret = (unsigned long *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 +		ret = (unsigned long *)__get_free_page(GFP_KERNEL |
 +				__GFP_ZERO | __GFP_BC | __GFP_BC_LIMIT);
 }
 
 return ret;
 @@ -69,6 +78,7 @@ static inline void pgtable_quicklist_fre
 #endif
 
 preempt_disable();
 +	bc_page_uncharge(virt_to_page(pgtable_entry), 0);
 *(unsigned long *)pgtable_entry = (unsigned long)pgtable_quicklist;
 pgtable_quicklist = (unsigned long *)pgtable_entry;
 ++pgtable_quicklist_size;
 @@ -77,7 +87,7 @@ static inline void pgtable_quicklist_fre
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 -	return pgtable_quicklist_alloc();
 +	return pgtable_quicklist_alloc(1);
 }
 
 static inline void pgd_free(pgd_t * pgd)
 @@ -94,7 +104,7 @@ pgd_populate(struct mm_struct *mm, pgd_t
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
 -	return pgtable_quicklist_alloc();
 +	retur
...
 
 
 |  
	|  |  |  
	| 
		
			| [PATCH 8/13] BC: locked pages (core) [message #5930 is a reply to message #5922] | Tue, 05 September 2006 15:24   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Introduce new resource BC_LOCKEDPAGES which stands for accounting of mlock-ed user pages.
 
 Locked pages are important to be accounted separately
 as they are unreclaimable.
 
 Pages are charged to mm_struct BC.
 
 Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 include/bc/beancounter.h |    3 -
 include/bc/vmpages.h     |   95 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h    |    3 +
 include/linux/shmem_fs.h |    5 ++
 kernel/bc/Makefile       |    1
 kernel/bc/beancounter.c  |    2
 kernel/bc/vmpages.c      |   75 +++++++++++++++++++++++++++++++++++++
 kernel/fork.c            |   11 +++--
 mm/shmem.c               |    4 +
 9 files changed, 195 insertions(+), 4 deletions(-)
 
 --- ./include/bc/beancounter.h.bclockcore	2006-09-05 12:54:40.000000000 +0400
 +++ ./include/bc/beancounter.h	2006-09-05 12:59:27.000000000 +0400
 @@ -13,8 +13,9 @@
 */
 
 #define BC_KMEMSIZE	0
 +#define BC_LOCKEDPAGES	1
 
 -#define BC_RESOURCES	1
 +#define BC_RESOURCES	2
 
 struct bc_resource_parm {
 unsigned long barrier;	/* A barrier over which resource allocations
 --- /dev/null	2006-07-18 14:52:43.075228448 +0400
 +++ ./include/bc/vmpages.h	2006-09-05 13:04:03.000000000 +0400
 @@ -0,0 +1,95 @@
 +/*
 + *  include/bc/vmpages.h
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#ifndef __BC_VMPAGES_H_
 +#define __BC_VMPAGES_H_
 +
 +#include <bc/beancounter.h>
 +#include <bc/task.h>
 +
 +struct mm_struct;
 +struct file;
 +struct shmem_inode_info;
 +
 +#ifdef CONFIG_BEANCOUNTERS
 +int __must_check bc_memory_charge(struct mm_struct *mm, unsigned long size,
 +		unsigned long vm_flags, struct file *vm_file, int strict);
 +void bc_memory_uncharge(struct mm_struct *mm, unsigned long size,
 +		unsigned long vm_flags, struct file *vm_file);
 +
 +int __must_check bc_locked_charge(struct mm_struct *mm, unsigned long size);
 +void bc_locked_uncharge(struct mm_struct *mm, unsigned long size);
 +
 +int __must_check bc_locked_shm_charge(struct shmem_inode_info *info,
 +		unsigned long size);
 +void bc_locked_shm_uncharge(struct shmem_inode_info *info,
 +		unsigned long size);
 +
 +/*
 + * mm's beancounter should be the same as the exec one
 + * of taks using this mm. thus we have two cases of its
 + * initialisation:
 + *  1. new mm is done for fork-ed task
 + *  2. new mm is done for exec-ing task
 + */
 +#define mm_init_bc(mm, t)	do {					\
 +		(mm)->mm_bc = get_beancounter((t)->task_bc.exec_bc);	\
 +	} while (0)
 +#define mm_free_bc(mm)		do {					\
 +		put_beancounter((mm)->mm_bc);				\
 +	} while (0)
 +
 +#define shmi_init_bc(info)	do {					\
 +		(info)->shm_bc = get_beancounter(get_exec_bc());	\
 +	} while (0)
 +#define shmi_free_bc(info)	do {					\
 +		put_beancounter((info)->shm_bc);			\
 +	} while (0)
 +
 +#else /* CONFIG_BEANCOUNTERS */
 +
 +static inline int __must_check bc_memory_charge(struct mm_struct *mm,
 +		unsigned long size, unsigned long vm_flags,
 +		struct file *vm_file, int strict)
 +{
 +	return 0;
 +}
 +
 +static inline void bc_memory_uncharge(struct mm_struct *mm, unsigned long size,
 +		unsigned long vm_flags, struct file *vm_file)
 +{
 +}
 +
 +static inline int __must_check bc_locked_charge(struct mm_struct *mm,
 +		unsigned long size)
 +{
 +	return 0;
 +}
 +
 +static inline void bc_locked_uncharge(struct mm_struct *mm, unsigned long size)
 +{
 +}
 +
 +static inline int __must_check bc_locked_shm_charge(struct shmem_inode_info *i,
 +		unsigned long size)
 +{
 +	return 0;
 +}
 +
 +static inline void bc_locked_shm_uncharge(struct shmem_inode_info *i,
 +		unsigned long size)
 +{
 +}
 +
 +#define mm_init_bc(mm, t)	do { } while (0)
 +#define mm_free_bc(mm)		do { } while (0)
 +#define shmi_init_bc(info)	do { } while (0)
 +#define shmi_free_bc(info)	do { } while (0)
 +
 +#endif /* CONFIG_BEANCOUNTERS */
 +#endif
 +
 --- ./include/linux/sched.h.bclockcore	2006-09-05 12:54:21.000000000 +0400
 +++ ./include/linux/sched.h	2006-09-05 12:59:27.000000000 +0400
 @@ -358,6 +358,9 @@ struct mm_struct {
 /* aio bits */
 rwlock_t		ioctx_list_lock;
 struct kioctx		*ioctx_list;
 +#ifdef CONFIG_BEANCOUNTERS
 +	struct beancounter	*mm_bc;
 +#endif
 };
 
 struct sighand_struct {
 --- ./include/linux/shmem_fs.h.bclockcore	2006-04-21 11:59:36.000000000 +0400
 +++ ./include/linux/shmem_fs.h	2006-09-05 12:59:27.000000000 +0400
 @@ -8,6 +8,8 @@
 
 #define SHMEM_NR_DIRECT 16
 
 +struct beancounter;
 +
 struct shmem_inode_info {
 spinlock_t		lock;
 unsigned long		flags;
 @@ -19,6 +21,9 @@ struct shmem_inode_info {
 swp_entry_t		i_direct[SHMEM_NR_DIRECT]; /* first blocks */
 struct list_head	swaplist;	/* chain of maybes on swap */
 struct inode		vfs_inode;
 +#ifdef CONFIG_BEANCOUNTERS
 +	struct beancounter	*shm_bc;
 +#endif
 };
 
 struct shmem_sb_info {
 --- ./kernel/bc/Makefile.bclockcore	2006-09-05 12:54:50.000000000 +0400
 +++ ./kernel/bc/Makefile	2006-09-05 12:59:37.000000000 +0400
 @@ -8,3 +8,4 @@ obj-y += beancounter.o
 obj-y += misc.o
 obj-y += sys.o
 obj-y += kmem.o
 +obj-y += vmpages.o
 --- ./kernel/bc/beancounter.c.bclockcore	2006-09-05 12:55:13.000000000 +0400
 +++ ./kernel/bc/beancounter.c	2006-09-05 12:59:45.000000000 +0400
 @@ -21,6 +21,7 @@ struct beancounter init_bc;
 
 const char *bc_rnames[] = {
 "kmemsize",	/* 0 */
 +	"lockedpages",
 };
 
 #define BC_HASH_BITS		8
 @@ -232,6 +233,7 @@ static void init_beancounter_syslimits(s
 int k;
 
 bc->bc_parms[BC_KMEMSIZE].limit = 32 * 1024 * 1024;
 +	bc->bc_parms[BC_LOCKEDPAGES].limit = 8;
 
 for (k = 0; k < BC_RESOURCES; k++)
 bc->bc_parms[k].barrier = bc->bc_parms[k].limit;
 --- /dev/null	2006-07-18 14:52:43.075228448 +0400
 +++ ./kernel/bc/vmpages.c	2006-09-05 12:59:27.000000000 +0400
 @@ -0,0 +1,75 @@
 +/*
 + *  kernel/bc/vmpages.c
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#include <linux/sched.h>
 +#include <linux/mm.h>
 +#include <linux/shmem_fs.h>
 +
 +#include <bc/beancounter.h>
 +#include <bc/vmpages.h>
 +
 +#include <asm/page.h>
 +
 +int bc_memory_charge(struct mm_struct *mm, unsigned long size,
 +		unsigned long vm_flags, struct file *vm_file, int strict)
 +{
 +	struct beancounter *bc;
 +
 +	bc = mm->mm_bc;
 +	size >>= PAGE_SHIFT;
 +
 +	if (vm_flags & VM_LOCKED)
 +		if (bc_charge(bc, BC_LOCKEDPAGES, size, strict))
 +			return -ENOMEM;
 +	return 0;
 +}
 +
 +void bc_memory_uncharge(struct mm_struct *mm, unsigned long size,
 +		unsigned long vm_flags, struct file *vm_file)
 +{
 +	struct beancounter *bc;
 +
 +	bc = mm->mm_bc;
 +	size >>= PAGE_SHIFT;
 +
 +	if (vm_flags & VM_LOCKED)
 +		bc_uncharge(bc, BC_LOCKEDPAGES, size);
 +}
 +
 +static inline int locked_charge(struct beancounter *bc,
 +		unsigned long size)
 +{
 +	size >>= PAGE_SHIFT;
 +	return bc_charge(bc, BC_LOCKEDPAGES, size, BC_BARRIER);
 +}
 +
 +static inline void locked_uncharge(struct beancounter *bc,
 +		unsigned long size)
 +{
 +	size >>= PAGE_SHIFT;
 +	bc_uncharge(bc, BC_LOCKEDPAGES, size);
 +}
 +
 +int bc_locked_charge(struct mm_struct *mm, unsigned long size)
 +{
 +	return locked_charge(mm->mm_bc, size);
 +}
 +
 +void bc_locked_uncharge(struct mm_struct *mm, unsigned long size)
 +{
 +	locked_uncharge(mm->mm_bc, size);
 +}
 +
 +int bc_locked_shm_charge(struct shmem_inode_info *info, unsigned long size)
 +{
 +	return locked_charge(info->shm_bc, size);
 +}
 +
 +void bc_locked_shm_uncharge(struct shmem_inode_info *info, unsigned long size)
 +{
 +	locked_uncharge(info->shm_bc, size);
 +}
 --- ./kernel/fork.c.bclockcore	2006-09-05 12:58:17.000000000 +0400
 +++ ./kernel/fork.c	2006-09-05 12:59:59.000000000 +0400
 @@ -49,6 +49,7 @@
 #include <linux/taskstats_kern.h>
 
 #include <bc/task.h>
 +#include <bc/vmpages.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
 @@ -322,7 +323,8 @@ static inline void mm_free_pgd(struct mm
 
 #include <linux/init_task.h>
 
 -static struct mm_struct * mm_init(struct mm_struct * mm)
 +static struct mm_struct * mm_init(struct mm_struct * mm,
 +		struct task_struct *tsk)
 {
 atomic_set(&mm->mm_users, 1);
 atomic_set(&mm->mm_count, 1);
 @@ -339,6 +341,7 @@ static struct mm_struct * mm_init(struct
 mm->cached_hole_size = ~0UL;
 
 if (likely(!mm_alloc_pgd(mm))) {
 +		mm_init_bc(mm, tsk);
 mm->def_flags = 0;
 return mm;
 }
 @@ -356,7 +359,7 @@ struct mm_struct * mm_alloc(void)
 mm = allocate_mm();
 if (mm) {
 memset(mm, 0, sizeof(*mm));
 -		mm = mm_init(mm);
 +		mm = mm_init(mm, current);
 }
 return mm;
 }
 @@ -371,6 +374,7 @@ void fastcall __mmdrop(struct mm_struct
 BUG_ON(mm == &init_mm);
 mm_free_pgd(mm);
 destroy_context(mm);
 +	mm_free_bc(mm);
 free_mm(mm);
 }
 
 @@ -477,7 +481,7 @@ static struct mm_struct *dup_mm(struct t
 
 memcpy(mm, oldmm, sizeof(*mm));
 
 -	if (!mm_init(mm))
 +	if (!mm_init(mm, tsk))
 goto fail_nomem;
 
 if (init_new_context(tsk, mm))
 @@ -504,6 +508,7 @@ fail_nocontext:
 * because it calls destroy_context()
 */
 mm_free_pgd(mm);
 +	mm_free_bc(mm);
 free_mm(mm);
 return NULL;
 }
 --- ./mm/shmem.c.bclockcore	2006-09-05 12:58:17.000000000 +0400
 +++ ./mm/shmem.c	2006-09-05 12:59:27.000000000 +0400
 @@ -47,6 +47,8 @@
 #include <linux/migrate.h>
 #include <linux/highmem.h>
 
 +#include <bc/vmpages.h>
 +
 #include <asm/uaccess.h>
 #include <asm/div64.h>
 #include <asm/pgtable.h>
 @@ -698,6 +700,7 @@ static void shmem_delete_inode(struct in
 sbinfo->free_inodes++;
 spin_unlock(&sbinfo->stat_lock);
 }
 +	shmi_free_bc(info);
 clear_inode(inode);
 }
 
 @@ -1359,6 +1362,7 @@ shmem_get_inode(struct super_block *sb,
 info = SHMEM_I(inode);
 memset(info, 0, (char *)inode - (char *)info);
 spin_lock_init(&info->lock);
 +		shmi_init_bc(info);
 INIT_LIST_HEAD(&info->swaplist);
 
 switch (mode & S_IFMT) {
...
 
 
 |  
	|  |  |  
	| 
		
			| [PATCH 9/13] BC: locked pages (charge hooks) [message #5931 is a reply to message #5922] | Tue, 05 September 2006 15:25   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Introduce calls to BC core over the kernel to charge locked memory. 
 Normaly new locked piece of memory may appear in insert_vm_struct,
 but there are places (do_mmap_pgoff, dup_mmap etc) when new vma
 is not inserted by insert_vm_struct(), but either link_vma-ed or
 merged with some other - these places call BC code explicitly.
 
 Plus sys_mlock[all] itself has to be patched to charge/uncharge
 needed amount of pages.
 
 Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 fs/binfmt_elf.c            |    5 ++-
 include/asm-alpha/mman.h   |    1
 include/asm-generic/mman.h |    1
 include/asm-mips/mman.h    |    1
 include/asm-parisc/mman.h  |    1
 include/linux/mm.h         |    1
 mm/mlock.c                 |   21 +++++++++++++---
 mm/mmap.c                  |   59 ++++++++++++++++++++++++++++++++++++++-------
 mm/mremap.c                |   18 ++++++++++++-
 mm/shmem.c                 |   12 ++++++++-
 10 files changed, 104 insertions(+), 16 deletions(-)
 
 --- ./fs/binfmt_elf.c.bclockcharge	2006-09-05 12:53:54.000000000 +0400
 +++ ./fs/binfmt_elf.c	2006-09-05 13:08:26.000000000 +0400
 @@ -360,7 +360,7 @@ static unsigned long load_elf_interp(str
 eppnt = elf_phdata;
 for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
 if (eppnt->p_type == PT_LOAD) {
 -			int elf_type = MAP_PRIVATE | MAP_DENYWRITE;
 +			int elf_type = MAP_PRIVATE|MAP_DENYWRITE|MAP_EXECPRIO;
 int elf_prot = 0;
 unsigned long vaddr = 0;
 unsigned long k, map_addr;
 @@ -846,7 +846,8 @@ static int load_elf_binary(struct linux_
 if (elf_ppnt->p_flags & PF_X)
 elf_prot |= PROT_EXEC;
 
 -		elf_flags = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE;
 +		elf_flags = MAP_PRIVATE | MAP_DENYWRITE |
 +			MAP_EXECUTABLE | MAP_EXECPRIO;
 
 vaddr = elf_ppnt->p_vaddr;
 if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
 --- ./include/asm-alpha/mman.h.mapfx	2006-04-21 11:59:35.000000000 +0400
 +++ ./include/asm-alpha/mman.h	2006-09-05 18:13:12.000000000 +0400
 @@ -28,6 +28,7 @@
 #define MAP_NORESERVE	0x10000		/* don't check for reservations */
 #define MAP_POPULATE	0x20000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x40000		/* do not block on IO */
 +#define MAP_EXECPRIO	0x80000		/* charge against BC limit */
 
 #define MS_ASYNC	1		/* sync memory asynchronously */
 #define MS_SYNC		2		/* synchronous memory sync */
 --- ./include/asm-generic/mman.h.x	2006-04-21 11:59:35.000000000 +0400
 +++ ./include/asm-generic/mman.h	2006-09-05 14:02:04.000000000 +0400
 @@ -19,6 +19,7 @@
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */
 +#define MAP_EXECPRIO	0x20000		/* charge agains BC_LIMIT */
 
 #define MS_ASYNC	1		/* sync memory asynchronously */
 #define MS_INVALIDATE	2		/* invalidate the caches */
 --- ./include/asm-mips/mman.h.mapfx	2006-04-21 11:59:36.000000000 +0400
 +++ ./include/asm-mips/mman.h	2006-09-05 18:13:34.000000000 +0400
 @@ -46,6 +46,7 @@
 #define MAP_LOCKED	0x8000		/* pages are locked */
 #define MAP_POPULATE	0x10000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
 +#define MAP_EXECPRIO	0x40000		/* charge against BC limit */
 
 /*
 * Flags for msync
 --- ./include/asm-parisc/mman.h.mapfx	2006-04-21 11:59:36.000000000 +0400
 +++ ./include/asm-parisc/mman.h	2006-09-05 18:13:47.000000000 +0400
 @@ -22,6 +22,7 @@
 #define MAP_GROWSDOWN	0x8000		/* stack-like segment */
 #define MAP_POPULATE	0x10000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
 +#define MAP_EXECPRIO	0x40000		/* charge against BC limit */
 
 #define MS_SYNC		1		/* synchronous memory sync */
 #define MS_ASYNC	2		/* sync memory asynchronously */
 --- ./include/linux/mm.h.bclockcharge	2006-09-05 12:55:28.000000000 +0400
 +++ ./include/linux/mm.h	2006-09-05 13:06:37.000000000 +0400
 @@ -1103,6 +1103,7 @@ out:
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 +extern unsigned long __do_brk(unsigned long, unsigned long, int);
 
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 --- ./mm/mlock.c.bclockcharge	2006-04-21 11:59:36.000000000 +0400
 +++ ./mm/mlock.c	2006-09-05 13:06:37.000000000 +0400
 @@ -11,6 +11,7 @@
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 
 +#include <bc/vmpages.h>
 
 static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 unsigned long start, unsigned long end, unsigned int newflags)
 @@ -25,6 +26,14 @@ static int mlock_fixup(struct vm_area_st
 goto out;
 }
 
 +	if (newflags & VM_LOCKED) {
 +		ret = bc_locked_charge(mm, end - start);
 +		if (ret < 0) {
 +			*prev = vma;
 +			goto out;
 +		}
 +	}
 +
 pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 vma->vm_file, pgoff, vma_policy(vma));
 @@ -38,13 +47,13 @@ static int mlock_fixup(struct vm_area_st
 if (start != vma->vm_start) {
 ret = split_vma(mm, vma, start, 1);
 if (ret)
 -			goto out;
 +			goto out_uncharge;
 }
 
 if (end != vma->vm_end) {
 ret = split_vma(mm, vma, end, 0);
 if (ret)
 -			goto out;
 +			goto out_uncharge;
 }
 
 success:
 @@ -63,13 +72,19 @@ success:
 pages = -pages;
 if (!(newflags & VM_IO))
 ret = make_pages_present(start, end);
 -	}
 +	} else
 +		bc_locked_uncharge(mm, end - start);
 
 vma->vm_mm->locked_vm -= pages;
 out:
 if (ret == -ENOMEM)
 ret = -EAGAIN;
 return ret;
 +
 +out_uncharge:
 +	if (newflags & VM_LOCKED)
 +		bc_locked_uncharge(mm, end - start);
 +	goto out;
 }
 
 static int do_mlock(unsigned long start, size_t len, int on)
 --- ./mm/mmap.c.bclockcharge	2006-09-05 12:53:59.000000000 +0400
 +++ ./mm/mmap.c	2006-09-05 13:07:13.000000000 +0400
 @@ -26,6 +26,8 @@
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
 
 +#include <bc/vmpages.h>
 +
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
 #include <asm/tlb.h>
 @@ -220,6 +222,10 @@ static struct vm_area_struct *remove_vma
 struct vm_area_struct *next = vma->vm_next;
 
 might_sleep();
 +
 +	bc_memory_uncharge(vma->vm_mm, vma->vm_end - vma->vm_start,
 +			vma->vm_flags, vma->vm_file);
 +
 if (vma->vm_ops && vma->vm_ops->close)
 vma->vm_ops->close(vma);
 if (vma->vm_file)
 @@ -267,7 +273,7 @@ asmlinkage unsigned long sys_brk(unsigne
 goto out;
 
 /* Ok, looks good - let it rip. */
 -	if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk)
 +	if (__do_brk(oldbrk, newbrk-oldbrk, BC_BARRIER) != oldbrk)
 goto out;
 set_brk:
 mm->brk = brk;
 @@ -1047,6 +1053,11 @@ munmap_back:
 }
 }
 
 +	error = bc_memory_charge(mm, len, vm_flags, file,
 +			flags & MAP_EXECPRIO ? BC_LIMIT : BC_BARRIER);
 +	if (error)
 +		goto charge_fail;
 +
 /*
 * Can we just expand an old private anonymous mapping?
 * The VM_SHARED test is necessary because shmem_zero_setup
 @@ -1160,6 +1171,8 @@ unmap_and_free_vma:
 free_vma:
 kmem_cache_free(vm_area_cachep, vma);
 unacct_error:
 +	bc_memory_uncharge(mm, len, vm_flags, file);
 +charge_fail:
 if (charged)
 vm_unacct_memory(charged);
 return error;
 @@ -1489,12 +1502,16 @@ static int acct_stack_growth(struct vm_a
 return -ENOMEM;
 }
 
 +	if (bc_memory_charge(mm, grow << PAGE_SHIFT,
 +				vma->vm_flags, vma->vm_file, BC_LIMIT))
 +		goto err_ch;
 +
 /*
 * Overcommit..  This must be the final test, as it will
 * update security statistics.
 */
 if (security_vm_enough_memory(grow))
 -		return -ENOMEM;
 +		goto err_acct;
 
 /* Ok, everything looks good - let it rip */
 mm->total_vm += grow;
 @@ -1502,6 +1519,11 @@ static int acct_stack_growth(struct vm_a
 mm->locked_vm += grow;
 vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
 return 0;
 +
 +err_acct:
 +	bc_memory_uncharge(mm, grow << PAGE_SHIFT, vma->vm_flags, vma->vm_file);
 +err_ch:
 +	return -ENOMEM;
 }
 
 #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
 @@ -1857,7 +1879,7 @@ static inline void verify_mm_writelocked
 *  anonymous maps.  eventually we may be able to do some
 *  brk-specific accounting here.
 */
 -unsigned long do_brk(unsigned long addr, unsigned long len)
 +unsigned long __do_brk(unsigned long addr, unsigned long len, int bc_strict)
 {
 struct mm_struct * mm = current->mm;
 struct vm_area_struct * vma, * prev;
 @@ -1914,6 +1936,9 @@ unsigned long do_brk(unsigned long addr,
 
 flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
 
 +	if (bc_memory_charge(mm, len, flags, NULL, bc_strict))
 +		goto out_unacct;
 +
 /* Can we just expand an old private anonymous mapping? */
 if (vma_merge(mm, prev, addr, addr + len, flags,
 NULL, NULL, pgoff, NULL))
 @@ -1923,10 +1948,8 @@ unsigned long do_brk(unsigned long addr,
 * create a vma struct for an anonymous mapping
 */
 vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
 -	if (!vma) {
 -		vm_unacct_memory(len >> PAGE_SHIFT);
 -		return -ENOMEM;
 -	}
 +	if (!vma)
 +		goto out_uncharge;
 
 vma->vm_mm = mm;
 vma->vm_start = addr;
 @@ -1943,6 +1966,17 @@ out:
 make_pages_present(addr, addr + len);
 }
 return addr;
 +
 +out_uncharge:
 +	bc_memory_uncharge(mm, len, flags, NULL);
 +out_unacct:
 +	vm_unacct_memory(len >> PAGE_SHIFT);
 +	return -ENOMEM;
 +}
 +
 +unsigned long do_brk(unsigned long addr, unsigned long len)
 +{
 +	return __do_brk(addr, len, BC_LIMIT);
 }
 
 EXPORT_SYMBOL(do_brk);
 @@ -2005,9 +2039,18 @@ int insert_vm_struct(struct mm_struct *
 return -ENOMEM;
 if ((vma->vm_flags & VM_ACCOUNT) &&
 security_vm_enough_memory(vma_pages(vma)))
 -		return -ENOMEM;
 +		goto err_acct;
 +	if (bc_memory_charge(mm, vma->vm_end - vma->vm_start,
 +				vma->vm_flags, vma->vm_file, BC_LIMIT))
 +		goto err_charge;
 vma_link(mm, vma, prev, rb_link,
...
 
 
 |  
	|  |  |  
	| 
		
			| [PATCH 10/13] BC: privvm pages [message #5932 is a reply to message #5922] | Tue, 05 September 2006 15:26   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| This patch instroduces new resource - BC_PRIVVMPAGES. It is an upper estimation of currently used physical memory.
 
 There are different approaches to user pages control:
 a) account all the mappings on mmap/brk and reject as
 soon as the sum of VMA's lengths reaches the barrier.
 
 This approach is very bad as applications always map
 more than they really use, very often MUCH more.
 
 b) account only the really used memory and reject as
 soon as RSS reaches the limit.
 
 This approach is not good either as user space pages are
 allocated in page fault handler and the only way to reject
 allocation is to kill the task.
 
 Comparing to previous scenarion this is much worse as
 application won't even be able to terminate gracefully.
 
 c) account a part of memory on mmap/brk and reject there,
 and account the rest of the memory in page fault handlers
 without any rejects.
 This type of accounting is used in UBC.
 
 d) account physical memory and behave like a standalone
 kernel - reclaim user memory when run out of it.
 
 This type of memory control is to be introduced later
 as an addition to c). UBC provides all the needed
 statistics for this (physical memory, swap pages etc.)
 
 Privvmpages accounting is described in details in
 http://wiki.openvz.org/User_pages_accounting
 
 A note about sys_mprotect: as it can change mapping state from
 BC_VM_PRIVATE to !BC_VM_PRIVATE and vice-versa appropriate amount of
 pages is (un)charged in mprotect_fixup.
 
 Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 include/bc/beancounter.h |    3 +-
 include/bc/vmpages.h     |   44 +++++++++++++++++++++++++++++++++++++++
 kernel/bc/beancounter.c  |    2 +
 kernel/bc/vmpages.c      |   53 ++++++++++++++++++++++++++++++++++++++++++++---
 kernel/fork.c            |    9 +++++++
 mm/mprotect.c            |   17 ++++++++++++++-
 mm/shmem.c               |    7 ++++++
 7 files changed, 129 insertions(+), 6 deletions(-)
 
 --- ./include/bc/beancounter.h.bcprivvm	2006-09-05 12:59:27.000000000 +0400
 +++ ./include/bc/beancounter.h	2006-09-05 13:17:50.000000000 +0400
 @@ -14,8 +14,9 @@
 
 #define BC_KMEMSIZE	0
 #define BC_LOCKEDPAGES	1
 +#define BC_PRIVVMPAGES	2
 
 -#define BC_RESOURCES	2
 +#define BC_RESOURCES	3
 
 struct bc_resource_parm {
 unsigned long barrier;	/* A barrier over which resource allocations
 --- ./include/bc/vmpages.h.bcprivvm	2006-09-05 13:04:03.000000000 +0400
 +++ ./include/bc/vmpages.h	2006-09-05 13:38:07.000000000 +0400
 @@ -8,6 +8,8 @@
 #ifndef __BC_VMPAGES_H_
 #define __BC_VMPAGES_H_
 
 +#include <linux/mm.h>
 +
 #include <bc/beancounter.h>
 #include <bc/task.h>
 
 @@ -15,12 +17,37 @@ struct mm_struct;
 struct file;
 struct shmem_inode_info;
 
 +/*
 + * sys_mprotect() can change mapping state form private to
 + * shared and vice-versa. Thus rescharging is needed, but
 + * with the following rules:
 + * 1. No state change   : nothing to be done at all;
 + * 2. shared -> private : need to charge before operation starts
 + *                        and roll back on error path;
 + * 3. private -> shared : need to uncharge after successfull state
 + *                        change. Uncharging first and charging back
 + *                        on error path isn't good as charge will have
 + *                        to be BC_FORCE and thus can potentially create
 + *                        an overcharged privvmpages.
 + */
 +#define BC_NOCHARGE	0
 +#define BC_UNCHARGE	1 /* private -> shared */
 +#define BC_CHARGE	2 /* shared -> private */
 +
 +#define BC_VM_PRIVATE(flags, file) ( ((flags) & VM_WRITE) ? \
 +			( (file) == NULL || !((flags) & VM_SHARED) ) : 0 )
 +
 #ifdef CONFIG_BEANCOUNTERS
 int __must_check bc_memory_charge(struct mm_struct *mm, unsigned long size,
 unsigned long vm_flags, struct file *vm_file, int strict);
 void bc_memory_uncharge(struct mm_struct *mm, unsigned long size,
 unsigned long vm_flags, struct file *vm_file);
 
 +int __must_check bc_privvm_recharge(unsigned long old_flags,
 +		unsigned long new_flags, struct file *vm_file);
 +int __must_check bc_privvm_charge(struct mm_struct *mm, unsigned long size);
 +void bc_privvm_uncharge(struct mm_struct *mm, unsigned long size);
 +
 int __must_check bc_locked_charge(struct mm_struct *mm, unsigned long size);
 void bc_locked_uncharge(struct mm_struct *mm, unsigned long size);
 
 @@ -64,6 +91,23 @@ static inline void bc_memory_uncharge(st
 {
 }
 
 +static inline int __must_check bc_privvm_recharge(unsigned long old_flags,
 +		unsigned long new_flags, struct file *vm_file)
 +{
 +	return BC_NOCHARGE;
 +}
 +
 +static inline int __must_check bc_privvm_charge(struct mm_struct *mm,
 +		unsigned long size)
 +{
 +	return 0;
 +}
 +
 +static inline void bc_privvm_uncharge(struct mm_struct *mm,
 +		unsigned long size)
 +{
 +}
 +
 static inline int __must_check bc_locked_charge(struct mm_struct *mm,
 unsigned long size)
 {
 --- ./kernel/bc/beancounter.c.bcprivvm	2006-09-05 12:59:45.000000000 +0400
 +++ ./kernel/bc/beancounter.c	2006-09-05 13:17:50.000000000 +0400
 @@ -22,6 +22,7 @@ struct beancounter init_bc;
 const char *bc_rnames[] = {
 "kmemsize",	/* 0 */
 "lockedpages",
 +	"privvmpages",
 };
 
 #define BC_HASH_BITS		8
 @@ -234,6 +235,7 @@ static void init_beancounter_syslimits(s
 
 bc->bc_parms[BC_KMEMSIZE].limit = 32 * 1024 * 1024;
 bc->bc_parms[BC_LOCKEDPAGES].limit = 8;
 +	bc->bc_parms[BC_PRIVVMPAGES].limit = BC_MAXVALUE;
 
 for (k = 0; k < BC_RESOURCES; k++)
 bc->bc_parms[k].barrier = bc->bc_parms[k].limit;
 --- ./kernel/bc/vmpages.c.bcprivvm	2006-09-05 12:59:27.000000000 +0400
 +++ ./kernel/bc/vmpages.c	2006-09-05 13:28:16.000000000 +0400
 @@ -18,26 +18,73 @@ int bc_memory_charge(struct mm_struct *m
 unsigned long vm_flags, struct file *vm_file, int strict)
 {
 struct beancounter *bc;
 +	unsigned long flags;
 
 bc = mm->mm_bc;
 size >>= PAGE_SHIFT;
 
 +	spin_lock_irqsave(&bc->bc_lock, flags);
 if (vm_flags & VM_LOCKED)
 -		if (bc_charge(bc, BC_LOCKEDPAGES, size, strict))
 -			return -ENOMEM;
 +		if (bc_charge_locked(bc, BC_LOCKEDPAGES, size, strict))
 +			goto err_locked;
 +	if (BC_VM_PRIVATE(vm_flags, vm_file))
 +		if (bc_charge_locked(bc, BC_PRIVVMPAGES, size, strict))
 +			goto err_privvm;
 +	spin_unlock_irqrestore(&bc->bc_lock, flags);
 return 0;
 +
 +err_privvm:
 +	bc_uncharge_locked(bc, BC_LOCKEDPAGES, size);
 +err_locked:
 +	spin_unlock_irqrestore(&bc->bc_lock, flags);
 +	return -ENOMEM;
 }
 
 void bc_memory_uncharge(struct mm_struct *mm, unsigned long size,
 unsigned long vm_flags, struct file *vm_file)
 {
 struct beancounter *bc;
 +	unsigned long flags;
 
 bc = mm->mm_bc;
 size >>= PAGE_SHIFT;
 
 +	spin_lock_irqsave(&bc->bc_lock, flags);
 if (vm_flags & VM_LOCKED)
 -		bc_uncharge(bc, BC_LOCKEDPAGES, size);
 +		bc_uncharge_locked(bc, BC_LOCKEDPAGES, size);
 +	if (BC_VM_PRIVATE(vm_flags, vm_file))
 +		bc_uncharge_locked(bc, BC_PRIVVMPAGES, size);
 +	spin_unlock_irqrestore(&bc->bc_lock, flags);
 +}
 +
 +int bc_privvm_recharge(unsigned long vm_flags_old, unsigned long vm_flags_new,
 +		struct file *vm_file)
 +{
 +	int priv_old, priv_new;
 +
 +	priv_old = (BC_VM_PRIVATE(vm_flags_old, vm_file) ? 1 : 0);
 +	priv_new = (BC_VM_PRIVATE(vm_flags_new, vm_file) ? 1 : 0);
 +
 +	if (priv_old == priv_new)
 +		return BC_NOCHARGE;
 +
 +	return priv_new ? BC_CHARGE : BC_UNCHARGE;
 +}
 +
 +int bc_privvm_charge(struct mm_struct *mm, unsigned long size)
 +{
 +	struct beancounter *bc;
 +
 +	bc = mm->mm_bc;
 +	bc_charge(bc, BC_PRIVVMPAGES, size >> PAGE_SHIFT);
 +}
 +
 +void bc_privvm_uncharge(struct mm_struct *mm, unsigned long size)
 +{
 +	struct beancounter *bc;
 +
 +	bc = mm->mm_bc;
 +	bc_uncharge(bc, BC_PRIVVMPAGES, size >> PAGE_SHIFT);
 }
 
 static inline int locked_charge(struct beancounter *bc,
 --- ./kernel/fork.c.bcprivvm	2006-09-05 13:17:15.000000000 +0400
 +++ ./kernel/fork.c	2006-09-05 13:23:27.000000000 +0400
 @@ -236,9 +236,13 @@ static inline int dup_mmap(struct mm_str
 goto fail_nomem;
 charge = len;
 }
 +		if (bc_memory_charge(mm, mpnt->vm_end - mpnt->vm_start,
 +					mpnt->vm_flags & ~VM_LOCKED,
 +					mpnt->vm_file, BC_LIMIT) < 0)
 +			goto fail_nomem;
 tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 if (!tmp)
 -			goto fail_nomem;
 +			goto fail_alloc;
 *tmp = *mpnt;
 pol = mpol_copy(vma_policy(mpnt));
 retval = PTR_ERR(pol);
 @@ -292,6 +296,9 @@ out:
 return retval;
 fail_nomem_policy:
 kmem_cache_free(vm_area_cachep, tmp);
 +fail_alloc:
 +	bc_memory_uncharge(mm, mpnt->vm_end - mpnt->vm_start,
 +			mpnt->vm_flags & ~VM_LOCKED, mpnt->vm_file);
 fail_nomem:
 retval = -ENOMEM;
 vm_unacct_memory(charge);
 --- ./mm/mprotect.c.bcprivvm	2006-09-05 12:53:59.000000000 +0400
 +++ ./mm/mprotect.c	2006-09-05 13:27:40.000000000 +0400
 @@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
 +#include <bc/vmpages.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
 @@ -139,12 +140,19 @@ mprotect_fixup(struct vm_area_struct *vm
 pgoff_t pgoff;
 int error;
 int dirty_accountable = 0;
 +	int recharge;
 
 if (newflags == oldflags) {
 *pprev = vma;
 return 0;
 }
 
 +	recharge = bc_privvm_recharge(oldflags, newflags, vma->vm_file);
 +	if (recharge == BC_CHARGE) {
 +		if (bc_privvm_charge(mm, end - start))
 +			return -ENOMEM;
 +	}
 +
 /*
 * If we make a private mapping writable we increase our commit;
 * but (without finer accounting) cannot reduce our commit if we
 @@ -157,8 +165,9 @@ mprotect_fixup(struct vm_area_struct *vm
 if (newflags & VM_WRITE) {
 if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_SHARED))) {
 charged = nrpages;
 +			error = -ENOMEM;
 if (security_vm_enough_memory(charged))
 -				return -ENOMEM;
 +				goto fail_acct;
 newflags |= VM_ACCOUNT;
 }
 }
 @@ -205,12 +213,18 @@ success:
 hugetlb_change_protection(vma, 
...
 
 
 |  
	|  |  |  
	| 
		
			| [PATCH 11/13] BC: vmrss (preparations) [message #5933 is a reply to message #5922] | Tue, 05 September 2006 15:28   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| This patch does simple things: - intruduces an bc_magic field on beancunter to make sure
 union on struct page is correctly used in next patches
 - adds nr_beancounters
 - adds unused_privvmpages variable (counter of privvm pages
 which are not mapped into VM address space and thus potentially
 can be allocated later)
 
 This is needed by vmrss accounting and is done to make patch reviewing
 simpler.
 
 Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 include/bc/beancounter.h |   13 +++++++++++++
 include/bc/vmpages.h     |    2 ++
 kernel/bc/beancounter.c  |    5 +++++
 kernel/bc/kmem.c         |    1 +
 kernel/bc/vmpages.c      |   44 ++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 61 insertions(+), 4 deletions(-)
 
 --- ./include/bc/beancounter.h.bcvmrssprep	2006-09-05 13:17:50.000000000 +0400
 +++ ./include/bc/beancounter.h	2006-09-05 13:44:33.000000000 +0400
 @@ -45,6 +45,13 @@ struct bc_resource_parm {
 #define BC_MAXVALUE	LONG_MAX
 
 /*
 + * This magic is used to distinuish user beancounter and pages beancounter
 + * in struct page. page_ub and page_bc are placed in union and MAGIC
 + * ensures us that we don't use pbc as ubc in bc_page_uncharge().
 + */
 +#define BC_MAGIC                0x62756275UL
 +
 +/*
 *	Resource management structures
 * Serialization issues:
 *   beancounter list management is protected via bc_hash_lock
 @@ -54,11 +61,13 @@ struct bc_resource_parm {
 */
 
 struct beancounter {
 +	unsigned long		bc_magic;
 atomic_t		bc_refcount;
 spinlock_t		bc_lock;
 bcid_t			bc_id;
 struct hlist_node	hash;
 
 +	unsigned long		unused_privvmpages;
 /* resources statistics and settings */
 struct bc_resource_parm	bc_parms[BC_RESOURCES];
 };
 @@ -74,6 +83,8 @@ enum bc_severity { BC_BARRIER, BC_LIMIT,
 
 #ifdef CONFIG_BEANCOUNTERS
 
 +extern unsigned int nr_beancounters = 1;
 +
 /*
 * These functions tune minheld and maxheld values for a given
 * resource when held value changes
 @@ -137,6 +137,8 @@ extern const char *bc_rnames[];
 
 #else /* CONFIG_BEANCOUNTERS */
 
 +#define nr_beancounters 0
 +
 #define beancounter_findcreate(id, f)			(NULL)
 #define get_beancounter(bc)				(NULL)
 #define put_beancounter(bc)				do { } while (0)
 --- ./include/bc/vmpages.h.bcvmrssprep	2006-09-05 13:38:07.000000000 +0400
 +++ ./include/bc/vmpages.h	2006-09-05 13:40:21.000000000 +0400
 @@ -77,6 +77,8 @@ void bc_locked_shm_uncharge(struct shmem
 put_beancounter((info)->shm_bc);			\
 } while (0)
 
 +void bc_update_privvmpages(struct beancounter *bc);
 +
 #else /* CONFIG_BEANCOUNTERS */
 
 static inline int __must_check bc_memory_charge(struct mm_struct *mm,
 --- ./kernel/bc/beancounter.c.bcvmrssprep	2006-09-05 13:17:50.000000000 +0400
 +++ ./kernel/bc/beancounter.c	2006-09-05 13:44:53.000000000 +0400
 @@ -19,6 +19,8 @@ static void init_beancounter_struct(stru
 
 struct beancounter init_bc;
 
 +unsigned int nr_beancounters;
 +
 const char *bc_rnames[] = {
 "kmemsize",	/* 0 */
 "lockedpages",
 @@ -88,6 +90,7 @@ retry:
 
 out_install:
 hlist_add_head(&new_bc->hash, slot);
 +	nr_beancounters++;
 spin_unlock_irqrestore(&bc_hash_lock, flags);
 out:
 return new_bc;
 @@ -110,6 +113,7 @@ void put_beancounter(struct beancounter
 bc->bc_parms[i].held, bc_rnames[i]);
 
 hlist_del(&bc->hash);
 +	nr_beancounters--;
 spin_unlock_irqrestore(&bc_hash_lock, flags);
 
 kmem_cache_free(bc_cachep, bc);
 @@ -214,6 +218,7 @@ EXPORT_SYMBOL_GPL(bc_uncharge);
 
 static void init_beancounter_struct(struct beancounter *bc, bcid_t id)
 {
 +	bc->bc_magic = BC_MAGIC;
 atomic_set(&bc->bc_refcount, 1);
 spin_lock_init(&bc->bc_lock);
 bc->bc_id = id;
 --- ./kernel/bc/kmem.c.bcvmrssprep	2006-09-05 12:54:40.000000000 +0400
 +++ ./kernel/bc/kmem.c	2006-09-05 13:40:21.000000000 +0400
 @@ -79,6 +79,7 @@ void bc_page_uncharge(struct page *page,
 if (bc == NULL)
 return;
 
 +	BUG_ON(bc->bc_magic != BC_MAGIC);
 bc_uncharge(bc, BC_KMEMSIZE, PAGE_SIZE << order);
 put_beancounter(bc);
 page_bc(page) = NULL;
 --- ./kernel/bc/vmpages.c.bcvmrssprep	2006-09-05 13:28:16.000000000 +0400
 +++ ./kernel/bc/vmpages.c	2006-09-05 13:45:34.000000000 +0400
 @@ -14,6 +14,34 @@
 
 #include <asm/page.h>
 
 +void bc_update_privvmpages(struct beancounter *bc)
 +{
 +	bc->bc_parms[BC_PRIVVMPAGES].held = bc->unused_privvmpages;
 +	bc_adjust_minheld(bc, BC_PRIVVMPAGES);
 +	bc_adjust_maxheld(bc, BC_PRIVVMPAGES);
 +}
 +
 +static inline int privvm_charge(struct beancounter *bc, unsigned long sz,
 +		int strict)
 +{
 +	if (bc_charge_locked(bc, BC_PRIVVMPAGES, sz, strict))
 +		return -ENOMEM;
 +
 +	bc->unused_privvmpages += sz;
 +	return 0;
 +}
 +
 +static inline void privvm_uncharge(struct beancounter *bc, unsigned long sz)
 +{
 +	if (unlikely(bc->unused_privvmpages < sz)) {
 +		printk("BC: overuncharging %d unused pages: val %lu held %lu\n",
 +				bc->bc_id, sz, bc->unused_privvmpages);
 +		sz = bc->unused_privvmpages;
 +	}
 +	bc->unused_privvmpages -= sz;
 +	bc_update_privvmpages(bc);
 +}
 +
 int bc_memory_charge(struct mm_struct *mm, unsigned long size,
 unsigned long vm_flags, struct file *vm_file, int strict)
 {
 @@ -28,7 +56,7 @@ int bc_memory_charge(struct mm_struct *m
 if (bc_charge_locked(bc, BC_LOCKEDPAGES, size, strict))
 goto err_locked;
 if (BC_VM_PRIVATE(vm_flags, vm_file))
 -		if (bc_charge_locked(bc, BC_PRIVVMPAGES, size, strict))
 +		if (privvm_charge(bc, size, strict))
 goto err_privvm;
 spin_unlock_irqrestore(&bc->bc_lock, flags);
 return 0;
 @@ -53,7 +81,7 @@ void bc_memory_uncharge(struct mm_struct
 if (vm_flags & VM_LOCKED)
 bc_uncharge_locked(bc, BC_LOCKEDPAGES, size);
 if (BC_VM_PRIVATE(vm_flags, vm_file))
 -		bc_uncharge_locked(bc, BC_PRIVVMPAGES, size);
 +		privvm_uncharge(bc, size);
 spin_unlock_irqrestore(&bc->bc_lock, flags);
 }
 
 @@ -73,18 +101,26 @@ int bc_privvm_recharge(unsigned long vm_
 
 int bc_privvm_charge(struct mm_struct *mm, unsigned long size)
 {
 +	int ret;
 struct beancounter *bc;
 +	unsigned long flags;
 
 bc = mm->mm_bc;
 -	bc_charge(bc, BC_PRIVVMPAGES, size >> PAGE_SHIFT);
 +	spin_lock_irqsave(&bc->bc_lock, flags);
 +	ret = privvm_charge(bc, size >> PAGE_SHIFT, BC_BARRIER);
 +	spin_unlock_irqrestore(&bc->bc_lock, flags);
 +	return ret;
 }
 
 void bc_privvm_uncharge(struct mm_struct *mm, unsigned long size)
 {
 struct beancounter *bc;
 +	unsigned long flags;
 
 bc = mm->mm_bc;
 -	bc_uncharge(bc, BC_PRIVVMPAGES, size >> PAGE_SHIFT);
 +	spin_lock_irqsave(&bc->bc_lock, flags);
 +	privvm_uncharge(bc, size >> PAGE_SHIFT);
 +	spin_unlock_irqrestore(&bc->bc_lock, flags);
 }
 
 static inline int locked_charge(struct beancounter *bc,
 |  
	|  |  |  
	| 
		
			| [PATCH 12/13] BC: vmrss (core) [message #5934 is a reply to message #5922] | Tue, 05 September 2006 15:28   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| This is the core of vmrss accounting. 
 The main introduced object is page_beancounter.
 It ties together page and BCs which use the page.
 This allows correctly account fractions of memory shared
 between BCs (http://wiki.openvz.org/RSS_fractions_accounting)
 
 Accounting API:
 1. bc_alloc_rss_counter() allocates a tie between page and BC
 2. bc_free_rss_counter frees it.
 
 (1) and (2) must be done each time a page is about
 to be added to someone's rss.
 
 3. When page is touched by BC (i.e. by any task which mm belongs to BC)
 page is bc_vmrss_page_add()-ed to that BC. Touching page leads
 to subtracting it from unused_prvvmpages and adding to held_pages.
 4. When page is unmapped from BC it is bc_vmrss_page_del()-ed from it.
 
 5. When task forks all it's mapped pages must be bc_vmrss_page_dup()-ed.
 i.e. page beancounter reference counter must be increased.
 
 6. Some pages (former PGReserved) must be added to rss, but without
 having a reference on it. These pages are bc_vmrss_page_add_noref()-ed.
 
 Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 include/bc/beancounter.h |    3
 include/bc/vmpages.h     |    4
 include/bc/vmrss.h       |   72 ++++++
 include/linux/mm.h       |    6
 include/linux/shmem_fs.h |    2
 init/main.c              |    2
 kernel/bc/Kconfig        |    9
 kernel/bc/Makefile       |    1
 kernel/bc/beancounter.c  |    9
 kernel/bc/vmpages.c      |    7
 kernel/bc/vmrss.c        |  508 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/shmem.c               |    6
 12 files changed, 627 insertions(+), 2 deletions(-)
 
 --- ./include/bc/beancounter.h.bcrsscore	2006-09-05 13:44:33.000000000 +0400
 +++ ./include/bc/beancounter.h	2006-09-05 13:50:29.000000000 +0400
 @@ -68,6 +68,9 @@ struct beancounter {
 struct hlist_node	hash;
 
 unsigned long		unused_privvmpages;
 +#ifdef CONFIG_BEANCOUNTERS_RSS
 +	unsigned long long	rss_pages;
 +#endif
 /* resources statistics and settings */
 struct bc_resource_parm	bc_parms[BC_RESOURCES];
 };
 --- ./include/bc/vmpages.h.bcrsscore	2006-09-05 13:40:21.000000000 +0400
 +++ ./include/bc/vmpages.h	2006-09-05 13:46:35.000000000 +0400
 @@ -77,6 +77,8 @@ void bc_locked_shm_uncharge(struct shmem
 put_beancounter((info)->shm_bc);			\
 } while (0)
 
 +#define mm_same_bc(mm1, mm2)	((mm1)->mm_bc == (mm2)->mm_bc)
 +
 void bc_update_privvmpages(struct beancounter *bc);
 
 #else /* CONFIG_BEANCOUNTERS */
 @@ -136,6 +138,8 @@ static inline void bc_locked_shm_uncharg
 #define shmi_init_bc(info)	do { } while (0)
 #define shmi_free_bc(info)	do { } while (0)
 
 +#define mm_same_bc(mm1, mm2)	(1)
 +
 #endif /* CONFIG_BEANCOUNTERS */
 #endif
 
 --- /dev/null	2006-07-18 14:52:43.075228448 +0400
 +++ ./include/bc/vmrss.h	2006-09-05 13:50:25.000000000 +0400
 @@ -0,0 +1,72 @@
 +/*
 + *  include/ub/vmrss.h
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#ifndef __BC_VMRSS_H_
 +#define __BC_VMRSS_H_
 +
 +struct page_beancounter;
 +
 +struct page;
 +struct mm_struct;
 +struct vm_area_struct;
 +
 +/* values that represens page's 'weight' in bc rss accounting */
 +#define PB_PAGE_WEIGHT_SHIFT 24
 +#define PB_PAGE_WEIGHT (1 << PB_PAGE_WEIGHT_SHIFT)
 +/* page obtains one more reference within beancounter */
 +#define PB_COPY_SAME	((struct page_beancounter *)-1)
 +
 +#ifdef CONFIG_BEANCOUNTERS_RSS
 +
 +struct page_beancounter * __must_check bc_alloc_rss_counter(void);
 +struct page_beancounter * __must_check bc_alloc_rss_counter_list(long num,
 +		struct page_beancounter *list);
 +
 +void bc_free_rss_counter(struct page_beancounter *rc);
 +
 +void bc_vmrss_page_add(struct page *pg, struct mm_struct *mm,
 +		struct vm_area_struct *vma, struct page_beancounter **ppb);
 +void bc_vmrss_page_del(struct page *pg, struct mm_struct *mm,
 +		struct vm_area_struct *vma);
 +void bc_vmrss_page_dup(struct page *pg, struct mm_struct *mm,
 +		struct vm_area_struct *vma, struct page_beancounter **ppb);
 +void bc_vmrss_page_add_noref(struct page *pg, struct mm_struct *mm,
 +		struct vm_area_struct *vma);
 +
 +unsigned long mm_rss_pages(struct mm_struct *mm, unsigned long start,
 +		unsigned long end);
 +
 +void bc_init_rss(void);
 +
 +#else /* CONFIG_BEANCOUNTERS_RSS */
 +
 +static inline struct page_beancounter * __must_check bc_alloc_rss_counter(void)
 +{
 +	return NULL;
 +}
 +
 +static inline struct page_beancounter * __must_check bc_alloc_rss_counter_list(
 +		long num, struct page_beancounter *list)
 +{
 +	return NULL;
 +}
 +
 +static inline void bc_free_rss_counter(struct page_beancounter *rc)
 +{
 +}
 +
 +#define bc_vmrss_page_add(pg, mm, vma, pb)	do { } while (0)
 +#define bc_vmrss_page_del(pg, mm, vma)		do { } while (0)
 +#define bc_vmrss_page_dup(pg, mm, vma, pb)	do { } while (0)
 +#define bc_vmrss_page_add_noref(pg, mm, vma)	do { } while (0)
 +#define mm_rss_pages(mm, start, end)	(0)
 +
 +#define bc_init_rss()			do { } while (0)
 +
 +#endif /* CONFIG_BEANCOUNTERS_RSS */
 +
 +#endif
 --- ./include/linux/mm.h.bcrsscore	2006-09-05 13:06:37.000000000 +0400
 +++ ./include/linux/mm.h	2006-09-05 13:47:12.000000000 +0400
 @@ -275,11 +275,15 @@ struct page {
 unsigned long trace[8];
 #endif
 #ifdef CONFIG_BEANCOUNTERS
 -	struct beancounter	*page_bc;
 +	union {
 +		struct beancounter	*page_bc;
 +		struct page_beancounter	*page_pb;
 +	};
 #endif
 };
 
 #define page_bc(page)			((page)->page_bc)
 +#define page_pb(page)			((page)->page_pb)
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
 --- ./include/linux/shmem_fs.h.bcrsscore	2006-09-05 12:59:27.000000000 +0400
 +++ ./include/linux/shmem_fs.h	2006-09-05 13:50:19.000000000 +0400
 @@ -41,4 +41,6 @@ static inline struct shmem_inode_info *S
 return container_of(inode, struct shmem_inode_info, vfs_inode);
 }
 
 +int is_shmem_mapping(struct address_space *mapping);
 +
 #endif
 --- ./init/main.c.bcrsscore	2006-09-05 12:54:17.000000000 +0400
 +++ ./init/main.c	2006-09-05 13:46:35.000000000 +0400
 @@ -51,6 +51,7 @@
 #include <linux/lockdep.h>
 
 #include <bc/beancounter.h>
 +#include <bc/vmrss.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
 @@ -608,6 +609,7 @@ asmlinkage void __init start_kernel(void
 check_bugs();
 
 acpi_early_init(); /* before LAPIC and SMP init */
 +	bc_init_rss();
 
 /* Do the rest non-__init'ed, we're now alive */
 rest_init();
 --- ./kernel/bc/Kconfig.bcrsscore	2006-09-05 12:54:14.000000000 +0400
 +++ ./kernel/bc/Kconfig	2006-09-05 13:50:35.000000000 +0400
 @@ -22,4 +22,13 @@ config BEANCOUNTERS
 per-process basis.  Per-process accounting doesn't prevent malicious
 users from spawning a lot of resource-consuming processes.
 
 +config BEANCOUNTERS_RSS
 +	bool "Account physical memory usage"
 +	default y
 +	depends on BEANCOUNTERS
 +	help
 +	  This allows to estimate per beancounter physical memory usage.
 +	  Implemented alghorithm accounts shared pages of memory as well,
 +	  dividing them by number of beancounter which use the page.
 +
 endmenu
 --- ./kernel/bc/Makefile.bcrsscore	2006-09-05 12:59:37.000000000 +0400
 +++ ./kernel/bc/Makefile	2006-09-05 13:50:48.000000000 +0400
 @@ -9,3 +9,4 @@ obj-y += misc.o
 obj-y += sys.o
 obj-y += kmem.o
 obj-y += vmpages.o
 +obj-$(CONFIG_BEANCOUNTERS_RSS) += vmrss.o
 --- ./kernel/bc/beancounter.c.bcrsscore	2006-09-05 13:44:53.000000000 +0400
 +++ ./kernel/bc/beancounter.c	2006-09-05 13:49:38.000000000 +0400
 @@ -11,6 +11,7 @@
 #include <linux/hash.h>
 
 #include <bc/beancounter.h>
 +#include <bc/vmrss.h>
 
 static kmem_cache_t *bc_cachep;
 static struct beancounter default_beancounter;
 @@ -112,6 +113,14 @@ void put_beancounter(struct beancounter
 printk("BC: %d has %lu of %s held on put", bc->bc_id,
 bc->bc_parms[i].held, bc_rnames[i]);
 
 +	if (bc->unused_privvmpages != 0)
 +		printk("BC: %d has %lu of unused pages held on put", bc->bc_id,
 +			bc->unused_privvmpages);
 +#ifdef CONFIG_BEANCOUNTERS_RSS
 +	if (bc->rss_pages != 0)
 +		printk("BC: %d hash %llu of rss pages held on put", bc->bc_id,
 +			bc->rss_pages);
 +#endif
 hlist_del(&bc->hash);
 nr_beancounters--;
 spin_unlock_irqrestore(&bc_hash_lock, flags);
 --- ./kernel/bc/vmpages.c.bcrsscore	2006-09-05 13:45:34.000000000 +0400
 +++ ./kernel/bc/vmpages.c	2006-09-05 13:48:50.000000000 +0400
 @@ -11,12 +11,17 @@
 
 #include <bc/beancounter.h>
 #include <bc/vmpages.h>
 +#include <bc/vmrss.h>
 
 #include <asm/page.h>
 
 void bc_update_privvmpages(struct beancounter *bc)
 {
 -	bc->bc_parms[BC_PRIVVMPAGES].held = bc->unused_privvmpages;
 +	bc->bc_parms[BC_PRIVVMPAGES].held = bc->unused_privvmpages
 +#ifdef CONFIG_BEANCOUNTERS_RSS
 +		+ (bc->rss_pages >> PB_PAGE_WEIGHT_SHIFT)
 +#endif
 +		;
 bc_adjust_minheld(bc, BC_PRIVVMPAGES);
 bc_adjust_maxheld(bc, BC_PRIVVMPAGES);
 }
 --- /dev/null	2006-07-18 14:52:43.075228448 +0400
 +++ ./kernel/bc/vmrss.c	2006-09-05 13:51:21.000000000 +0400
 @@ -0,0 +1,508 @@
 +/*
 + *  kernel/bc/vmrss.c
 + *
 + *  Copyright (C) 2006 OpenVZ. SWsoft Inc
 + *
 + */
 +
 +#include <linux/sched.h>
 +#include <linux/mm.h>
 +#include <linux/list.h>
 +#include <linux/slab.h>
 +#include <linux/vmalloc.h>
 +#include <linux/shmem_fs.h>
 +#include <linux/highmem.h>
 +
 +#include <bc/beancounter.h>
 +#include <bc/vmpages.h>
 +#include <bc/vmrss.h>
 +
 +#include <asm/pgtable.h>
 +
 +/*
 + * Core object of accounting.
 + * page_beancounter (or rss_counter) ties together page an bc.
 + * Page has associated circular list of such pbs. When page is
 + * shared between bcs then it's size is splitted between all of
 + * them in 2^n-s parts.
 + *
 + * E.g. three bcs will share page like 1/2:1/4:1/4
 + * adding one more reference would produce such a change:
 + * 1/2(bc1) : 1/4(bc2) : 1/4(bc3) ->
 + * (1/4(bc1) + 1/4(bc1)) : 1/4(bc2) : 1/4(bc3) ->
 + * 1/4(bc2) : 1/4(bc3) : 1/4(bc4) : 1/4(bc1)
 + */
 +
 +#define PB_MAGIC	0x62700001UL
 +
 +struct page_beancounter {
 +	unsigned long magic;
 +	struct page *page;
 +	struct beancounter *bc;
 +	struct page_beancounter *next_hash;
 +	unsigned refcount;
 +	struct list_head page_list;
 +};
 +
 +#define PB_REFC_BITS 24
 +
 +#define pb_shift(p)	((p)->refcount >> PB_REFC_BITS)
 +#define pb_shift_inc(p)	do { ((p)->refcount += (1 << PB_REFC_BITS)); } while (0)
 +#define pb_shift_dec(p)	do { ((p)->refcount -= (1 << PB_REFC_BITS)); } while (0)
 +
 +#define pb_count(p)	((p)->refcount & ((1 << PB_REFC_BITS) - 1))
 +#define pb_get(p)	do { ((p)->refcount++); } while (0)
 +#define pb_put(p)	do { ((p)->refcount--); } while (0)
 +
 +#define pb_refcount_init(p, shift) do {					\
 +		(p)->refcount = ((shift) << PB_REFC_BITS) + (1);	\
 +	} while (0)
 +
 +static spinlock_t pb_lock = SPIN_LOCK_UNLOCKED;
 +static struct page_beancounter **pb_hash_table;
 +static unsigned int pb_hash_mask;
 +
 +static inline int pb_hash(struct beancounter *bc, struct page *page)
 +{
 +	return (page_to_pfn(page) + (bc->bc_id << 10)) & pb_hash_mask;
 +}
 +
 +static kmem_cache_t *pb_cachep;
 +#define alloc_pb()	kmem_cache_alloc(pb_cachep, GFP_KERNEL)
 +#define free_pb(p)	kmem_cache_free(pb_cachep, p)
 +
 +#define next_page_pb(p) list_entry(p->page_list.next,	\
 +		struct page_beancounter, page_list);
 +#define prev_page_pb(p) list_entry(p->page_list.prev,	\
 +		struct page_beancounter, page_list);
 +
 +/*
 + * Allocates a new page_beancounter struct and
 + * initialises requred fields.
 + * pb->next_hash is set to NULL as this field is used
 + * in two ways:
 + * 1. When pb is in hash - it points to the next one in
 + *    the current hash chain;
 + * 2. When pb is not in hash yet - it points to the next pb
 + *    in list just allocated.
 + */
 +struct page_beancounter *bc_alloc_rss_counter(void)
 +{
 +	struct page_beancounter *pb;
 +
 +	pb = alloc_pb();
 +	if (pb == NULL)
 +		return ERR_PTR(-ENOMEM);
 +
 +	pb->magic = PB_MAGIC;
 +	pb->next_hash = NULL;
 +	return pb;
 +}
 +
 +/*
 + * This function ensures that @list has at least @num elements.
 + * Otherwise needed elements are allocated and new list is
 + * returned. On error old list is freed.
 + *
 + * num == BC_ALLOC_ALL means that lis must contain as many
 + * elements as there are BCCs in hash now.
 + */
 +struct page_beancounter *bc_alloc_rss_counter_list(long num,
 +		struct page_beancounter *list)
 +{
 +	struct page_beancounter *pb;
 +
 +	for (pb = list; pb != NULL && num != 0; pb = pb->next_hash, num--);
 +
 +	/* need to allocate num more elements */
 +	while (num > 0) {
 +		pb = alloc_pb();
 +		if (pb == NULL)
 +			goto err;
 +
 +		pb->magic = PB_MAGIC;
 +		pb->next_hash = list;
 +		list = pb;
 +		num--;
 +	}
 +
 +	return list;
 +
 +err:
 +	bc_free_rss_counter(list);
 +	return ERR_PTR(-ENOMEM);
 +}
 +
 +/*
 + * Free the list of page_beancounter-s
 + */
 +void bc_free_rss_counter(struct page_beancounter *pb)
 +{
 +	struct page_beancounter *tmp;
 +
 +	while (pb) {
 +		tmp = pb->next_hash;
 +		free_pb(pb);
 +		pb = tmp;
 +	}
 +}
 +
 +/*
 + * Helpers to update rss_pages and unused_privvmpages on BC
 + */
 +static void mod_rss_pages(struct beancounter *bc, int val,
 +		struct vm_area_struct *vma, int unused)
 +{
 +	unsigned long flags;
 +
 +	spin_lock_irqsave(&bc->bc_lock, flags);
 +	if (vma && BC_VM_PRIVATE(vma->vm_flags, vma->vm_file)) {
 +		if (unused < 0 && unlikely(bc->unused_privvmpages < -unused)) {
 +			printk("BC: overuncharging %d unused pages: "
 +					"val %i, held %lu\n",
 +					bc->bc_id, unused,
 +					bc->unused_privvmpages);
 +			unused = -bc->unused_privvmpages;
 +		}
 +		bc->unused_privvmpages += unused;
 +	}
 +	bc->rss_pages += val;
 +	bc_update_privvmpages(bc);
 +	spin_unlock_irqrestore(&bc->bc_lock, flags);
 +}
 +
 +#define __inc_rss_pages(bc, val)	mod_rss_pages(bc, val, NULL, 0)
 +#define __dec_rss_pages(bc, val)	mod_rss_pages(bc, -(val), NULL, 0)
 +#define inc_rss_pages(bc, val, vma)	mod_rss_pages(bc, val, vma, -1)
 +#define dec_rss_pages(bc, val, vma)	mod_rss_pages(bc, -(val), vma, 1)
 +
 +/*
 + * Routines to manipulate page-to-bc references (page_beancounter)
 + * Reference may be added, removed or duplicated (see descriptions below)
 + */
 +
 +static int __pb_dup_ref(struct page *pg, struct beancounter *bc, int hash)
 +{
 +	struct page_beancounter *p;
 +
 +	for (p = pb_hash_table[hash];
 +			p != NULL && (p->page != pg || p->bc != bc);
 +			p = p->next_hash);
 +	if (p == NULL)
 +		return -1;
 +
 +	pb_get(p);
 +	return 0;
 +}
 +
 +static int __pb_add_ref(struct page *pg, struct beancounter *bc,
 +		int hash, struct page_beancounter **ppb)
 +{
 +	struct page_beancounter *head, *p;
 +	int shift, ret;
 +
 +	p = *ppb;
 +	*ppb = p->next_hash;
 +
 +	p->page = pg;
 +	p->bc = get_beancounter(bc);
 +	p->next_hash = pb_hash_table[hash];
 +	pb_hash_table[hash] = p;
 +
 +	head = page_pb(pg);
 +	if (head != NULL) {
 +		BUG_ON(head->magic != PB_MAGIC);
 +		/*
 +		 * Move the first element to the end of the list.
 +		 * List head (pb_head) is set to the next entry.
 +		 * Note that this code works even if head is the only element
 +		 * on the list (because it's cyclic).
 +		 */
 +		page_pb(pg) = next_page_pb(head);
 +		pb_shift_inc(head);
 +		shift = pb_shift(head);
 +		/*
 +		 * Update user beancounter, the share of head has been changed.
 +		 * Note that the shift counter is taken after increment.
 +		 */
 +		__dec_rss_pages(head->bc, PB_PAGE_WEIGHT >> shift);
 +		/*
 +		 * Add the new page beancounter to the end of the list.
 +		 */
 +		list_add_tail(&p->page_list, &page_pb(pg)->page_list);
 +	} else {
 +		page_pb(pg) = p;
 +		shift = 0;
 +		INIT_LIST_HEAD(&p->page_list);
 +	}
 +
 +	pb_refcount_init(p, shift);
 +	ret = PB_PAGE_WEIGHT >> shift;
 +	return ret;
 +}
 +
 +static int __pb_remove_ref(struct page *page, struct beancounter *bc)
 +{
 +	int hash, ret;
 +	struct page_beancounter *p, **q;
 +	int shift, shiftt;
 +
 +	ret = 0;
 +
 +	hash = pb_hash(bc, page);
 +
 +	BUG_ON(page_pb(page) != NULL && page_pb(page)->magic != PB_MAGIC);
 +	for (q = pb_hash_table + hash, p = *q;
 +			p != NULL && (p->page != page || p->bc != bc);
 +			q = &p->next_hash, p = *q);
 +	if (p == NULL)
 +		goto out;
 +
 +	pb_put(p);
 +	if (pb_count(p) > 0)
 +		goto out;
 +
 +	/* remove from the hash list */
 +	*q = p->next_hash;
 +
 +	shift = pb_shift(p);
 +	ret = PB_PAGE_WEIGHT >> shift;
 +
 +	if (page_pb(page) == p) {
 +		if (list_empty(&p->page_list)) {
 +			page_pb(page) = NULL;
 +			put_beancounter(bc);
 +			free_pb(p);
 +			goto out;
 +		}
 +		page_pb(page) = next_page_pb(p);
 +	}
 +
 +	list_del(&p->page_list);
 +	put_beancounter(bc);
 +	free_pb(p);
 +
 +	/*
 +	 * Now balance the list.
 +	 * Move the tail and adjust its shift counter.
 +	 */
 +	p = prev_page_pb(page_pb(page));
 +	shiftt = pb_shift(p);
 +	pb_shift_dec(p);
 +	page_pb(page) = p;
 +	__inc_rss_pages(p->bc, PB_PAGE_WEIGHT >> shiftt);
 +
 +	/*
 +	 * If the shift counter of the moved beancounter is different from the
 +	 * removed one's, repeat the procedure for one more tail beancounter
 +	 */
 +	if (shiftt > shift) {
 +		p = prev_page_pb(page_pb(page));
 +		pb_shift_dec(p);
 +		page_pb(page) = p;
 +		__inc_rss_pages(p->bc, PB_PAGE_WEIGHT >> shiftt);
 +	}
 +out:
 +	return ret;
 +}
 +
 +/*
 + * bc_vmrss_page_add: Called when page is added to resident set
 + *   of any mm. In this case page is substracted from unused_privvmpages
 + *   (if it is BC_VM_PRIVATE one) and a reference to BC must be set
 + *   with page_beancounter.
 + *
 + * bc_vmrss_page_del: The reverse operation - page is removed from
 + *   resident set and must become unused.
 + *
 + * bc_vmrss_page_dup: This is called on dup_mmap() when all pages
 + *   become shared between two mm structs. This case has one feature:
 + *   some pages (see below) may lack a reference to BC, so setting
 + *   new reference is not needed, but update of unused_privvmpages
 + *   is required.
 + *
 + * bc_vmrss_page_add_noref: This is called for (former) reserved pages
 + *   like ZERO_PAGE() or some pages set up with insert_page(). These
 + *   pages must not have reference to any BC, but must be accounted in
 + *   rss.
 + */
 +
 +void bc_vmrss_page_add(struct page *pg, struct mm_struct *mm,
 +		struct vm_area_struct *vma, struct page_beancounter **ppb)
 +{
 +	struct beancounter *bc;
 +	int hash, ret;
 +
 +	if (!PageAnon(pg) && is_shmem_mapping(pg->mapping))
 +		return;
 +
 +	bc = mm->mm_bc;
 +	hash = pb_hash(bc, pg);
 +
 +	ret = 0;
 +	spin_lock(&pb_lock);
 +	if (__pb_dup_ref(pg, bc, hash))
 +		ret = __pb_add_ref(pg, bc, hash, ppb);
 +	spin_unlock(&pb_lock);
 +
 +	inc_rss_pages(bc, ret, vma);
 +}
 +
 +void bc_vmrss_page_del(struct page *pg, struct mm_struct *mm,
 +		struct vm_area_struct *vma)
 +{
 +	struct beancounter *bc;
 +	int ret;
 +
 +	if (!PageAnon(pg) && is_shmem_mapping(pg->mapping))
 +		return;
 +
 +	bc = mm->mm_bc;
 +
 +	spin_lock(&pb_lock);
 +	ret = __pb_remove_ref(pg, bc);
 +	spin_unlock(&pb_lock);
 +
 +	dec_rss_pages(bc, ret, vma);
 +}
 +
 +void bc_vmrss_page_dup(struct page *pg, struct mm_struct *mm,
 +		struct vm_area_struct *vma, struct page_beancounter **ppb)
 +{
 +	struct beancounter *bc;
 +	int hash, ret;
 +
 +	if (!PageAnon(pg) && is_shmem_mapping(pg->mapping))
 +		return;
 +
 +	bc = mm->mm_bc;
 +	hash = pb_hash(bc, pg);
 +
 +	ret = 0;
 +	spin_lock(&pb_lock);
 +	if (page_pb(pg) == NULL)
 +		/*
 +		 * pages like ZERO_PAGE must not be accounted in pbc
 +		 * so on fork we just skip them
 +		 */
 +		goto out_unlock;
 +
 +	if (*ppb == PB_COPY_SAME) {
 +		if (__pb_dup_ref(pg, bc, hash))
 +			WARN_ON(1);
 +	} else
 +		ret = __pb_add_ref(pg, bc, hash, ppb);
 +out_unlock:
 +	spin_unlock(&pb_lock);
 +
 +	inc_rss_pages(bc, ret, vma);
 +}
 +
 +void bc_vmrss_page_add_noref(struct page *pg, struct mm_struct *mm,
 +		struct vm_area_struct *vma)
 +{
 +	inc_rss_pages(mm->mm_bc, 0, vma);
 +}
 +
 +/*
 + * Calculate the number of currently resident pages for
 + * given mm_struct in a given range (addr - end).
 + * This is needed for mprotect_fixup() as by the time
 + * it is called some pages can be resident and thus
 + * not accounted in bc->unused_privvmpages. Such pages
 + * must num be uncharged (as they already are).
 + */
 +
 +static unsigned long pages_in_pte_range(struct mm_struct *mm, pmd_t *pmd,
 +				unsigned long addr, unsigned long end,
 +				unsigned long *pages)
 +{
 +	pte_t *pte;
 +	spinlock_t *ptl;
 +
 +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 +	do {
 +		pte_t ptent = *pte;
 +		if (!pte_none(ptent) && pte_present(ptent))
 +			(*pages)++;
 +	} while (pte++, addr += PAGE_SIZE, addr != end);
 +	pte_unmap_unlock(pte - 1, ptl);
 +	return addr;
 +}
 +
 +static inline unsigned long pages_in_pmd_range(struct mm_struct *mm, pud_t *pud,
 +				unsigned long addr, unsigned long end,
 +				unsigned long *pages)
 +{
 +	pmd_t *pmd;
 +	unsigned long next;
 +
 +	pmd = pmd_offset(pud, addr);
 +	do {
 +		next = pmd_addr_end(addr, end);
 +		if (pmd_none_or_clear_bad(pmd))
 +			continue;
 +
 +		next = pages_in_pte_range(mm, pmd, addr, next, pages);
 +	} while (pmd++, addr = next, addr != end);
 +	return addr;
 +}
 +
 +static inline unsigned long pages_in_pud_range(struct mm_struct *mm, pgd_t *pgd,
 +				unsigned long addr, unsigned long end,
 +				unsigned long *pages)
 +{
 +	pud_t *pud;
 +	unsigned long next;
 +
 +	pud = pud_offset(pgd, addr);
 +	do {
 +		next = pud_addr_end(addr, end);
 +		if (pud_none_or_clear_bad(pud))
 +			continue;
 +
 +		next = pages_in_pmd_range(mm, pud, addr, next, pages);
 +	} while (pud++, addr = next, addr != end);
 +	return addr;
 +}
 +
 +unsigned long mm_rss_pages(struct mm_struct *mm,
 +		unsigned long addr, unsigned long end)
 +{
 +	pgd_t *pgd;
 +	unsigned long next;
 +	unsigned long pages;
 +
 +	BUG_ON(addr >= end);
 +
 +	pages = 0;
 +	pgd = pgd_offset(mm, addr);
 +	do {
 +		next = pgd_addr_end(addr, end);
 +		if (pgd_none_or_clear_bad(pgd))
 +			continue;
 +
 +		next = pages_in_pud_range(mm, pgd, addr, next, &pages);
 +	} while (pgd++, addr = next, addr != end);
 +	return pages;
 +}
 +
 +void __init bc_init_rss(void)
 +{
 +	unsigned long hash_size;
 +
 +	pb_cachep = kmem_cache_create("page_beancounter",
 +			sizeof(struct page_beancounter), 0,
 +			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL, NULL);
 +
 +	hash_size = num_physpages >> 2;
 +	for (pb_hash_mask = 1;
 +			(hash_size & pb_hash_mask) != hash_size;
 +			pb_hash_mask = (pb_hash_mask << 1) + 1);
 +
 +	hash_size = pb_hash_mask + 1;
 +	printk(KERN_INFO "BC: Page beancounter hash is %lu entries.\n",
 +			hash_size);
 +	pb_hash_table = vmalloc(hash_size * sizeof(struct page_beancounter *));
 +	memset(pb_hash_table, 0, hash_size * sizeof(struct page_beancounter *));
 +}
 --- ./mm/shmem.c.bcrsscore	2006-09-05 13:39:26.000000000 +0400
 +++ ./mm/shmem.c	2006-09-05 13:46:35.000000000 +0400
 @@ -2236,6 +2236,12 @@ static struct vm_operations_struct shmem
 #endif
 };
 
 +#ifdef CONFIG_BEANCOUNTERS_RSS
 +int is_shmem_mapping(struct address_space *mapping)
 +{
 +	return (mapping != NULL && mapping->a_ops == &shmem_aops);
 +}
 +#endif
 
 static int shmem_get_sb(struct file_system_type *fs_type,
 int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 |  
	|  |  |  
	| 
		
			| [PATCH 13/13] BC: vmrss (charges) [message #5935 is a reply to message #5922] | Tue, 05 September 2006 15:29   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Introduce calls to BC code over the kernel to add accounting of physical pages/privvmpages.
 
 Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
 Signed-Off-By: Kirill Korotaev <dev@sw.ru>
 
 ---
 
 fs/exec.c          |   11 ++++
 include/linux/mm.h |    3 -
 kernel/fork.c      |    2
 mm/filemap_xip.c   |    2
 mm/fremap.c        |   11 ++++
 mm/memory.c        |  141 +++++++++++++++++++++++++++++++++++++++++------------
 mm/migrate.c       |    3 +
 mm/mprotect.c      |   12 +++-
 mm/rmap.c          |    4 +
 mm/swapfile.c      |   47 ++++++++++++-----
 10 files changed, 186 insertions(+), 50 deletions(-)
 
 --- ./fs/exec.c.bcrssch	2006-09-05 12:53:55.000000000 +0400
 +++ ./fs/exec.c	2006-09-05 13:51:55.000000000 +0400
 @@ -50,6 +50,8 @@
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
 
 +#include <bc/vmrss.h>
 +
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
 
 @@ -308,6 +310,11 @@ void install_arg_page(struct vm_area_str
 struct mm_struct *mm = vma->vm_mm;
 pte_t * pte;
 spinlock_t *ptl;
 +	struct page_beancounter *pb;
 +
 +	pb = bc_alloc_rss_counter();
 +	if (IS_ERR(pb))
 +		goto out_nopb;
 
 if (unlikely(anon_vma_prepare(vma)))
 goto out;
 @@ -325,11 +332,15 @@ void install_arg_page(struct vm_area_str
 set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
 page, vma->vm_page_prot))));
 page_add_new_anon_rmap(page, vma, address);
 +	bc_vmrss_page_add(page, mm, vma, &pb);
 pte_unmap_unlock(pte, ptl);
 
 /* no need for flush_tlb */
 +	bc_free_rss_counter(pb);
 return;
 out:
 +	bc_free_rss_counter(pb);
 +out_nopb:
 __free_page(page);
 force_sig(SIGKILL, current);
 }
 --- ./include/linux/mm.h.bcrssch	2006-09-05 13:47:12.000000000 +0400
 +++ ./include/linux/mm.h	2006-09-05 13:51:55.000000000 +0400
 @@ -753,7 +753,8 @@ void free_pgd_range(struct mmu_gather **
 void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
 unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 -			struct vm_area_struct *vma);
 +			struct vm_area_struct *vma,
 +			struct vm_area_struct *dst_vma);
 int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
 unsigned long size, pgprot_t prot);
 void unmap_mapping_range(struct address_space *mapping,
 --- ./kernel/fork.c.bcrssch	2006-09-05 13:23:27.000000000 +0400
 +++ ./kernel/fork.c	2006-09-05 13:51:55.000000000 +0400
 @@ -280,7 +280,7 @@ static inline int dup_mmap(struct mm_str
 rb_parent = &tmp->vm_rb;
 
 mm->map_count++;
 -		retval = copy_page_range(mm, oldmm, mpnt);
 +		retval = copy_page_range(mm, oldmm, mpnt, tmp);
 
 if (tmp->vm_ops && tmp->vm_ops->open)
 tmp->vm_ops->open(tmp);
 --- ./mm/filemap_xip.c.bcrssch	2006-07-10 12:39:20.000000000 +0400
 +++ ./mm/filemap_xip.c	2006-09-05 13:51:55.000000000 +0400
 @@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/uio.h>
 #include <linux/rmap.h>
 +#include <bc/vmrss.h>
 #include <asm/tlbflush.h>
 #include "filemap.h"
 
 @@ -189,6 +190,7 @@ __xip_unmap (struct address_space * mapp
 /* Nuke the page table entry. */
 flush_cache_page(vma, address, pte_pfn(*pte));
 pteval = ptep_clear_flush(vma, address, pte);
 +			bc_vmrss_page_del(page, mm, vma);
 page_remove_rmap(page);
 dec_mm_counter(mm, file_rss);
 BUG_ON(pte_dirty(pteval));
 --- ./mm/fremap.c.bcrssch	2006-09-05 12:53:59.000000000 +0400
 +++ ./mm/fremap.c	2006-09-05 13:51:55.000000000 +0400
 @@ -16,6 +16,8 @@
 #include <linux/module.h>
 #include <linux/syscalls.h>
 
 +#include <bc/vmrss.h>
 +
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 @@ -33,6 +35,7 @@ static int zap_pte(struct mm_struct *mm,
 if (page) {
 if (pte_dirty(pte))
 set_page_dirty(page);
 +			bc_vmrss_page_del(page, mm, vma);
 page_remove_rmap(page);
 page_cache_release(page);
 }
 @@ -57,6 +60,11 @@ int install_page(struct mm_struct *mm, s
 pte_t *pte;
 pte_t pte_val;
 spinlock_t *ptl;
 +	struct page_beancounter *pb;
 +
 +	pb = bc_alloc_rss_counter();
 +	if (IS_ERR(pb))
 +		goto out_nopb;
 
 pte = get_locked_pte(mm, addr, &ptl);
 if (!pte)
 @@ -82,12 +90,15 @@ int install_page(struct mm_struct *mm, s
 pte_val = mk_pte(page, prot);
 set_pte_at(mm, addr, pte, pte_val);
 page_add_file_rmap(page);
 +	bc_vmrss_page_add(page, mm, vma, &pb);
 update_mmu_cache(vma, addr, pte_val);
 lazy_mmu_prot_update(pte_val);
 err = 0;
 unlock:
 pte_unmap_unlock(pte, ptl);
 out:
 +	bc_free_rss_counter(pb);
 +out_nopb:
 return err;
 }
 EXPORT_SYMBOL(install_page);
 --- ./mm/memory.c.bcrssch	2006-09-05 12:53:59.000000000 +0400
 +++ ./mm/memory.c	2006-09-05 13:51:55.000000000 +0400
 @@ -51,6 +51,9 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 
 +#include <bc/vmpages.h>
 +#include <bc/vmrss.h>
 +
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
 @@ -427,7 +430,9 @@ struct page *vm_normal_page(struct vm_ar
 static inline void
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 -		unsigned long addr, int *rss)
 +		unsigned long addr, int *rss,
 +		struct vm_area_struct *dst_vma,
 +		struct page_beancounter **ppb)
 {
 unsigned long vm_flags = vma->vm_flags;
 pte_t pte = *src_pte;
 @@ -481,6 +486,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 page = vm_normal_page(vma, addr, pte);
 if (page) {
 get_page(page);
 +		bc_vmrss_page_dup(page, dst_mm, dst_vma, ppb);
 page_dup_rmap(page);
 rss[!!PageAnon(page)]++;
 }
 @@ -489,20 +495,32 @@ out_set_pte:
 set_pte_at(dst_mm, addr, dst_pte, pte);
 }
 
 +#define pte_ptrs(a)     (PTRS_PER_PTE - ((a >> PAGE_SHIFT)&(PTRS_PER_PTE - 1)))
 +
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
 -		unsigned long addr, unsigned long end)
 +		unsigned long addr, unsigned long end,
 +		struct vm_area_struct *dst_vma)
 {
 pte_t *src_pte, *dst_pte;
 spinlock_t *src_ptl, *dst_ptl;
 int progress = 0;
 -	int rss[2];
 +	int rss[2], err;
 +	struct page_beancounter *pb;
 
 +	err = -ENOMEM;
 +	pb = (mm_same_bc(dst_mm, src_mm) ? PB_COPY_SAME : NULL);
 again:
 +	if (pb != PB_COPY_SAME) {
 +		pb = bc_alloc_rss_counter_list(pte_ptrs(addr), pb);
 +		if (IS_ERR(pb))
 +			goto out;
 +	}
 +
 rss[1] = rss[0] = 0;
 dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 if (!dst_pte)
 -		return -ENOMEM;
 +		goto out;
 src_pte = pte_offset_map_nested(src_pmd, addr);
 src_ptl = pte_lockptr(src_mm, src_pmd);
 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 @@ -524,7 +542,8 @@ again:
 progress++;
 continue;
 }
 -		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
 +		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss,
 +				dst_vma, &pb);
 progress += 8;
 } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 @@ -536,12 +555,18 @@ again:
 cond_resched();
 if (addr != end)
 goto again;
 -	return 0;
 +
 +	err = 0;
 +out:
 +	if (pb != PB_COPY_SAME)
 +		bc_free_rss_counter(pb);
 +	return err;
 }
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
 -		unsigned long addr, unsigned long end)
 +		unsigned long addr, unsigned long end,
 +		struct vm_area_struct *dst_vma)
 {
 pmd_t *src_pmd, *dst_pmd;
 unsigned long next;
 @@ -555,7 +580,7 @@ static inline int copy_pmd_range(struct
 if (pmd_none_or_clear_bad(src_pmd))
 continue;
 if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
 -						vma, addr, next))
 +						vma, addr, next, dst_vma))
 return -ENOMEM;
 } while (dst_pmd++, src_pmd++, addr = next, addr != end);
 return 0;
 @@ -563,7 +588,8 @@ static inline int copy_pmd_range(struct
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
 -		unsigned long addr, unsigned long end)
 +		unsigned long addr, unsigned long end,
 +		struct vm_area_struct *dst_vma)
 {
 pud_t *src_pud, *dst_pud;
 unsigned long next;
 @@ -577,14 +603,14 @@ static inline int copy_pud_range(struct
 if (pud_none_or_clear_bad(src_pud))
 continue;
 if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
 -						vma, addr, next))
 +						vma, addr, next, dst_vma))
 return -ENOMEM;
 } while (dst_pud++, src_pud++, addr = next, addr != end);
 return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 -		struct vm_area_struct *vma)
 +		struct vm_area_struct *vma, struct vm_area_struct *dst_vma)
 {
 pgd_t *src_pgd, *dst_pgd;
 unsigned long next;
 @@ -612,7 +638,7 @@ int copy_page_range(struct mm_struct *ds
 if (pgd_none_or_clear_bad(src_pgd))
 continue;
 if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
 -						vma, addr, next))
 +						vma, addr, next, dst_vma))
 return -ENOMEM;
 } while (dst_pgd++, src_pgd++, addr = next, addr != end);
 return 0;
 @@ -681,6 +707,7 @@ static unsigned long zap_pte_range(struc
 mark_page_accessed(page);
 file_rss--;
 }
 +			bc_vmrss_page_del(page, mm, vma);
 page_remove_rmap(page);
 tlb_remove_page(tlb, page);
 continue;
 @@ -1104,8 +1131,9 @@ int get_user_pages(struct task_struct *t
 }
 EXPORT_SYMBOL(get_user_pages);
 
 -static int zeromap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 -			unsigned long addr, unsigned long end, pgprot_t prot)
 +static int zeromap_pte_range(struct mm_struct *mm,
 +		struct vm_area_struct *vma, pmd_t *pmd,
 +		unsigned long addr, unsigned long end, pgprot_t prot)
 {
 pte_t *pte;
 spinlock_t *ptl;
 @@ -1118,6 +1146,7 @@ static int zeromap_pte_range(struct mm_s
 struct page *page = ZERO_PAGE(addr);
 pte_t zero_pte = pte_wrprotect(mk_pte(page, prot));
 page_cache_get(page);
 +		bc_vmrss_page_add_noref(page,
...
 
 
 |  
	|  |  |  
	|  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [PATCH] BC: resource beancounters (v4) (added user memory) [message #5945 is a reply to message #5922] | Tue, 05 September 2006 17:46   |  
			| 
				
				
					|  Dave Hansen Messages: 240
 Registered: October 2005
 | Senior Member |  |  |  
	| On Tue, 2006-09-05 at 19:02 +0400, Kirill Korotaev wrote: > Core Resource Beancounters (BC) + kernel/user memory control.
 >
 > BC allows to account and control consumption
 > of kernel resources used by group of processes.
 
 Hi Kirill,
 
 I've honestly lost track of these discussions along the way, so I hope
 you don't mind summarizing a bit.
 
 Do these patches help with accounting for anything other than memory?
 Will we need new user/kernel interfaces for cpu, i/o bandwidth, etc...?
 
 Have you given any thought to the possibility that a task might need to
 move between accounting contexts?  That has certainly been a
 "requirement" pushed on to CKRM for a long time, and the need goes
 something like this:
 
 1. A system runs a web server, which services several virtual domains
 2. that web server receives a request for foo.com
 3. the web server switches into foo.com's accounting context
 4. the web server reads things from disk, allocates some memory, and
 makes a database request.
 5. the database receives the request, and switches into foo.com's
 accounting context, and charges foo.com for its resource use
 etc...
 
 So, the goal is to run _one_ copy of an application on a system, but
 account for its resources in a much more fine-grained way than at the
 application level.
 
 I think we can probably use beancounters for this, if we do not worry
 about migrating _existing_ charges when we change accounting context.
 Does that make sense?
 
 -- Dave
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [PATCH] BC: resource beancounters (v4) (added user memory) [message #5946 is a reply to message #5945] | Tue, 05 September 2006 18:28   |  
			| 
				
				
					|  Balbir Singh Messages: 491
 Registered: August 2006
 | Senior Member |  |  |  
	| Dave Hansen wrote: > On Tue, 2006-09-05 at 19:02 +0400, Kirill Korotaev wrote:
 >> Core Resource Beancounters (BC) + kernel/user memory control.
 >>
 >> BC allows to account and control consumption
 >> of kernel resources used by group of processes.
 >
 > Hi Kirill,
 >
 > I've honestly lost track of these discussions along the way, so I hope
 > you don't mind summarizing a bit.
 >
 > Do these patches help with accounting for anything other than memory?
 > Will we need new user/kernel interfaces for cpu, i/o bandwidth, etc...?
 >
 > Have you given any thought to the possibility that a task might need to
 > move between accounting contexts?  That has certainly been a
 > "requirement" pushed on to CKRM for a long time, and the need goes
 > something like this:
 >
 > 1. A system runs a web server, which services several virtual domains
 > 2. that web server receives a request for foo.com
 > 3. the web server switches into foo.com's accounting context
 > 4. the web server reads things from disk, allocates some memory, and
 >    makes a database request.
 > 5. the database receives the request, and switches into foo.com's
 >    accounting context, and charges foo.com for its resource use
 > etc...
 >
 > So, the goal is to run _one_ copy of an application on a system, but
 > account for its resources in a much more fine-grained way than at the
 > application level.
 >
 > I think we can probably use beancounters for this, if we do not worry
 > about migrating _existing_ charges when we change accounting context.
 > Does that make sense?
 >
 > -- Dave
 
 This is much better stated than I did. Thanks!
 
 --
 
 Balbir Singh,
 Linux Technology Center,
 IBM Software Labs
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 11/13] BC: vmrss (preparations) [message #5952 is a reply to message #5933] | Tue, 05 September 2006 22:09   |  
			| 
				
				
					|  Cedric Le Goater Messages: 443
 Registered: February 2006
 | Senior Member |  |  |  
	| Kirill Korotaev wrote: 
 <snip>
 
 > --- ./include/bc/beancounter.h.bcvmrssprep    2006-09-05
 > 13:17:50.000000000 +0400
 > +++ ./include/bc/beancounter.h    2006-09-05 13:44:33.000000000 +0400
 > @@ -45,6 +45,13 @@ struct bc_resource_parm {
 > #define BC_MAXVALUE    LONG_MAX
 >
 > /*
 > + * This magic is used to distinuish user beancounter and pages beancounter
 > + * in struct page. page_ub and page_bc are placed in union and MAGIC
 > + * ensures us that we don't use pbc as ubc in bc_page_uncharge().
 > + */
 > +#define BC_MAGIC                0x62756275UL
 > +
 > +/*
 >  *    Resource management structures
 >  * Serialization issues:
 >  *   beancounter list management is protected via bc_hash_lock
 > @@ -54,11 +61,13 @@ struct bc_resource_parm {
 >  */
 >
 > struct beancounter {
 > +    unsigned long        bc_magic;
 >     atomic_t        bc_refcount;
 >     spinlock_t        bc_lock;
 >     bcid_t            bc_id;
 >     struct hlist_node    hash;
 >
 > +    unsigned long        unused_privvmpages;
 >     /* resources statistics and settings */
 >     struct bc_resource_parm    bc_parms[BC_RESOURCES];
 > };
 > @@ -74,6 +83,8 @@ enum bc_severity { BC_BARRIER, BC_LIMIT,
 >
 > #ifdef CONFIG_BEANCOUNTERS
 >
 > +extern unsigned int nr_beancounters = 1;
 > +
 
 my gcc doesn't like this one ...
 
 regards,
 
 C.
 
 Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
 
 ---
 include/bc/beancounter.h |    2 +-
 kernel/bc/beancounter.c  |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
 
 Index: 2.6.18-rc5-mm1/include/bc/beancounter.h
 ============================================================ =======
 --- 2.6.18-rc5-mm1.orig/include/bc/beancounter.h
 +++ 2.6.18-rc5-mm1/include/bc/beancounter.h
 @@ -86,7 +86,7 @@ enum bc_severity { BC_BARRIER, BC_LIMIT,
 
 #ifdef CONFIG_BEANCOUNTERS
 
 -extern unsigned int nr_beancounters = 1;
 +extern unsigned int nr_beancounters;
 
 /*
 * These functions tune minheld and maxheld values for a given
 Index: 2.6.18-rc5-mm1/kernel/bc/beancounter.c
 ============================================================ =======
 --- 2.6.18-rc5-mm1.orig/kernel/bc/beancounter.c
 +++ 2.6.18-rc5-mm1/kernel/bc/beancounter.c
 @@ -20,7 +20,7 @@ static void init_beancounter_struct(stru
 
 struct beancounter init_bc;
 
 -unsigned int nr_beancounters;
 +unsigned int nr_beancounters = 1;
 
 const char *bc_rnames[] = {
 "kmemsize",	/* 0 */
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [PATCH] BC: resource beancounters (v4) (added user memory) [message #5954 is a reply to message #5945] | Wed, 06 September 2006 00:17   |  
			| 
				
				
					|  Rohit Seth Messages: 101
 Registered: August 2006
 | Senior Member |  |  |  
	| On Tue, 2006-09-05 at 10:46 -0700, Dave Hansen wrote: > On Tue, 2006-09-05 at 19:02 +0400, Kirill Korotaev wrote:
 > > Core Resource Beancounters (BC) + kernel/user memory control.
 > >
 > > BC allows to account and control consumption
 > > of kernel resources used by group of processes.
 >
 > Hi Kirill,
 >
 > I've honestly lost track of these discussions along the way, so I hope
 > you don't mind summarizing a bit.
 >
 > Do these patches help with accounting for anything other than memory?
 > Will we need new user/kernel interfaces for cpu, i/o bandwidth, etc...?
 >
 > Have you given any thought to the possibility that a task might need to
 > move between accounting contexts?  That has certainly been a
 > "requirement" pushed on to CKRM for a long time, and the need goes
 > something like this:
 >
 > 1. A system runs a web server, which services several virtual domains
 > 2. that web server receives a request for foo.com
 > 3. the web server switches into foo.com's accounting context
 > 4. the web server reads things from disk, allocates some memory, and
 >    makes a database request.
 > 5. the database receives the request, and switches into foo.com's
 >    accounting context, and charges foo.com for its resource use
 > etc...
 >
 
 I'm wondering why not have different processes to serve different
 domains on the same physical server...particularly when they have
 different database to work on.  Is the amount of memory that you save by
 having a single copy that much useful that you are even okay to
 serialize the whole operation (What would happen, while the request for
 foo.com is getting worked on, there is another request for
 foo_bar.com...does it need to wait for foo.com request to get done
 before it can be served).
 
 > So, the goal is to run _one_ copy of an application on a system, but
 > account for its resources in a much more fine-grained way than at the
 > application level.
 >
 
 What is that fine grained way.  If not process based then can it be
 associated with file system location?
 
 -rohit
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [PATCH 5/13] BC: user interface (syscalls) [message #5996 is a reply to message #5927] | Wed, 06 September 2006 13:45   |  
			| 
				
				
					|  Balbir Singh Messages: 491
 Registered: August 2006
 | Senior Member |  |  |  
	| Kirill Korotaev wrote: > Add the following system calls for BC management:
 >  1. sys_get_bcid     - get current BC id
 >  2. sys_set_bcid     - change exec_ and fork_ BCs on current
 >  3. sys_set_bclimit  - set limits for resources consumtions
 >  4. sys_get_bcstat   - return br_resource_parm on resource
 >
 > Signed-off-by: Pavel Emelianov <xemul@sw.ru>
 > Signed-off-by: Kirill Korotaev <dev@sw.ru>
 >
 > --- ./include/asm-powerpc/systbl.h.bcsys	2006-07-10 12:39:19.000000000 +0400
 > +++ ./include/asm-powerpc/systbl.h	2006-09-05 12:47:21.000000000 +0400
 > @@ -304,3 +304,7 @@ SYSCALL_SPU(fchmodat)
 >  SYSCALL_SPU(faccessat)
 >  COMPAT_SYS_SPU(get_robust_list)
 >  COMPAT_SYS_SPU(set_robust_list)
 > +SYSCALL(sys_get_bcid)
 > +SYSCALL(sys_set_bcid)
 > +SYSCALL(sys_set_bclimit)
 > +SYSCALL(sys_get_bcstat)
 
 
 Fix a build error for powerpc boxes. While compiling on powerpc, Vaidyanathan
 Srinivasan caught this error. System calls on powerpc do not need sys_ prefix.
 
 Signed-off-by: Balbir Singh <balbir@in.ibm.com>
 Signed-off-by: Vaidyanathan Srinivasan <svaidy@in.ibm.com>
 ---
 
 include/asm-powerpc/systbl.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)
 
 diff -puN include/asm-powerpc/systbl.h~fix-powerpc-build
 include/asm-powerpc/systbl.h
 ---  linux-2.6.18-rc5/include/asm-powerpc/systbl.h~fix-powerpc-bu ild	2006-09-06
 19:03:18.000000000 +0530
 +++ linux-2.6.18-rc5-balbir/include/asm-powerpc/systbl.h	2006-09-06
 19:03:38.000000000 +0530
 @@ -304,7 +304,7 @@ SYSCALL_SPU(fchmodat)
 SYSCALL_SPU(faccessat)
 COMPAT_SYS_SPU(get_robust_list)
 COMPAT_SYS_SPU(set_robust_list)
 -SYSCALL(sys_get_bcid)
 -SYSCALL(sys_set_bcid)
 -SYSCALL(sys_set_bclimit)
 -SYSCALL(sys_get_bcstat)
 +SYSCALL(get_bcid)
 +SYSCALL(set_bcid)
 +SYSCALL(set_bclimit)
 +SYSCALL(get_bcstat)
 _
 
 --
 
 Balbir Singh,
 Linux Technology Center,
 IBM Software Labs
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [PATCH] BC: resource beancounters (v4) (added user memory) [message #5997 is a reply to message #5945] | Wed, 06 September 2006 13:54   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| > On Tue, 2006-09-05 at 19:02 +0400, Kirill Korotaev wrote: >
 >>Core Resource Beancounters (BC) + kernel/user memory control.
 >>
 >>BC allows to account and control consumption
 >>of kernel resources used by group of processes.
 >
 >
 > Hi Kirill,
 >
 > I've honestly lost track of these discussions along the way, so I hope
 > you don't mind summarizing a bit.
 I think we need to create wiki to summarize it once and forever.
 http://wiki.openvz.org/UBC_discussion
 
 > Do these patches help with accounting for anything other than memory?
 this patch set - no, but the complete one - does:
 * numfile
 * numptys
 * numsocks (TCP, other, etc.)
 * numtasks
 * numflocks
 ...
 this list of resources was chosen to make sure that no DoS from the container
 is possible.
 This list is extensible easily and if resource is out of interest than
 its limits can be set to unlimited.
 
 > Will we need new user/kernel interfaces for cpu, i/o bandwidth, etc...?
 no. no new interfaces are required.
 
 BUT: I remind you the talks at OKS/OLS and in previous UBC discussions.
 It was noted that having a separate interfaces for CPU, I/O bandwidth
 and memory maybe worthwhile. BTW, I/O bandwidth already has a separate
 interface :/
 
 > Have you given any thought to the possibility that a task might need to
 > move between accounting contexts?  That has certainly been a
 > "requirement" pushed on to CKRM for a long time, and the need goes
 > something like this:
 Yes we thought about this and this is no more problematic for BC
 than for CKRM. See my explanation below.
 
 > 1. A system runs a web server, which services several virtual domains
 > 2. that web server receives a request for foo.com
 > 3. the web server switches into foo.com's accounting context
 > 4. the web server reads things from disk, allocates some memory, and
 >    makes a database request.
 > 5. the database receives the request, and switches into foo.com's
 >    accounting context, and charges foo.com for its resource use
 > etc...
 The question is - whether web server is multithreaded or not...
 If it is not - then no problem here, you can change current
 context and new resources will be charged accordingly.
 
 And current BC code is _able_ to handle it with _minor_ changes.
 (One just need to save bc not on mm struct, but rather on vma struct
 and change mm->bc on set_bc_id()).
 
 However, no one (can some one from CKRM team please?) explained so far
 what to do with threads. Consider the following example.
 
 1. Threaded web server spawns a child to serve a client.
 2. child thread touches some pages and they are charged to child BC
 (which differs from parent's one)
 3. child exits, but since its mm is shared with parent, these pages
 stay mapped and charged to child BC.
 
 So the question is:  what to do with these pages?
 - should we recharge them to another BC?
 - leave them charged?
 
 > So, the goal is to run _one_ copy of an application on a system, but
 > account for its resources in a much more fine-grained way than at the
 > application level.
 Yes.
 
 > I think we can probably use beancounters for this, if we do not worry
 > about migrating _existing_ charges when we change accounting context.
 > Does that make sense?
 exactly. thats what I'm saying. we can use beancounters for this
 if charges are kept for creator.
 
 Thanks,
 Kirill
 |  
	|  |  |  
	| 
		
			| Re: [PATCH 11/13] BC: vmrss (preparations) [message #5998 is a reply to message #5952] | Wed, 06 September 2006 13:56   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Thanks a lot!!! 
 > Kirill Korotaev wrote:
 >
 > <snip>
 >
 >>--- ./include/bc/beancounter.h.bcvmrssprep    2006-09-05
 >>13:17:50.000000000 +0400
 >>+++ ./include/bc/beancounter.h    2006-09-05 13:44:33.000000000 +0400
 >>@@ -45,6 +45,13 @@ struct bc_resource_parm {
 >>#define BC_MAXVALUE    LONG_MAX
 >>
 >>/*
 >>+ * This magic is used to distinuish user beancounter and pages beancounter
 >>+ * in struct page. page_ub and page_bc are placed in union and MAGIC
 >>+ * ensures us that we don't use pbc as ubc in bc_page_uncharge().
 >>+ */
 >>+#define BC_MAGIC                0x62756275UL
 >>+
 >>+/*
 >> *    Resource management structures
 >> * Serialization issues:
 >> *   beancounter list management is protected via bc_hash_lock
 >>@@ -54,11 +61,13 @@ struct bc_resource_parm {
 >> */
 >>
 >>struct beancounter {
 >>+    unsigned long        bc_magic;
 >>    atomic_t        bc_refcount;
 >>    spinlock_t        bc_lock;
 >>    bcid_t            bc_id;
 >>    struct hlist_node    hash;
 >>
 >>+    unsigned long        unused_privvmpages;
 >>    /* resources statistics and settings */
 >>    struct bc_resource_parm    bc_parms[BC_RESOURCES];
 >>};
 >>@@ -74,6 +83,8 @@ enum bc_severity { BC_BARRIER, BC_LIMIT,
 >>
 >>#ifdef CONFIG_BEANCOUNTERS
 >>
 >>+extern unsigned int nr_beancounters = 1;
 >>+
 >
 >
 > my gcc doesn't like this one ...
 >
 > regards,
 >
 > C.
 >
 > Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
 >
 > ---
 >  include/bc/beancounter.h |    2 +-
 >  kernel/bc/beancounter.c  |    2 +-
 >  2 files changed, 2 insertions(+), 2 deletions(-)
 >
 > Index: 2.6.18-rc5-mm1/include/bc/beancounter.h
 >  ============================================================ =======
 > --- 2.6.18-rc5-mm1.orig/include/bc/beancounter.h
 > +++ 2.6.18-rc5-mm1/include/bc/beancounter.h
 > @@ -86,7 +86,7 @@ enum bc_severity { BC_BARRIER, BC_LIMIT,
 >
 >  #ifdef CONFIG_BEANCOUNTERS
 >
 > -extern unsigned int nr_beancounters = 1;
 > +extern unsigned int nr_beancounters;
 >
 >  /*
 >   * These functions tune minheld and maxheld values for a given
 > Index: 2.6.18-rc5-mm1/kernel/bc/beancounter.c
 >  ============================================================ =======
 > --- 2.6.18-rc5-mm1.orig/kernel/bc/beancounter.c
 > +++ 2.6.18-rc5-mm1/kernel/bc/beancounter.c
 > @@ -20,7 +20,7 @@ static void init_beancounter_struct(stru
 >
 >  struct beancounter init_bc;
 >
 > -unsigned int nr_beancounters;
 > +unsigned int nr_beancounters = 1;
 >
 >  const char *bc_rnames[] = {
 >  	"kmemsize",	/* 0 */
 >
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [PATCH] BC: resource beancounters (v4) (added user memory) [message #6014 is a reply to message #5992] | Wed, 06 September 2006 19:17   |  
			| 
				
				
					|  Balbir Singh Messages: 491
 Registered: August 2006
 | Senior Member |  |  |  
	| Kirill Korotaev wrote: > Balbir Singh wrote:
 >> Kirill Korotaev wrote:
 >>
 >>> Core Resource Beancounters (BC) + kernel/user memory control.
 >>>
 >>> BC allows to account and control consumption
 >>> of kernel resources used by group of processes.
 >>>
 >>> Draft UBC description on OpenVZ wiki can be found at
 >>> http://wiki.openvz.org/UBC_parameters
 >>>
 >>> The full BC patch set allows to control:
 >>> - kernel memory. All the kernel objects allocatable
 >>> on user demand should be accounted and limited
 >>> for DoS protection.
 >>> E.g. page tables, task structs, vmas etc.
 >>
 >> One of the key requirements of resource management for us is to be able to
 >> migrate tasks across resource groups. Since bean counters do not associate
 >> a list of tasks associated with them, I do not see how this can be done
 >> with the existing bean counters.
 > It was discussed multiple times already.
 > The key problem here is the objects which do not _belong_ to tasks.
 > e.g. IPC objects. They exist in global namespace and can't be reaccounted.
 > At least no one proposed the policy to reaccount.
 > And please note, IPCs are not the only such objects.
 >
 > But I guess your comment mostly concerns user pages, yeah?
 
 Yes.
 
 > In this case reaccounting can be easily done using page beancounters
 > which are introduced in this patch set.
 > So if it is a requirement, then lets cooperate and create such functionality.
 >
 
 Sure, let's cooperate and talk.
 
 > So for now I see 2 main requirements from people:
 > - memory reclamation
 > - tasks moving across beancounters
 >
 
 Some not quite so urgent ones - like support for guarantees. I think this can
 be worked out as we make progress.
 
 > I agree with these requirements and lets move into this direction.
 > But moving so far can't be done without accepting:
 > 1. core functionality
 > 2. accounting
 >
 
 Some of the core functionality might be a limiting factor for the requirements.
 Lets agree on the requirements, I think its a great step forward and then
 build the core functionality with these requirements in mind.
 
 > Thanks,
 > Kirill
 >
 --
 
 Balbir Singh,
 Linux Technology Center,
 IBM Software Labs
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [PATCH] BC: resource beancounters (v4) (added user memory) [message #6023 is a reply to message #5992] | Wed, 06 September 2006 21:47   |  
			| 
				
				
					|  Chandra Seetharaman Messages: 88
 Registered: August 2006
 | Member |  |  |  
	| On Wed, 2006-09-06 at 17:06 +0400, Kirill Korotaev wrote: > Balbir Singh wrote:
 > > Kirill Korotaev wrote:
 > >
 > >> Core Resource Beancounters (BC) + kernel/user memory control.
 > >>
 > >> BC allows to account and control consumption
 > >> of kernel resources used by group of processes.
 > >>
 > >> Draft UBC description on OpenVZ wiki can be found at
 > >> http://wiki.openvz.org/UBC_parameters
 > >>
 > >> The full BC patch set allows to control:
 > >> - kernel memory. All the kernel objects allocatable
 > >> on user demand should be accounted and limited
 > >> for DoS protection.
 > >> E.g. page tables, task structs, vmas etc.
 > >
 > >
 > > One of the key requirements of resource management for us is to be able to
 > > migrate tasks across resource groups. Since bean counters do not associate
 > > a list of tasks associated with them, I do not see how this can be done
 > > with the existing bean counters.
 > It was discussed multiple times already.
 > The key problem here is the objects which do not _belong_ to tasks.
 > e.g. IPC objects. They exist in global namespace and can't be reaccounted.
 > At least no one proposed the policy to reaccount.
 > And please note, IPCs are not the only such objects.
 
 >From implementation point of view I do not see it to be any different
 than how it can be done under UBC.
 
 AFAICS, beancounters are associated with tasks not those "objects".
 Those "objects" get their bc through some association with a task. The
 same can be done in the other case also.
 
 If my understanding is wrong, please tell me how one can associate such
 "object" to a bc.
 
 >
 > But I guess your comment mostly concerns user pages, yeah?
 > In this case reaccounting can be easily done using page beancounters
 > which are introduced in this patch set.
 > So if it is a requirement, then lets cooperate and create such functionality.
 
 hmm... that is what I thought I was doing when I was replying on these
 threads. May be I should have waited for this "call for co-operation"
 before jumping on it :)
 
 >
 > So for now I see 2 main requirements from people:
 > - memory reclamation
 > - tasks moving across beancounters
 
 Please consider the requirements I listed before
 http://marc.theaimsgroup.com/?l=ckrm-tech&m=115593001810 616&w=2
 
 >
 > I agree with these requirements and lets move into this direction.
 > But moving so far can't be done without accepting:
 > 1. core functionality
 > 2. accounting
 
 I agree that discussion need to happen on the core functionality and
 interface.
 >
 > Thanks,
 > Kirill
 >
 >
 >  ------------------------------------------------------------ -------------
 > Using Tomcat but need to do more? Need to support web services, security?
 > Get stuff done quickly with pre-integrated technology to make your job easier
 > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
 >  http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&b id=263057&dat=121642
 > _______________________________________________
 > ckrm-tech mailing list
 > https://lists.sourceforge.net/lists/listinfo/ckrm-tech
 --
 
 ------------------------------------------------------------ ----------
 Chandra Seetharaman               | Be careful what you choose....
 - sekharan@us.ibm.com   |      .......you may get it.
 ------------------------------------------------------------ ----------
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [PATCH] BC: resource beancounters (v4) (added user memory) [message #6024 is a reply to message #5997] | Wed, 06 September 2006 21:54   |  
			| 
				
				
					|  Chandra Seetharaman Messages: 88
 Registered: August 2006
 | Member |  |  |  
	| On Wed, 2006-09-06 at 17:57 +0400, Kirill Korotaev wrote: > > On Tue, 2006-09-05 at 19:02 +0400, Kirill Korotaev wrote:
 > >
 > >>Core Resource Beancounters (BC) + kernel/user memory control.
 > >>
 > >>BC allows to account and control consumption
 > >>of kernel resources used by group of processes.
 > >
 > >
 > > Hi Kirill,
 > >
 > > I've honestly lost track of these discussions along the way, so I hope
 > > you don't mind summarizing a bit.
 > I think we need to create wiki to summarize it once and forever.
 > http://wiki.openvz.org/UBC_discussion
 >
 > > Do these patches help with accounting for anything other than memory?
 > this patch set - no, but the complete one - does:
 > * numfile
 > * numptys
 > * numsocks (TCP, other, etc.)
 > * numtasks
 > * numflocks
 > ...
 > this list of resources was chosen to make sure that no DoS from the container
 > is possible.
 > This list is extensible easily and if resource is out of interest than
 > its limits can be set to unlimited.
 >
 > > Will we need new user/kernel interfaces for cpu, i/o bandwidth, etc...?
 > no. no new interfaces are required.
 
 Good to know that.
 
 Your CPU controller supports guarantee ?
 
 Do you have a i/o controller ?
 
 >
 > BUT: I remind you the talks at OKS/OLS and in previous UBC discussions.
 > It was noted that having a separate interfaces for CPU, I/O bandwidth
 
 But, it will be lot simpler for the user to configure/use if they are
 together. We should discuss this also.
 
 > and memory maybe worthwhile. BTW, I/O bandwidth already has a separate
 > interface :/
 >
 > > Have you given any thought to the possibility that a task might need to
 > > move between accounting contexts?  That has certainly been a
 > > "requirement" pushed on to CKRM for a long time, and the need goes
 > > something like this:
 > Yes we thought about this and this is no more problematic for BC
 > than for CKRM. See my explanation below.
 >
 > > 1. A system runs a web server, which services several virtual domains
 > > 2. that web server receives a request for foo.com
 > > 3. the web server switches into foo.com's accounting context
 > > 4. the web server reads things from disk, allocates some memory, and
 > >    makes a database request.
 > > 5. the database receives the request, and switches into foo.com's
 > >    accounting context, and charges foo.com for its resource use
 > > etc...
 > The question is - whether web server is multithreaded or not...
 > If it is not - then no problem here, you can change current
 > context and new resources will be charged accordingly.
 >
 > And current BC code is _able_ to handle it with _minor_ changes.
 > (One just need to save bc not on mm struct, but rather on vma struct
 > and change mm->bc on set_bc_id()).
 >
 > However, no one (can some one from CKRM team please?) explained so far
 > what to do with threads. Consider the following example.
 >
 > 1. Threaded web server spawns a child to serve a client.
 > 2. child thread touches some pages and they are charged to child BC
 >    (which differs from parent's one)
 > 3. child exits, but since its mm is shared with parent, these pages
 >    stay mapped and charged to child BC.
 >
 > So the question is:  what to do with these pages?
 > - should we recharge them to another BC?
 > - leave them charged?
 
 Leave them charged. It will be charged to the appropriate UBC when they
 touch it again.
 
 >
 > > So, the goal is to run _one_ copy of an application on a system, but
 > > account for its resources in a much more fine-grained way than at the
 > > application level.
 > Yes.
 >
 > > I think we can probably use beancounters for this, if we do not worry
 > > about migrating _existing_ charges when we change accounting context.
 > > Does that make sense?
 > exactly. thats what I'm saying. we can use beancounters for this
 > if charges are kept for creator.
 >
 > Thanks,
 > Kirill
 >
 >  ------------------------------------------------------------ -------------
 > Using Tomcat but need to do more? Need to support web services, security?
 > Get stuff done quickly with pre-integrated technology to make your job easier
 > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
 >  http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&b id=263057&dat=121642
 > _______________________________________________
 > ckrm-tech mailing list
 > https://lists.sourceforge.net/lists/listinfo/ckrm-tech
 --
 
 ------------------------------------------------------------ ----------
 Chandra Seetharaman               | Be careful what you choose....
 - sekharan@us.ibm.com   |      .......you may get it.
 ------------------------------------------------------------ ----------
 |  
	|  |  |  
	|  |  
	|  | 
 
 
 Current Time: Sat Oct 25 17:44:28 GMT 2025 
 Total time taken to generate the page: 0.08323 seconds |