OpenVZ Forum


Home » Mailing lists » Devel » [RFC][PATCH] UBC: user resource beancounters
[RFC][PATCH] UBC: user resource beancounters [message #5192] Wed, 16 August 2006 15:23 Go to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

The following patch set presents base of
User Resource Beancounters (UBC).
UBC allows to account and control consumption
of kernel resources used by group of processes.

The full UBC patch set allows to control:
- kernel memory. All the kernel objects allocatable
on user demand should be accounted and limited
for DoS protection.
E.g. page tables, task structs, vmas etc.

- virtual memory pages. UBC allows to
limit a container to some amount of memory and
introduces 2-level OOM killer taking into account
container's consumption.
pages shared between containers are correctly
charged as fractions (tunable).

- network buffers. These includes TCP/IP rcv/snd
buffers, dgram snd buffers, unix, netlinks and
other buffers.

- minor resources accounted/limited by number:
tasks, files, flocks, ptys, siginfo, pinned dcache
mem, sockets, iptentries (for containers with
virtualized networking)

As the first step we want to propose for discussion
the most complicated parts of resource management:
kernel memory and virtual memory.
The patch set to be sent provides core for UBC and
management of kernel memory only. Virtual memory
management will be sent in a couple of days.

The patches in these series are:
diff-ubc-kconfig.patch:
Adds kernel/ub/Kconfig file with UBC options and
includes it into arch Kconfigs

diff-ubc-core.patch:
Contains core functionality and interfaces of UBC:
find/create beancounter, initialization,
charge/uncharge of resource, core objects' declarations.

diff-ubc-task.patch:
Contains code responsible for setting UB on task,
it's inheriting and setting host context in interrupts.

Task contains three beancounters:
1. exec_ub - current context. all resources are charged
to this beancounter.
2. task_ub - beancounter to which task_struct is charged
itself.
3. fork_sub - beancounter which is inherited by
task's children on fork

diff-ubc-syscalls.patch:
Patch adds system calls for UB management:
1. sys_getluid - get current UB id
2. sys_setluid - changes exec_ and fork_ UBs on current
3. sys_setublimit - set limits for resources consumtions

diff-ubc-kmem-core.patch:
Introduces UB_KMEMSIZE resource which accounts kernel
objects allocated by task's request.

Objects are accounted via struct page and slab objects.
For the latter ones each slab contains a set of pointers
corresponding object is charged to.

Allocation charge rules:
1. Pages - if allocation is performed with __GFP_UBC flag - page
is charged to current's exec_ub.
2. Slabs - kmem_cache may be created with SLAB_UBC flag - in this
case each allocation is charged. Caches used by kmalloc are
created with SLAB_UBC | SLAB_UBC_NOCHARGE flags. In this case
only __GFP_UBC allocations are charged.

diff-ubc-kmem-charge.patch:
Adds SLAB_UBC and __GFP_UBC flags in appropriate places
to cause charging/limiting of specified resources.

diff-ubc-proc.patch:
Adds two proc entries user_beancounters and user_beancounters_sub
allowing to see current state (usage/limits/fails for each UB).
Implemented via seq files.

Patch set is applicable to 2.6.18-rc4-mm1

Thanks,
Kirill
[RFC][PATCH 1/7] UBC: kconfig [message #5195 is a reply to message #5192] Wed, 16 August 2006 15:34 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Add kernel/ub/Kconfig file with UBC options and
includes it into arch Kconfigs

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>

---
arch/i386/Kconfig | 2 ++
arch/ia64/Kconfig | 2 ++
arch/powerpc/Kconfig | 2 ++
arch/ppc/Kconfig | 2 ++
arch/sparc/Kconfig | 2 ++
arch/sparc64/Kconfig | 2 ++
arch/x86_64/Kconfig | 2 ++
kernel/ub/Kconfig | 25 +++++++++++++++++++++++++
8 files changed, 39 insertions(+)

--- ./arch/i386/Kconfig.ubkm 2006-07-10 12:39:10.000000000 +0400
+++ ./arch/i386/Kconfig 2006-07-28 14:10:41.000000000 +0400
@@ -1146,6 +1146,8 @@ source "crypto/Kconfig"

source "lib/Kconfig"

+source "kernel/ub/Kconfig"
+
#
# Use the generic interrupt handling code in kernel/irq/:
#
--- ./arch/ia64/Kconfig.ubkm 2006-07-10 12:39:10.000000000 +0400
+++ ./arch/ia64/Kconfig 2006-07-28 14:10:56.000000000 +0400
@@ -481,6 +481,8 @@ source "fs/Kconfig"

source "lib/Kconfig"

+source "kernel/ub/Kconfig"
+
#
# Use the generic interrupt handling code in kernel/irq/:
#
--- ./arch/powerpc/Kconfig.arkcfg 2006-08-07 14:07:12.000000000 +0400
+++ ./arch/powerpc/Kconfig 2006-08-10 17:55:58.000000000 +0400
@@ -1038,6 +1038,8 @@ source "arch/powerpc/platforms/iseries/K

source "lib/Kconfig"

+source "ub/Kconfig"
+
menu "Instrumentation Support"
depends on EXPERIMENTAL

--- ./arch/ppc/Kconfig.arkcfg 2006-07-10 12:39:10.000000000 +0400
+++ ./arch/ppc/Kconfig 2006-08-10 17:56:13.000000000 +0400
@@ -1414,6 +1414,8 @@ endmenu

source "lib/Kconfig"

+source "ub/Kconfig"
+
source "arch/powerpc/oprofile/Kconfig"

source "arch/ppc/Kconfig.debug"
--- ./arch/sparc/Kconfig.arkcfg 2006-04-21 11:59:32.000000000 +0400
+++ ./arch/sparc/Kconfig 2006-08-10 17:56:24.000000000 +0400
@@ -296,3 +296,5 @@ source "security/Kconfig"
source "crypto/Kconfig"

source "lib/Kconfig"
+
+source "ub/Kconfig"
--- ./arch/sparc64/Kconfig.arkcfg 2006-07-17 17:01:11.000000000 +0400
+++ ./arch/sparc64/Kconfig 2006-08-10 17:56:36.000000000 +0400
@@ -432,3 +432,5 @@ source "security/Kconfig"
source "crypto/Kconfig"

source "lib/Kconfig"
+
+source "lib/Kconfig"
--- ./arch/x86_64/Kconfig.ubkm 2006-07-10 12:39:11.000000000 +0400
+++ ./arch/x86_64/Kconfig 2006-07-28 14:10:49.000000000 +0400
@@ -655,3 +655,5 @@ source "security/Kconfig"
source "crypto/Kconfig"

source "lib/Kconfig"
+
+source "kernel/ub/Kconfig"
--- ./kernel/ub/Kconfig.ubkm 2006-07-28 13:07:38.000000000 +0400
+++ ./kernel/ub/Kconfig 2006-07-28 13:09:51.000000000 +0400
@@ -0,0 +1,25 @@
+#
+# User resources part (UBC)
+#
+# Copyright (C) 2006 OpenVZ. SWsoft Inc
+
+menu "User resources"
+
+config USER_RESOURCE
+ bool "Enable user resource accounting"
+ default y
+ help
+ This patch provides accounting and allows to configure
+ limits for user's consumption of exhaustible system resources.
+ The most important resource controlled by this patch is unswappable
+ memory (either mlock'ed or used by internal kernel structures and
+ buffers). The main goal of this patch is to protect processes
+ from running short of important resources because of an accidental
+ misbehavior of processes or malicious activity aiming to ``kill''
+ the system. It's worth to mention that resource limits configured
+ by setrlimit(2) do not give an acceptable level of protection
+ because they cover only small fraction of resources and work on a
+ per-process basis. Per-process accounting doesn't prevent malicious
+ users from spawning a lot of resource-consuming processes.
+
+endmenu
[RFC][PATCH 2/7] UBC: core (structures, API) [message #5196 is a reply to message #5192] Wed, 16 August 2006 15:35 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Core functionality and interfaces of UBC:
find/create beancounter, initialization,
charge/uncharge of resource, core objects' declarations.

Basic structures:
ubparm - resource description
user_beancounter - set of resources, id, lock

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>

---
include/ub/beancounter.h | 157 ++++++++++++++++++
init/main.c | 4
kernel/Makefile | 1
kernel/ub/Makefile | 7
kernel/ub/beancounter.c | 398 +++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 567 insertions(+)

--- /dev/null 2006-07-18 14:52:43.075228448 +0400
+++ ./include/ub/beancounter.h 2006-08-10 14:58:27.000000000 +0400
@@ -0,0 +1,157 @@
+/*
+ * include/ub/beancounter.h
+ *
+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
+ *
+ */
+
+#ifndef _LINUX_BEANCOUNTER_H
+#define _LINUX_BEANCOUNTER_H
+
+/*
+ * Resource list.
+ */
+
+#define UB_RESOURCES 0
+
+struct ubparm {
+ /*
+ * A barrier over which resource allocations are failed gracefully.
+ * e.g. if the amount of consumed memory is over the barrier further
+ * sbrk() or mmap() calls fail, the existing processes are not killed.
+ */
+ unsigned long barrier;
+ /* hard resource limit */
+ unsigned long limit;
+ /* consumed resources */
+ unsigned long held;
+ /* maximum amount of consumed resources through the last period */
+ unsigned long maxheld;
+ /* minimum amount of consumed resources through the last period */
+ unsigned long minheld;
+ /* count of failed charges */
+ unsigned long failcnt;
+};
+
+/*
+ * Kernel internal part.
+ */
+
+#ifdef __KERNEL__
+
+#include <linux/config.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <asm/atomic.h>
+
+/*
+ * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
+ */
+#define UB_MAXVALUE ( (1UL << (sizeof(unsigned long)*8-1)) - 1)
+
+
+/*
+ * Resource management structures
+ * Serialization issues:
+ * beancounter list management is protected via ub_hash_lock
+ * task pointers are set only for current task and only once
+ * refcount is managed atomically
+ * value and limit comparison and change are protected by per-ub spinlock
+ */
+
+struct user_beancounter
+{
+ atomic_t ub_refcount;
+ spinlock_t ub_lock;
+ uid_t ub_uid;
+ struct hlist_node hash;
+
+ struct user_beancounter *parent;
+ void *private_data;
+
+ /* resources statistics and settings */
+ struct ubparm ub_parms[UB_RESOURCES];
+};
+
+enum severity { UB_BARRIER, UB_LIMIT, UB_FORCE };
+
+/* Flags passed to beancounter_findcreate() */
+#define UB_LOOKUP_SUB 0x01 /* Lookup subbeancounter */
+#define UB_ALLOC 0x02 /* May allocate new one */
+#define UB_ALLOC_ATOMIC 0x04 /* Allocate with GFP_ATOMIC */
+
+#define UB_HASH_SIZE 256
+
+#ifdef CONFIG_USER_RESOURCE
+extern struct hlist_head ub_hash[];
+extern spinlock_t ub_hash_lock;
+
+static inline void ub_adjust_held_minmax(struct user_beancounter *ub,
+ int resource)
+{
+ if (ub->ub_parms[resource].maxheld < ub->ub_parms[resource].held)
+ ub->ub_parms[resource].maxheld = ub->ub_parms[resource].held;
+ if (ub->ub_parms[resource].minheld > ub->ub_parms[resource].held)
+ ub->ub_parms[resource].minheld = ub->ub_parms[resource].held;
+}
+
+void ub_print_resource_warning(struct user_beancounter *ub, int res,
+ char *str, unsigned long val, unsigned long held);
+void ub_print_uid(struct user_beancounter *ub, char *str, int size);
+
+int __charge_beancounter_locked(struct user_beancounter *ub,
+ int resource, unsigned long val, enum severity strict);
+void charge_beancounter_notop(struct user_beancounter *ub,
+ int resource, unsigned long val);
+int charge_beancounter(struct user_beancounter *ub,
+ int resource, unsigned long val, enum severity strict);
+
+void __uncharge_beancounter_locked(struct user_beancounter *ub,
+ int resource, unsigned long val);
+void uncharge_beancounter_notop(struct user_beancounter *ub,
+ int resource, unsigned long val);
+void uncharge_beancounter(struct user_beancounter *ub,
+ int resource, unsigned long val);
+
+struct user_beancounter *beancounter_findcreate(uid_t uid,
+ struct user_beancounter *parent, int flags);
+
+static inline struct user_beancounter *get_beancounter(
+ struct user_beancounter *ub)
+{
+ atomic_inc(&ub->ub_refcount);
+ return ub;
+}
+
+void __put_beancounter(struct user_beancounter *ub);
+static inline void put_beancounter(struct user_beancounter *ub)
+{
+ __put_beancounter(ub);
+}
+
+void ub_init_early(void);
+void ub_init_late(void);
+void ub_init_proc(void);
+
+extern struct user_beancounter ub0;
+extern const char *ub_rnames[];
+
+#else /* CONFIG_USER_RESOURCE */
+
+#define beancounter_findcreate(id, p, f) (NULL)
+#define get_beancounter(ub) (NULL)
+#define put_beancounter(ub) do { } while (0)
+#define __charge_beancounter_locked(ub, r, v, s) (0)
+#define charge_beancounter(ub, r, v, s) (0)
+#define charge_beancounter_notop(ub, r, v) do { } while (0)
+#define __uncharge_beancounter_locked(ub, r, v) do { } while (0)
+#define uncharge_beancounter(ub, r, v) do { } while (0)
+#define uncharge_beancounter_notop(ub, r, v) do { } while (0)
+#define ub_init_early() do { } while (0)
+#define ub_init_late() do { } while (0)
+#define ub_init_proc() do { } while (0)
+
+#endif /* CONFIG_USER_RESOURCE */
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_BEANCOUNTER_H */
--- ./init/main.c.ubcore 2006-08-10 14:55:47.000000000 +0400
+++ ./init/main.c 2006-08-10 14:57:01.000000000 +0400
@@ -52,6 +52,8 @@
#include <linux/debug_locks.h>
#include <linux/lockdep.h>

+#include <ub/beancounter.h>
+
#include <asm/io.h>
#include <asm/bugs.h>
#include <asm/setup.h>
@@ -470,6 +472,7 @@ asmlinkage void __init start_kernel(void
early_boot_irqs_off();
early_init_irq_lock_class();

+ ub_init_early();
/*
* Interrupts are still disabled. Do necessary setups, then
* enable them
@@ -563,6 +566,7 @@ asmlinkage void __init start_kernel(void
#endif
fork_init(num_physpages);
proc_caches_init();
+ ub_init_late();
buffer_init();
unnamed_dev_init();
key_init();
--- ./kernel/Makefile.ubcore 2006-08-10 14:55:47.000000000 +0400
+++ ./kernel/Makefile 2006-08-10 14:57:01.000000000 +0400
@@ -12,6 +12,7 @@ obj-y = sched.o fork.o exec_domain.o

obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
+obj-y += ub/
obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
obj-$(CONFIG_LOCKDEP) += lockdep.o
ifeq ($(CONFIG_PROC_FS),y)
--- /dev/null 2006-07-18 14:52:43.075228448 +0400
+++ ./kernel/ub/Makefile 2006-08-10 14:57:01.000000000 +0400
@@ -0,0 +1,7 @@
+#
+# User resources part (UBC)
+#
+# Copyright (C) 2006 OpenVZ. SWsoft Inc
+#
+
+obj-$(CONFIG_USER_RESOURCE) += beancounter.o
--- /dev/null 2006-07-18 14:52:43.075228448 +0400
+++ ./kernel/ub/beancounter.c 2006-08-10 15:09:34.000000000 +0400
@@ -0,0 +1,398 @@
+/*
+ * kernel/ub/beancounter.c
+ *
+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
+ * Original code by (C) 1998 Alan Cox
+ * 1998-2000 Andrey Savochkin <saw@saw.sw.com.sg>
+ */
+
+#include <linux/slab.h>
+#include <linux/module.h>
+
+#include <ub/beancounter.h>
+
+static kmem_cache_t *ub_cachep;
+static struct user_beancounter default_beancounter;
+static struct user_beancounter default_subbeancounter;
+
+static void init_beancounter_struct(struct user_beancounter *ub, uid_t id);
+
+struct user_beancounter ub0;
+
+const char *ub_rnames[] = {
+};
+
+#define ub_hash_fun(x) ((((x) >> 8) ^ (x)) & (UB_HASH_SIZE - 1))
+#define ub_subhash_fun(p, id) ub_hash_fun((p)->ub_uid + (id) * 17)
+
+struct hlist_head ub_hash[UB_HASH_SIZE];
+spinlock_t ub_hash_lock;
+
+EXPORT_SYMBOL(ub_hash);
+EXPORT_SYMBOL(ub_hash_lock);
+
+/*
+ * Per user resource beancounting. Resources are tied to their luid.
+ * The resource structure itself is tagged both to the process and
+ * the charging resources (a socket doesn't want to have to search for
+ * things at irq time for example). Reference counters keep things in
+ * hand.
+ *
+ * The case where a user creates resource, kills all his processes and
+ * then starts new ones is correctly handled this way. The refcounters
+ * will mean the old entry is still around with resource tied to it.
+ */
+
+struct user_beancounter *beancounter_findcreate(uid_t uid,
+ struct user_beancounter *p, int mask)
+{
+ struct user_beancounter *new_ub, *ub, *tmpl_ub;
+ unsigned long flags;
+ struct hlist_head *slot;
+ struct hlist_node *pos;
+
+ if (mask & UB_LOOKUP_SUB) {
+ WARN_ON(p == NULL);
+ tmpl_ub = &default_subbeancounter;
+ slot = &ub_hash[ub_subhash_fun(p, uid)];
+ } else {
+ WARN_ON(p != NULL);
+ tmpl_ub = &default_beancounter;
+ slot = &ub_hash[ub_hash_fun(uid)];
+ }
+ new_ub = NULL;
+
+retry:
+ spin_lock_irqsave(&ub_hash_lock, flags);
+ hlist_for_each_entry (ub, pos, slot, hash)
+ if (ub->ub_uid == uid && ub->parent == p)
+ break;
+
+ if (pos != NULL) {
+ get_beancounter(ub);
+ spin_unlock_irqrestore(&ub_hash_lock, flags);
+
+ if (new_ub != NULL) {
+ put_beancounter(new_ub->parent);
+ kmem_cache_free(ub_cachep, new_ub);
+ }
+ return ub;
+ }
+
+ if (!(mask & UB_ALLOC))
+ goto out_unlock;
+
+ if (new_ub != NULL)
+ goto out_install;
+
+ if (mask & UB_ALLOC_ATOMIC) {
+ new_ub = kmem_cache_alloc(ub_cachep, GFP_ATOMIC);
+ if (new_ub == NULL)
+ goto out_unlock;
+
+ memcpy(new_ub, tmpl_ub, sizeof(*new_ub));
+ init_beancounter_struct(new_ub, uid);
+ if (p)
+ new_ub->parent = get_beancounter(p);
+ goto out_install;
+ }
+
+ spin_unlock_irqrestore(&ub_hash_lock, flags);
+
+ new_ub = kmem_cache_alloc(ub_cachep, GFP_KERNEL);
+ if (new_ub == NULL)
+ goto out;
+
+ memcpy(new_ub, tmpl_ub, sizeof(*new_ub));
+ init_beancounter_struct(new_ub, uid);
+ if (p)
+ new_ub->parent = get_beancounter(p);
+ goto retry;
+
+out_in
...

[RFC][PATCH 3/7] UBC: ub context and inheritance [message #5198 is a reply to message #5192] Wed, 16 August 2006 15:36 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Contains code responsible for setting UB on task,
it's inheriting and setting host context in interrupts.

Task references three beancounters:
1. exec_ub current context. all resources are
charged to this beancounter.
2. task_ub beancounter to which task_struct is
charged itself.
3. fork_sub beancounter which is inherited by
task's children on fork

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>

---
include/linux/sched.h | 5 +++++
include/ub/task.h | 42 ++++++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 21 ++++++++++++++++-----
kernel/irq/handle.c | 9 +++++++++
kernel/softirq.c | 8 ++++++++
kernel/ub/Makefile | 1 +
kernel/ub/beancounter.c | 4 ++++
kernel/ub/misc.c | 34 ++++++++++++++++++++++++++++++++++
8 files changed, 119 insertions(+), 5 deletions(-)

--- ./include/linux/sched.h.ubfork 2006-07-17 17:01:12.000000000 +0400
+++ ./include/linux/sched.h 2006-07-31 16:01:54.000000000 +0400
@@ -81,6 +81,8 @@ struct sched_param {
#include <linux/timer.h>
#include <linux/hrtimer.h>

+#include <ub/task.h>
+
#include <asm/processor.h>

struct exec_domain;
@@ -997,6 +999,9 @@ struct task_struct {
spinlock_t delays_lock;
struct task_delay_info *delays;
#endif
+#ifdef CONFIG_USER_RESOURCE
+ struct task_beancounter task_bc;
+#endif
};

static inline pid_t process_group(struct task_struct *tsk)
--- ./include/ub/task.h.ubfork 2006-07-28 18:53:52.000000000 +0400
+++ ./include/ub/task.h 2006-08-01 15:26:08.000000000 +0400
@@ -0,0 +1,42 @@
+/*
+ * include/ub/task.h
+ *
+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
+ *
+ */
+
+#ifndef __UB_TASK_H_
+#define __UB_TASK_H_
+
+#include <linux/config.h>
+
+struct user_beancounter;
+
+struct task_beancounter {
+ struct user_beancounter *exec_ub;
+ struct user_beancounter *task_ub;
+ struct user_beancounter *fork_sub;
+};
+
+#ifdef CONFIG_USER_RESOURCE
+#define get_exec_ub() (current->task_bc.exec_ub)
+#define set_exec_ub(newub) \
+ ({ \
+ struct user_beancounter *old; \
+ struct task_beancounter *tbc; \
+ tbc = &current->task_bc; \
+ old = tbc->exec_ub; \
+ tbc->exec_ub = newub; \
+ old; \
+ })
+
+int ub_task_charge(struct task_struct *parent, struct task_struct *new);
+void ub_task_uncharge(struct task_struct *tsk);
+
+#else /* CONFIG_USER_RESOURCE */
+#define get_exec_ub() (NULL)
+#define set_exec_ub(__ub) (NULL)
+#define ub_task_charge(p, t) (0)
+#define ub_task_uncharge(t) do { } while (0)
+#endif /* CONFIG_USER_RESOURCE */
+#endif /* __UB_TASK_H_ */
--- ./kernel/irq/handle.c.ubirq 2006-07-10 12:39:20.000000000 +0400
+++ ./kernel/irq/handle.c 2006-08-01 12:39:34.000000000 +0400
@@ -16,6 +16,9 @@
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>

+#include <ub/beancounter.h>
+#include <ub/task.h>
+
#include "internals.h"

/**
@@ -166,6 +169,9 @@ fastcall unsigned int __do_IRQ(unsigned
struct irq_desc *desc = irq_desc + irq;
struct irqaction *action;
unsigned int status;
+ struct user_beancounter *ub;
+
+ ub = set_exec_ub(&ub0);

kstat_this_cpu.irqs[irq]++;
if (CHECK_IRQ_PER_CPU(desc->status)) {
@@ -178,6 +184,8 @@ fastcall unsigned int __do_IRQ(unsigned
desc->chip->ack(irq);
action_ret = handle_IRQ_event(irq, regs, desc->action);
desc->chip->end(irq);
+
+ (void) set_exec_ub(ub);
return 1;
}

@@ -246,6 +254,7 @@ out:
desc->chip->end(irq);
spin_unlock(&desc->lock);

+ (void) set_exec_ub(ub);
return 1;
}

--- ./kernel/softirq.c.ubirq 2006-07-17 17:01:12.000000000 +0400
+++ ./kernel/softirq.c 2006-08-01 12:40:44.000000000 +0400
@@ -18,6 +18,9 @@
#include <linux/rcupdate.h>
#include <linux/smp.h>

+#include <ub/beancounter.h>
+#include <ub/task.h>
+
#include <asm/irq.h>
/*
- No shared variables, all the data are CPU local.
@@ -191,6 +194,9 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+ struct user_beancounter *ub;
+
+ ub = set_exec_ub(&ub0);

pending = local_softirq_pending();
account_system_vtime(current);
@@ -229,6 +235,8 @@ restart:

account_system_vtime(current);
_local_bh_enable();
+
+ (void) set_exec_ub(ub);
}

#ifndef __ARCH_HAS_DO_SOFTIRQ
--- ./kernel/fork.c.ubfork 2006-07-17 17:01:12.000000000 +0400
+++ ./kernel/fork.c 2006-08-01 12:58:36.000000000 +0400
@@ -46,6 +46,8 @@
#include <linux/delayacct.h>
#include <linux/taskstats_kern.h>

+#include <ub/task.h>
+
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -102,6 +104,7 @@ static kmem_cache_t *mm_cachep;

void free_task(struct task_struct *tsk)
{
+ ub_task_uncharge(tsk);
free_thread_info(tsk->thread_info);
rt_mutex_debug_task_free(tsk);
free_task_struct(tsk);
@@ -162,18 +165,19 @@ static struct task_struct *dup_task_stru

tsk = alloc_task_struct();
if (!tsk)
- return NULL;
+ goto out;

ti = alloc_thread_info(tsk);
- if (!ti) {
- free_task_struct(tsk);
- return NULL;
- }
+ if (!ti)
+ goto out_tsk;

*tsk = *orig;
tsk->thread_info = ti;
setup_thread_stack(tsk, orig);

+ if (ub_task_charge(orig, tsk))
+ goto out_ti;
+
/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
atomic_set(&tsk->fs_excl, 0);
@@ -180,6 +184,13 @@ static struct task_struct *dup_task_stru
#endif
tsk->splice_pipe = NULL;
return tsk;
+
+out_ti:
+ free_thread_info(ti);
+out_tsk:
+ free_task_struct(tsk);
+out:
+ return NULL;
}

#ifdef CONFIG_MMU
--- ./kernel/ub/Makefile.ubcore 2006-08-03 16:24:56.000000000 +0400
+++ ./kernel/ub/Makefile 2006-08-01 11:08:39.000000000 +0400
@@ -5,3 +5,4 @@
#

obj-$(CONFIG_USER_RESOURCE) += beancounter.o
+obj-$(CONFIG_USER_RESOURCE) += misc.o
--- ./kernel/ub/beancounter.c.ubcore 2006-07-28 13:07:44.000000000 +0400
+++ ./kernel/ub/beancounter.c 2006-08-03 16:14:17.000000000 +0400
@@ -395,6 +395,10 @@
spin_lock_init(&ub_hash_lock);
slot = &ub_hash[ub_hash_fun(ub->ub_uid)];
hlist_add_head(&ub->hash, slot);
+
+ current->task_bc.exec_ub = ub;
+ current->task_bc.task_ub = get_beancounter(ub);
+ current->task_bc.fork_sub = get_beancounter(ub);
}

void __init ub_init_late(void)
--- ./kernel/ub/misc.c.ubfork 2006-07-31 16:23:44.000000000 +0400
+++ ./kernel/ub/misc.c 2006-07-31 16:28:47.000000000 +0400
@@ -0,0 +1,34 @@
+/*
+ * kernel/ub/misc.c
+ *
+ * Copyright (C) 2006 OpenVZ. SWsoft Inc.
+ *
+ */
+
+#include <linux/sched.h>
+
+#include <ub/beancounter.h>
+#include <ub/task.h>
+
+int ub_task_charge(struct task_struct *parent, struct task_struct *new)
+{
+ struct task_beancounter *old_bc;
+ struct task_beancounter *new_bc;
+ struct user_beancounter *ub;
+
+ old_bc = &parent->task_bc;
+ new_bc = &new->task_bc;
+
+ ub = old_bc->fork_sub;
+ new_bc->exec_ub = get_beancounter(ub);
+ new_bc->task_ub = get_beancounter(ub);
+ new_bc->fork_sub = get_beancounter(ub);
+ return 0;
+}
+
+void ub_task_uncharge(struct task_struct *tsk)
+{
+ put_beancounter(tsk->task_bc.exec_ub);
+ put_beancounter(tsk->task_bc.task_ub);
+ put_beancounter(tsk->task_bc.fork_sub);
+}
[RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5199 is a reply to message #5192] Wed, 16 August 2006 15:37 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Add the following system calls for UB management:
1. sys_getluid - get current UB id
2. sys_setluid - changes exec_ and fork_ UBs on current
3. sys_setublimit - set limits for resources consumtions

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>

---
arch/i386/kernel/syscall_table.S | 3
arch/ia64/kernel/entry.S | 3
arch/sparc/kernel/systbls.S | 2
arch/sparc64/kernel/systbls.S | 2
include/asm-i386/unistd.h | 5 +
include/asm-ia64/unistd.h | 5 +
include/asm-powerpc/systbl.h | 3
include/asm-powerpc/unistd.h | 5 +
include/asm-sparc/unistd.h | 3
include/asm-sparc64/unistd.h | 3
include/asm-x86_64/unistd.h | 8 ++
kernel/ub/Makefile | 1
kernel/ub/sys.c | 126 +++++++++++++++++++++++++++++++++++++++
13 files changed, 163 insertions(+), 6 deletions(-)

--- ./arch/i386/kernel/syscall_table.S.ubsys 2006-07-10 12:39:10.000000000 +0400
+++ ./arch/i386/kernel/syscall_table.S 2006-07-31 14:36:59.000000000 +0400
@@ -317,3 +317,6 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+ .long sys_getluid
+ .long sys_setluid
+ .long sys_setublimit /* 320 */
--- ./arch/ia64/kernel/entry.S.ubsys 2006-07-10 12:39:10.000000000 +0400
+++ ./arch/ia64/kernel/entry.S 2006-07-31 15:25:36.000000000 +0400
@@ -1610,5 +1610,8 @@ sys_call_table:
data8 sys_sync_file_range // 1300
data8 sys_tee
data8 sys_vmsplice
+ daat8 sys_getluid
+ data8 sys_setluid
+ data8 sys_setublimit // 1305

.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
--- ./arch/sparc/kernel/systbls.S.arsys 2006-07-10 12:39:10.000000000 +0400
+++ ./arch/sparc/kernel/systbls.S 2006-08-10 17:07:15.000000000 +0400
@@ -78,7 +78,7 @@ sys_call_table:
/*285*/ .long sys_mkdirat, sys_mknodat, sys_fchownat, sys_futimesat, sys_fstatat64
/*290*/ .long sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
/*295*/ .long sys_fchmodat, sys_faccessat, sys_pselect6, sys_ppoll, sys_unshare
-/*300*/ .long sys_set_robust_list, sys_get_robust_list
+/*300*/ .long sys_set_robust_list, sys_get_robust_list, sys_getluid, sys_setluid, sys_setublimit

#ifdef CONFIG_SUNOS_EMUL
/* Now the SunOS syscall table. */
--- ./arch/sparc64/kernel/systbls.S.arsys 2006-07-10 12:39:11.000000000 +0400
+++ ./arch/sparc64/kernel/systbls.S 2006-08-10 17:08:52.000000000 +0400
@@ -79,7 +79,7 @@ sys_call_table32:
.word sys_mkdirat, sys_mknodat, sys_fchownat, compat_sys_futimesat, compat_sys_fstatat64
/*290*/ .word sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat
.word sys_fchmodat, sys_faccessat, compat_sys_pselect6, compat_sys_ppoll, sys_unshare
-/*300*/ .word compat_sys_set_robust_list, compat_sys_get_robust_list
+/*300*/ .word compat_sys_set_robust_list, compat_sys_get_robust_list, sys_getluid, sys_setluid, sys_setublimit

#endif /* CONFIG_COMPAT */

--- ./include/asm-i386/unistd.h.ubsys 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-i386/unistd.h 2006-07-31 15:56:31.000000000 +0400
@@ -323,10 +323,13 @@
#define __NR_vmsplice 316
#define __NR_move_pages 317
#define __NR_getcpu 318
+#define __NR_getluid 319
+#define __NR_setluid 320
+#define __NR_setublimit 321

#ifdef __KERNEL__

-#define NR_syscalls 318
+#define NR_syscalls 322
#include <linux/err.h>

/*
--- ./include/asm-ia64/unistd.h.ubsys 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-ia64/unistd.h 2006-07-31 15:57:23.000000000 +0400
@@ -291,11 +291,14 @@
#define __NR_sync_file_range 1300
#define __NR_tee 1301
#define __NR_vmsplice 1302
+#define __NR_getluid 1303
+#define __NR_setluid 1304
+#define __NR_setublimit 1305

#ifdef __KERNEL__


-#define NR_syscalls 279 /* length of syscall table */
+#define NR_syscalls 282 /* length of syscall table */

#define __ARCH_WANT_SYS_RT_SIGACTION

--- ./include/asm-powerpc/systbl.h.arsys 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-powerpc/systbl.h 2006-08-10 17:05:53.000000000 +0400
@@ -304,3 +304,6 @@ SYSCALL_SPU(fchmodat)
SYSCALL_SPU(faccessat)
COMPAT_SYS_SPU(get_robust_list)
COMPAT_SYS_SPU(set_robust_list)
+SYSCALL(sys_getluid)
+SYSCALL(sys_setluid)
+SYSCALL(sys_setublimit)
--- ./include/asm-powerpc/unistd.h.arsys 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-powerpc/unistd.h 2006-08-10 17:06:28.000000000 +0400
@@ -323,10 +323,13 @@
#define __NR_faccessat 298
#define __NR_get_robust_list 299
#define __NR_set_robust_list 300
+#define __NR_getluid 301
+#define __NR_setluid 302
+#define __NR_setublimit 303

#ifdef __KERNEL__

-#define __NR_syscalls 301
+#define __NR_syscalls 304

#define __NR__exit __NR_exit
#define NR_syscalls __NR_syscalls
--- ./include/asm-sparc/unistd.h.arsys 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-sparc/unistd.h 2006-08-10 17:08:19.000000000 +0400
@@ -318,6 +318,9 @@
#define __NR_unshare 299
#define __NR_set_robust_list 300
#define __NR_get_robust_list 301
+#define __NR_getluid 302
+#define __NR_setluid 303
+#define __NR_setublimit 304

#ifdef __KERNEL__
/* WARNING: You MAY NOT add syscall numbers larger than 301, since
--- ./include/asm-sparc64/unistd.h.arsys 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-sparc64/unistd.h 2006-08-10 17:09:24.000000000 +0400
@@ -320,6 +320,9 @@
#define __NR_unshare 299
#define __NR_set_robust_list 300
#define __NR_get_robust_list 301
+#define __NR_getluid 302
+#define __NR_setluid 303
+#define __NR_setublimit 304

#ifdef __KERNEL__
/* WARNING: You MAY NOT add syscall numbers larger than 301, since
--- ./include/asm-x86_64/unistd.h.ubsys 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-x86_64/unistd.h 2006-07-31 16:00:01.000000000 +0400
@@ -619,10 +619,16 @@ __SYSCALL(__NR_sync_file_range, sys_sync
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_getluid 280
+__SYSCALL(__NR_getluid, sys_getluid)
+#define __NR_setluid 281
+__SYSCALL(__NR_setluid, sys_setluid)
+#define __NR_setublimit 282
+__SYSCALL(__NR_setublimit, sys_setublimit)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_setublimit
#include <linux/err.h>

#ifndef __NO_STUBS
--- ./kernel/ub/Makefile.ubsys 2006-07-28 14:08:37.000000000 +0400
+++ ./kernel/ub/Makefile 2006-08-01 11:08:39.000000000 +0400
@@ -6,3 +6,4 @@

obj-$(CONFIG_USER_RESOURCE) += beancounter.o
obj-$(CONFIG_USER_RESOURCE) += misc.o
+obj-y += sys.o
--- ./kernel/ub/sys.c.ubsys 2006-07-28 18:52:18.000000000 +0400
+++ ./kernel/ub/sys.c 2006-08-03 16:14:23.000000000 +0400
@@ -0,0 +1,126 @@
+/*
+ * kernel/ub/sys.c
+ *
+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
+ *
+ */
+
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <asm/uaccess.h>
+
+#include <ub/beancounter.h>
+#include <ub/task.h>
+
+#ifndef CONFIG_USER_RESOURCE
+asmlinkage long sys_getluid(void)
+{
+ return -ENOSYS;
+}
+
+asmlinkage long sys_setluid(uid_t uid)
+{
+ return -ENOSYS;
+}
+
+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
+ unsigned long *limits)
+{
+ return -ENOSYS;
+}
+#else /* CONFIG_USER_RESOURCE */
+
+/*
+ * The (rather boring) getluid syscall
+ */
+asmlinkage long sys_getluid(void)
+{
+ struct user_beancounter *ub;
+
+ ub = get_exec_ub();
+ if (ub == NULL)
+ return -EINVAL;
+
+ return ub->ub_uid;
+}
+
+/*
+ * The setluid syscall
+ */
+asmlinkage long sys_setluid(uid_t uid)
+{
+ int error;
+ struct user_beancounter *ub;
+ struct task_beancounter *task_bc;
+
+ task_bc = &current->task_bc;
+
+ /* You may not disown a setluid */
+ error = -EINVAL;
+ if (uid == (uid_t)-1)
+ goto out;
+
+ /* You may only set an ub as root */
+ error = -EPERM;
+ if (!capable(CAP_SETUID))
+ goto out;
+
+ /* Ok - set up a beancounter entry for this user */
+ error = -ENOBUFS;
+ ub = beancounter_findcreate(uid, NULL, UB_ALLOC);
+ if (ub == NULL)
+ goto out;
+
+ /* install bc */
+ put_beancounter(task_bc->exec_ub);
+ task_bc->exec_ub = ub;
+ put_beancounter(task_bc->fork_sub);
+ task_bc->fork_sub = get_beancounter(ub);
+ error = 0;
+out:
+ return error;
+}
+
+/*
+ * The setbeanlimit syscall
+ */
+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
+ unsigned long *limits)
+{
+ int error;
+ unsigned long flags;
+ struct user_beancounter *ub;
+ unsigned long new_limits[2];
+
+ error = -EPERM;
+ if(!capable(CAP_SYS_RESOURCE))
+ goto out;
+
+ error = -EINVAL;
+ if (resource >= UB_RESOURCES)
+ goto out;
+
+ error = -EFAULT;
+ if (copy_from_user(&new_limits, limits, sizeof(new_limits)))
+ goto out;
+
+ error = -EINVAL;
+ if (new_limits[0] > UB_MAXVALUE || new_limits[1] > UB_MAXVALUE)
+ goto out;
+
+ error = -ENOENT;
+ ub = beancounter_findcreate(uid, NULL, 0);
+ if (ub == NULL)
+ goto out;
+
+ spin_lock_irqsave(&ub->ub_lock, flags);
+ ub->ub_parms[resource].barrier = new_limits[0];
+ ub->ub_parms[resource].limit = new_limits[1];
+ spin_unlock_irqrestore(&ub->ub_lock, flags);
+
+ put_beancounter(ub);
+ error = 0;
+out:
+ return error;
+}
+#endif
...

[RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5200 is a reply to message #5192] Wed, 16 August 2006 15:39 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Introduce UB_KMEMSIZE resource which accounts kernel
objects allocated by task's request.

Reference to UB is kept on struct page or slab object.
For slabs each struct slab contains a set of pointers
corresponding objects are charged to.

Allocation charge rules:
1. Pages - if allocation is performed with __GFP_UBC flag - page
is charged to current's exec_ub.
2. Slabs - kmem_cache may be created with SLAB_UBC flag - in this
case each allocation is charged. Caches used by kmalloc are
created with SLAB_UBC | SLAB_UBC_NOCHARGE flags. In this case
only __GFP_UBC allocations are charged.

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>

---
include/linux/gfp.h | 8 ++-
include/linux/mm.h | 6 ++
include/linux/slab.h | 4 +
include/linux/vmalloc.h | 1
include/ub/beancounter.h | 4 +
include/ub/kmem.h | 33 ++++++++++++
kernel/ub/Makefile | 1
kernel/ub/beancounter.c | 3 +
kernel/ub/kmem.c | 89 ++++++++++++++++++++++++++++++++++
mm/mempool.c | 2
mm/page_alloc.c | 11 ++++
mm/slab.c | 121 ++++++++++++++++++++++++++++++++++++++---------
mm/vmalloc.c | 6 ++
13 files changed, 264 insertions(+), 25 deletions(-)

--- ./include/linux/gfp.h.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./include/linux/gfp.h 2006-08-16 19:12:56.000000000 +0400
@@ -46,15 +46,18 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_UBC ((__force gfp_t)0x80000u) /* Charge allocation with UB */
+#define __GFP_UBC_LIMIT ((__force gfp_t)0x100000u) /* Charge against UB limit */

-#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 21 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/* if you forget to add the bitmask here kernel will crash, period */
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE| \
+ __GFP_UBC|__GFP_UBC_LIMIT)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
@@ -63,6 +66,7 @@ struct vm_area_struct;
#define GFP_NOIO (__GFP_WAIT)
#define GFP_NOFS (__GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
+#define GFP_KERNEL_UBC (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_UBC)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
__GFP_HIGHMEM)
--- ./include/linux/mm.h.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./include/linux/mm.h 2006-08-16 19:10:51.000000000 +0400
@@ -274,8 +274,14 @@ struct page {
unsigned int gfp_mask;
unsigned long trace[8];
#endif
+#ifdef CONFIG_USER_RESOURCE
+ union {
+ struct user_beancounter *page_ub;
+ } bc;
+#endif
};

+#define page_ub(page) ((page)->bc.page_ub)
#define page_private(page) ((page)->private)
#define set_page_private(page, v) ((page)->private = (v))

--- ./include/linux/slab.h.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./include/linux/slab.h 2006-08-16 19:10:51.000000000 +0400
@@ -46,6 +46,8 @@ typedef struct kmem_cache kmem_cache_t;
#define SLAB_PANIC 0x00040000UL /* panic if kmem_cache_create() fails */
#define SLAB_DESTROY_BY_RCU 0x00080000UL /* defer freeing pages to RCU */
#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */
+#define SLAB_UBC 0x00200000UL /* Account with UB */
+#define SLAB_UBC_NOCHARGE 0x00400000UL /* Explicit accounting */

/* flags passed to a constructor func */
#define SLAB_CTOR_CONSTRUCTOR 0x001UL /* if not set, then deconstructor */
@@ -293,6 +295,8 @@ extern kmem_cache_t *bio_cachep;

extern atomic_t slab_reclaim_pages;

+struct user_beancounter;
+struct user_beancounter **kmem_cache_ubp(kmem_cache_t *cachep, void *obj);
#endif /* __KERNEL__ */

#endif /* _LINUX_SLAB_H */
--- ./include/linux/vmalloc.h.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./include/linux/vmalloc.h 2006-08-16 19:10:51.000000000 +0400
@@ -36,6 +36,7 @@ struct vm_struct {
* Highlevel APIs for driver use
*/
extern void *vmalloc(unsigned long size);
+extern void *vmalloc_ub(unsigned long size);
extern void *vmalloc_user(unsigned long size);
extern void *vmalloc_node(unsigned long size, int node);
extern void *vmalloc_exec(unsigned long size);
--- ./include/ub/beancounter.h.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./include/ub/beancounter.h 2006-08-16 19:10:51.000000000 +0400
@@ -12,7 +12,9 @@
* Resource list.
*/

-#define UB_RESOURCES 0
+#define UB_KMEMSIZE 0
+
+#define UB_RESOURCES 1

struct ubparm {
/*
--- ./include/ub/kmem.h.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./include/ub/kmem.h 2006-08-16 19:10:51.000000000 +0400
@@ -0,0 +1,33 @@
+/*
+ * include/ub/kmem.h
+ *
+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
+ *
+ */
+
+#ifndef __UB_KMEM_H_
+#define __UB_KMEM_H_
+
+#include <linux/config.h>
+
+/*
+ * UB_KMEMSIZE accounting
+ */
+
+struct mm_struct;
+struct page;
+struct user_beancounter;
+
+#ifdef CONFIG_USER_RESOURCE
+int ub_page_charge(struct page *page, int order, gfp_t flags);
+void ub_page_uncharge(struct page *page, int order);
+
+int ub_slab_charge(kmem_cache_t *cachep, void *obj, gfp_t flags);
+void ub_slab_uncharge(kmem_cache_t *cachep, void *obj);
+#else
+#define ub_page_charge(pg, o, mask) (0)
+#define ub_page_uncharge(pg, o) do { } while (0)
+#define ub_slab_charge(cachep, o) (0)
+#define ub_slab_uncharge(cachep, o) do { } while (0)
+#endif
+#endif /* __UB_SLAB_H_ */
--- ./kernel/ub/Makefile.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./kernel/ub/Makefile 2006-08-16 19:10:51.000000000 +0400
@@ -7,3 +7,4 @@
obj-$(CONFIG_USER_RESOURCE) += beancounter.o
obj-$(CONFIG_USER_RESOURCE) += misc.o
obj-y += sys.o
+obj-$(CONFIG_USER_RESOURCE) += kmem.o
--- ./kernel/ub/beancounter.c.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./kernel/ub/beancounter.c 2006-08-16 19:10:51.000000000 +0400
@@ -20,6 +20,7 @@ static void init_beancounter_struct(stru
struct user_beancounter ub0;

const char *ub_rnames[] = {
+ "kmemsize", /* 0 */
};

#define ub_hash_fun(x) ((((x) >> 8) ^ (x)) & (UB_HASH_SIZE - 1))
@@ -356,6 +357,8 @@ static void init_beancounter_syslimits(s
{
int k;

+ ub->ub_parms[UB_KMEMSIZE].limit = 32 * 1024 * 1024;
+
for (k = 0; k < UB_RESOURCES; k++)
ub->ub_parms[k].barrier = ub->ub_parms[k].limit;
}
--- ./kernel/ub/kmem.c.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./kernel/ub/kmem.c 2006-08-16 19:10:51.000000000 +0400
@@ -0,0 +1,89 @@
+/*
+ * kernel/ub/kmem.c
+ *
+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+
+#include <ub/beancounter.h>
+#include <ub/kmem.h>
+#include <ub/task.h>
+
+/*
+ * Slab accounting
+ */
+
+int ub_slab_charge(kmem_cache_t *cachep, void *objp, gfp_t flags)
+{
+ unsigned int size;
+ struct user_beancounter *ub, **slab_ubp;
+
+ ub = get_exec_ub();
+ if (ub == NULL)
+ return 0;
+
+ size = kmem_cache_size(cachep);
+ if (charge_beancounter(ub, UB_KMEMSIZE, size,
+ (flags & __GFP_UBC_LIMIT ? UB_LIMIT : UB_BARRIER)))
+ return -ENOMEM;
+
+ slab_ubp = kmem_cache_ubp(cachep, objp);
+ *slab_ubp = get_beancounter(ub);
+ return 0;
+}
+
+void ub_slab_uncharge(kmem_cache_t *cachep, void *objp)
+{
+ unsigned int size;
+ struct user_beancounter *ub, **slab_ubp;
+
+ slab_ubp = kmem_cache_ubp(cachep, objp);
+ if (*slab_ubp == NULL)
+ return;
+
+ ub = *slab_ubp;
+ size = kmem_cache_size(cachep);
+ uncharge_beancounter(ub, UB_KMEMSIZE, size);
+ put_beancounter(ub);
+ *slab_ubp = NULL;
+}
+
+/*
+ * Pages accounting
+ */
+
+int ub_page_charge(struct page *page, int order, gfp_t flags)
+{
+ struct user_beancounter *ub;
+
+ BUG_ON(page_ub(page) != NULL);
+
+ ub = get_exec_ub();
+ if (ub == NULL)
+ return 0;
+
+ if (charge_beancounter(ub, UB_KMEMSIZE, PAGE_SIZE << order,
+ (flags & __GFP_UBC_LIMIT ? UB_LIMIT : UB_BARRIER)))
+ return -ENOMEM;
+
+ page_ub(page) = get_beancounter(ub);
+ return 0;
+}
+
+void ub_page_uncharge(struct page *page, int order)
+{
+ struct user_beancounter *ub;
+
+ ub = page_ub(page);
+ if (ub == NULL)
+ return;
+
+ uncharge_beancounter(ub, UB_KMEMSIZE, PAGE_SIZE << order);
+ put_beancounter(ub);
+ page_ub(page) = NULL;
+}
--- ./mm/mempool.c.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./mm/mempool.c 2006-08-16 19:10:51.000000000 +0400
@@ -119,6 +119,7 @@ int mempool_resize(mempool_t *pool, int
unsigned long flags;

BUG_ON(new_min_nr <= 0);
+ gfp_mask &= ~__GFP_UBC;

spin_lock_irqsave(&pool->lock, flags);
if (new_min_nr <= pool->min_nr) {
@@ -212,6 +213,7 @@ void * mempool_alloc(mempool_t *pool, gf
gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */
gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
gfp_mask |= __GFP_NOWARN; /* failures are OK */
+ gfp_mask &= ~__GFP_UBC; /* do not charge */

gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);

--- ./mm/page_alloc.c.kmemcore 2006-08-16 19:10:38.000000000 +0400
+++ ./mm/page_alloc.c 2006-08-16 19:10:51.000000000 +0400
@@ -38,6 +38,8 @@
#include <linux/mempolicy.h>
#include <linux/stop_machine.h>

+#include <ub/kmem.h>
+
#include <asm/tlbflush.h>
#include <asm/div64.h>
#include "internal.h"
...

[RFC][PATCH 6/7] UBC: kernel memory acconting (mark objects) [message #5201 is a reply to message #5192] Wed, 16 August 2006 15:40 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Mark some kmem caches with SLAB_UBC and some allocations with __GFP_UBC
to cause charging/limiting of appropriate kernel resources.

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>

---
arch/i386/kernel/ldt.c | 4 ++--
arch/i386/mm/init.c | 4 ++--
arch/i386/mm/pgtable.c | 6 ++++--
drivers/char/tty_io.c | 10 +++++-----
fs/file.c | 8 ++++----
fs/locks.c | 2 +-
fs/namespace.c | 3 ++-
fs/select.c | 7 ++++---
include/asm-i386/thread_info.h | 4 ++--
include/asm-ia64/pgalloc.h | 24 +++++++++++++++++-------
include/asm-x86_64/pgalloc.h | 12 ++++++++----
include/asm-x86_64/thread_info.h | 5 +++--
ipc/msgutil.c | 4 ++--
ipc/sem.c | 7 ++++---
ipc/util.c | 8 ++++----
kernel/fork.c | 15 ++++++++-------
kernel/posix-timers.c | 3 ++-
kernel/signal.c | 2 +-
kernel/user.c | 2 +-
mm/rmap.c | 3 ++-
mm/shmem.c | 3 ++-
21 files changed, 80 insertions(+), 56 deletions(-)

--- ./arch/i386/kernel/ldt.c.ubslabs 2006-04-21 11:59:31.000000000 +0400
+++ ./arch/i386/kernel/ldt.c 2006-08-01 13:22:30.000000000 +0400
@@ -39,9 +39,9 @@ static int alloc_ldt(mm_context_t *pc, i
oldsize = pc->size;
mincount = (mincount+511)&(~511);
if (mincount*LDT_ENTRY_SIZE > PAGE_SIZE)
- newldt = vmalloc(mincount*LDT_ENTRY_SIZE);
+ newldt = vmalloc_ub(mincount*LDT_ENTRY_SIZE);
else
- newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL);
+ newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL_UBC);

if (!newldt)
return -ENOMEM;
--- ./arch/i386/mm/init.c.ubslabs 2006-07-10 12:39:10.000000000 +0400
+++ ./arch/i386/mm/init.c 2006-08-01 13:17:07.000000000 +0400
@@ -680,7 +680,7 @@ void __init pgtable_cache_init(void)
pmd_cache = kmem_cache_create("pmd",
PTRS_PER_PMD*sizeof(pmd_t),
PTRS_PER_PMD*sizeof(pmd_t),
- 0,
+ SLAB_UBC,
pmd_ctor,
NULL);
if (!pmd_cache)
@@ -689,7 +689,7 @@ void __init pgtable_cache_init(void)
pgd_cache = kmem_cache_create("pgd",
PTRS_PER_PGD*sizeof(pgd_t),
PTRS_PER_PGD*sizeof(pgd_t),
- 0,
+ SLAB_UBC,
pgd_ctor,
PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
if (!pgd_cache)
--- ./arch/i386/mm/pgtable.c.ubslabs 2006-07-10 12:39:10.000000000 +0400
+++ ./arch/i386/mm/pgtable.c 2006-08-01 13:27:35.000000000 +0400
@@ -158,9 +158,11 @@ struct page *pte_alloc_one(struct mm_str
struct page *pte;

#ifdef CONFIG_HIGHPTE
- pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO , 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO |
+ __GFP_UBC | __GFP_UBC_LIMIT, 0);
#else
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO|
+ __GFP_UBC | __GFP_UBC_LIMIT, 0);
#endif
return pte;
}
--- ./drivers/char/tty_io.c.ubslabs 2006-07-10 12:39:11.000000000 +0400
+++ ./drivers/char/tty_io.c 2006-08-01 15:21:21.000000000 +0400
@@ -158,7 +158,7 @@ static struct tty_struct *alloc_tty_stru
{
struct tty_struct *tty;

- tty = kmalloc(sizeof(struct tty_struct), GFP_KERNEL);
+ tty = kmalloc(sizeof(struct tty_struct), GFP_KERNEL_UBC);
if (tty)
memset(tty, 0, sizeof(struct tty_struct));
return tty;
@@ -1495,7 +1495,7 @@ static int init_dev(struct tty_driver *d

if (!*tp_loc) {
tp = (struct termios *) kmalloc(sizeof(struct termios),
- GFP_KERNEL);
+ GFP_KERNEL_UBC);
if (!tp)
goto free_mem_out;
*tp = driver->init_termios;
@@ -1503,7 +1503,7 @@ static int init_dev(struct tty_driver *d

if (!*ltp_loc) {
ltp = (struct termios *) kmalloc(sizeof(struct termios),
- GFP_KERNEL);
+ GFP_KERNEL_UBC);
if (!ltp)
goto free_mem_out;
memset(ltp, 0, sizeof(struct termios));
@@ -1528,7 +1528,7 @@ static int init_dev(struct tty_driver *d

if (!*o_tp_loc) {
o_tp = (struct termios *)
- kmalloc(sizeof(struct termios), GFP_KERNEL);
+ kmalloc(sizeof(struct termios), GFP_KERNEL_UBC);
if (!o_tp)
goto free_mem_out;
*o_tp = driver->other->init_termios;
@@ -1536,7 +1536,7 @@ static int init_dev(struct tty_driver *d

if (!*o_ltp_loc) {
o_ltp = (struct termios *)
- kmalloc(sizeof(struct termios), GFP_KERNEL);
+ kmalloc(sizeof(struct termios), GFP_KERNEL_UBC);
if (!o_ltp)
goto free_mem_out;
memset(o_ltp, 0, sizeof(struct termios));
--- ./fs/file.c.ubslabs 2006-07-17 17:01:12.000000000 +0400
+++ ./fs/file.c 2006-08-01 15:18:03.000000000 +0400
@@ -44,9 +44,9 @@ struct file ** alloc_fd_array(int num)
int size = num * sizeof(struct file *);

if (size <= PAGE_SIZE)
- new_fds = (struct file **) kmalloc(size, GFP_KERNEL);
+ new_fds = (struct file **) kmalloc(size, GFP_KERNEL_UBC);
else
- new_fds = (struct file **) vmalloc(size);
+ new_fds = (struct file **) vmalloc_ub(size);
return new_fds;
}

@@ -213,9 +213,9 @@ fd_set * alloc_fdset(int num)
int size = num / 8;

if (size <= PAGE_SIZE)
- new_fdset = (fd_set *) kmalloc(size, GFP_KERNEL);
+ new_fdset = (fd_set *) kmalloc(size, GFP_KERNEL_UBC);
else
- new_fdset = (fd_set *) vmalloc(size);
+ new_fdset = (fd_set *) vmalloc_ub(size);
return new_fdset;
}

--- ./fs/locks.c.ubslabs 2006-07-10 12:39:16.000000000 +0400
+++ ./fs/locks.c 2006-08-01 12:46:47.000000000 +0400
@@ -2226,7 +2226,7 @@ EXPORT_SYMBOL(lock_may_write);
static int __init filelock_init(void)
{
filelock_cache = kmem_cache_create("file_lock_cache",
- sizeof(struct file_lock), 0, SLAB_PANIC,
+ sizeof(struct file_lock), 0, SLAB_PANIC | SLAB_UBC,
init_once, NULL);
return 0;
}
--- ./fs/namespace.c.ubslabs 2006-07-10 12:39:16.000000000 +0400
+++ ./fs/namespace.c 2006-08-01 12:47:12.000000000 +0400
@@ -1825,7 +1825,8 @@ void __init mnt_init(unsigned long mempa
init_rwsem(&namespace_sem);

mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct vfsmount),
- 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL, NULL);
+ 0, SLAB_HWCACHE_ALIGN | SLAB_UBC | SLAB_PANIC,
+ NULL, NULL);

mount_hashtable = (struct list_head *)__get_free_page(GFP_ATOMIC);

--- ./fs/select.c.ubslabs 2006-07-10 12:39:17.000000000 +0400
+++ ./fs/select.c 2006-08-01 15:17:01.000000000 +0400
@@ -103,7 +103,8 @@ static struct poll_table_entry *poll_get
if (!table || POLL_TABLE_FULL(table)) {
struct poll_table_page *new_table;

- new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
+ new_table = (struct poll_table_page *)
+ __get_free_page(GFP_KERNEL_UBC);
if (!new_table) {
p->error = -ENOMEM;
__set_current_state(TASK_RUNNING);
@@ -339,7 +340,7 @@ static int core_sys_select(int n, fd_set
if (size > sizeof(stack_fds) / 6) {
/* Not enough space in on-stack array; must use kmalloc */
ret = -ENOMEM;
- bits = kmalloc(6 * size, GFP_KERNEL);
+ bits = kmalloc(6 * size, GFP_KERNEL_UBC);
if (!bits)
goto out_nofds;
}
@@ -693,7 +694,7 @@ int do_sys_poll(struct pollfd __user *uf
if (!stack_pp)
stack_pp = pp = (struct poll_list *)stack_pps;
else {
- pp = kmalloc(size, GFP_KERNEL);
+ pp = kmalloc(size, GFP_KERNEL_UBC);
if (!pp)
goto out_fds;
}
--- ./include/asm-i386/thread_info.h.ubslabs 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-i386/thread_info.h 2006-08-01 15:19:50.000000000 +0400
@@ -99,13 +99,13 @@ static inline struct thread_info *curren
({ \
struct thread_info *ret; \
\
- ret = kmalloc(THREAD_SIZE, GFP_KERNEL); \
+ ret = kmalloc(THREAD_SIZE, GFP_KERNEL_UBC); \
if (ret) \
memset(ret, 0, THREAD_SIZE); \
ret; \
})
#else
-#define alloc_thread_info(tsk) kmalloc(THREAD_SIZE, GFP_KERNEL)
+#define alloc_thread_info(tsk) kmalloc(THREAD_SIZE, GFP_KERNEL_UBC)
#endif

#define free_thread_info(info) kfree(info)
--- ./include/asm-ia64/pgalloc.h.ubslabs 2006-07-10 12:39:19.000000000 +0400
+++ ./include/asm-ia64/pgalloc.h 2006-08-01 13:35:49.000000000 +0400
@@ -19,6 +19,8 @@
#include <linux/page-flags.h>
#include <linux/threads.h>

+#include <ub/kmem.h>
+
#include <asm/mmu_context.h>

DECLARE_PER_CPU(unsigned long *, __pgtable_quicklist);
@@ -37,7 +39,7 @@ static inline long pgtable_quicklist_tot
return ql_size;
}

-static inline void *pgtable_quicklist_alloc(void)
+static inline void *pgtable_quicklist_alloc(int charge)
{
unsigned long *ret = NULL;

@@ -45,13 +47,20 @@ static inline void *pgtable_quicklist_al

ret = pgtable_quicklist;
if (likely(ret != NULL)) {
+ if (charge && ub_page_charge(virt_to_page(ret),
+ 0, __GFP_UBC_LIMIT)) {
+ ret = NULL;
+ goto out;
+ }
pgtable_quicklist = (unsigned long *)(*ret);
ret[0] = 0;
--pgtable_quicklist_size;
+out:
preempt_enable();
} else {
preempt_enable();
- ret = (unsigned long *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ ret = (unsigned long *)__get_free_page(GFP_KERNEL |
+ __GFP_ZERO | __GFP_UBC | __GFP_UBC_LIMIT);
}

return ret;
@@ -69,6 +78,7 @@ static inline void pgtable_quicklist_fre
#endif

preempt_disable();
+ ub_page_uncharge(virt_to_page(pgtable_entry), 0);
*(unsigned long *)pgtable_entry = (unsigned long)pgtable_quicklist;
pgtable_quicklist = (unsigned long *)pgtable_entry;
++pgtable_quicklist_size;
@@ -77,7 +87,7 @@ static inline void pgtable_quicklist_fre

static inline pgd_t *pgd_alloc(struct mm_struct *mm)
{
- return pgtable_quicklist_alloc();
+ return pgtable_quicklist_alloc(1);
}

static inline void pgd_free(pgd_t * pgd)
@@ -94,7 +104,7 @@ pgd_populate(struct mm_struct *mm, pgd_t

static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
{
- return pgtable_quicklist_alloc();
+ retur
...

[RFC][PATCH 7/7] UBC: proc interface [message #5202 is a reply to message #5192] Wed, 16 August 2006 15:42 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Add proc interface (/proc/user_beancounters) allowing to see current
state (usage/limits/fails for each UB). Implemented via seq files.

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>

---
init/main.c | 1
kernel/ub/Makefile | 1
kernel/ub/proc.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 207 insertions(+)

--- ./init/main.c.ubproc 2006-07-31 18:40:20.000000000 +0400
+++ ./init/main.c 2006-08-03 16:02:19.000000000 +0400
@@ -578,6 +578,7 @@ asmlinkage void __init start_kernel(void
page_writeback_init();
#ifdef CONFIG_PROC_FS
proc_root_init();
+ ub_init_proc();
#endif
cpuset_init();
taskstats_init_early();
--- ./kernel/ub/Makefile.ubproc 2006-07-31 17:49:05.000000000 +0400
+++ ./kernel/ub/Makefile 2006-08-01 11:08:39.000000000 +0400
@@ -4,3 +4,4 @@ obj-$(CONFIG_USER_RESOURCE) += beancount
obj-$(CONFIG_USER_RESOURCE) += misc.o
obj-y += sys.o
obj-$(CONFIG_USER_RESOURCE) += kmem.o
+obj-$(CONFIG_USER_RESOURCE) += proc.o
--- ./kernel/ub/proc.c.ubproc 2006-08-01 10:22:09.000000000 +0400
+++ ./kernel/ub/proc.c 2006-08-03 15:50:35.000000000 +0400
@@ -0,0 +1,205 @@
+/*
+ * kernel/ub/proc.c
+ *
+ * Copyright (C) 2006 OpenVZ. SWsoft Inc.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+
+#include <ub/beancounter.h>
+
+#ifdef CONFIG_PROC_FS
+
+#if BITS_PER_LONG == 32
+static const char *head_fmt = "%10s %-12s %10s %10s %10s %10s %10s\n";
+static const char *res_fmt = "%10s %-12s %10lu %10lu %10lu %10lu %10lu\n";
+#else
+static const char *head_fmt = "%10s %-12s %20s %20s %20s %20s %20s\n";
+static const char *res_fmt = "%10s %-12s %20lu %20lu %20lu %20lu %20lu\n";
+#endif
+
+static void ub_show_header(struct seq_file *f)
+{
+ seq_printf(f, head_fmt, "uid", "resource",
+ "held", "maxheld", "barrier", "limit", "failcnt");
+}
+
+static void ub_show_res(struct seq_file *f, struct user_beancounter *ub, int r)
+{
+ char ub_uid[64];
+
+ if (r == 0)
+ ub_print_uid(ub, ub_uid, sizeof(ub_uid));
+ else
+ strcpy(ub_uid, "");
+
+ seq_printf(f, res_fmt, ub_uid, ub_rnames[r],
+ ub->ub_parms[r].held,
+ ub->ub_parms[r].maxheld,
+ ub->ub_parms[r].barrier,
+ ub->ub_parms[r].limit,
+ ub->ub_parms[r].failcnt);
+}
+
+static struct ub_seq_struct {
+ unsigned long flags;
+ int slot;
+ struct user_beancounter *ub;
+} ub_seq_ctx;
+
+static int ub_show(struct seq_file *f, void *v)
+{
+ int res;
+
+ for (res = 0; res < UB_RESOURCES; res++)
+ ub_show_res(f, ub_seq_ctx.ub, res);
+ return 0;
+}
+
+static void *ub_start_ctx(struct seq_file *f, unsigned long p, int sub)
+{
+ struct user_beancounter *ub;
+ struct hlist_node *pos;
+ unsigned long flags;
+ int slot;
+
+ if (p == 0)
+ ub_show_header(f);
+
+ spin_lock_irqsave(&ub_hash_lock, flags);
+ ub_seq_ctx.flags = flags;
+
+ for (slot = 0; slot < UB_HASH_SIZE; slot++)
+ hlist_for_each_entry (ub, pos, &ub_hash[slot], hash) {
+ if (!sub && ub->parent != NULL)
+ continue;
+
+ if (p-- == 0) {
+ ub_seq_ctx.ub = ub;
+ ub_seq_ctx.slot = slot;
+ return &ub_seq_ctx;
+ }
+ }
+
+ return NULL;
+}
+
+static void *ub_next_ctx(struct seq_file *f, loff_t *ppos, int sub)
+{
+ struct user_beancounter *ub;
+ struct hlist_node *pos;
+ int slot;
+
+ ub = ub_seq_ctx.ub;
+
+ pos = &ub->hash;
+ hlist_for_each_entry_continue (ub, pos, hash) {
+ if (!sub && ub->parent != NULL)
+ continue;
+
+ ub_seq_ctx.ub = ub;
+ (*ppos)++;
+ return &ub_seq_ctx;
+ }
+
+ for (slot = ub_seq_ctx.slot + 1; slot < UB_HASH_SIZE; slot++)
+ hlist_for_each_entry (ub, pos, &ub_hash[slot], hash) {
+ if (!sub && ub->parent != NULL)
+ continue;
+
+ ub_seq_ctx.ub = ub;
+ ub_seq_ctx.slot = slot;
+ (*ppos)++;
+ return &ub_seq_ctx;
+ }
+
+ return NULL;
+}
+
+static void *ub_start(struct seq_file *f, loff_t *ppos)
+{
+ return ub_start_ctx(f, *ppos, 0);
+}
+
+static void *ub_sub_start(struct seq_file *f, loff_t *ppos)
+{
+ return ub_start_ctx(f, *ppos, 1);
+}
+
+static void *ub_next(struct seq_file *f, void *v, loff_t *pos)
+{
+ return ub_next_ctx(f, pos, 0);
+}
+
+static void *ub_sub_next(struct seq_file *f, void *v, loff_t *pos)
+{
+ return ub_next_ctx(f, pos, 1);
+}
+
+static void ub_stop(struct seq_file *f, void *v)
+{
+ unsigned long flags;
+
+ flags = ub_seq_ctx.flags;
+ spin_unlock_irqrestore(&ub_hash_lock, flags);
+}
+
+static struct seq_operations ub_seq_ops = {
+ .start = ub_start,
+ .next = ub_next,
+ .stop = ub_stop,
+ .show = ub_show
+};
+
+static int ub_open(struct inode *inode, struct file *filp)
+{
+ return seq_open(filp, &ub_seq_ops);
+}
+
+static struct file_operations ub_file_operations = {
+ .open = ub_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static struct seq_operations ub_sub_seq_ops = {
+ .start = ub_sub_start,
+ .next = ub_sub_next,
+ .stop = ub_stop,
+ .show = ub_show
+};
+
+static int ub_sub_open(struct inode *inode, struct file *filp)
+{
+ return seq_open(filp, &ub_sub_seq_ops);
+}
+
+static struct file_operations ub_sub_file_operations = {
+ .open = ub_sub_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+void __init ub_init_proc(void)
+{
+ struct proc_dir_entry *entry;
+
+ entry = create_proc_entry("user_beancounters", S_IRUGO, NULL);
+ if (entry)
+ entry->proc_fops = &ub_file_operations;
+ else
+ panic("Can't create /proc/user_beancounters\n");
+
+ entry = create_proc_entry("user_beancounters_sub", S_IRUGO, NULL);
+ if (entry)
+ entry->proc_fops = &ub_sub_file_operations;
+ else
+ panic("Can't create /proc/user_beancounters_sub\n");
+}
+#endif
Re: [RFC][PATCH 3/7] UBC: ub context and inheritance [message #5206 is a reply to message #5198] Wed, 16 August 2006 16:31 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-08-16 am 19:38 +0400, ysgrifennodd Kirill Korotaev:
> Contains code responsible for setting UB on task,
> it's inheriting and setting host context in interrupts.
>
> Task references three beancounters:
> 1. exec_ub current context. all resources are
> charged to this beancounter.
> 2. task_ub beancounter to which task_struct is
> charged itself.
> 3. fork_sub beancounter which is inherited by
> task's children on fork
>
> Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
> Signed-Off-By: Kirill Korotaev <dev@sw.ru>

Acked-by: Alan Cox <alan@redhat.com>
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5207 is a reply to message #5199] Wed, 16 August 2006 16:32 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-08-16 am 19:39 +0400, ysgrifennodd Kirill Korotaev:
> Add the following system calls for UB management:
> 1. sys_getluid - get current UB id
> 2. sys_setluid - changes exec_ and fork_ UBs on current
> 3. sys_setublimit - set limits for resources consumtions
>
> Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
> Signed-Off-By: Kirill Korotaev <dev@sw.ru>

Acked-by: Alan Cox <alan@redhat.com>
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5208 is a reply to message #5200] Wed, 16 August 2006 16:35 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
The

+ ub->ub_parms[UB_KMEMSIZE].limit = 32 * 1024 * 1024

seems a bit arbitary. 32Mb is variously vast amounts of memory and not
enough to boot depending if you are booting a PDA or a 4096 core Itanic
box
Re: [RFC][PATCH 6/7] UBC: kernel memory acconting (mark objects) [message #5209 is a reply to message #5201] Wed, 16 August 2006 16:36 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-08-16 am 19:42 +0400, ysgrifennodd Kirill Korotaev:
> Mark some kmem caches with SLAB_UBC and some allocations with __GFP_UBC
> to cause charging/limiting of appropriate kernel resources.
>
> Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
> Signed-Off-By: Kirill Korotaev <dev@sw.ru>

Acked-by: Alan Cox <alan@redhat.com>

(although it will clash slightly with the diffs I just sent Andrew)
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5210 is a reply to message #5196] Wed, 16 August 2006 16:38 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-08-16 am 19:37 +0400, ysgrifennodd Kirill Korotaev:
> + * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
> + */
> +#define UB_MAXVALUE ( (1UL << (sizeof(unsigned long)*8-1)) - 1)
> +

Whats wrong with using the kernels LONG_MAX ?
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5214 is a reply to message #5196] Wed, 16 August 2006 18:18 Go to previous messageGo to next message
Andrew Morton is currently offline  Andrew Morton
Messages: 127
Registered: December 2005
Senior Member
On Wed, 16 Aug 2006 11:11:08 -0700
Rohit Seth <rohitseth@google.com> wrote:

> > +struct user_beancounter
> > +{
> > + atomic_t ub_refcount;
> > + spinlock_t ub_lock;
> > + uid_t ub_uid;
>
> Why uid? Will it be possible to club processes belonging to different
> users to same bean counter.

hm. I'd have expected to see a `struct user_struct *' here, not a uid_t.
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5215 is a reply to message #5199] Wed, 16 August 2006 18:44 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-08-16 am 11:17 -0700, ysgrifennodd Rohit Seth:
> I think there should be a check here for seeing if the new limits are
> lower than the current usage of a resource. If so then take appropriate
> action.

Generally speaking there isn't a sane appropriate action because the
resources can't just be yanked.
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5217 is a reply to message #5200] Wed, 16 August 2006 18:47 Go to previous messageGo to next message
Dave Hansen is currently offline  Dave Hansen
Messages: 240
Registered: October 2005
Senior Member
On Wed, 2006-08-16 at 19:40 +0400, Kirill Korotaev wrote:
> --- ./include/linux/mm.h.kmemcore 2006-08-16 19:10:38.000000000
> +0400
> +++ ./include/linux/mm.h 2006-08-16 19:10:51.000000000 +0400
> @@ -274,8 +274,14 @@ struct page {
> unsigned int gfp_mask;
> unsigned long trace[8];
> #endif
> +#ifdef CONFIG_USER_RESOURCE
> + union {
> + struct user_beancounter *page_ub;
> + } bc;
> +#endif
> };

Is everybody OK with adding this accounting to the 'struct page'? Is
there any kind of noticeable performance penalty for this? I thought
that we had this aligned pretty well on cacheline boundaries.

How many things actually use this? Can we have the slab ubcs without
the struct page pointer?

-- Dave
Re: [RFC][PATCH] UBC: user resource beancounters [message #5218 is a reply to message #5192] Wed, 16 August 2006 19:06 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-08-16 am 11:53 -0700, ysgrifennodd Rohit Seth:
> > pages shared between containers are correctly
> > charged as fractions (tunable).
> >
>
> I wouldn't be too worried about doing fractions. Make it unfair and
> charge it to either the container who first instantiated the file or the
> container who faulted on that page first.

Thats no good if you can arrange who gets charged, it becomes possible
to accumulate the advantages and break the constraints intended.

> Though the part that seems important is to be able to define a directory
> in fs and say all pages belonging to files underneath that directory are
> going to be put in specific container.

Thats an extremely crude use of beancounters. You can do far more useful
things with them and namespaces (and even at times without namespaces)
such as preventing one web site breaking another.
Re: [RFC][PATCH 7/7] UBC: proc interface [message #5232 is a reply to message #5202] Wed, 16 August 2006 17:13 Go to previous messageGo to next message
Greg KH is currently offline  Greg KH
Messages: 27
Registered: February 2006
Junior Member
On Wed, Aug 16, 2006 at 07:44:30PM +0400, Kirill Korotaev wrote:
> Add proc interface (/proc/user_beancounters) allowing to see current
> state (usage/limits/fails for each UB). Implemented via seq files.

Ugh, why /proc? This doesn't have anything to do with processes, just
users, right? What's wrong with /sys/kernel/ instead?

Or /sys/kernel/debug/user_beancounters/ in debugfs as this is just a
debugging thing, right?

thanks,

greg k-h
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5233 is a reply to message #5196] Wed, 16 August 2006 17:15 Go to previous messageGo to next message
Greg KH is currently offline  Greg KH
Messages: 27
Registered: February 2006
Junior Member
On Wed, Aug 16, 2006 at 07:37:26PM +0400, Kirill Korotaev wrote:
> +struct user_beancounter
> +{
> + atomic_t ub_refcount;

Why not use a struct kref here instead of rolling your own reference
counting logic?

thanks,

greg k-h
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5234 is a reply to message #5199] Wed, 16 August 2006 17:17 Go to previous messageGo to next message
Greg KH is currently offline  Greg KH
Messages: 27
Registered: February 2006
Junior Member
On Wed, Aug 16, 2006 at 07:39:43PM +0400, Kirill Korotaev wrote:
> --- ./include/asm-sparc/unistd.h.arsys 2006-07-10 12:39:19.000000000 +0400
> +++ ./include/asm-sparc/unistd.h 2006-08-10 17:08:19.000000000 +0400
> @@ -318,6 +318,9 @@
> #define __NR_unshare 299
> #define __NR_set_robust_list 300
> #define __NR_get_robust_list 301
> +#define __NR_getluid 302
> +#define __NR_setluid 303
> +#define __NR_setublimit 304

Hm, you seem to be ignoring this:

>
> #ifdef __KERNEL__
> /* WARNING: You MAY NOT add syscall numbers larger than 301, since

Same thing for sparc64:

> --- ./include/asm-sparc64/unistd.h.arsys 2006-07-10
> 12:39:19.000000000 +0400
> +++ ./include/asm-sparc64/unistd.h 2006-08-10 17:09:24.000000000 +0400
> @@ -320,6 +320,9 @@
> #define __NR_unshare 299
> #define __NR_set_robust_list 300
> #define __NR_get_robust_list 301
> +#define __NR_getluid 302
> +#define __NR_setluid 303
> +#define __NR_setublimit 304
>
> #ifdef __KERNEL__
> /* WARNING: You MAY NOT add syscall numbers larger than 301, since

You might want to read those comments...

thanks,

greg k-h
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5235 is a reply to message #5199] Wed, 16 August 2006 18:17 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-08-16 at 19:39 +0400, Kirill Korotaev wrote:
> Add the following system calls for UB management:
> 1. sys_getluid - get current UB id
> 2. sys_setluid - changes exec_ and fork_ UBs on current
> 3. sys_setublimit - set limits for resources consumtions
>

Why not have another system call for getting the current limits?

But as I said in previous mail, configfs seems like a better choice for
user interface. That way user has to go to one place to read/write
limits, see the current usage and other stats.

> Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
> Signed-Off-By: Kirill Korotaev <dev@sw.ru>

...<snip>...
> +
> +/*
> + * The setbeanlimit syscall
> + */
> +asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
> + unsigned long *limits)
> +{

> + ub = beancounter_findcreate(uid, NULL, 0);
> + if (ub == NULL)
> + goto out;
> +
> + spin_lock_irqsave(&ub->ub_lock, flags);
> + ub->ub_parms[resource].barrier = new_limits[0];
> + ub->ub_parms[resource].limit = new_limits[1];
> + spin_unlock_irqrestore(&ub->ub_lock, flags);
> +

I think there should be a check here for seeing if the new limits are
lower than the current usage of a resource. If so then take appropriate
action.

-rohit
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5236 is a reply to message #5196] Wed, 16 August 2006 18:11 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-08-16 at 19:37 +0400, Kirill Korotaev wrote:
> Core functionality and interfaces of UBC:
> find/create beancounter, initialization,
> charge/uncharge of resource, core objects' declarations.
>
> Basic structures:
> ubparm - resource description
> user_beancounter - set of resources, id, lock
>
> Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
> Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>
> ---
> include/ub/beancounter.h | 157 ++++++++++++++++++
> init/main.c | 4
> kernel/Makefile | 1
> kernel/ub/Makefile | 7
> kernel/ub/beancounter.c | 398 +++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 567 insertions(+)
>
> --- /dev/null 2006-07-18 14:52:43.075228448 +0400
> +++ ./include/ub/beancounter.h 2006-08-10 14:58:27.000000000 +0400
> @@ -0,0 +1,157 @@
> +/*
> + * include/ub/beancounter.h
> + *
> + * Copyright (C) 2006 OpenVZ. SWsoft Inc
> + *
> + */
> +
> +#ifndef _LINUX_BEANCOUNTER_H
> +#define _LINUX_BEANCOUNTER_H
> +
> +/*
> + * Resource list.
> + */
> +
> +#define UB_RESOURCES 0
> +
> +struct ubparm {
> + /*
> + * A barrier over which resource allocations are failed gracefully.
> + * e.g. if the amount of consumed memory is over the barrier further
> + * sbrk() or mmap() calls fail, the existing processes are not killed.
> + */
> + unsigned long barrier;
> + /* hard resource limit */
> + unsigned long limit;
> + /* consumed resources */
> + unsigned long held;
> + /* maximum amount of consumed resources through the last period */
> + unsigned long maxheld;
> + /* minimum amount of consumed resources through the last period */
> + unsigned long minheld;
> + /* count of failed charges */
> + unsigned long failcnt;
> +};

What is the difference between barrier and limit. They both sound like
hard limits. No?

> +
> +/*
> + * Kernel internal part.
> + */
> +
> +#ifdef __KERNEL__
> +
> +#include <linux/config.h>
> +#include <linux/spinlock.h>
> +#include <linux/list.h>
> +#include <asm/atomic.h>
> +
> +/*
> + * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
> + */
> +#define UB_MAXVALUE ( (1UL << (sizeof(unsigned long)*8-1)) - 1)
> +
> +
> +/*
> + * Resource management structures
> + * Serialization issues:
> + * beancounter list management is protected via ub_hash_lock
> + * task pointers are set only for current task and only once
> + * refcount is managed atomically
> + * value and limit comparison and change are protected by per-ub spinlock
> + */
> +
> +struct user_beancounter
> +{
> + atomic_t ub_refcount;
> + spinlock_t ub_lock;
> + uid_t ub_uid;

Why uid? Will it be possible to club processes belonging to different
users to same bean counter.

> + struct hlist_node hash;
> +
> + struct user_beancounter *parent;
> + void *private_data;
> +

What are the above two fields used for?

> + /* resources statistics and settings */
> + struct ubparm ub_parms[UB_RESOURCES];
> +};
> +

I presume UB_RESOURCES value is going to change as different resources
start getting tracked.

I think something like configfs should be used for user interface. It
automatically presents the right interfaces to user land (based on
kernel implementation). And you wouldn't need any changes in glibc etc.


-rohit
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5237 is a reply to message #5200] Wed, 16 August 2006 18:24 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-08-16 at 19:40 +0400, Kirill Korotaev wrote:
> Introduce UB_KMEMSIZE resource which accounts kernel
> objects allocated by task's request.
>
> Reference to UB is kept on struct page or slab object.
> For slabs each struct slab contains a set of pointers
> corresponding objects are charged to.
>
> Allocation charge rules:
> 1. Pages - if allocation is performed with __GFP_UBC flag - page
> is charged to current's exec_ub.
> 2. Slabs - kmem_cache may be created with SLAB_UBC flag - in this
> case each allocation is charged. Caches used by kmalloc are
> created with SLAB_UBC | SLAB_UBC_NOCHARGE flags. In this case
> only __GFP_UBC allocations are charged.

<snip>

> --- ./mm/page_alloc.c.kmemcore 2006-08-16 19:10:38.000000000 +0400
> +++ ./mm/page_alloc.c 2006-08-16 19:10:51.000000000 +0400
> @@ -38,6 +38,8 @@
> #include <linux/mempolicy.h>
> #include <linux/stop_machine.h>
>
> +#include <ub/kmem.h>
> +
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> #include "internal.h"
> @@ -484,6 +486,8 @@ static void __free_pages_ok(struct page
> if (reserved)
> return;
>
> + ub_page_uncharge(page, order);
> +
> kernel_map_pages(page, 1 << order, 0);
> local_irq_save(flags);
> __count_vm_events(PGFREE, 1 << order);
> @@ -764,6 +768,8 @@ static void fastcall free_hot_cold_page(
> if (free_pages_check(page))
> return;
>
> + ub_page_uncharge(page, 0);
> +
> kernel_map_pages(page, 1, 0);
>
> pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
> @@ -1153,6 +1159,11 @@ nopage:
> show_mem();
> }
> got_pg:
> + if ((gfp_mask & __GFP_UBC) &&
> + ub_page_charge(page, order, gfp_mask)) {
> + __free_pages(page, order);
> + page = NULL;
> + }
> #ifdef CONFIG_PAGE_OWNER
> if (page)
> set_page_owner(page, order, gfp_mask);

If I'm reading this patch right then seems like you are making page
allocations to fail w/o (for example) trying to purge some pages from
the page cache belonging to this container. Or is that reclaim going to
come later?

-rohit
Re: [RFC][PATCH] UBC: user resource beancounters [message #5238 is a reply to message #5192] Wed, 16 August 2006 18:53 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-08-16 at 19:24 +0400, Kirill Korotaev wrote:
> The following patch set presents base of
> User Resource Beancounters (UBC).
> UBC allows to account and control consumption
> of kernel resources used by group of processes.
>
> The full UBC patch set allows to control:
> - kernel memory. All the kernel objects allocatable
> on user demand should be accounted and limited
> for DoS protection.
> E.g. page tables, task structs, vmas etc.
>

Good.

> - virtual memory pages. UBC allows to
> limit a container to some amount of memory and
> introduces 2-level OOM killer taking into account
> container's consumption.
> pages shared between containers are correctly
> charged as fractions (tunable).
>

I wouldn't be too worried about doing fractions. Make it unfair and
charge it to either the container who first instantiated the file or the
container who faulted on that page first.

Though the part that seems important is to be able to define a directory
in fs and say all pages belonging to files underneath that directory are
going to be put in specific container. Just like you are having
resource beans associated with sockets, have address_space or inode also
associated with resource beans. (And it should be possible to have a
container/resource bean without any active process but set of
address_space mappings with its own limits and current usage).

-rohit
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5239 is a reply to message #5217] Wed, 16 August 2006 19:15 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-08-16 at 11:47 -0700, Dave Hansen wrote:
> On Wed, 2006-08-16 at 19:40 +0400, Kirill Korotaev wrote:
> > --- ./include/linux/mm.h.kmemcore 2006-08-16 19:10:38.000000000
> > +0400
> > +++ ./include/linux/mm.h 2006-08-16 19:10:51.000000000 +0400
> > @@ -274,8 +274,14 @@ struct page {
> > unsigned int gfp_mask;
> > unsigned long trace[8];
> > #endif
> > +#ifdef CONFIG_USER_RESOURCE
> > + union {
> > + struct user_beancounter *page_ub;
> > + } bc;
> > +#endif
> > };
>
> Is everybody OK with adding this accounting to the 'struct page'?

My preference would be to have container (I keep on saying container,
but resource beancounter) pointer embeded in task, mm(not sure),
address_space and anon_vma structures. This should allow us to track
user land pages optimally. But for tracking kernel usage on behalf of
user, we will have to use an additional field (unless we can re-use
mapping). Please correct me if I'm wrong, though all the kernel
resources will be allocated/freed in context of a user process. And at
that time we know if a allocation should succeed or not. So we may
actually not need to track kernel pages that closely. We are not going
to run reclaim on any of them anyways.

-rohit
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5240 is a reply to message #5215] Wed, 16 August 2006 19:22 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-08-16 at 20:04 +0100, Alan Cox wrote:
> Ar Mer, 2006-08-16 am 11:17 -0700, ysgrifennodd Rohit Seth:
> > I think there should be a check here for seeing if the new limits are
> > lower than the current usage of a resource. If so then take appropriate
> > action.
>
> Generally speaking there isn't a sane appropriate action because the
> resources can't just be yanked.
>

I was more thinking about (for example) user land physical memory limit
for that bean counter. If the limits are going down, then the system
call should try to flush out page cache pages or swap out anonymous
memory. But you are right that it won't be possible in all cases, like
for in kernel memory limits.

-rohit
Re: [RFC][PATCH 5/7] UBC: kernel memory accounting (core) [message #5241 is a reply to message #5239] Thu, 17 August 2006 00:02 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-08-16 am 12:15 -0700, ysgrifennodd Rohit Seth:
> resources will be allocated/freed in context of a user process. And at
> that time we know if a allocation should succeed or not. So we may
> actually not need to track kernel pages that closely.

Quite the reverse, tracking kernel pages is critical, user pages kind of
balance out for many cases kernel ones don't.
Re: [ckrm-tech] [RFC][PATCH] UBC: user resource beancounters [message #5253 is a reply to message #5192] Thu, 17 August 2006 00:15 Go to previous messageGo to next message
Chandra Seetharaman is currently offline  Chandra Seetharaman
Messages: 88
Registered: August 2006
Member
Kirill,

Thanks for posting the patches to ckrm-tech. I 'll look into it and post
my comments tomorrow.

Some documentation (or pointer to the documentation) on how to use this
feature and high level design would really help.

How does the hierarchy work ? (May be reading the code would clear it
up :).

few comments below..
On Wed, 2006-08-16 at 19:24 +0400, Kirill Korotaev wrote:
<snip>
> The patches in these series are:
> diff-ubc-kconfig.patch:
> Adds kernel/ub/Kconfig file with UBC options and
> includes it into arch Kconfigs

Since the core functionality is arch independent, why not have the
Kconfig stuff in some generic place like init/Kconfig ?

>
> diff-ubc-core.patch:
> Contains core functionality and interfaces of UBC:
> find/create beancounter, initialization,
> charge/uncharge of resource, core objects' declarations.
>
> diff-ubc-task.patch:
> Contains code responsible for setting UB on task,
> it's inheriting and setting host context in interrupts.
>
> Task contains three beancounters:
> 1. exec_ub - current context. all resources are charged
> to this beancounter.
> 2. task_ub - beancounter to which task_struct is charged
> itself.
> 3. fork_sub - beancounter which is inherited by
> task's children on fork

wondering why we need three of these ?

>
> diff-ubc-syscalls.patch:
> Patch adds system calls for UB management:
> 1. sys_getluid - get current UB id
> 2. sys_setluid - changes exec_ and fork_ UBs on current
> 3. sys_setublimit - set limits for resources consumtions

I agree with Rohit that configfs based interface would be more easy to
use (you will not get into the system call number issue that Greg has
pointed too).
>
> diff-ubc-kmem-core.patch:
> Introduces UB_KMEMSIZE resource which accounts kernel
> objects allocated by task's request.
>
> Objects are accounted via struct page and slab objects.
> For the latter ones each slab contains a set of pointers
> corresponding object is charged to.
>
> Allocation charge rules:
> 1. Pages - if allocation is performed with __GFP_UBC flag - page
> is charged to current's exec_ub.
> 2. Slabs - kmem_cache may be created with SLAB_UBC flag - in this
> case each allocation is charged. Caches used by kmalloc are
> created with SLAB_UBC | SLAB_UBC_NOCHARGE flags. In this case
> only __GFP_UBC allocations are charged.
>
> diff-ubc-kmem-charge.patch:
> Adds SLAB_UBC and __GFP_UBC flags in appropriate places
> to cause charging/limiting of specified resources.
>
> diff-ubc-proc.patch:
> Adds two proc entries user_beancounters and user_beancounters_sub
> allowing to see current state (usage/limits/fails for each UB).
> Implemented via seq files.

again, configfs would be easier.

>
> Patch set is applicable to 2.6.18-rc4-mm1
>
> Thanks,
> Kirill
>
>
> ------------------------------------------------------------ -------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&b id=263057&dat=121642
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--

------------------------------------------------------------ ----------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
------------------------------------------------------------ ----------
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5274 is a reply to message #5210] Thu, 17 August 2006 11:40 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

> Ar Mer, 2006-08-16 am 19:37 +0400, ysgrifennodd Kirill Korotaev:
>
>>+ * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
>>+ */
>>+#define UB_MAXVALUE ( (1UL << (sizeof(unsigned long)*8-1)) - 1)
>>+
>
>
> Whats wrong with using the kernels LONG_MAX ?
just historical code line which introduces UB_MAXVALUE independant of
cross-compiler/headers etc.

Will replace it.

Kirill
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5275 is a reply to message #5233] Thu, 17 August 2006 11:43 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

>>+struct user_beancounter
>>+{
>>+ atomic_t ub_refcount;
>
>
> Why not use a struct kref here instead of rolling your own reference
> counting logic?

We need more complex decrement/locking scheme than krefs
provide. e.g. in __put_beancounter() we need
atomic_dec_and_lock_irqsave() semantics for performance optimizations.

Kirill
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5279 is a reply to message #5236] Thu, 17 August 2006 11:52 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Rohit Seth wrote:
> On Wed, 2006-08-16 at 19:37 +0400, Kirill Korotaev wrote:
>
>>Core functionality and interfaces of UBC:
>>find/create beancounter, initialization,
>>charge/uncharge of resource, core objects' declarations.
>>
>>Basic structures:
>> ubparm - resource description
>> user_beancounter - set of resources, id, lock
>>
>>Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
>>Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>>
>>---
>> include/ub/beancounter.h | 157 ++++++++++++++++++
>> init/main.c | 4
>> kernel/Makefile | 1
>> kernel/ub/Makefile | 7
>> kernel/ub/beancounter.c | 398 +++++++++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 567 insertions(+)
>>
>>--- /dev/null 2006-07-18 14:52:43.075228448 +0400
>>+++ ./include/ub/beancounter.h 2006-08-10 14:58:27.000000000 +0400
>>@@ -0,0 +1,157 @@
>>+/*
>>+ * include/ub/beancounter.h
>>+ *
>>+ * Copyright (C) 2006 OpenVZ. SWsoft Inc
>>+ *
>>+ */
>>+
>>+#ifndef _LINUX_BEANCOUNTER_H
>>+#define _LINUX_BEANCOUNTER_H
>>+
>>+/*
>>+ * Resource list.
>>+ */
>>+
>>+#define UB_RESOURCES 0
>>+
>>+struct ubparm {
>>+ /*
>>+ * A barrier over which resource allocations are failed gracefully.
>>+ * e.g. if the amount of consumed memory is over the barrier further
>>+ * sbrk() or mmap() calls fail, the existing processes are not killed.
>>+ */
>>+ unsigned long barrier;
>>+ /* hard resource limit */
>>+ unsigned long limit;
>>+ /* consumed resources */
>>+ unsigned long held;
>>+ /* maximum amount of consumed resources through the last period */
>>+ unsigned long maxheld;
>>+ /* minimum amount of consumed resources through the last period */
>>+ unsigned long minheld;
>>+ /* count of failed charges */
>>+ unsigned long failcnt;
>>+};
>
>
> What is the difference between barrier and limit. They both sound like
> hard limits. No?
check __charge_beancounter_locked and severity.
It provides some kind of soft and hard limits.

>>+
>>+/*
>>+ * Kernel internal part.
>>+ */
>>+
>>+#ifdef __KERNEL__
>>+
>>+#include <linux/config.h>
>>+#include <linux/spinlock.h>
>>+#include <linux/list.h>
>>+#include <asm/atomic.h>
>>+
>>+/*
>>+ * UB_MAXVALUE is essentially LONG_MAX declared in a cross-compiling safe form.
>>+ */
>>+#define UB_MAXVALUE ( (1UL << (sizeof(unsigned long)*8-1)) - 1)
>>+
>>+
>>+/*
>>+ * Resource management structures
>>+ * Serialization issues:
>>+ * beancounter list management is protected via ub_hash_lock
>>+ * task pointers are set only for current task and only once
>>+ * refcount is managed atomically
>>+ * value and limit comparison and change are protected by per-ub spinlock
>>+ */
>>+
>>+struct user_beancounter
>>+{
>>+ atomic_t ub_refcount;
>>+ spinlock_t ub_lock;
>>+ uid_t ub_uid;
>
>
> Why uid? Will it be possible to club processes belonging to different
> users to same bean counter.
oh, its a misname. Should be ub_id. it is ID of user_beancounter
and has nothing to do with user id.

>>+ struct hlist_node hash;
>>+
>>+ struct user_beancounter *parent;
>>+ void *private_data;
>>+
>
>
> What are the above two fields used for?
the first one is for hierarchical UBs,
see beancounter_findcreate with UB_LOOKUP_SUB.
private_data is probably not used yet :)

>>+ /* resources statistics and settings */
>>+ struct ubparm ub_parms[UB_RESOURCES];
>>+};
>>+
>
>
> I presume UB_RESOURCES value is going to change as different resources
> start getting tracked.
what's wrong with it?

> I think something like configfs should be used for user interface. It
> automatically presents the right interfaces to user land (based on
> kernel implementation). And you wouldn't need any changes in glibc etc.
1. UBC doesn't require glibc modificatins.
2. if you think a bit more about it, adding UB parameters doesn't
require user space changes as well.
3. it is possible to add any kind of interface for UBC. but do you like the idea
to grep 200(containers)x20(parameters) files for getting current usages?
Do you like the idea to convert numbers to strings and back w/o
thinking of data types?

Thanks,
Kirill
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5280 is a reply to message #5214] Thu, 17 August 2006 11:52 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Andrew Morton wrote:
> On Wed, 16 Aug 2006 11:11:08 -0700
> Rohit Seth <rohitseth@google.com> wrote:
>
>
>>>+struct user_beancounter
>>>+{
>>>+ atomic_t ub_refcount;
>>>+ spinlock_t ub_lock;
>>>+ uid_t ub_uid;
>>
>>Why uid? Will it be possible to club processes belonging to different
>>users to same bean counter.
>
>
> hm. I'd have expected to see a `struct user_struct *' here, not a uid_t.

Sorry, misused name. should be ub_id. not related to user_struct or user.

Thanks,
Kirill
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5282 is a reply to message #5234] Thu, 17 August 2006 12:00 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Greg KH wrote:
> On Wed, Aug 16, 2006 at 07:39:43PM +0400, Kirill Korotaev wrote:
>
>>--- ./include/asm-sparc/unistd.h.arsys 2006-07-10 12:39:19.000000000 +0400
>>+++ ./include/asm-sparc/unistd.h 2006-08-10 17:08:19.000000000 +0400
>>@@ -318,6 +318,9 @@
>>#define __NR_unshare 299
>>#define __NR_set_robust_list 300
>>#define __NR_get_robust_list 301
>>+#define __NR_getluid 302
>>+#define __NR_setluid 303
>>+#define __NR_setublimit 304
>
>
> Hm, you seem to be ignoring this:
>
>
>>#ifdef __KERNEL__
>>/* WARNING: You MAY NOT add syscall numbers larger than 301, since
>
>
> Same thing for sparc64:
[...skipped...]

Oh, will fix NR_SYSCALLS in entry.S and the comment in unistd.h. Thanks for catching this!

Thanks,
Kirill
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5284 is a reply to message #5235] Thu, 17 August 2006 12:03 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

>>Add the following system calls for UB management:
>> 1. sys_getluid - get current UB id
>> 2. sys_setluid - changes exec_ and fork_ UBs on current
>> 3. sys_setublimit - set limits for resources consumtions
>>
>
>
> Why not have another system call for getting the current limits?
will add sys_getublimit().

> But as I said in previous mail, configfs seems like a better choice for
> user interface. That way user has to go to one place to read/write
> limits, see the current usage and other stats.
Check another email about interfaces. I have arguments against it :/

>>Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
>>Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>
>
> ...<snip>...
>
>>+
>>+/*
>>+ * The setbeanlimit syscall
>>+ */
>>+asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
>>+ unsigned long *limits)
>>+{
>
>
>>+ ub = beancounter_findcreate(uid, NULL, 0);
>>+ if (ub == NULL)
>>+ goto out;
>>+
>>+ spin_lock_irqsave(&ub->ub_lock, flags);
>>+ ub->ub_parms[resource].barrier = new_limits[0];
>>+ ub->ub_parms[resource].limit = new_limits[1];
>>+ spin_unlock_irqrestore(&ub->ub_lock, flags);
>>+
>
>
> I think there should be a check here for seeing if the new limits are
> lower than the current usage of a resource. If so then take appropriate
> action.
any idea what exact action to add here?
Looks like can be added when needed, agree?

Thanks,
Kirill
Re: [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5285 is a reply to message #5240] Thu, 17 August 2006 12:11 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Rohit Seth wrote:
> On Wed, 2006-08-16 at 20:04 +0100, Alan Cox wrote:
>
>>Ar Mer, 2006-08-16 am 11:17 -0700, ysgrifennodd Rohit Seth:
>>
>>>I think there should be a check here for seeing if the new limits are
>>>lower than the current usage of a resource. If so then take appropriate
>>>action.
>>
>>Generally speaking there isn't a sane appropriate action because the
>>resources can't just be yanked.
>>
>
>
> I was more thinking about (for example) user land physical memory limit
> for that bean counter. If the limits are going down, then the system
> call should try to flush out page cache pages or swap out anonymous
> memory. But you are right that it won't be possible in all cases, like
> for in kernel memory limits.
Such kind of memory management is less efficient than the one
making decisions based on global shortages and global LRU alogrithm.

The problem here is that doing swap out takes more expensive disk I/O
influencing other users.

So throttling algorithms if wanted should be optional, not mandatory.
Lets postpone it and concentrate on the core.

Thanks,
Kirill
Re: [ckrm-tech] [RFC][PATCH 4/7] UBC: syscalls (user interface) [message #5288 is a reply to message #5199] Thu, 17 August 2006 11:09 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Wed, Aug 16, 2006 at 07:39:43PM +0400, Kirill Korotaev wrote:

> +/*
> + * The setbeanlimit syscall
> + */
> +asmlinkage long sys_setublimit(uid_t uid, unsigned long resource,
> + unsigned long *limits)
> +{

[snip]

> + spin_lock_irqsave(&ub->ub_lock, flags);
> + ub->ub_parms[resource].barrier = new_limits[0];
> + ub->ub_parms[resource].limit = new_limits[1];

Would it be usefull to notify the "resource" controller about this
change in limits? For ex: in case of the CPU controller I wrote
(http://lkml.org/lkml/2006/8/4/9), I was finding it usefull to recv
notification of changes to these limits, so that internal structures
(which are kept per-task-group) can be updated.


--
Regards,
vatsa
Re: [ckrm-tech] [RFC][PATCH] UBC: user resource beancounters [message #5289 is a reply to message #5192] Thu, 17 August 2006 11:02 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Wed, Aug 16, 2006 at 07:24:03PM +0400, Kirill Korotaev wrote:
> As the first step we want to propose for discussion
> the most complicated parts of resource management:
> kernel memory and virtual memory.

Do you have any plans to post a CPU controller? Is that tied to UBC
interface as well?

--
Regards,
vatsa
Re: [ckrm-tech] [RFC][PATCH 3/7] UBC: ub context and inheritance [message #5290 is a reply to message #5198] Thu, 17 August 2006 11:09 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Wed, Aug 16, 2006 at 07:38:44PM +0400, Kirill Korotaev wrote:
> Contains code responsible for setting UB on task,
> it's inheriting and setting host context in interrupts.
>
> Task references three beancounters:
> 1. exec_ub current context. all resources are
> charged to this beancounter.
> 2. task_ub beancounter to which task_struct is
> charged itself.
> 3. fork_sub beancounter which is inherited by
> task's children on fork

Is there a case where exec_ub and fork_sub can differ? I dont see that
in the patch.

--
Regards,
vatsa
Re: [ckrm-tech] [RFC][PATCH 2/7] UBC: core (structures, API) [message #5291 is a reply to message #5196] Thu, 17 August 2006 11:09 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Wed, Aug 16, 2006 at 07:37:26PM +0400, Kirill Korotaev wrote:
> +struct user_beancounter
> +{
> + atomic_t ub_refcount;
> + spinlock_t ub_lock;
> + uid_t ub_uid;
> + struct hlist_node hash;
> +
> + struct user_beancounter *parent;

This seems to hint at some heirarchy of ubc? How would that heirarchy be
used? I cant find anything in the patch which forms this heirarchy
(basically I dont see any place where beancounter_findcreate() is called
with non-NULL 2nd arg).

[snip]

> +static void init_beancounter_syslimits(struct user_beancounter *ub)
> +{
> + int k;
> +
> + for (k = 0; k < UB_RESOURCES; k++)
> + ub->ub_parms[k].barrier = ub->ub_parms[k].limit;

This sets barrier to 0. Is this value of 0 interpreted differently by
different controllers? One way to interpret it is "dont allocate any
resource", other way to interpret it is "don't care - give me what you
can" (which makes sense for stuff like CPU and network bandwidth).



--
Regards,
vatsa
Re: [RFC][PATCH 2/7] UBC: core (structures, API) [message #5293 is a reply to message #5275] Thu, 17 August 2006 12:14 Go to previous messageGo to previous message
Greg KH is currently offline  Greg KH
Messages: 27
Registered: February 2006
Junior Member
On Thu, Aug 17, 2006 at 03:45:56PM +0400, Kirill Korotaev wrote:
> >>+struct user_beancounter
> >>+{
> >>+ atomic_t ub_refcount;
> >
> >
> >Why not use a struct kref here instead of rolling your own reference
> >counting logic?
>
> We need more complex decrement/locking scheme than krefs
> provide. e.g. in __put_beancounter() we need
> atomic_dec_and_lock_irqsave() semantics for performance optimizations.

Ah, ok, missed that. Nevermind then :)

thanks,

greg k-h
Previous Topic: [PATCH] vzlist: Fix "cast from pointer to integer of different size" warnings
Next Topic: [RFC][PATCH 1/2] add user namespace [try #2]
Goto Forum:
  


Current Time: Fri Oct 24 21:57:52 GMT 2025

Total time taken to generate the page: 0.09908 seconds