OpenVZ Forum


Home » Mailing lists » Devel » [RFC] [PATCH 0/3] Add group fairness to CFS
[RFC] [PATCH 0/3] Add group fairness to CFS [message #18632] Wed, 23 May 2007 16:48 Go to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
Here's an attempt to extend CFS (v13) to be fair at a group level, rather than
just at task level. The patch is in a very premature state (passes
simple tests, smp load balance not supported yet) at this point. I am sending 
it out early to know if this is a good direction to proceed.

Salient points which needs discussion:

1. This patch reuses CFS core to achieve fairness at group level also.

   To make this possible, CFS core has been abstracted to deal with generic 
   schedulable "entities" (tasks, users etc).

2. The per-cpu rb-tree has been split to be per-group per-cpu.

   schedule() now becomes two step on every cpu : pick a group first (from
   group rb-tree) and a task within that group next (from that group's task
   rb-tree)

3. Grouping mechanism - I have used 'uid' as the basis of grouping for
   timebeing (since that grouping concept is already in mainline today).
   The patch can be adapted to a more generic process grouping mechanism
   (like http://lkml.org/lkml/2007/4/27/146) later.

Some results below, obtained on a 4way (with HT) Intel Xeon box. All 
number are reflective of single CPU performance (tests were forced to 
run on single cpu since load balance is not yet supported).


			         uid "vatsa"	           uid "guest"
		             (make -s -j4 bzImage)    (make -s -j20 bzImage)

2.6.22-rc1		          772.02 sec		497.42 sec (real)
2.6.22-rc1+cfs-v13 	          780.62 sec		478.35 sec (real)
2.6.22-rc1+cfs-v13+this patch     776.36 sec		776.68 sec (real)

[ An exclusive cpuset containing only one CPU was created and the
compilation jobs of both users were run simultaneously in this cpuset ]

I also disabled CONFIG_FAIR_USER_SCHED and compared the results with
cfs-v13:

					uid "vatsa"
					make -s -j4 bzImage

2.6.22-rc1+cfs-v13			395.57 sec (real)
2.6.22-rc1+cfs-v13+this_patch		388.54 sec (real)

There is no regression I can see (rather some improvement, which I can't
understand atm). I will run more tests later to check this regression aspect.

Request your comments on the future direction to proceed!


-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
[RFC] [PATCH 1/3] task_cpu(p) needs to be correct always [message #18633 is a reply to message #18632] Wed, 23 May 2007 16:51 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
We rely very much on task_cpu(p) to be correct at all times, so that we can
correctly find the task_grp_rq from which the task has to be removed or added
to.

There is however one place in the scheduler where this assumption of
task_cpu(p) being correct is broken. This patch fixes that piece of code.

(Thanks to Balbir Singh for pointing this out to me)

Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com>

---
 kernel/sched.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc7/kernel/sched.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched.c	2007-05-23 20:46:41.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched.c	2007-05-23 20:48:26.000000000 +0530
@@ -4571,7 +4571,7 @@
 static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 {
 	struct rq *rq_dest, *rq_src;
-	int ret = 0;
+	int ret = 0, on_rq;
 
 	if (unlikely(cpu_is_offline(dest_cpu)))
 		return ret;
@@ -4587,9 +4587,11 @@
 	if (!cpu_isset(dest_cpu, p->cpus_allowed))
 		goto out;
 
-	set_task_cpu(p, dest_cpu);
-	if (p->on_rq) {
+	on_rq = p->on_rq;
+	if (on_rq)
 		deactivate_task(rq_src, p, 0);
+	set_task_cpu(p, dest_cpu);
+	if (on_rq) {
 		activate_task(rq_dest, p, 0);
 		check_preempt_curr(rq_dest, p);
 	}

-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
[RFC] [PATCH 2/3] Introduce two new structures - struct lrq and sched_entity [message #18634 is a reply to message #18632] Wed, 23 May 2007 16:54 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
This patch groups together fields used by CFS (for SCHED_NORMAL tasks) 
in task_struct and runqueue into separate structures so that they can be 
reused in a later patch.

'struct sched_entity' represents the attributes used by CFS for every
schedulable entity (task in this case).

'struct lrq' represents the runqueue used to store schedulable entities (tasks 
in this case) and to maintain various clocks (ex: fair clock for tasks).

This patch also modifies rest of kernel to reflect these new structures.

Intended effect of this patch is zero on overall functionality of
scheduler.

Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com>

---
 fs/proc/array.c           |    2 
 include/linux/sched.h     |   44 ++++----
 kernel/exit.c             |    2 
 kernel/posix-cpu-timers.c |   19 +--
 kernel/sched.c            |  195 +++++++++++++++++++------------------
 kernel/sched_debug.c      |   85 ++++++++--------
 kernel/sched_fair.c       |  238 +++++++++++++++++++++++-----------------------
 7 files changed, 301 insertions(+), 284 deletions(-)

Index: linux-2.6.21-rc7/fs/proc/array.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/proc/array.c	2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/fs/proc/array.c	2007-05-23 20:48:34.000000000 +0530
@@ -412,7 +412,7 @@
 	 * Use CFS's precise accounting, if available:
 	 */
 	if (!has_rt_policy(task)) {
-		utime = nsec_to_clock_t(task->sum_exec_runtime);
+		utime = nsec_to_clock_t(task->se.sum_exec_runtime);
 		stime = 0;
 	}
 
Index: linux-2.6.21-rc7/include/linux/sched.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/sched.h	2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/include/linux/sched.h	2007-05-23 20:48:34.000000000 +0530
@@ -838,6 +838,29 @@
 	void (*task_new) (struct rq *rq, struct task_struct *p);
 };
 
+/* CFS scheduling entity (task, user etc) statistics fields: */
+struct sched_entity {
+	int load_weight;	/* for niceness load balancing purposes */
+	int on_rq;
+	struct rb_node run_node;
+	u64 wait_start_fair;
+	u64 wait_start;
+	u64 exec_start;
+	u64 sleep_start, sleep_start_fair;
+	u64 block_start;
+	u64 sleep_max;
+	u64 block_max;
+	u64 exec_max;
+	u64 wait_max;
+	u64 last_ran;
+
+	s64 wait_runtime;
+	u64 sum_exec_runtime;
+	s64 fair_key;
+	s64 sum_wait_runtime, sum_sleep_runtime;
+	unsigned long wait_runtime_overruns, wait_runtime_underruns;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -852,34 +875,15 @@
 	int oncpu;
 #endif
 #endif
-	int load_weight;	/* for niceness load balancing purposes */
 
 	int prio, static_prio, normal_prio;
-	int on_rq;
 	struct list_head run_list;
-	struct rb_node run_node;
+	struct sched_entity se;
 
 	unsigned short ioprio;
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
-	/* CFS scheduling class statistics fields: */
-	u64 wait_start_fair;
-	u64 wait_start;
-	u64 exec_start;
-	u64 sleep_start, sleep_start_fair;
-	u64 block_start;
-	u64 sleep_max;
-	u64 block_max;
-	u64 exec_max;
-	u64 wait_max;
-	u64 last_ran;
-
-	s64 wait_runtime;
-	u64 sum_exec_runtime;
-	s64 fair_key;
-	s64 sum_wait_runtime, sum_sleep_runtime;
-	unsigned long wait_runtime_overruns, wait_runtime_underruns;
 
 	unsigned int policy;
 	cpumask_t cpus_allowed;
Index: linux-2.6.21-rc7/kernel/exit.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/exit.c	2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/kernel/exit.c	2007-05-23 20:48:34.000000000 +0530
@@ -124,7 +124,7 @@
 		sig->nivcsw += tsk->nivcsw;
 		sig->inblock += task_io_get_inblock(tsk);
 		sig->oublock += task_io_get_oublock(tsk);
-		sig->sum_sched_runtime += tsk->sum_exec_runtime;
+		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 		sig = NULL; /* Marker for below. */
 	}
 
Index: linux-2.6.21-rc7/kernel/posix-cpu-timers.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/posix-cpu-timers.c	2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/kernel/posix-cpu-timers.c	2007-05-23 20:48:34.000000000 +0530
@@ -161,7 +161,8 @@
 }
 static inline unsigned long long sched_ns(struct task_struct *p)
 {
-	return (p == current) ? current_sched_runtime(p) : p->sum_exec_runtime;
+	return (p == current) ? current_sched_runtime(p) :
+				 p->se.sum_exec_runtime;
 }
 
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
@@ -249,7 +250,7 @@
 		cpu->sched = p->signal->sum_sched_runtime;
 		/* Add in each other live thread.  */
 		while ((t = next_thread(t)) != p) {
-			cpu->sched += t->sum_exec_runtime;
+			cpu->sched += t->se.sum_exec_runtime;
 		}
 		cpu->sched += sched_ns(p);
 		break;
@@ -467,7 +468,7 @@
 void posix_cpu_timers_exit(struct task_struct *tsk)
 {
 	cleanup_timers(tsk->cpu_timers,
-		       tsk->utime, tsk->stime, tsk->sum_exec_runtime);
+		       tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);
 
 }
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -475,7 +476,7 @@
 	cleanup_timers(tsk->signal->cpu_timers,
 		       cputime_add(tsk->utime, tsk->signal->utime),
 		       cputime_add(tsk->stime, tsk->signal->stime),
-		       tsk->sum_exec_runtime + tsk->signal->sum_sched_runtime);
+		     tsk->se.sum_exec_runtime + tsk->signal->sum_sched_runtime);
 }
 
 
@@ -536,7 +537,7 @@
 		nsleft = max_t(unsigned long long, nsleft, 1);
 		do {
 			if (likely(!(t->flags & PF_EXITING))) {
-				ns = t->sum_exec_runtime + nsleft;
+				ns = t->se.sum_exec_runtime + nsleft;
 				if (t->it_sched_expires == 0 ||
 				    t->it_sched_expires > ns) {
 					t->it_sched_expires = ns;
@@ -1004,7 +1005,7 @@
 		struct cpu_timer_list *t = list_first_entry(timers,
 						      struct cpu_timer_list,
 						      entry);
-		if (!--maxfire || tsk->sum_exec_runtime < t->expires.sched) {
+		if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
 			tsk->it_sched_expires = t->expires.sched;
 			break;
 		}
@@ -1049,7 +1050,7 @@
 	do {
 		utime = cputime_add(utime, t->utime);
 		stime = cputime_add(stime, t->stime);
-		sum_sched_runtime += t->sum_exec_runtime;
+		sum_sched_runtime += t->se.sum_exec_runtime;
 		t = next_thread(t);
 	} while (t != tsk);
 	ptime = cputime_add(utime, stime);
@@ -1208,7 +1209,7 @@
 				t->it_virt_expires = ticks;
 			}
 
-			sched = t->sum_exec_runtime + sched_left;
+			sched = t->se.sum_exec_runtime + sched_left;
 			if (sched_expires && (t->it_sched_expires == 0 ||
 					      t->it_sched_expires > sched)) {
 				t->it_sched_expires = sched;
@@ -1300,7 +1301,7 @@
 
 	if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
 	    (tsk->it_sched_expires == 0 ||
-	     tsk->sum_exec_runtime < tsk->it_sched_expires))
+	     tsk->se.sum_exec_runtime < tsk->it_sched_expires))
 		return;
 
 #undef	UNEXPIRED
Index: linux-2.6.21-rc7/kernel/sched.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched.c	2007-05-23 20:48:26.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched.c	2007-05-23 20:48:34.000000000 +0530
@@ -114,6 +114,23 @@
 	struct list_head queue[MAX_RT_PRIO];
 };
 
+/* CFS-related fields in a runqueue */
+struct lrq {
+	unsigned long raw_weighted_load;
+	#define CPU_LOAD_IDX_MAX 5
+	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+	unsigned long nr_load_updates;
+
+	u64 fair_clock, prev_fair_clock;
+	u64 exec_clock, prev_exec_clock;
+	s64 wait_runtime;
+	unsigned long wait_runtime_overruns, wait_runtime_underruns;
+
+	struct rb_root tasks_timeline;
+	struct rb_node *rb_leftmost;
+	struct rb_node *rb_load_balance_curr;
+};
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -129,16 +146,13 @@
 	 * remote CPUs use both these fields when doing load calculation.
 	 */
 	long nr_running;
-	unsigned long raw_weighted_load;
-	#define CPU_LOAD_IDX_MAX 5
-	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+	struct lrq lrq;
 
 	unsigned char idle_at_tick;
 #ifdef CONFIG_NO_HZ
 	unsigned char in_nohz_recently;
 #endif
 	u64 nr_switches;
-	unsigned long nr_load_updates;
 
 	/*
 	 * This is part of a global counter where only the total sum
@@ -154,10 +168,6 @@
 
 	u64 clock, prev_clock_raw;
 	s64 clock_max_delta;
-	u64 fair_clock, prev_fair_clock;
-	u64 exec_clock, prev_exec_clock;
-	s64 wait_runtime;
-	unsigned long wait_runtime_overruns, wait_runtime_underruns;
 
 	unsigned int clock_warps;
 	unsigned int clock_unstable_events;
@@ -168,10 +178,6 @@
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
 
-	struct rb_root tasks_timeline;
-	struct rb_node *rb_leftmost;
-	struct rb_node *rb_load_balance_curr;
-
 	atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
@@ -573,13 +579,13 @@
 static inline void
 inc_raw_weighted_load(struct rq *rq, const struct task_struct *p)
 {
-	rq->raw_weighted_load += p->load_weight;
+	rq->lrq.raw_weighted_load += p->se.load_weight;
 }
 
 static inline void
 dec_raw_weighted_load(struct rq *rq, const struct task_struct *p)
 {
-	rq->raw_weighted_load -= p->load_weight;
+	rq->lrq.raw_weighted_load -= p->se.load_weight;
 }
 
 static inline void inc_nr_running(struct task_struct *p, struct rq *rq)
@@ -605,22 +611,22 @@
 
 static void set_load_weight(struct task_struct *p)
 {
-	task_rq(p)->wait_runtime -= p->wait_runtime;
-	p->wait_runtime = 0;
+	task_rq(p)->lrq.wait_runtime -= p->se.wait_runtime;
+	p->se.wait_runtime = 0;
 
 	if (has_rt_policy(p)) {
-		p->load_weight = prio_to_weight[0] * 2;
+		p->se.load_weight = prio_to_weight[0] * 2;
 		return;
 	}
 	/*
 	 * SCHED_BATCH tasks get minimal weight:
 	 */
 	if (p->policy == SCHED_BATCH) {
-		p->lo
...

[RFC] [PATCH 3/3] Generalize CFS core and provide per-user fairness [message #18635 is a reply to message #18632] Wed, 23 May 2007 16:56 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
This patch reuses CFS core to provide inter task-group fairness. For
demonstration purpose, the patch also extends CFS to provide per-user
fairness. The patch is very experimental atm and in particular, SMP LOAD
BALANCE IS DISABLED to keep the patch simple. I think group-based smp
load
balance is more trickier and I intend to look at it next.

Although user id is chosen as the basis of grouping tasks, the patch can
be adapted to work with other task grouping mechanisms (like :
http://lkml.org/lkml/2007/4/27/146).

Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com>


---
 arch/i386/Kconfig     |    6 
 arch/x86_64/Kconfig   |    6 
 include/linux/sched.h |   16 
 kernel/sched.c        |  256 +++++++++++----
 kernel/sched_debug.c  |    4 
 kernel/sched_fair.c   |  820 ++++++++++++++++++++++++++++++++------------------
 kernel/sched_rt.c     |    2 
 kernel/user.c         |    5 
 8 files changed, 753 insertions(+), 362 deletions(-)

Index: linux-2.6.21-rc7/arch/i386/Kconfig
===================================================================
--- linux-2.6.21-rc7.orig/arch/i386/Kconfig	2007-05-23 20:46:38.000000000 +0530
+++ linux-2.6.21-rc7/arch/i386/Kconfig	2007-05-23 20:48:39.000000000 +0530
@@ -307,6 +307,12 @@
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config FAIR_USER_SCHED
+	bool "Fair user scheduler"
+	default n
+	help
+		Fair user scheduler
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
Index: linux-2.6.21-rc7/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.21-rc7.orig/arch/x86_64/Kconfig	2007-05-23 20:46:38.000000000 +0530
+++ linux-2.6.21-rc7/arch/x86_64/Kconfig	2007-05-23 20:48:39.000000000 +0530
@@ -330,6 +330,12 @@
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config FAIR_USER_SCHED
+	bool "Fair user scheduler"
+	default n
+	help
+		Fair user scheduler
+
 source "kernel/Kconfig.preempt"
 
 config NUMA
Index: linux-2.6.21-rc7/include/linux/sched.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/sched.h	2007-05-23 20:48:34.000000000 +0530
+++ linux-2.6.21-rc7/include/linux/sched.h	2007-05-23 20:48:39.000000000 +0530
@@ -551,6 +551,16 @@
 #define is_rt_policy(p)		((p) != SCHED_NORMAL && (p) != SCHED_BATCH)
 #define has_rt_policy(p)	unlikely(is_rt_policy((p)->policy))
 
+#ifdef CONFIG_FAIR_USER_SCHED
+int sched_alloc_user(struct user_struct *user);
+void sched_free_user(struct user_struct *user);
+void sched_move_task(struct user_struct *old);
+#else
+static inline int sched_alloc_user(struct user_struct *user) { return 0; }
+static inline void sched_free_user(struct user_struct *user) { }
+static inline void sched_move_task(struct user_struct *user) { }
+#endif
+
 /*
  * Some day this will be a full-fledged user tracking system..
  */
@@ -575,6 +585,10 @@
 	/* Hash table maintenance information */
 	struct list_head uidhash_list;
 	uid_t uid;
+#ifdef CONFIG_FAIR_USER_SCHED
+	struct sched_entity *se;        /* per-cpu sched_entity */
+	struct lrq *lrq;        /* per-cpu runqueue for this user */
+#endif
 };
 
 extern struct user_struct *find_user(uid_t);
@@ -859,6 +873,8 @@
 	s64 fair_key;
 	s64 sum_wait_runtime, sum_sleep_runtime;
 	unsigned long wait_runtime_overruns, wait_runtime_underruns;
+	struct sched_entity *parent;
+	struct lrq *my_q;       /* The queue owned by this entity */
 };
 
 struct task_struct {
Index: linux-2.6.21-rc7/kernel/sched.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched.c	2007-05-23 20:48:34.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched.c	2007-05-23 20:48:39.000000000 +0530
@@ -129,6 +129,14 @@
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
 	struct rb_node *rb_load_balance_curr;
+	struct sched_entity *curr;
+	unsigned int *sched_granularity;	/* &sysctl_sched_granularity */
+	struct rq *rq;
+	unsigned long nice_0_load;
+#ifdef CONFIG_FAIR_USER_SCHED
+	struct list_head lrq_list;
+	struct rcu_head rcu;
+#endif
 };
 
 /*
@@ -164,6 +172,7 @@
 
 	struct task_struct *curr, *idle;
 	unsigned long next_balance;
+	unsigned long rt_load;
 	struct mm_struct *prev_mm;
 
 	u64 clock, prev_clock_raw;
@@ -214,6 +223,32 @@
 	struct lock_class_key rq_lock_key;
 };
 
+#define NICE_0_LOAD	SCHED_LOAD_SCALE
+#define NICE_0_SHIFT	SCHED_LOAD_SHIFT
+
+#ifdef CONFIG_FAIR_USER_SCHED
+static struct sched_entity root_user_se[NR_CPUS];
+static struct lrq root_user_lrq[NR_CPUS];
+
+static inline void init_se(struct sched_entity *se, struct lrq *lrq)
+{
+	se->my_q = lrq;
+	se->load_weight = NICE_0_LOAD;
+}
+
+static inline void init_lrq(struct lrq *lrq, struct rq *rq)
+{
+	lrq->rq = rq;
+	lrq->fair_clock = 1;
+	lrq->tasks_timeline = RB_ROOT;
+	lrq->nice_0_load = NICE_0_LOAD;
+	lrq->sched_granularity = &sysctl_sched_granularity;
+	INIT_LIST_HEAD(&lrq->lrq_list);
+	list_add_rcu(&lrq->lrq_list, &rq->lrq.lrq_list);
+}
+
+#endif
+
 static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
 static DEFINE_MUTEX(sched_hotcpu_mutex);
 
@@ -555,9 +590,6 @@
 #define RTPRIO_TO_LOAD_WEIGHT(rp) \
 	(PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp))
 
-#define NICE_0_LOAD	SCHED_LOAD_SCALE
-#define NICE_0_SHIFT	SCHED_LOAD_SHIFT
-
 /*
  * Nice levels are multiplicative, with a gentle 10% change for every
  * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
@@ -576,16 +608,22 @@
 /*  10 */   110,    87,    70,    56,    45,    36,    29,    23,    18,    15,
 };
 
+extern struct sched_class rt_sched_class;
+
 static inline void
 inc_raw_weighted_load(struct rq *rq, const struct task_struct *p)
 {
-	rq->lrq.raw_weighted_load += p->se.load_weight;
+	/* Hack - needs better handling */
+	if (p->sched_class == &rt_sched_class)
+		rq->rt_load += p->se.load_weight;
 }
 
 static inline void
 dec_raw_weighted_load(struct rq *rq, const struct task_struct *p)
 {
-	rq->lrq.raw_weighted_load -= p->se.load_weight;
+	/* Hack - needs better handling */
+	if (p->sched_class == &rt_sched_class)
+		rq->rt_load -= p->se.load_weight;
 }
 
 static inline void inc_nr_running(struct task_struct *p, struct rq *rq)
@@ -728,10 +766,32 @@
 	return cpu_curr(task_cpu(p)) == p;
 }
 
+#ifdef CONFIG_FAIR_USER_SCHED
+
+#define for_each_lrq(rq, lrq) \
+	for (lrq = container_of((rq)->lrq.lrq_list.next, struct lrq, lrq_list);\
+	     prefetch(rcu_dereference(lrq->lrq_list.next)), lrq != &(rq)->lrq;\
+	     lrq = container_of(lrq->lrq_list.next, struct lrq, lrq_list))
+
+#else
+
+#define for_each_lrq(rq, lrq) \
+		for (lrq = &rq->lrq; lrq != NULL; lrq = NULL)
+
+#endif
+
 /* Used instead of source_load when we know the type == 0 */
 unsigned long weighted_cpuload(const int cpu)
 {
-	return cpu_rq(cpu)->lrq.raw_weighted_load;
+	struct lrq *lrq;
+	unsigned long weight = 0;
+
+	for_each_lrq(cpu_rq(cpu), lrq)
+		weight += lrq->raw_weighted_load;
+
+	weight += cpu_rq(cpu)->rt_load;
+
+	return weight;
 }
 
 #ifdef CONFIG_SMP
@@ -761,6 +821,10 @@
 	if (p->se.sleep_start_fair)
 		p->se.sleep_start_fair -= fair_clock_offset;
 
+#ifdef CONFIG_FAIR_USER_SCHED
+	p->se.parent = &p->user->se[new_cpu];
+#endif
+
 	task_thread_info(p)->cpu = new_cpu;
 
 }
@@ -863,12 +927,18 @@
  */
 static inline unsigned long source_load(int cpu, int type)
 {
-	struct rq *rq = cpu_rq(cpu);
+	unsigned long rwl, cpl = 0;
+	struct lrq *lrq;
+
+	rwl = weighted_cpuload(cpu);
 
 	if (type == 0)
-		return rq->lrq.raw_weighted_load;
+		return rwl;
+
+	for_each_lrq(cpu_rq(cpu), lrq)
+		cpl += lrq->cpu_load[type-1];
 
-	return min(rq->lrq.cpu_load[type-1], rq->lrq.raw_weighted_load);
+	return min(cpl, rwl);
 }
 
 /*
@@ -877,12 +947,18 @@
  */
 static inline unsigned long target_load(int cpu, int type)
 {
-	struct rq *rq = cpu_rq(cpu);
+	unsigned long rwl, cpl = 0;
+	struct lrq *lrq;
+
+	rwl = weighted_cpuload(cpu);
 
 	if (type == 0)
-		return rq->lrq.raw_weighted_load;
+		return rwl;
+
+	for_each_lrq(cpu_rq(cpu), lrq)
+		cpl += lrq->cpu_load[type-1];
 
-	return max(rq->lrq.cpu_load[type-1], rq->lrq.raw_weighted_load);
+	return max(cpl, rwl);
 }
 
 /*
@@ -893,7 +969,7 @@
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long n = rq->nr_running;
 
-	return n ? rq->lrq.raw_weighted_load / n : SCHED_LOAD_SCALE;
+	return n ? weighted_cpuload(cpu) / n : SCHED_LOAD_SCALE;
 }
 
 /*
@@ -1583,59 +1659,6 @@
 	return running + uninterruptible;
 }
 
-static void update_load_fair(struct rq *this_rq)
-{
-	unsigned long this_load, fair_delta, exec_delta, idle_delta;
-	unsigned int i, scale;
-	s64 fair_delta64, exec_delta64;
-	unsigned long tmp;
-	u64 tmp64;
-
-	this_rq->lrq.nr_load_updates++;
-	if (!(sysctl_sched_load_smoothing & 64)) {
-		this_load = this_rq->lrq.raw_weighted_load;
-		goto do_avg;
-	}
-
-	fair_delta64 = this_rq->lrq.fair_clock -
-			 this_rq->lrq.prev_fair_clock + 1;
-	this_rq->lrq.prev_fair_clock = this_rq->lrq.fair_clock;
-
-	exec_delta64 = this_rq->lrq.exec_clock -
-			 this_rq->lrq.prev_exec_clock + 1;
-	this_rq->lrq.prev_exec_clock = this_rq->lrq.exec_clock;
-
-	if (fair_delta64 > (s64)LONG_MAX)
-		fair_delta64 = (s64)LONG_MAX;
-	fair_delta = (unsigned long)fair_delta64;
-
-	if (exec_delta64 > (s64)LONG_MAX)
-		exec_delta64 = (s64)LONG_MAX;
-	exec_delta = (unsigned long)exec_delta64;
-	if (exec_delta > TICK_NSEC)
-		exec_delta = TICK_NSEC;
-
-	idle_delta = TICK_NSEC - exec_delta;
-
-	tmp = (SCHED_LOAD_SCALE * exec_delta) / fair_delta;
-	tmp64 = (u64)tmp * (u64)exec_delta;
-	do_div(tmp64, TICK_NSEC);
-	this_load = (unsigned long)tmp64;
-
-do_avg:
-	/* Update our load: */
-	for (i = 0, scale = 1; i < CPU_LOAD_IDX_MA
...

Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18653 is a reply to message #18632] Thu, 24 May 2007 03:15 Go to previous messageGo to next message
Peter Williams is currently offline  Peter Williams
Messages: 7
Registered: May 2007
Junior Member
Srivatsa Vaddagiri wrote:
> Here's an attempt to extend CFS (v13) to be fair at a group level, rather than
> just at task level. The patch is in a very premature state (passes
> simple tests, smp load balance not supported yet) at this point. I am sending 
> it out early to know if this is a good direction to proceed.
> 
> Salient points which needs discussion:
> 
> 1. This patch reuses CFS core to achieve fairness at group level also.
> 
>    To make this possible, CFS core has been abstracted to deal with generic 
>    schedulable "entities" (tasks, users etc).
> 
> 2. The per-cpu rb-tree has been split to be per-group per-cpu.
> 
>    schedule() now becomes two step on every cpu : pick a group first (from
>    group rb-tree) and a task within that group next (from that group's task
>    rb-tree)
> 
> 3. Grouping mechanism - I have used 'uid' as the basis of grouping for
>    timebeing (since that grouping concept is already in mainline today).
>    The patch can be adapted to a more generic process grouping mechanism
>    (like http://lkml.org/lkml/2007/4/27/146) later.
> 
> Some results below, obtained on a 4way (with HT) Intel Xeon box. All 
> number are reflective of single CPU performance (tests were forced to 
> run on single cpu since load balance is not yet supported).
> 
> 
> 			         uid "vatsa"	           uid "guest"
> 		             (make -s -j4 bzImage)    (make -s -j20 bzImage)
> 
> 2.6.22-rc1		          772.02 sec		497.42 sec (real)
> 2.6.22-rc1+cfs-v13 	          780.62 sec		478.35 sec (real)
> 2.6.22-rc1+cfs-v13+this patch     776.36 sec		776.68 sec (real)

This would seem to indicate that being fair between groups isn't always 
a good thing.  With 2.6.22-rc1 and 2.6.22-rc1+cfs-v13 "guest" gets his 
build done in about 2/3 the time of "vatsa" without seriously 
inconveniencing "vatsa".  All making scheduling fair between the groups 
has done is penalize "guest" without significantly improving matters for 
"vatsa" (he gains a mere 4 seconds out of 780).

BUT I imagine that this is an artefact caused by the use of HT 
technology and that if the test were run on a computer without HT the 
results would be more impressive.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18663 is a reply to message #18632] Fri, 25 May 2007 17:14 Go to previous messageGo to next message
tong.n.li is currently offline  tong.n.li
Messages: 1
Registered: May 2007
Junior Member
On Fri, 2007-05-25 at 21:44 +0530, Srivatsa Vaddagiri wrote:
> > 
> > That assumes per-user scheduling groups; other configurations would
> > make it one step for each level of hierarchy. It may be possible to
> > reduce those steps to only state transitions that change weightings
> > and incremental updates of task weightings. By and large, one needs
> > the groups to determine task weightings as opposed to hierarchically
> > scheduling, so there are alternative ways of going about this, ones
> > that would even make load balancing easier.
> 
> Yeah I agree that providing hierarchical group-fairness at the cost of single 
> (or fewer) scheduling levels would be a nice thing to target for,
> although I don't know of any good way to do it. Do you have any ideas
> here? Doing group fairness in a single level, using a common rb-tree for tasks 
> from all groups is very difficult IMHO. We need atleast two levels.
> 
> One possibility is that we recognize deeper hierarchies only in user-space,
> but flatten this view from kernel perspective i.e some user space tool
> will have to distributed the weights accordingly in this flattened view
> to the kernel.

Nice work, Vatsa. When I wrote the DWRR algorithm, I flattened the
hierarchies into one level, so maybe that approach can be applied to
your code as well. What I did is to maintain task and task group weights
and reservations separately from the scheduler, while the scheduler only
sees one system-wide weight per task and does not concern about which
group a task is in. The key here is the system-wide weight of each task
should represent an equivalent share to the share represented by the
group hierarchies. To do this, the scheduler looks up the task and group
weights/reservations it maintains, and dynamically computes the
system-wide weight *only* when it need a weight for a given task while
scheduling. The on-demand weight computation makes sure the cost is
small (constant time). The computation itself can be seen from an
example: assume we have a group of two tasks and the group's total share
is represented by a weight of 10. Inside the group, let's say the two
tasks, P1 and P2, have weights 1 and 2. Then the system-wide weight for
P1 is 10/3 and the weight for P2 is 20/3. In essence, this flattens
weights into one level without changing the shares they represent.

  tong
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18664 is a reply to message #18632] Fri, 25 May 2007 08:29 Go to previous messageGo to next message
Ingo Molnar is currently offline  Ingo Molnar
Messages: 51
Registered: December 2005
Member
* Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:

> Can you repeat your tests with this patch pls? With the patch applied, 
> I am now getting the same split between nice 0 and nice 10 task as 
> CFS-v13 provides (90:10 as reported by top )
> 
>  5418 guest     20   0  2464  304  236 R   90  0.0   5:41.40 3 hog
>  5419 guest     30  10  2460  304  236 R   10  0.0   0:43.62 3 nice10hog

btw., what are you thoughts about SMP?

it's a natural extension of your current code. I think the best approach 
would be to add a level of 'virtual CPU' objects above struct user. (how 
to set the attributes of those objects is open - possibly combine it 
with cpusets?)

That way the scheduler would first pick a "virtual CPU" to schedule, and 
then pick a user from that virtual CPU, and then a task from the user. 

To make group accounting scalable, the accounting object attached to the 
user struct should/must be per-cpu (per-vcpu) too. That way we'd have a 
clean hierarchy like:

  CPU #0 => VCPU A [ 40% ] + VCPU B [ 60% ]
  CPU #1 => VCPU C [ 30% ] + VCPU D [ 70% ]

  VCPU A => USER X [ 10% ] + USER Y [ 90% ]
  VCPU B => USER X [ 10% ] + USER Y [ 90% ]
  VCPU C => USER X [ 10% ] + USER Y [ 90% ]
  VCPU D => USER X [ 10% ] + USER Y [ 90% ]

the scheduler first picks a vcpu, then a user from a vcpu. (the actual 
external structure of the hierarchy should be opaque to the scheduler 
core, naturally, so that we can use other hierarchies too)

whenever the scheduler does accounting, it knows where in the hierarchy 
it is and updates all higher level entries too. This means that the 
accounting object for USER X is replicated for each VCPU it participates 
in.

SMP balancing is straightforward: it would fundamentally iterate through 
the same hierarchy and would attempt to keep all levels balanced - i 
abstracted away its iterators already.

Hm? 

	Ingo
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18670 is a reply to message #18632] Fri, 25 May 2007 09:30 Go to previous messageGo to next message
Guillaume Chazarain is currently offline  Guillaume Chazarain
Messages: 3
Registered: May 2007
Junior Member
Srivatsa Vaddagiri a écrit :
> Can you repeat your tests with this patch pls? With the patch applied, I am
> now getting the same split between nice 0 and nice 10 task as CFS-v13
> provides (90:10 as reported by top )

Yep, this fixes the problem for me too.

Thanks.

-- 
Guillaume
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18671 is a reply to message #18632] Fri, 25 May 2007 11:11 Go to previous messageGo to next message
Ingo Molnar is currently offline  Ingo Molnar
Messages: 51
Registered: December 2005
Member
* Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:

> On Fri, May 25, 2007 at 10:29:51AM +0200, Ingo Molnar wrote:
> > btw., what are you thoughts about SMP?
> 
> I was planning on reusing smpnice concepts here, with the difference 
> that we balance group weights across CPU in addition to total weight 
> of CPUs.

ok, that would be (much) simpler that any explicit vcpu approach. Do you 
plan to add this next?

	Ingo
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18675 is a reply to message #18632] Fri, 25 May 2007 12:05 Go to previous messageGo to next message
Ingo Molnar is currently offline  Ingo Molnar
Messages: 51
Registered: December 2005
Member
* Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:

> On Fri, May 25, 2007 at 01:11:40PM +0200, Ingo Molnar wrote:
> > > I was planning on reusing smpnice concepts here, with the difference 
> > > that we balance group weights across CPU in addition to total weight 
> > > of CPUs.
> > 
> > ok, that would be (much) simpler that any explicit vcpu approach. Do you 
> > plan to add this next?
> 
> sure, after i first cleanup current group changes a bit. Hopefully I 
> can send you first version of smp support by early next week.

great. Btw., could you please keep the "up to this point there should be 
no behavioral change in CFS" fundamental splitup of your patches - that 
way i can look at the core changes (and possibly apply them) without 
having to consider the uid based changes (which do change behavior and 
which need more upstream buy-in).

	Ingo
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18681 is a reply to message #18664] Fri, 25 May 2007 10:56 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Fri, May 25, 2007 at 10:29:51AM +0200, Ingo Molnar wrote:
> btw., what are you thoughts about SMP?

I was planning on reusing smpnice concepts here, with the difference
that we balance group weights across CPU in addition to total weight of
CPUs. 

For ex, assuming weight of each task is 10

CPU0 => USER X (Wt 100) + USER Y (Wt 200) => total weight 300
CPU1 => USER X (Wt 30)  + USER Y (Wt 80)  => total weight 110

So first we notice that CPU0 and CPU1 are imbalanced by weight 190 and
target at reducing this imbalance by half i.e CPU1 has to pull total
weight of 95 (190/2) from CPU0. However while pulling weights, we apply
the same imbalance/2 rule at group level also. For ex: we cannot pull more than
70/2 = 35 from USER X on CPU0  or more than 120/2 = 60 from USER Y on CPU0.

Using this rule, after balance, the two CPUs may look like:

CPU0 => USER X (Wt 70) + USER Y (Wt 140) => total weight 210
CPU1 => USER X (Wt 60) + USER Y (Wt 140) => total weight 200

I had tried this approach earlier (in
https://lists.linux-foundation.org/pipermail/containers/2007-April/004580.html)
and had obtained decent results. It also required minimal changes to
smpnice.

Compared to this, what better degree of control/flexibilty does virtual
cpu approach give?

> it's a natural extension of your current code. I think the best approach 
> would be to add a level of 'virtual CPU' objects above struct user. (how 
> to set the attributes of those objects is open - possibly combine it 
> with cpusets?)

are these virtual CPUs visible to users (ex: does smp_processor_id()
return virtual cpu id rather than physical id and does DEFINE_PER_CPU
create per-cpu data for virtual CPUs rather than physical cpus)?

> That way the scheduler would first pick a "virtual CPU" to schedule, 

are virtual cpus pinned to their physical cpu or can they bounce around?
i.e can CPU #0 schedule VCPU D (in your example below)? If bouncing is
allowed, I am not sure whether that is a good thing for performance. How
do we minimize this performance cost?

> and then pick a user from that virtual CPU, and then a task from the user. 
> 
> To make group accounting scalable, the accounting object attached to the 
> user struct should/must be per-cpu (per-vcpu) too. That way we'd have a 
> clean hierarchy like:
> 
>   CPU #0 => VCPU A [ 40% ] + VCPU B [ 60% ]
>   CPU #1 => VCPU C [ 30% ] + VCPU D [ 70% ]
> 
>   VCPU A => USER X [ 10% ] + USER Y [ 90% ]
>   VCPU B => USER X [ 10% ] + USER Y [ 90% ]
>   VCPU C => USER X [ 10% ] + USER Y [ 90% ]
>   VCPU D => USER X [ 10% ] + USER Y [ 90% ]
> 
> the scheduler first picks a vcpu, then a user from a vcpu. (the actual 
> external structure of the hierarchy should be opaque to the scheduler 
> core, naturally, so that we can use other hierarchies too)
> 
> whenever the scheduler does accounting, it knows where in the hierarchy 
> it is and updates all higher level entries too. This means that the 
> accounting object for USER X is replicated for each VCPU it participates 
> in.
> 
> SMP balancing is straightforward: it would fundamentally iterate through 
> the same hierarchy and would attempt to keep all levels balanced - i 
> abstracted away its iterators already
> 
> Hm? 

-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18684 is a reply to message #18671] Fri, 25 May 2007 11:28 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Fri, May 25, 2007 at 01:11:40PM +0200, Ingo Molnar wrote:
> > I was planning on reusing smpnice concepts here, with the difference 
> > that we balance group weights across CPU in addition to total weight 
> > of CPUs.
> 
> ok, that would be (much) simpler that any explicit vcpu approach. Do you 
> plan to add this next?

sure, after i first cleanup current group changes a bit. Hopefully I can
send you first version of smp support by early next week.

-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18685 is a reply to message #18664] Fri, 25 May 2007 13:05 Go to previous messageGo to next message
dev is currently offline  dev
Messages: 1693
Registered: September 2005
Location: Moscow
Senior Member

Ingo Molnar wrote:
> * Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
> 
> 
>>Can you repeat your tests with this patch pls? With the patch applied, 
>>I am now getting the same split between nice 0 and nice 10 task as 
>>CFS-v13 provides (90:10 as reported by top )
>>
>> 5418 guest     20   0  2464  304  236 R   90  0.0   5:41.40 3 hog
>> 5419 guest     30  10  2460  304  236 R   10  0.0   0:43.62 3 nice10hog
> 
> 
> btw., what are you thoughts about SMP?
> 
> it's a natural extension of your current code. I think the best approach 
> would be to add a level of 'virtual CPU' objects above struct user. (how 
> to set the attributes of those objects is open - possibly combine it 
> with cpusets?)

> That way the scheduler would first pick a "virtual CPU" to schedule, and 
> then pick a user from that virtual CPU, and then a task from the user. 

don't you mean the vice versa:
first use to scheduler, then VCPU (which is essentially a runqueue or rbtree),
then a task from VCPU?

this is the approach we use in OpenVZ and if you don't mind
I would propose to go this way for fair-scheduling in mainstream.
It has it's own advantages and disatvantages.

This is not the easy way to go and I can outline the problems/disadvantages
which appear on this way:
- tasks which bind to CPU mask will bind to virtual CPUs.
  no problem with user tasks, but some kernel threads
  use this to do CPU-related management (like cpufreq).
  This can be fixed using SMP IPI actually.
- VCPUs should no change PCPUs very frequently,
  otherwise there is some overhead. Solvable.

Advantages:
- High precision and fairness.
- Allows to use different group scheduling algorithms
  on top of VCPU concept.
  OpenVZ uses fairscheduler with CPU limiting feature allowing
  to set maximum CPU time given to a group of tasks.

> To make group accounting scalable, the accounting object attached to the 
> user struct should/must be per-cpu (per-vcpu) too. That way we'd have a 
> clean hierarchy like:
> 
>   CPU #0 => VCPU A [ 40% ] + VCPU B [ 60% ]
>   CPU #1 => VCPU C [ 30% ] + VCPU D [ 70% ]

how did you select these 40%:60% and 30%:70% split?

>   VCPU A => USER X [ 10% ] + USER Y [ 90% ]
>   VCPU B => USER X [ 10% ] + USER Y [ 90% ]
>   VCPU C => USER X [ 10% ] + USER Y [ 90% ]
>   VCPU D => USER X [ 10% ] + USER Y [ 90% ]
> 
> the scheduler first picks a vcpu, then a user from a vcpu. (the actual 
> external structure of the hierarchy should be opaque to the scheduler 
> core, naturally, so that we can use other hierarchies too)
> 
> whenever the scheduler does accounting, it knows where in the hierarchy 
> it is and updates all higher level entries too. This means that the 
> accounting object for USER X is replicated for each VCPU it participates 
> in.

So if 2 VCPUs running on 2 physical CPUs do accounting the have to update the same
user X accounting information which is not per-[v]cpu?

> SMP balancing is straightforward: it would fundamentally iterate through 
> the same hierarchy and would attempt to keep all levels balanced - i 
> abstracted away its iterators already.

Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [RFC] [PATCH 0/3] Add group fairness to CFS [message #18686 is a reply to message #18675] Fri, 25 May 2007 12:41 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Fri, May 25, 2007 at 02:05:36PM +0200, Ingo Molnar wrote:
> great. Btw., could you please keep the "up to this point there should be 
> no behavioral change in CFS" fundamental splitup of your patches -

sure ..basically the changes required in CFS core is the introduction of two
structures - struct sched_entity and struct lrq and generalization of
cfs routines to operate on these structures rather than on
task_struct/rq structures directly.

In the first split, I will ensure that this generalization is applied
only to tasks (which represents reorganization of core with no
behavioral/functional change in scheduler) and in a subsequent split/patch I 
will apply the generalization to uids also (which will add group fairness 
aspect to scheduler), as you require.

Thanks for your feedback so far!

> that way i can look at the core changes (and possibly apply them) without 
> having to consider the uid based changes (which do change behavior and 
> which need more upstream buy-in).

-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS [message #18688 is a reply to message #18685] Fri, 25 May 2007 15:34 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Fri, May 25, 2007 at 05:05:16PM +0400, Kirill Korotaev wrote:
> > That way the scheduler would first pick a "virtual CPU" to schedule, and 
> > then pick a user from that virtual CPU, and then a task from the user. 
> 
> don't you mean the vice versa:
> first use to scheduler, then VCPU (which is essentially a runqueue or rbtree),
> then a task from VCPU?
> 
> this is the approach we use in OpenVZ [...]

So is this how it looks in OpenVZ?

	CONTAINER1 => VCPU0 + VCPU1 
	CONTAINER2 => VCPU2 + VCPU3

PCPU0 picks a container first, a vcpu next and then a task in it
PCPU1 also picks a container first, a vcpu next and then a task in it.

Few questions:

1. Are VCPU runqueues (on which tasks are present) global queues?
  
   That is, let's say that both PCPU0 and PCPU1 pick CONTAINER1 to schedule
   (first level) at the same time and next (let's say) they pick same vcpu
   VCPU0 to schedule (second level). Will the two pcpu's now have to be 
   serialized for scanning task to schedule next (third level) within VCPU0 
   using a spinlock?  Won't that shootup scheduling costs (esp on large
   systems), compared to (local scheduling + balance across cpus once in a 
   while, the way its done today)?

   Or do you required that two pcpus don't schedule the same vcpu at the
   same time (the way hypervisors normally work)? Even then I would
   imagine a fair level of contention to be present in second step (pick
   a virtual cpu from a container's list of vcpus).

2. How would this load balance at virtual cpu level and sched domain based
   load balancing interact?

   The current sched domain based balancing code has many HT/MC/SMT related 
   optimizations, which ensure that tasks are spread across physical 
   threads/cores/packages in a most efficient manner - so as to utilize 
   hardware bandwidth to the maximum. You would now need to introduce
   those optimizations essentially at schedule() time ..? Don't know
   if that is a wise thing to do.

3. How do you determine the number of VCPUs per container? Is there any
   relation for number of virtual cpus exposed per user/container and
   the number of available cpus? For ex: in case of user-driven
   scheduling, we would want all users to see the same number of cpus
   (which is the number available in the system).

4. VCPU ids (namespace) - is it different for different containers?

   For ex: can id's of vcpus belonging to different containers (say VCPU0 and 
   VCPU2), as seen by users thr' vgetcpu/smp_processor_id() that is, be same?
   If so, then potentially two threads belonging to different users may find
   that they are running -truly simultaneously- on /same/ cpu 0 (one on
   VCPU0/PCPU0 and another on VCPU2/PCPU1) which normally isn't possible!

   This may be ok for containers, with non-overlapping cpu id namespace,
   but when applied to group scheduling for, say, users, which require a
   global cpu id namespace, wondering how that would be addressed ..


> and if you don't mind I would propose to go this way for fair-scheduling in 
> mainstream.
> It has it's own advantages and disatvantages.
> 
> This is not the easy way to go and I can outline the problems/disadvantages
> which appear on this way:
> - tasks which bind to CPU mask will bind to virtual CPUs.
>   no problem with user tasks, [...]

Why is this not a problem for user tasks? Tasks which bind to different
CPUs for performance reason now can find that they are running on same
(physical) CPU unknowingly.

> but some kernel threads
>   use this to do CPU-related management (like cpufreq).
>   This can be fixed using SMP IPI actually.
> - VCPUs should no change PCPUs very frequently,
>   otherwise there is some overhead. Solvable.
> 
> Advantages:
> - High precision and fairness.

I just don't know if this benefit of high degree of fairness is worth the 
complexity it introduces. Besides having some data which shows how much better 
is is with respect to fairness/overhead when compared with other approaches 
(like smpnice) would help I guess. I will however let experts like Ingo make 
the final call here  :)

-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS [message #18695 is a reply to message #18663] Mon, 28 May 2007 16:39 Go to previous messageGo to next message
Srivatsa Vaddagiri is currently offline  Srivatsa Vaddagiri
Messages: 241
Registered: August 2006
Senior Member
On Fri, May 25, 2007 at 10:14:58AM -0700, Li, Tong N wrote:
> Nice work, Vatsa. When I wrote the DWRR algorithm, I flattened the
> hierarchies into one level, so maybe that approach can be applied to
> your code as well. What I did is to maintain task and task group weights
> and reservations separately from the scheduler, while the scheduler only
> sees one system-wide weight per task and does not concern about which
> group a task is in. The key here is the system-wide weight of each task
> should represent an equivalent share to the share represented by the
> group hierarchies. To do this, the scheduler looks up the task and group
> weights/reservations it maintains, and dynamically computes the
> system-wide weight *only* when it need a weight for a given task while
> scheduling. The on-demand weight computation makes sure the cost is
> small (constant time). The computation itself can be seen from an
> example: assume we have a group of two tasks and the group's total share
> is represented by a weight of 10. Inside the group, let's say the two
> tasks, P1 and P2, have weights 1 and 2. Then the system-wide weight for
> P1 is 10/3 and the weight for P2 is 20/3. In essence, this flattens
> weights into one level without changing the shares they represent.

What do these task weights control? Timeslice primarily? If so, I am not
sure how well it can co-exist with cfs then (unless you are planning to
replace cfs with a equally good interactive/fair scheduler :)

I would be very interested if this weight calculation can be used for
smpnice based load balancing purposes too ..

-- 
Regards,
vatsa
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS [message #18708 is a reply to message #18695] Wed, 30 May 2007 00:14 Go to previous messageGo to next message
billh is currently offline  billh
Messages: 1
Registered: May 2007
Junior Member
On Mon, May 28, 2007 at 10:09:19PM +0530, Srivatsa Vaddagiri wrote:
> On Fri, May 25, 2007 at 10:14:58AM -0700, Li, Tong N wrote:
> > is represented by a weight of 10. Inside the group, let's say the two
> > tasks, P1 and P2, have weights 1 and 2. Then the system-wide weight for
> > P1 is 10/3 and the weight for P2 is 20/3. In essence, this flattens
> > weights into one level without changing the shares they represent.
> 
> What do these task weights control? Timeslice primarily? If so, I am not
> sure how well it can co-exist with cfs then (unless you are planning to
> replace cfs with a equally good interactive/fair scheduler :)

It's called SD. From Con Kolivas that got it right the first time around :)

> I would be very interested if this weight calculation can be used for
> smpnice based load balancing purposes too ..

bill

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS [message #18712 is a reply to message #18695] Wed, 30 May 2007 02:51 Go to previous message
William Lee Irwin III is currently offline  William Lee Irwin III
Messages: 20
Registered: April 2007
Junior Member
On Mon, May 28, 2007 at 10:09:19PM +0530, Srivatsa Vaddagiri wrote:
> What do these task weights control? Timeslice primarily? If so, I am not
> sure how well it can co-exist with cfs then (unless you are planning to
> replace cfs with a equally good interactive/fair scheduler :)
> I would be very interested if this weight calculation can be used for
> smpnice based load balancing purposes too ..

Task weights represent shares of CPU bandwidth. If task i has weight w_i
then its share of CPU bandwidth is intended to be w_i/sum_i w_i.

"Load weight" seems to be used more in the scheduler source.


-- wli
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
Previous Topic: Re: Pid namespaces approaches testing results
Next Topic: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
Goto Forum:
  


Current Time: Sat Oct 25 10:18:13 GMT 2025

Total time taken to generate the page: 0.12946 seconds