OpenVZ Forum: Devel » [RFC] Transactional CGroup task attachment

Home » Mailing lists » Devel » [RFC] Transactional CGroup task attachment

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Re: [RFC] Transactional CGroup task attachment [message #31890 is a reply to message #31800]

Sat, 12 July 2008 00:03

Matt Helsley
Messages: 86
Registered: August 2006

Member

On Wed, 2008-07-09 at 23:46 -0700, Paul Menage wrote:
> This is an initial design for a transactional task attachment
> framework for cgroups. There are probably some potential simplications
> that I've missed, particularly in the area of locking. Comments
> appreciated.
> 
> The Problem
> ===========
> 
> Currently cgroups task movements are done in three phases
> 
> 1) all subsystems in the destination cgroup get the chance to veto the
> movement (via their can_attach()) callback
> 2) task->cgroups is updated (while holding task->alloc_lock)
> 3) all subsystems get an attach() callback to let them perform any
> housekeeping updates
> 
> The problems with this include:
> 
> - There's no way to ensure that the result of can_attach() remains
> valid up until the attach() callback, unless any invalidating
> operations call cgroup_lock() to synchronize with the change. This is
> fine for something like cpusets, where invalidating operations are
> rare slow events like the user removing all cpus from a cpuset, or cpu
> hotplug triggering removal of a cpuset's last cpu. It's not so good
> for the virtual address space controller where the can_attach() check
> might be that the res_counter has enough space, and an invalidating
> operation might be another task in the cgroup allocating another page
> of virtual address space.
> 
> - It doesn't handle the case of the proposed "cgroup.procs" file which
> can move multiple threads into a cgroup in one operation; the
> can_attach() and attach() calls should be able to atomically allow all
> or none of the threads to move.
> 
> - it can create races around the time of the movement regarding to
> which cgroup a resource charge/uncharge should be assigned (e.g.
> between steps 2 and 3 new resource usage will be charged to the
> destination cgroup, but step 3 might involve migrating a charge equal
> to the task's resource usage from the old cgroup to the new, resulting
> in over/under-charges.

I agree with Balbir: this is a good description of the problems.

> 
> Conceptual solution
> ===================
> 
> In ideal terms, a solution for this problem would meet the following
> requirements:
> 
> - support movement of an arbitrary set of threads between an arbitrary
> set of cgroups
>
> - allow arbitrarily complex locking from the subsystems involved so
> that they can synchronize against concurrent charges, etc
> - allow rollback at any point in the process
> 
> But in practice that would probably be way more complex than we'd want
> in the kernel. We don't want to encourage excessively-complex locking
> from subsystems, and we don't need to support arbitrary task
> movements.
> 
> 
> (Hopefully!) Practical solution
> =============
> 
> So here's a more practical solution, which hopefully catches the
> important parts of the requirements without being quite so complex.
> 
> The restrictions are:
> 
> - only supporting movement to one destination cgroup (in the same
> hierarchy, of course); if an entire process is being moved, then
> potentially its threads could be coming from different source cgroups
> - a subsystem may optionally fail such an attach if it can't handle
> the synchronization this would entail.
> 
> - supporting moving either one thread, one entire thread group or (for
> the future) "all threads". This supports the existing "tasks" file,
> the proposed "procs" file and also allows scope for things like adding
> a subsystem to an existing hierarchy.
> 
> - locking/checking performed in two phases - one to support sleeping
> locks, and one to support spinlocks. This is to support both
> subsystems that use mutexes to protect their data, and subsystems that
> use spinlocks
> 
> - no locks allowed to be shared between multiple subsystems during the
> transaction, with the single exception of the mmap_sem of the
> thread/process being moved. This is because multiple subsystems use
> the mmap_sem for synchronization, and are quite likely to be mounted
> in the same hierarchy. The alternative would be to introduce a
> down_read_unfair() operation that would skip ahead of waiting writers,
> to safely allow a single thread to recursively lock mm->mmap_sem.
> 
> First we define the state for the transaction:
> 
> struct cgroup_attach_state {

nit: How about naming it cgroup_attach_request or
cgroup_attach_request_state? I suggest this because it's not really
"state" that's kept beyond the prepare-then-(commit|abort) sequence.

>   // The thread or process being moved, NULL if moving (potentially) all threads
>   struct task_struct *task;
>   enum {
>     CGROUP_ATTACH_THREAD,
>     CGROUP_ATTACH_PROCESS,
>     CGROUP_ATTACH_ALL
>   } mode;
>
> 
>   // The destination cgroup
>   struct cgroup *dest;
> 
>   // The source cgroup for "task" (child threads *may* have different
> groups; subsystem must handle this if it needs to)
>   struct cgroup *src;
>
>   // Private state for the attach operation per-subsys. Subsystems are
> completely responsible for managing this
>   void *subsys_state[CGROUP_SUBSYS_STATE];
> 
>   // "Recursive lock count" for task->mm->mmap_sem (needed if we don't
> introduce down_read_unfair())
>   int mmap_sem_lock_count;
> };
> 
> New cgroup subsystem callbacks (all optional):
> 
> -----
> 
> int prepare_attach_sleep(struct cgroup_attach_state *state);
> 
> Called during the first preparation phase for each subsystem. The
> subsystem may perform any sleeping behaviour, including waiting for
> mutexes and doing sleeping memory allocations, but may not disable
> interrupts or take any spinlocks. Return a -ve error on failure or 0
> on success. If it returns failure, then no further callbacks will be
> made for this attach; if it returns success then exactly one of
> abort_attach_sleep() or commit_attach() is guaranteed to be called in
> the future
> 
> No two subsystems may take the same lock as part of their
> prepare_attach_sleep() callback. A special case is made for mmap_sem:
> if this callback needs to down_read(&state->task->mmap_sem) it should
> only do so if state->mmap_sem_lock_count++ == 0. (A helper function
> will be provided for this). The callback should not
> write_lock(&state->task->mmap_sem).

What about the task->alloc_lock? Might that need to be taken by multiple
subsystems? See my next comment.

> Called with group_mutex (which prevents any other task movement
> between cgroups) held plus any mutexes/semaphores taken by earlier
> subsystems's callbacks.
> 
> -----
> 
> int prepare_attach_nosleep(struct cgroup_attach_state *state);
> 
> Called during the second preparation phase (assuming no subsystem
> failed in the first phase). The subsystem may not sleep in any way,
> but may disable interrupts or take spinlocks. Return a -ve error on
> failure or 0 on success. If it returns failure, then
> abort_attach_sleep() will be called; if it returns success then either
> abort_attach_nosleep() followed by abort_attach_sleep() will be
> called, or commit_attach() will be called
> 
> Called with cgroup_mutex and alloc_lock for task held (plus any
> mutexes/semaphores taken by subsystems in the prepare_attach_nosleep()
> phase, and any spinlocks taken by earlier subsystems in this phase .
> If state->mode == CGROUP_ATTACH_PROCESS then alloc_lock for all
> threads in task's thread_group are held. (Is this a really bad idea?
> Maybe we should call this without any task->alloc_lock held?)

With task->alloc_lock held would avoid the case where multiple
subsystems need it (assuming the case exists of course).

> ----
> 
> void abort_attach_sleep(struct cgroup_attach_state *state);
> 
> Called following a successful return from prepare_attach_sleep().
> Indicates that the attach operation was aborted and the subsystem
> should unwind any state changes made and locks taken by
> prepare_attach_sleep().
> 
> Called with same locks as prepare_attach_sleep()
> 
> ----
> 
> void abort_attach_nosleep(struct cgroup_attach_state *state);
> 
> Called following a successful return from prepare_attach_nosleep().
> Indicates that the attach operation was aborted and the subsystem
> should unwind any state changes made and locks taken by
> prepare_attach_nosleep().
> 
> Called with the same locks as prepare_attach_nosleep();
> 
> ----
> 
> void commit_attach(struct cgroup_attach_state *state);
> 
> Called following a successful return from prepare_attach_sleep() for a
> subsystem that has no prepare_attach_nosleep(), or following a
> successful return from prepare_attach_nosleep(). Indicates that the
> attach operation is going ahead, and
> any partially-committed state should be finalized, and any taken locks
> should be released. No further callbacks will be made for this attach.
> 
> This is called immediately after updating task->cgroups (and threads
> if necessary) to point to the new cgroup set.
> 
> Called with the same locks held as prepare_attach_nosleep()

Rather than describing what might be called later for each API entry
separately it might be simpler to prefix the whole API/protocol
description with something like:
======
A successful return from prepare_X will cause abort_X to be called if
any of the prepatory calls fail. (where X is either sleep or nosleep)

A successful return from prepare_X will cause commit to be called if all

...

[ Show the rest of the message ]

Report message to a moderator

[Message index]

		[RFC] Transactional CGroup task attachment By: Paul Menage on Thu, 10 July 2008 06:46
		Re: [RFC] Transactional CGroup task attachment By: Paul Jackson on Thu, 10 July 2008 11:23
		Re: [RFC] Transactional CGroup task attachment By: KAMEZAWA Hiroyuki on Fri, 11 July 2008 00:17
		Re: [RFC] Transactional CGroup task attachment By: Daisuke Nishimura on Mon, 14 July 2008 06:28
		Re: [RFC] Transactional CGroup task attachment By: KAMEZAWA Hiroyuki on Mon, 14 July 2008 07:50
		Re: [RFC] Transactional CGroup task attachment By: Daisuke Nishimura on Mon, 14 July 2008 12:36
		Re: [RFC] Transactional CGroup task attachment By: Paul Menage on Mon, 14 July 2008 19:16
		Re: [RFC] Transactional CGroup task attachment By: Li Zefan on Fri, 11 July 2008 02:14
		Re: [RFC] Transactional CGroup task attachment By: Paul Menage on Fri, 11 July 2008 03:31
		Re: [RFC] Transactional CGroup task attachment By: serue on Fri, 11 July 2008 14:27
		Re: [RFC] Transactional CGroup task attachment By: Paul Menage on Fri, 11 July 2008 15:23
		Re: [RFC] Transactional CGroup task attachment By: serue on Fri, 11 July 2008 15:34
		Re: [RFC] Transactional CGroup task attachment By: Paul Menage on Fri, 11 July 2008 15:36
		Re: [RFC] Transactional CGroup task attachment By: Balbir Singh on Fri, 11 July 2008 16:12
		Re: [RFC] Transactional CGroup task attachment By: Paul Menage on Fri, 11 July 2008 20:41
		Re: [RFC] Transactional CGroup task attachment By: Matt Helsley on Sat, 12 July 2008 00:03
		Re: [RFC] Transactional CGroup task attachment By: Paul Menage on Sat, 12 July 2008 00:18
		Re: [RFC] Transactional CGroup task attachment By: Matt Helsley on Sat, 12 July 2008 00:48
		Re: [RFC] Transactional CGroup task attachment By: Paul Menage on Sat, 12 July 2008 01:49
		Re: [RFC] Transactional CGroup task attachment By: Matt Helsley on Sat, 12 July 2008 02:03
		Re: [RFC] Transactional CGroup task attachment By: Paul Menage on Sat, 12 July 2008 02:09
		Re: [RFC] Transactional CGroup task attachment By: Peter Zijlstra on Mon, 14 July 2008 10:56
		Re: [RFC] Transactional CGroup task attachment By: Peter Zijlstra on Mon, 14 July 2008 10:50

Previous Topic:	Re: [PATCH] ltp controllers: block device i/o bandwidth controller testcase (was: Re: [LTP] [PATCH 0
Next Topic:	[PATCH -mm 2/3] i/o bandwidth controller infrastructure

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Mon Feb 16 07:44:25 GMT 2026

Total time taken to generate the page: 0.29234 seconds