Today's Messages (off)
| Unanswered Messages (on)
| Forum: Devel |
|---|
| Topic: Re: [RFD] reboot / shutdown of a container |
|---|
| Re: [RFD] reboot / shutdown of a container [message #41943] |
Thu, 13 January 2011 16:50 |
Bruno Pr Messages: 3 Registered: January 2011 |
Junior Member |
From: *parallels.com
|
|
On Thu, 13 January 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> On 01/13/2011 09:09 PM, Bruno Prémont wrote:
> > On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@free.fr> wrote:
> >> in the container implementation, we are facing the problem of a process
> >> calling the sys_reboot syscall which of course makes the host to
> >> poweroff/reboot.
> >>
> >> If we drop the cap_sys_reboot capability, sys_reboot fails and the
> >> container reach a shutdown state but the init process stay there, hence
> >> the container becomes stuck waiting indefinitely the process '1' to exit.
> >>
> >> The current implementation to make the shutdown / reboot of the
> >> container to work is we watch, from a process outside of the container,
> >> the<rootfs>/var/run/utmp file and check the runlevel each time the file
> >> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
> >> a single remaining in the container and then we kill it.
> >>
> >> That works but this is not efficient in case of a large number of
> >> containers as we will have to watch a lot of utmp files. In addition,
> >> the /var/run directory must *not* mounted as tmpfs in the distro.
> >> Unfortunately, it is the default setup on most of the distros and tends
> >> to generalize. That implies, the rootfs init's scripts must be modified
> >> for the container when we put in place its rootfs and as /var/run is
> >> supposed to be a tmpfs, most of the applications do not cleanup the
> >> directory, so we need to add extra services to wipeout the files.
> >>
> >> More problems arise when we do an upgrade of the distro inside the
> >> container, because all the setup we made at creation time will be lost.
> >> The upgrade overwrite the scripts, the fstab and so on.
> >>
> >> We did what was possible to solve the problem from userspace but we
> >> reach always a limit because there are different implementations of the
> >> 'init' process and the init's scripts differ from a distro to another
> >> and the same with the versions.
> >>
> >> We think this problem can only be solved from the kernel.
> >>
> >> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
> >> pid namespace when the sys_reboot is called. Of course that won't occur
> >> for the init pid namespace.
> > Wouldn't sending SIGKILL to the pid '1' process of the originating PID
> > namespace be sufficient (that would trigger a SIGCHLD for the parent
> > process in the outer PID namespace.
>
> This is already the case. The question is : when do we send this signal ?
> We have to wait for the container system shutdown before killing it.
I meant that sys_reboot() would kill the namespace's init if it's not
called from boot namespace.
See below
> > (as far as I remember the PID namespace is killed when its 'init' exits,
> > if this is not the case all other processes in the given namespace would
> > have to be killed as well)
>
> Yes, absolutely but this is not the point, reaping the container is not
> a problem.
>
> What we are trying to achieve is to shutdown properly the container from
> inside (from outside will be possible too with the setns syscall).
>
> Assuming the process '1234' creates a new process in a new namespace set
> and wait for it.
>
> The new process '1' will exec /sbin/init and the system will boot up.
> But, when the system is shutdown or rebooted, after the down scripts are
> executed the kill -15 -1 will be invoked, killing all the processes
> expect the process '1' and the caller. This one will then call
> 'sys_reboot' and exit. Hence we still have the init process idle and its
> parent '1234' waiting for it to die.
This call to sys_reboot() would kill "new process '1'" instead of trying to
operate on the HW box.
This also has the advantage that a container would not require an informed
parent "monitoring" it from outside (though it would not be restarted even if
requested without such informed outside parent).
> If we are able to receive the information in the process '1234' : "the
> sys_reboot was called in the child pid namespace", we can take then kill
> our child pid. If this information is raised via a signal sent by the
> kernel with the proper information in the siginfo_t (eg. si_code
> contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the
> solution will be generic for all the shutdown/reboot of any kind of
> container and init version.
Could this be passed for a SIGCHLD? (when namespace is reaped, and received
by 1234 from above example assuming sys_reboot() kills the "new process '1'")
Looks like yes, but with the need to define new values for si_code (reusing
LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen).
> > Only issue is how to differentiate the various reboot() modes (restart,
> > power-off/halt) from outside, though that one also exists with the SIGPWR
> > signal.
Bruno
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [RFD] reboot / shutdown of a container |
|---|
| Re: [RFD] reboot / shutdown of a container [message #41942] |
Thu, 13 January 2011 15:09 |
Bruno Pr Messages: 3 Registered: January 2011 |
Junior Member |
From: *parallels.com
|
|
On Thu, 13 January 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> in the container implementation, we are facing the problem of a process
> calling the sys_reboot syscall which of course makes the host to
> poweroff/reboot.
>
> If we drop the cap_sys_reboot capability, sys_reboot fails and the
> container reach a shutdown state but the init process stay there, hence
> the container becomes stuck waiting indefinitely the process '1' to exit.
>
> The current implementation to make the shutdown / reboot of the
> container to work is we watch, from a process outside of the container,
> the <rootfs>/var/run/utmp file and check the runlevel each time the file
> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
> a single remaining in the container and then we kill it.
>
> That works but this is not efficient in case of a large number of
> containers as we will have to watch a lot of utmp files. In addition,
> the /var/run directory must *not* mounted as tmpfs in the distro.
> Unfortunately, it is the default setup on most of the distros and tends
> to generalize. That implies, the rootfs init's scripts must be modified
> for the container when we put in place its rootfs and as /var/run is
> supposed to be a tmpfs, most of the applications do not cleanup the
> directory, so we need to add extra services to wipeout the files.
>
> More problems arise when we do an upgrade of the distro inside the
> container, because all the setup we made at creation time will be lost.
> The upgrade overwrite the scripts, the fstab and so on.
>
> We did what was possible to solve the problem from userspace but we
> reach always a limit because there are different implementations of the
> 'init' process and the init's scripts differ from a distro to another
> and the same with the versions.
>
> We think this problem can only be solved from the kernel.
>
> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
> pid namespace when the sys_reboot is called. Of course that won't occur
> for the init pid namespace.
Wouldn't sending SIGKILL to the pid '1' process of the originating PID
namespace be sufficient (that would trigger a SIGCHLD for the parent
process in the outer PID namespace.
(as far as I remember the PID namespace is killed when its 'init' exits,
if this is not the case all other processes in the given namespace would
have to be killed as well)
Only issue is how to differentiate the various reboot() modes (restart,
power-off/halt) from outside, though that one also exists with the SIGPWR
signal.
Bruno
> Does it make sense ?
>
> Any idea is very welcome :)
>
> -- Daniel
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [PATCH] Teach cifs about network namespaces (take 2) |
|---|
| Re: [PATCH] Teach cifs about network namespaces (take 2) [message #41996] |
Thu, 13 January 2011 13:52 |
Rob Landley Messages: 19 Registered: December 2010 |
Junior Member |
From: *parallels.com
|
|
On 01/11/2011 03:30 PM, Jeff Layton wrote:
> On Tue, 11 Jan 2011 12:04:54 -0600
> Rob Landley <rlandley@parallels.com> wrote:
>
>> From: Rob Landley <rlandley@parallels.com>
>>
>> Teach cifs about network namespaces, so mounting uses adresses/routing
>> visible from the container rather than from init context.
>>
>> Signed-off-by: Rob Landley <rlandley@parallels.com>
>> ---
>>
>> Updated with Matt's feedback and to apply to current linus-git.
>>
>> fs/cifs/cifsglob.h | 37 +++++++++++++++++++++++++++++++++++++
>> fs/cifs/connect.c | 14 ++++++++++++--
>> 2 files changed, 49 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
>> index 606ca8b..8175d31 100644
>> --- a/fs/cifs/cifsglob.h
>> +++ b/fs/cifs/cifsglob.h
>> @@ -165,6 +165,9 @@ struct TCP_Server_Info {
>> struct socket *ssocket;
>> struct sockaddr_storage dstaddr;
>> struct sockaddr_storage srcaddr; /* locally bind to this IP */
>> +#ifdef CONFIG_NET_NS
>> + struct net *net;
>> +#endif
>> wait_queue_head_t response_q;
>> wait_queue_head_t request_q; /* if more than maxmpx to srvr must block*/
>> struct list_head pending_mid_q;
>> @@ -224,6 +227,40 @@ struct TCP_Server_Info {
>> };
>>
>
> I've got a patch queued that rearranges some fields in TCP_Server_Info
> according to pahole's recommendations. You may want to base this patch
> on that.
I confirmed that where it is just misses being affected by your patch
(offset but no fuzz), and it follows struct sockaddr_storage which
include/linux/socket.h A) pads to 128 bytes, B) adds an alignment
compiler directive to just to be sure.
So it seems reasonable to leave it where it is for the moment.
Rob
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: Two newbie questions on containers |
|---|
| Re: Two newbie questions on containers [message #41995] |
Wed, 12 January 2011 12:35 |
Rob Landley Messages: 19 Registered: December 2010 |
Junior Member |
From: *parallels.com
|
|
On 01/11/2011 04:08 PM, Timur Tabi wrote:
> Hi,
>
> I'm in the process of learning about Linux containers, including cgroups, and
> the learning curve seems pretty steep to me.
Join the club. I need to write up documentation on what I'm learning...
> So I have a couple newbie
> questions for all of you. Any detailed answered are greatly appreciated.
Container support consists of a number of things. The cgroups
filesystem is one, the napespace flags to clone (all the ones starting
with CLONE_NEW*) are another, various synthetic filesystems like devpts
have "-o newinstance". And of course it's all built on top of chroot.
The LXC userspace tool attempts to tie all of these together into
something coherent. They have their own mailing list, off of lxc.sf.net.
I wrote up my ignorance at:
http://landley.livejournal.com/47024.html
http://landley.livejournal.com/47205.html
> 1) For the PowerPC architecture, is there anything that is "missing"? I can't
> really tell how much of cgroups and lxc is architecture-specific, and there
> appears to be PowerPC support for both already. I'd like to know if this
> another one of those areas, like KVM, where x86 is fully implemented and PowerPC
> support is lagging.
Containers support is basicaly chroot on steroids. It attempts to build
_up_ from chroot to provide efficient fully isolated virtual systems,
the same way paravirtualization is attempting to strip down
virtualization to reinvent the microkernel. Both approaches have their
fundamental limits: containers are never going to boot Windows and
paravirtualization is unlikely to scale much better than Multix or "The
Hurd".
But a big advantage of containers is it's about as portable as chroot.
It hasn't received a lot of testing on other targets, but there's no
fundamental reason it shouldn't work just fine.
This might help:
http://lxc.sourceforge.net/index.php/about/kernel-namespaces /
> 2) Given a random device driver, like a driver for a serial port, is there an
> opportunity for the driver to be enhanced to support cgroups or lxc?
Define "support".
There's two main categories of device containerization:
1) Selective visibility, so you can move a physical device into a
container and have it _only_ show up there, and not be visible elsewhere.
2) Synthetic devices, such as /dev/console in a container that a host
LXC instance can attach to and be at the other end of. (TUN/TAP
ethernet interfaces are another example.) These generally exist to let
the container talk to the outside world with host-controllable routing.
However I'm assured that "fully transparent" containers are not a goal.
So having /dev/console be a pty if necessary may be "good enough" for a
given deployment.
(Also, note that since a container inherits a filesystem via chroot
(with whatever shared subtree mount splices the host cares to set up
before chrooting), it has less need for block device access than usual.)
> Doing a
> simple search of the kernel source code, I don't really see any drivers making
> calls into any cgroup code, so I don't understand how to restrict device access
> to a specific container or cgroup.
I'm banging on the CONFIG_NET_NS stuff a bit, although I'm sure there's
plenty of bugs for all. :)
Note that there are longstanding out-of-tree containerization solutions
(most notably openvz) which are implemented differently (new syscalls,
an approach that got vetoed) than the containers support that made it
into the kernel (Google's submission, based on something SGI did).
Those out of tree things do stuff that Linus's tree still doesn't.
They're porting stuff to the new way of doing things and submitting it
upstream, but there's still a lot of shoveling left to do. So you may
have heard of capabilities that simply aren't in mainline yet.
Rob
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: [PATCH cgroups] Remove deprecated subsystem from examples. |
|---|
| [PATCH cgroups] Remove deprecated subsystem from examples. [message #41983] |
Tue, 14 December 2010 21:28 |
Trevor Woerner Messages: 1 Registered: December 2010 |
Junior Member |
From: *parallels.com
|
|
From: Trevor Woerner <twoerner@gmail.com>
The 'ns' cgroup is considered deprecated. Change the cgroup subsystem
used in the examples of the cgroup documentation from 'ns' to 'blkio'.
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Trevor Woerner <twoerner@gmail.com>
---
Documentation/cgroups/cgroups.txt | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..44b8b7a 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -355,13 +355,13 @@ subsystems, type:
To change the set of subsystems bound to a mounted hierarchy, just
remount with different options:
-# mount -o remount,cpuset,ns hier1 /dev/cgroup
+# mount -o remount,cpuset,blkio hier1 /dev/cgroup
-Now memory is removed from the hierarchy and ns is added.
+Now memory is removed from the hierarchy and blkio is added.
-Note this will add ns to the hierarchy but won't remove memory or
+Note this will add blkio to the hierarchy but won't remove memory or
cpuset, because the new options are appended to the old ones:
-# mount -o remount,ns /dev/cgroup
+# mount -o remount,blkio /dev/cgroup
To Specify a hierarchy's release_agent:
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
--
1.7.1
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [Lxc-users] regular lxc development call? |
|---|
| Re: [Lxc-users] regular lxc development call? [message #41929] |
Mon, 13 December 2010 13:12 |
Walter Stanish Messages: 1 Registered: December 2010 |
Junior Member |
From: *parallels.com
|
|
Apologies that I am travelling in north Africa at the moment - somewhat
sudden change of schedule - and will have highly sporadic availability until
late January. Nevertheless please keep me Cc'd on any developments for
subsequent calls.
W
On 13/12/2010 7:05 PM, "Stéphane Graber" <stgraber@ubuntu.com> wrote:
On Tue, 2010-11-30 at 03:06 +0000, Serge E. Hallyn wrote:
> Quoting Daniel Lezcano (daniel.lezcano@f...
I'd like to attend that call, Skype ID: stgraber
Depending on how many people are going to attend and where they're from,
I might be able to provide a conf number.
I asked my company (Revolution Linux) and we can use our 1-800 number
for the call. I can also invite people from other countries as long as
they are on landline.
9:30am central is a bit early for me as I tend to arrive at the office
around 10am central (9am eastern).
I'm usually around from 9am eastern to 11:30am and 12:30pm to 5:30pm.
Monday being usually quite busy so would like to avoid if possible :)
I guess it might be useful to have a list somewhere (wiki ?) of people
who'd like to attend with availabilities and timezone.
--
Stéphane Graber
Ubuntu developer
http://www.ubuntu.com
------------------------------------------------------------ ------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
_______________________________________________
Lxc-users mailing list
Lxc-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-users
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [Lxc-users] regular lxc development call? |
|---|
| Re: [Lxc-users] regular lxc development call? [message #41982] |
Mon, 13 December 2010 13:03 |
St Messages: 1 Registered: December 2010 |
Junior Member |
From: *parallels.com
|
|
On Tue, 2010-11-30 at 03:06 +0000, Serge E. Hallyn wrote:
> Quoting Daniel Lezcano (daniel.lezcano@free.fr):
> > On 11/29/2010 03:53 PM, Serge E. Hallyn wrote:
> > > Hi,
> > >
> > > at UDS-N we had a session on 'fine-tuning containers'. The focus was
> > > things we can do in the next few months to improve containers. The
> > > meeting proeedings can be found at
> > > https://wiki.ubuntu.com/UDSProceedings/N/CloudInfrastructure #Make%20LXC%20ready%20for%20production
> > >
> > > We have a few work items written down at
> > > https://blueprints.edge.launchpad.net/ubuntu/+spec/cloud-ser ver-n-containers-finetune
> > > The list is flexible fwiw, but we thought it might help to have a regular
> > > call, perhaps every other week, to discuss work items, their design,
> > > and their progress. For some features like reboot/shutdown, I think
> > > design still needs discussion. For other things, it's more important
> > > that we just discuss who's doing what and what's been done.
> > >
> > > Is there interest in having such a call?
> > >
> >
> > Yep, IMO it is a good idea.
> >
> > > I suspect most of the containers work now is purely volunteer driven,
> > > so a free venue seems worthwhile. Should we do this over skype? IRC?
> > > Does someone want to set up a conference number?
> > >
> >
> > I don't have a conf number, if anyone has one that will be great,
> > otherwise I am fine with skype or irc.
>
> Looks like we'll be starting small anyway, so let's just try skype. Anyone
> interested in joining, please send me your skype id.
>
> What is a good time? I'll just toss thursday at 9:30am US Central time
> (15:30 UTC) out there.
>
> -serge
I'd like to attend that call, Skype ID: stgraber
Depending on how many people are going to attend and where they're from,
I might be able to provide a conf number.
I asked my company (Revolution Linux) and we can use our 1-800 number
for the call. I can also invite people from other countries as long as
they are on landline.
9:30am central is a bit early for me as I tend to arrive at the office
around 10am central (9am eastern).
I'm usually around from 9am eastern to 11:30am and 12:30pm to 5:30pm.
Monday being usually quite busy so would like to avoid if possible :)
I guess it might be useful to have a list somewhere (wiki ?) of people
who'd like to attend with availabilities and timezone.
--
Stéphane Graber
Ubuntu developer
http://www.ubuntu.com
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: RE: trying to build simple checkpoint/restart recipes |
|---|
| RE: trying to build simple checkpoint/restart recipes [message #41980] |
Wed, 08 December 2010 16:10 |
Rob Landley Messages: 19 Registered: December 2010 |
Junior Member |
From: *parallels.com
|
|
> > The restoration of the mounts is not scriptable however. It involves
> > parsing the mountinfo file and coordinating the mounts with those done by
> > lxc itself during lxc-restart. I honestly haven't looked at that closely
>
> I'd be fine with requiring some bit of hand-parsing. But right, even
> once we get a list of the mounts to be restored, I don't know of any
> good way to get those mounts re-created at the right time.
Mount code is one of my old stomping grounds from back when I wrote
the busybox mount and switch_root commands and had to learn more
implementation details about it than I ever wanted to know. :)
I never could find a proper mount spec, and kept meaning to write one,
but I blathered about some of the less obvious details here:
http://www.mail-archive.com/busybox@busybox.net/msg07013.htm l
There are four top level categories of filesystem: Block backed, ram backed,
pipe backed (network and fuse and so on), and synthetic (sysfs, procfs,
devtmpfs...). And that's not counting bind mounts (which are internal
to the VFS and not really a filesystem), and loopback devices (which are
sort of the _opposite_ of a filesystem)...
> I suppose I could hack lxc-restart to do it. But I'm sort of hoping we
> can get something less hacked and more true to the 'real' upstream
> code.
Which upstream code?
> So do you know of anyone who's been working on re-creation of mounts
> in the kernel? If not, what have you been doing, hand-scripting
> all container creation, checkpoint, and restart?
I express interest in this topic.
Rob
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch |
|---|
| Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [message #41927] |
Sun, 28 November 2010 23:09 |
Gene Cooperman Messages: 5 Registered: November 2010 |
Junior Member |
From: *parallels.com
|
|
Hi Oren,
On Thu, Nov 25, 2010 at 11:04:16AM -0500, Oren Laadan wrote:
> On Tue, 23 Nov 2010, Kapil Arya wrote:
>
> > OL> Even if it did - the question is not how to deal with "glue"
> > OL> (you demonstrated quite well how to do that with DMTCP), but
> > OL> how should teh basic, core c/r functionality work - which is
> > OL> below, and orthogonal to the "glue".
> >
> > There seems to be an implicit assumption that it is easy to separate the DMTCP
> > "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but
> > it splits the problems into modules along a different line than Linux C/R. We
> > look forward to the joint experiment in which we would try to combine DMTCP
> > with Linux C/R. This will help answer the question in our mind.
>
> I apologize for being blunt - but this is probably an issue specific to
> DMTCP's engineering...
>
I completely agree with you, Oren. DMTCP was never designed to be split
into a userland and in-kernel replacement. We will want to re-factor
DMTCP to make this happen.
I'm sorry if my e-mail came off as confrontational. That was not my
intention. I was just looking forward to an interesting intellectual
experiment --- how to go about combining DMTCP and Linux C/R. I was
trying to guess ahead of time where there are interesting challenges, and
my hope is that we will find a way to solve them together.
Best wishes,
- Gene
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch |
|---|
| Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [message #41973] |
Tue, 23 November 2010 22:50 |
Kapil Arya Messages: 1 Registered: November 2010 |
Junior Member |
From: *parallels.com
|
|
(Our first comment below actually replies to an earlier post by Oren. It seemed
simpler to combine our comments.)
> > d. Â screen and other full-screen text programs These are not the only
> > examples of difficult interactions with the rest of the world.
>
> This actually never required a userspace "component" with Zap or linux-cr (to
> the best of my knowledge).
We would guess that Zap would not be able to support screen without a user
space component. The bug occurs when screen is configured to have a status line
at the bottom. We would be interested if you want to try it and let us know the
results.
=============================================
> > > category     linux-cr
> > > userspace
> > > ------------------------------------------------------------ --------------------
> > > PERFORMANCE   has _zero_ runtime overhead   visible overhead due to
> > > syscalls interposition and state tracking even w/o checkpoints;
> >
> > In our experiments so far, the overhead of system calls has been
> > unmeasurable. Â We never wrap read() or write(), in order to keep overhead
> > low. Â We also never wrap pthread synchronization primitives such as locks,
> > for the same reason. Â The other system calls are used much less often, and
> > so the overhead has been too small to measure in our experiments.
>
> Syscall interception will have visible effect on applications that use those
> syscalls. You may not observe overheasd with HPC ones, but do you have
> numbers on server apps ? Â apps that use fork/clone and pipes extensively ?
> threads benchmarks et ? Â compare that to aboslute zero overhead of linux-cr.
Its true that we haven't taken serious data on overhead with server apps. Is
there a particular server app that you are thinking of as an example? I would
expect fork/clone and pipes to be invoked infrequently in the server apps and do
not add measurably to CPU time. In most server apps such as MySQL, it is
common to maintain a pool of threads for reuse rather than to repeatedly call
clone for a new thread. This is done to ensure that the overhead of the clone
calls is not significant. I would expect a similar policy for fork and pipes.
<snip>
> > > OPERATION    applications run unmodified   to do c/r, needs
> > > 'controller' task (launch and manage _entire_ execution) - point of
> > > failure. Â restricts how a system is used.
> >
> > We'd like to clarify what may be some misconceptions. Â The DMTCP controller
> > does not launch or manage any tasks. Â The DMTCP controller is stateless,
> > and is only there to provide a barrier, namespace server, and single point
> > of contact to relay ckpt/restart commands. Â Recall that the DMTCP
> > controller handls processes across hosts --- not just on a single host.
>
> The controller is another point of failure. I already pointed that the
> (controlled) application crashes when your controller dies, and you mentioned
> it's a bug that should be fixed. But then there will always be a risk for
> another, and another ... Â You also mentioned that if the controller dies,
> then the app should contionue to run, but will not be checkpointable anymore
> (IIUC).
>
> The point is, that the controller is another point of failure, and makes the
> execution/checkpoint intrusive. It also adds security and user-management
> issues as you'll need one (or more ?) controller per user (right now, it's
> one for all, no ?). and so on.
Just to clarify, DMTCP uses one coordinator for each checkpointable
computation. A single user may be running multiple computations with one
coordinator for each computation. We don't actually use the word controller
in DMTCP terminology because the coordinator is stateless and so in
coordinating but not controlling other processes.
> Plus, because the restarted apps get their virtualized IDs from the
> controller, then they can't now "see" existing/new processes that may get the
> "same" pids (virtualization is not in the kernel).
This appears to be a misconception. The wrappers within the user process
maintain the pid-translation table for that process. The translation table is
the translation between the original pid given by the kernel and the current
pid set by the kernel on restart. This is handled locally and does not involve
the coordinator.
In the case of a fork there could be a pid-clash (the original pid
generated for a
new process that conflicts with someone else's original pid). However, DMTCP
handles this by checking within the fork wrapper for a pid-clash. In the rare
case of a pid-clash, the child process exits and the parent forks again. Same
applies for clone and any pid clash at restart time.
> > Â Â Also, in any computation involving multiple processes, _every_ process
> > Â Â of the computation is a point of failure. Â If any process of the
> > Â Â computation dies, then the simple application strategy is to give up
> > Â Â and revert to an earlier checkpoint. Â There are techniques by which an
> > Â Â app or DMTCP can recreate certain failed processes. Â DMTCP doesn't
> > Â Â currently recreate a dead controller (no demand for it), but it's not
> > Â Â hard to do technically.
>
> The point is that you _add_ a point of failure: you make the "checkpoint"
> operation a possible reason for the application to crash. In contrast, in
> linux-cr the checkpoiint is idempotent - nunharmful because it does not make
> the applications execute. Instead, it merely observes their state.
We were speaking above of the case when the process dies during a
computation. We were not referring to checkpoint time.
<snip>
We would like to add our own comment/question. To set the context we quote an
earlier post:
OL> Even if it did - the question is not how to deal with "glue"
OL> (you demonstrated quite well how to do that with DMTCP), but
OL> how should teh basic, core c/r functionality work - which is
OL> below, and orthogonal to the "glue".
There seems to be an implicit assumption that it is easy to separate the DMTCP
"glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but
it splits the problems into modules along a different line than Linux C/R. We
look forward to the joint experiment in which we would try to combine DMTCP
with Linux C/R. This will help answer the question in our mind.
In order to explore the issue, let's imagine that we have a successful merge of
DMTCP and Linux C/R. The following are some user-space glue issues. It's not
obvious to us how the merged software will handle these issues.
1. Sockets -- DMTCP handles all sockets in a common manner through a single
module. Sockets are checkpointed independently of whether they are local or
remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees
remote sockets? Or should DMTCP take down all remote sockets before
checkpointing? If DMTCP has to do this, it would be less efficient than the
current design which keeps the remote sockets connections alive during
checkpoint.
2. XLib and X11-server -- Consider checkpointing a single X11 app without the
X11-server and without VNC. This is something we intend to add to DMTCP in the
next few months. We have already mapped out the design in our minds. An X11
application includes the Xlib library. The data of an X11 window is, by
default, contained in the X11 library -- not in the X11-server. The application
communicates with the X11-server using socket connections, which would be
considered a leak by Linux C/R. At restart time, DMTCP will ask the
X11-server to create a bare window and then make the appropriate Xlib call to
repaint the window based on the data stored in the Xlib  library.
For checkpoint/resume, the window stays up and does not has to be repainted.
How will the combined DMTCP/Linux C/R work? Will DMTCP have to take
down the window prior to Linux C/R and paint a new window at resume time?
Doesn't this add inefficiency?
3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via
a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since
there is a second process operating the master end of the pty. In this
case we are
guessing that Linux C/R would checkpoint and restart without the gurantees of
reliability. We are guessing that Linux C/R would not save and restore the pty,
instead it would be the responsibility of DMTCP to restore the current settings
of the pty (e.g. packet mode vs. regular mode). Is our understanding correct?
Would this work?
Thanks,
Gene and Kapil
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
...
|
|
| | Topic: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch |
|---|
| Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [message #41972] |
Thu, 18 November 2010 15:13 |
Jose R. Santos Messages: 1 Registered: November 2010 |
Junior Member |
From: *parallels.com
|
|
On Thu, 18 Nov 2010 10:48:34 +0100
Tejun Heo <tj@kernel.org> wrote:
> Hello, Pavel.
>
> On 11/18/2010 10:13 AM, Pavel Emelyanov wrote:
> >>> By this do you mean the very idea of having CR support in the
> >>> kernel? Or our design of it in the kernel?
> >>
> >> The former, I'm afraid.
> >
> > Can you elaborate on this please?
>
> I think I already did that several times in this thread but here's an
> attempt at summary.
Yet the arguments seem to be vague enough not to be convincing to the
people working on the code.
> * It adds a bunch of pseudo ABI when most of the same information is
> available via already established ABI.
Can you elaborate on this? What established ABI are you proposing we
use here. Hopefully we can turn this into a more technical discussion.
> * In a way which can only ever be used and tested by CR. If possible,
So what if it can only be tested with CR as long as we can make CR work
on a variety of environments? Scalability changes for _really_ large
SMP boxes can only be reliably tested by people such equipment. We are
not imposing any such restriction and this code can be tested on very
wide range of setups.
> kernel should provide generic mechanisms which can be used to
> implement features in userland. One of the reasons why we'd like to
> export small basic building blocks instead of full end-to-end
> solutions from the kernel is that we don't know how things will
> change in the future. In-kernel CR puts too much in the kernel in a
> way too inflexible manner.
>
> * It essentially adds a separate complete set of entry/exit points for
> a lot of things, which makes things more error prone and increases
> maintenance overhead across the board.
I partially agree with you here. There will be maintenance overhead
every time you add code to the kernel that _may_ make changes in the
future more complicated. This true for _any_ code that is added to the
core kernel. Now in my experience such maintenance burden is most
disruptive when the code being added creates a lot of new state that
need to be tracked in multiple places unrelated to CR (in this case).
Our argument is that the CR code is not creating new state that will
cause painful future changes to the kernel. If you have specific
example that you are concerned with, great. Lets discuss those.
Are we promising zero maintenance cost? But guess what, neither do most
features that make into the kernel.
Now, if we change the argument around... What would be the maintenance
cost keeping this outside the kernel. I would argue that it is much
higher and would use SystemTap as the first example that come to mind.
> * And, most of all, there are userland implementation and
> virtualization, making the benefit to overhead ratio completely off.
Can we keep virtualization out of this. Every time someone mentions
virtualization as a solution, it makes me feel like these people just
don't understand the problem we are trying to solve. It is just not
practical to create a new VM for every application you want to CR.
These are two different tools to attack two different problems.
> Userland implementation _already_ achieves most of what's necessary
> for the most important use case of HPC without any special help from
What are these _most_ important cases of HPC that you are referring too?
Can we do a lot of these cases from userspace? Sure, but why are the
ones that can't be done from userspace any less important. If nobody
cared about those, we would not be having this conversation.
> the kernel. The only reasonable thing to do is taking a good look
> at it and finding ways to improve it.
The userspace vs in-kernel discussion has been done before as multiple
people have already said in this thread. Show me a version of userspace
CR that can correctly do all that an in-kernel implementation is capable
of.
> Thanks.
>
--
Jose R. Santos
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch |
|---|
| Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [message #41908] |
Mon, 08 November 2010 13:37 |
Gene Cooperman Messages: 5 Registered: November 2010 |
Junior Member |
From: *parallels.com
|
|
Thanks for the careful response, Oren. For others who read this,
one could interpret Oren's rapid post as criticizing the work of
Andres Lagar Cavilla. I'm sure that this was not Oren's intention.
Please read below for a brief clarification of the novelty of SnowFlock.
Anyway, I really look forward to the phone discussion. I've also
enjoyed our interchange, for giving me an opportunity to explain more about
the DMTCP design. Thank you.
Best wishes,
- Gene
On Mon, Nov 08, 2010 at 01:14:12PM -0500, Oren Laadan wrote:
> Hi,
>
> Ok, I'll bite the bullet for now - to be continued...
>
> Just one important clarification:
>
> >>Linux-cr can do live migration - e.g. VDI, move the desktop - in
> >>which case skype's sockets' network stacks are reconstructed,
> >>transparently to both skype (local apps) and the peer (remote apps).
> >>Then, at the destination host and skype continues to work.
> >
> >That's a really cool thing to do, and it's definitely not part of what
> >DMTCP does. It might be possible to do userland live migration,
> >but it's definitely not part of our current scope. But if we're talking
> >about live migration, have you also looked at the work of
> >Andres Lagar Caviilla on SnowFlock?
> > http://andres.lagarcavilla.com/publications/LagarCavillaEuro sys09.pdf
> >He does live migration of entire virtual machines, again with very
> >small delay. Of course, the issue for any type of live migration is that
> >if the rate of dirtying pages is very high (e.g. HPC), then there is
> >still a delay or slow response, due to page faults to a remote host.
>
> VMware, Xen and KVM already do live migration. However, VMs
> are a separate beast.
I absolutely agree with your point that live migration of
applications is a different beast, and technically very novel.
Since I know Andres Lagar Cavilla personally, I also feel obligated
to comment why SnowFlock truly is novel in the VM space. First, as Andres
writes:
"SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3
VMM [Barham 2003]."
In the abstract, Andres points out one of the major points of novelty:
"To evaluate SnowFlock, we focus on the demanding
scenario of services requiring on-the-fly creation of hundreds
of parallel workers in order to solve computationallyintensive
queries in seconds."
We must be careful that we don't destroy someone's reputation without
a careful study of their work.
> We are concerned about _application_ level c/r and migration
> (complete containers or individual applications). Many proven
> techniques from the VM world apply to our context too (in your
> example, post-copy migration).
>
> Oren.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch |
|---|
| Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [message #41907] |
Sun, 07 November 2010 18:31 |
Gene Cooperman Messages: 5 Registered: November 2010 |
Junior Member |
From: *parallels.com
|
|
On Sun, Nov 07, 2010 at 04:44:20PM -0500, Oren Laadan wrote:
> [cc'ing linux containers mailing list]
>
> On 11/06/2010 04:40 PM, Gene Cooperman wrote:
>
> >8. What happens if the DMTCP coordinator ( checkpoint control process) dies?
> > [ The same thing that happens if a user process dies. We kill the whole
> > computation, and restart. At restart, we use a new coordinator.
> > Coordinators are stateless. ]
>
> My experience is different:
>
> I downloaded dmtcp and followed the quick-start guide:
> (1) "dmtcp_coordinator" on one terminal
> (2) "dmtcp_checkpoint bash" on another terminal
>
> Then I:
> (3) pkill -9 dmtcp_coordinator
> ... oops - 'bash' died.
>
> I didn't even try to take a checkpoint :(
You're right. I just reproduced your example. But please remember that
we're working in a design space where if any process of a computation
dies, then we kill the computation and restart. It doesn't matter to us
if it's a user process or the DMTCP coordinator that died. I do think
this is getting too detailed for the LKML list, but since you bring it
up, here is the analysis. The user bash process exits with:
[31331] ERROR at dmtcpmessagetypes.cpp:62 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
_magicBits =
Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die uncleanly?
This means that when the DMTCP coordinator died, it sent a message to the
checkpoint thread within the user process. The message was ill-formed.
The current DMTCP code says that if a checkpoint thread receives an
ill-formed message from the coordinator, then it should die. It's not
hard to change the protocol between DMTCP coordinator and checkpoint
thread of the user process into a more robust protocol with RETRY, further
ACK, etc. We haven't done this. Right now, the user simply restarts from
the last checkpoint. If one process of a computation has been compromised
(either DMTCP coordinator or user process), then the whole computation
has been compromised. I think in a previous version of DMTCP, the policy
was to allow the computation to continue when the coordinator dies.
Policies change.
But I think you're missing the larger point. We've developed DMTCP
over six years, largely with programmers who are much less experienced
than the kernel developers. Yet DMTCP works reliably for many users.
I consider this a credit to the DMTCP design. The Linux C/R design
is also excellent.
Can we get back to questions of design, using the implementations as
reference implementations? If you don't object, I'll also skip replying
to the other post, since I think we're getting too detailed. I'm having
trouble keeping up with the posts. :-) An offline discussion will
give us time to look more carefully at these issues, and draw more
careful conclusions.
Thanks,
- Gene
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: Need help to debug freeze on kernel side (somehow related to lxc) |
|---|
| Re: Need help to debug freeze on kernel side (somehow related to lxc) [message #41906] |
Fri, 05 November 2010 08:41 |
pva Messages: 2 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
Thank you Matt, for your help!
I've changed a bit subject to make it more clear that lxc freezer itself
have no relation (I've did checks you provided, just to be sure). Now
server freezed again and I had some time to gather a bit of information.
Yet still I'm unsure what to do with this freeze.
By freeze I mean that `ps aux` output freeze at some point and I'm
unable to kill it with ctrl+C. strace pointed that it hangs on
reading /proc/3780/cmdline file (environ file is unreadable too). exe
symlink pointed on /usr/sbin/sshd and this time I was unable to ssh,
while previously it was possible (so different processes occur in the
same situation from time to time). This process does not belongs to
cgroup (it's in / cgroup). kill/kill -9 3780 did nothing. I've tried to
gather more proc information from /proc/3780 (in attachment), also there
is kern.log with some sysrq information (memory info, kernel dump and
similar). Could you help me to see what other information could be of
interest here? How to find out where sshd hanged and why? I
thought /proc/3780/syscall could help here, but I failed to find what
this file has inside and numbers there are not addresses of functions in
System.map (or grep was unable to find them). Any suggestions, please?
With best regards,
--
Peter.
В Сбт, 30/10/2010 в 17:36 -0700, Matt Helsley пишет:
> On Fri, Oct 29, 2010 at 04:27:40PM +0400, Пётр Волков wrote:
> > Hi. We are using lxc to separate different services into containers: for
> > this discussions we have apache+php, mysql, nginx containers to serve
> > our web application. After upgrade (I think from kernel 2.6.32 into
> > something newer, now we are using 2.6.35, but tried 34 too) we've
> > experience following issue: at some point nginx starts to show us "504
> > Gateway Time-out" error and while it is possible to ssh on server `ps
> > aux` hangs (with no ability to stop it), it is impossible to restart
> > apache container (hangs on stop) and the only way to fix this is to
> > restart server using sysrq or power button. At the same time there is
> > nothing in the logs. I suspect apache starts to eat lots of memory but
> > oom killer somehow freezes container but I don't have any proves. What
>
> The OOM killer does not freeze tasks. Now if the tasks were already
> frozen and if the OOM killer selected them then I can see how that
> would be a problem. However, again I doubt that's what's happening here
> for several reasons.
>
> 1. lxc doesn't arbitrarily freeze tasks -- unless you were checkpointing
> or freezing the task yourself (or using a custom script to do
> so), the tasks in the container's cgroup should not be frozen.
>
> 2. If the task(s) are frozen then by definition they are not allocating
> memory. At best they're pinning the memory they've already
> allocated before being frozen. [ The tasks will respond to
> kill signals when thawed. ]
>
> > could you suggest to debug this issue? What sysrq information could be
> > useful here?
>
> [ Cc'ing lxc-users@lists.sf.net for lxc-specific debugging ideas/advice. ]
>
> Here's some info on collecting and diagnosing the state of the freezer
> so that hopefully we can eliminate your concerns about it being invovled
> and confirm what I've said above:
>
> If you want to figure out if the cgroup freezer is involved at all
> debugging it requires that you be in the "host". Find out which
> process ids are your apache/nginx/etc processes. Then look at their
> cgroups in /proc/<pid>/cgroup. Keep in mind that the "/" in those
> paths isn't the same as "/" -- it's the directory the cgroup
> subsystems are mounted at (see /proc/mounts to figure out where).
> You want the line that says "freezer".
>
> Look at the cgroups mount point with the freezer subssystem in the
> cgroup(s) of these processes (it'll say "freezer" in the mount options).
> Confirm that your pids are listed in the cgroup by looking at the tasks
> file.
>
> If the freezer.state file of those cgroups contains the word "THAWED"
> then the problem lies elsewhere. If the freezer.state says "FREEZING"
> or "FROZEN" however then you'll want to look at the state of the
> processes. Some or all should be in the "D" state while "FREEZING".
> All should be in "D" state while "FROZEN".
>
> "FREEZING" is an intermediate state however so it's not possible to
> determine if there's a bug based purely on the info collected so far.
> The best you can do with "FREEZING" is try and write "FROZEN" into
> freezer.state one or more times and see if it 'eventually' succeeds
> -- say within 10 seconds or 20 attempts, whichever takes longer.
> If it doesn't then you need to strace the processes and see if any
> are stuck in a syscall -- vfork perhaps. You can also try writing
> "THAWED". If it doesn't thaw on the first try then there's a bug.
>
> Whenever you write a new state to freezer.state you should read the
> file again to find out whether the state change took place. Some
> transitions are handled lazily and only take place when you ask for
> the state by reading it.
>
> That's the way to figure out if the freezer is involved and, if so,
> where it's stuck.
>
> Cheers,
> -Matt Helsley
1:blkio,freezer,devices,memory,cpuacct,cpu,ns,debug,cpuset:/
7d0bc56000-7d0bcca000 r-xp 00000000 fe:00 221 /usr/sbin/sshd
7d0bec9000-7d0becb000 r--p 00073000 fe:00 221 /usr/sbin/sshd
7d0becb000-7d0becc000 rw-p 00075000 fe:00 221 /usr/sbin/sshd
7d0becc000-7d0befd000 rw-p 00000000 00:00 0 [heap]
312a1beb000-312a1bf6000 r-xp 00000000 08:03 7686 /lib64/libnss_files-2.12.1.so (deleted)
312a1bf6000-312a1df6000 ---p 0000b000 08:03 7686 /lib64/libnss_files-2.12.1.so (deleted)
312a1df6000-312a1df7000 r--p 0000b000 08:03 7686 /lib64/libnss_files-2.12.1.so (deleted)
312a1df7000-312a1df8000 rw-p 0000c000 08:03 7686 /lib64/libnss_files-2.12.1.so (deleted)
312a1df8000-312a1e02000 r-xp 00000000 08:03 7684 /lib64/libnss_nis-2.12.1.so (deleted)
312a1e02000-312a2001000 ---p 0000a000 08:03 7684 /lib64/libnss_nis-2.12.1.so (deleted)
312a2001000-312a2002000 r--p 00009000 08:03 7684 /lib64/libnss_nis-2.12.1.so (deleted)
312a2002000-312a2003000 rw-p 0000a000 08:03 7684 /lib64/libnss_nis-2.12.1.so (deleted)
312a2003000-312a2018000 r-xp 00000000 08:03 7687 /lib64/libnsl-2.12.1.so (deleted)
312a2018000-312a2217000 ---p 00015000 08:03 7687 /lib64/libnsl-2.12.1.so (deleted)
312a2217000-312a2218000 r--p 00014000 08:03 7687 /lib64/libnsl-2.12.1.so (deleted)
312a2218000-312a2219000 rw-p 00015000 08:03 7687 /lib64/libnsl-2.12.1.so (deleted)
312a2219000-312a221b000 rw-p 00000000 00:00 0
312a221b000-312a2222000 r-xp 00000000 08:03 7591 /lib64/libnss_compat-2.12.1.so (deleted)
312a2222000-312a2421000 ---p 00007000 08:03 7591 /lib64/libnss_compat-2.12.1.so (deleted)
312a2421000-312a2422000 r--p 00006000 08:03 7591 /lib64/libnss_compat-2.12.1.so (deleted)
312a2422000-312a2423000 rw-p 00007000 08:03 7591 /lib64/libnss_compat-2.12.1.so (deleted)
312a2423000-312a2425000 r-xp 00000000 08:03 7682 /lib64/libdl-2.12.1.so (deleted)
312a2425000-312a2625000 ---p 00002000 08:03 7682 /lib64/libdl-2.12.1.so (deleted)
312a2625000-312a2626000 r--p 00002000 08:03 7682 /lib64/libdl-2.12.1.so (deleted)
312a2626000-312a2627000 rw-p 00003000 08:03 7682 /lib64/libdl-2.12.1.so (deleted)
312a2627000-312a2784000 r-xp 00000000 08:03 7689 /lib64/libc-2.12.1.so (deleted)
312a2784000-312a2983000 ---p 0015d000 08:03 7689 /lib64/libc-2.12.1.so (deleted)
312a2983000-312a2987000 r--p 0015c000 08:03 7689 /lib64/libc-2.12.1.so (deleted)
312a2987000-312a2988000 rw-p 00160000 08:03 7689 /lib64/libc-2.12.1.so (deleted)
312a2988000-312a298d000 rw-p 00000000 00:00 0
312a298d000-312a2995000 r-xp 00000000 08:03 7473 /lib64/libcrypt-2.12.1.so (deleted)
312a2995000-312a2b94000 ---p 00008000 08:03 7473 /lib64/libcrypt-2.12.1.so (deleted)
312a2b94000-312a2b95000 r--p 00007000 08:03 7473 /lib64/libcrypt-2.12.1.so (deleted)
312a2b95000-312a2b96000 rw-p 00008000 08:03 7473 /lib64/libcrypt-2.12.1.so (deleted)
312a2b96000-312a2bc4000 rw-p 00000000 00:00 0
312a2bc4000-312a2bc6000 r-xp 00000000 08:03 7483 /lib64/libutil-2.12.1.so (deleted)
312a2bc6000-312a2dc5000 ---p 00002000 08:03 7483 /lib64/libutil-2.12.1.so (deleted)
312a2dc5000-312a2dc6000 r--p 00001000 08:03 7483 /lib64/libutil-2.12.1.so (deleted)
312a2dc6000-312a2dc7000 rw-p 00002000 08:03 7483 /lib64/libutil-2.12.1.so (deleted)
312a2dc7000-312a2ddf000 r-xp 00000000 08:03 86 /lib64/libz.so.1.2.5
312a2ddf000-312a2fde000 ---p 00018000 08:03 86 /lib64/libz.so.1.2.5
312a2fde000-312a2fdf000 r--p 00017000 08:03 86 /lib64/libz.so.1.2.5
312a2fdf000-312a2fe0000 r
...
Attachment: 3780-cgroup
(Size: 0.06KB, Downloaded 58 times)
Attachment: 3780-maps
(Size: 6.14KB, Downloaded 66 times)
Attachment: 3780-sched
(Size: 2.37KB, Downloaded 52 times)
Attachment: 3780-schedstat
(Size: 0.01KB, Downloaded 87 times)
Attachment: 3780-stat
(Size: 0.21KB, Downloaded 91 times)
Attachment: 3780-statm
(Size: 0.02KB, Downloaded 83 times)
Attachment: 3780-status
(Size: 0.72KB, Downloaded 70 times)
Attachment: 3780-syscall
(Size: 0.03KB, Downloaded 85 times)
Attachment: kern.log
(Size: 129.49KB, Downloaded 68 times)
|
|
| | Topic: Re: [PATCH] cgroup: prefer [kv]zalloc over [kv]malloc+memset in memory controller code. |
|---|
| Re: [PATCH] cgroup: prefer [kv]zalloc over [kv]malloc+memset in memory controller code. [message #41902] |
Wed, 03 November 2010 10:15 |
Wu Fengguang Messages: 5 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
On Mon, Nov 01, 2010 at 08:59:13PM +0100, Jesper Juhl wrote:
> @@ -4169,13 +4169,11 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> */
> if (!node_state(node, N_NORMAL_MEMORY))
> tmp = -1;
> - pn = kmalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
> + pn = kmalloc_node(sizeof(*pn), GFP_KERNEL|__GFP_ZERO, tmp);
Use the simpler kzalloc_node()? It's introduced here:
commit 979b0fea2d9ae5d57237a368d571cbc84655fba6
Author: Jeff Layton <jlayton@redhat.com>
Date: Thu Jun 5 22:47:00 2008 -0700
vm: add kzalloc_node() inline
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Thanks,
Fengguang
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [PATCH] cgroup: prefer [kv]zalloc over [kv]malloc+memset in memory controller code. |
|---|
| Re: [PATCH] cgroup: prefer [kv]zalloc over [kv]malloc+memset in memory controller code. [message #41901] |
Tue, 02 November 2010 08:24 |
Johannes Weiner Messages: 9 Registered: November 2010 |
Junior Member |
From: *parallels.com
|
|
On Mon, Nov 01, 2010 at 08:59:13PM +0100, Jesper Juhl wrote:
> On Mon, 1 Nov 2010, Johannes Weiner wrote:
>
> > On Mon, Nov 01, 2010 at 08:40:56PM +0100, Jesper Juhl wrote:
> > > In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
> > > followed by memset() to zero the memory. This can be more efficiently
> > > achieved by using kzalloc() and vzalloc().
> > >
> > > Signed-off-by: Jesper Juhl <jj@chaosbits.net>
> >
> > Looks good to me, but there is also the memset after kmalloc in
> > alloc_mem_cgroup_per_zone_info(). Can you switch that over as well
> > in this patch? You can pass __GFP_ZERO to kmalloc_node() for zeroing.
>
> Sure thing.
>
> Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Thanks.
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: [PATCH] cgroup: prefer [kv]zalloc over [kv]malloc+memset in memory controller code. |
|---|
| [PATCH] cgroup: prefer [kv]zalloc over [kv]malloc+memset in memory controller code. [message #41962] |
Mon, 01 November 2010 15:40 |
Jesper Juhl Messages: 7 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
Hi (please CC me on replies),
Apologies to those who receive this multiple times. I screwed up the To:
field in my original mail :-(
In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
followed by memset() to zero the memory. This can be more efficiently
achieved by using kzalloc() and vzalloc().
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
memcontrol.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..90da698 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4199,14 +4199,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
/* Can be very big if MAX_NUMNODES is very big */
if (size < PAGE_SIZE)
- mem = kmalloc(size, GFP_KERNEL);
+ mem = kzalloc(size, GFP_KERNEL);
else
- mem = vmalloc(size);
+ mem = vzalloc(size);
if (!mem)
return NULL;
- memset(mem, 0, size);
mem->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
if (!mem->stat) {
if (size < PAGE_SIZE)
--
Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: [PATCH] cgroup: prefer [kv]zalloc over [kv]malloc+memset in memory controller code. |
|---|
| [PATCH] cgroup: prefer [kv]zalloc over [kv]malloc+memset in memory controller code. [message #41961] |
Mon, 01 November 2010 15:35 |
Jesper Juhl Messages: 7 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
Hi (please CC me on replies),
In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
followed by memset() to zero the memory. This can be more efficiently
achieved by using kzalloc() and vzalloc().
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
memcontrol.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..90da698 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4199,14 +4199,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
/* Can be very big if MAX_NUMNODES is very big */
if (size < PAGE_SIZE)
- mem = kmalloc(size, GFP_KERNEL);
+ mem = kzalloc(size, GFP_KERNEL);
else
- mem = vmalloc(size);
+ mem = vzalloc(size);
if (!mem)
return NULL;
- memset(mem, 0, size);
mem->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
if (!mem->stat) {
if (size < PAGE_SIZE)
--
Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/
Plain text mails only, please http://www.expita.com/nomime.html
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback |
|---|
| Re: [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback [message #41899] |
Sun, 31 October 2010 16:03 |
Wu Fengguang Messages: 5 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
On Sat, Oct 30, 2010 at 12:06:33AM +0800, Greg Thelen wrote:
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
>
> > On Fri, 29 Oct 2010 00:09:14 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> If the current process is in a non-root memcg, then
> >> balance_dirty_pages() will consider the memcg dirty limits
> >> as well as the system-wide limits. This allows different
> >> cgroups to have distinct dirty limits which trigger direct
> >> and background writeback at different levels.
> >>
> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
The "check both memcg&global dirty limit" looks much more sane than
the V3 implementation. Although it still has misbehaviors in some
cases, it's generally a good new feature to have.
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
> > Ideally, I think some comments in the code for "why we need double-check system's
> > dirty limit and memcg's dirty limit" will be appreciated.
>
> I will add to the balance_dirty_pages() comment. It will read:
> /*
> * balance_dirty_pages() must be called by processes which are generating dirty
> * data. It looks at the number of dirty pages in the machine and will force
> * the caller to perform writeback if the system is over `vm_dirty_ratio'.
~~~~~~~~~~~~~~~~~ ~~~~
To be exact, it tries to throttle the dirty speed so that
vm_dirty_ratio is not exceeded. In fact balance_dirty_pages() starts
throttling the dirtier slightly below vm_dirty_ratio.
> * If we're over `background_thresh' then the writeback threads are woken to
> * perform some writeout. The current task may have per-memcg dirty
> * limits, which are also checked.
> */
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces |
|---|
| Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces [message #41898] |
Fri, 29 October 2010 23:02 |
Wu Fengguang Messages: 5 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
On Sat, Oct 30, 2010 at 05:35:50AM +0800, Greg Thelen wrote:
> >> +A cgroup may contain more dirty memory than its dirty limit. This is possible
> >> +because of the principle that the first cgroup to touch a page is charged for
> >> +it. Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> >> +counted to the originally charged cgroup.
> >> +
> >> +Example: If page is allocated by a cgroup A task, then the page is charged to
> >> +cgroup A. If the page is later dirtied by a task in cgroup B, then the cgroup A
> >> +dirty count will be incremented. If cgroup A is over its dirty limit but cgroup
> >> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> >> +over its dirty limit without throttling the dirtying cgroup B task.
> >
> > It's good to document the above "misbehavior". But why not throttling
> > the dirtying cgroup B task? Is it simply not implemented or makes no
> > sense to do so at all?
>
> Ideally cgroup B would be throttled. Note, even with this misbehavior,
> the system dirty limit will keep cgroup B from exceeding system-wide
> limits.
Yeah. And I'm OK with the current behavior, since
1) it does not impact the global limits
2) the common memcg usage (the workload you cared) seems don't share
pages between memcg's a lot
So I'm OK to improve it in future when there comes a need.
> The challenge here is that when the current system increments dirty
> counters using account_page_dirtied() which does not immediately check
> against dirty limits. Later balance_dirty_pages() checks to see if any
> limits were exceeded, but only after a batch of pages may have been
> dirtied. The task may have written many pages in many different memcg.
> So checking all possible memcg that may have been written in the mapping
> may be a large set. I do not like this approach.
Me too.
> memcontrol.c can easily detect when memcg other than the current task's
> memcg is charged for a dirty page. It does not record this today, but
> it could. When such a foreign page dirty event occurs the associated
> memcg could be linked into the dirtying address_space so that
> balance_dirty_pages() could check the limits of all foreign memcg. In
> the common case I think the task is dirtying pages that have been
> charged to the task's cgroup, so the address_space's foreign_memcg list
> would be empty. But when such foreign memcg are dirtied
> balance_dirty_pages() would have access to references to all memcg that
> need dirty limits checking. This approach might work. Comments?
It still introduce complexities of maintaining the foreign memcg <=>
task mutual links.
Another approach may to add a parameter "struct page *page" to
balance_dirty_pages(). Then balance_dirty_pages() can check the memcg
that is associated with the _current_ dirtied page. It may not catch
all foreign memcg's, but should work fine with good probability
without introducing new data structure.
Thanks,
Fengguang
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [PATCH 0/5][v5][cr] Checkpoint/restart file locks |
|---|
| Re: [PATCH 0/5][v5][cr] Checkpoint/restart file locks [message #41954] |
Fri, 29 October 2010 10:31 |
Lin Ming Messages: 1 Registered: October 2010 |
Junior Member |
From: *sh.intel.com
|
|
On Fri, Oct 29, 2010 at 2:16 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> Checkpoint/restart file locks.
>
> Changelog[v5]:
> Â Â Â Â - This patchset only checkpoints/restores file locks. C/R of
> Â Â Â Â Â file-owner and file-leases will be addressed in follown patches.
> Â Â Â Â Â C/R of file-owner information must deal with nested-containers
> Â Â Â Â Â and, will need a way to C/R struct pids. C/R of file-leases depends
> Â Â Â Â Â on C/R of file-owner information.
>
>
> Sukadev Bhattiprolu (5):
> Â Move file_lock macros into linux/fs.h
> Â Define flock_set()
> Â Define flock64_set()
> Â Checkpoint/restore file-locks
> Â Document design of C/R of file-locks and leases
>
> Â Documentation/checkpoint/file-locks | Â 52 ++++++
>  fs/checkpoint.c           |  318 +++++++++++++++++++++++++++++++++--
>  fs/locks.c              |  89 ++++++----
>  include/linux/checkpoint_hdr.h    |  17 ++
>  include/linux/fs.h          |  10 +
> Â 5 files changed, 433 insertions(+), 53 deletions(-)
> Â create mode 100644 Documentation/checkpoint/file-locks
Hi,
Which tree are these patches against?
I can't apply them neither to Linus tree(18cb657c) nor
vfs-2.6.git/for-linus branch(a4cdbd8b).
Lin Ming
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Need help to debug container's freeze |
|---|
| Need help to debug container's freeze [message #41894] |
Fri, 29 October 2010 08:27 |
pva Messages: 2 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
Hi. We are using lxc to separate different services into containers: for
this discussions we have apache+php, mysql, nginx containers to serve
our web application. After upgrade (I think from kernel 2.6.32 into
something newer, now we are using 2.6.35, but tried 34 too) we've
experience following issue: at some point nginx starts to show us "504
Gateway Time-out" error and while it is possible to ssh on server `ps
aux` hangs (with no ability to stop it), it is impossible to restart
apache container (hangs on stop) and the only way to fix this is to
restart server using sysrq or power button. At the same time there is
nothing in the logs. I suspect apache starts to eat lots of memory but
oom killer somehow freezes container but I don't have any proves. What
could you suggest to debug this issue? What sysrq information could be
useful here?
--
Peter.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [PATCH v4 06/11] memcg: add dirty page accounting infrastructure |
|---|
| Re: [PATCH v4 06/11] memcg: add dirty page accounting infrastructure [message #41897] |
Fri, 29 October 2010 07:13 |
Wu Fengguang Messages: 5 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
On Fri, Oct 29, 2010 at 03:09:09PM +0800, Greg Thelen wrote:
> +
> + case MEMCG_NR_FILE_DIRTY:
> + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> + if (val > 0) {
> + if (TestSetPageCgroupFileDirty(pc))
> + val = 0;
> + } else {
> + if (!TestClearPageCgroupFileDirty(pc))
> + val = 0;
> + }
I'm wondering why TestSet/TestClear and even the cgroup page flags for
dirty/writeback/unstable pages are necessary at all (it helps to
document in changelog if there are any). For example, VFS will call
TestSetPageDirty() before calling
mem_cgroup_inc_page_stat(MEMCG_NR_FILE_DIRTY), so there should be no
chance of false double counting.
Thanks,
Fengguang
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces |
|---|
| Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces [message #41896] |
Fri, 29 October 2010 07:03 |
Wu Fengguang Messages: 5 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
Hi Greg,
On Fri, Oct 29, 2010 at 03:09:05PM +0800, Greg Thelen wrote:
> Document cgroup dirty memory interfaces and statistics.
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> +page cache used by a cgroup. So, in case of multiple cgroup writers, they will
> +not be able to consume more than their designated share of dirty pages and will
> +be forced to perform write-out if they cross that limit.
It's more pertinent to say "will be throttled", as "perform write-out"
is some implementation behavior that will change soon.
> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
> + in the cgroup at which a process generating dirty pages will start itself
> + writing out dirty data. Suffix (k, K, m, M, g, or G) can be used to indicate
> + that value is kilo, mega or gigabytes.
The suffix feature is handy, thanks! It makes sense to also add this
for the global interfaces, perhaps in a standalone patch.
> +A cgroup may contain more dirty memory than its dirty limit. This is possible
> +because of the principle that the first cgroup to touch a page is charged for
> +it. Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> +counted to the originally charged cgroup.
> +
> +Example: If page is allocated by a cgroup A task, then the page is charged to
> +cgroup A. If the page is later dirtied by a task in cgroup B, then the cgroup A
> +dirty count will be incremented. If cgroup A is over its dirty limit but cgroup
> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> +over its dirty limit without throttling the dirtying cgroup B task.
It's good to document the above "misbehavior". But why not throttling
the dirtying cgroup B task? Is it simply not implemented or makes no
sense to do so at all?
Thanks,
Fengguang
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containe rs
|
|
| | Topic: [PATCH] [openvz] printk: Handle global log buffer reallocation |
|---|
| [PATCH] [openvz] printk: Handle global log buffer reallocation [message #41953] |
Tue, 19 October 2010 04:11 |
maximilian attems Messages: 1 Registered: October 2010 |
Junior Member |
From: *parallels.com
|
|
From: Ben Hutchings <ben@decadent.org.uk>
Subject: [PATCH] [openvz] printk: Handle global log buffer reallocation
Date: Sun, 17 Oct 2010 02:24:28 +0100
Currently an increase in log_buf_len results in disaster, as
ve0.log_buf is left pointing to the old log buffer.
Update ve0.log_buf when the global log buffer is reallocated. Also
acquire logbuf_lock before reading ve_log_buf_len, to avoid a race
with reallocation.
Reported-and-tested-by: Tim Small <tim@seoss.co.uk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: maximilian attems <max@stro.at>
---
belows patch fixes http://bugs.debian.org/600299
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -198,6 +198,9 @@
spin_lock_irqsave(&logbuf_lock, flags);
log_buf_len = size;
log_buf = new_log_buf;
+#ifdef CONFIG_VE
+ ve0.log_buf = log_buf;
+#endif
offset = start = min(con_start, log_start);
dest_idx = 0;
@@ -354,9 +357,9 @@
if (ve_log_buf == NULL)
goto out;
count = len;
+ spin_lock_irq(&logbuf_lock);
if (count > ve_log_buf_len)
count = ve_log_buf_len;
- spin_lock_irq(&logbuf_lock);
if (count > ve_logged_chars)
count = ve_logged_chars;
if (do_clear)
|
|
| | Topic: [PATCH -mm 3/3] i/o accounting and control |
|---|
| [PATCH -mm 3/3] i/o accounting and control [message #32144] |
Tue, 22 July 2008 16:58 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
Apply the io-throttle controller to the opportune kernel functions. Both
accounting and throttling functionalities are performed by
cgroup_io_throttle().
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
block/blk-core.c | 9 +++++++++
fs/aio.c | 31 ++++++++++++++++++++++++++++++-
fs/buffer.c | 20 +++++++++++++++++---
fs/direct-io.c | 4 ++++
include/linux/sched.h | 3 +++
kernel/fork.c | 3 +++
mm/filemap.c | 18 +++++++++++++++++-
mm/page-writeback.c | 30 +++++++++++++++++++++++++++---
mm/readahead.c | 5 +++++
9 files changed, 115 insertions(+), 8 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4c222ba..431294f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/blktrace_api.h>
@@ -1482,7 +1483,15 @@ void submit_bio(int rw, struct bio *bio)
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
} else {
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+
task_io_account_read(bio->bi_size);
+ /*
+ * Do not throttle page requests that need to be
+ * urgently reclaimed.
+ */
+ cgroup_io_throttle(bio->bi_bdev, bio->bi_size,
+ !(PageReclaim(page) || PageSwapCache(page)));
count_vm_events(PGPGIN, count);
}
diff --git a/fs/aio.c b/fs/aio.c
index 0051fd9..1f3abb3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -22,6 +22,7 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/blk-io-throttle.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/slab.h>
@@ -1558,6 +1559,8 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
{
struct kiocb *req;
struct file *file;
+ struct block_device *bdev;
+ struct inode *inode;
ssize_t ret;
/* enforce forwards compatibility on users */
@@ -1580,10 +1583,26 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
if (unlikely(!file))
return -EBADF;
+ /*
+ * Pre-account AIO activity: we over-account *all* the bytes here;
+ * bytes read from the page cache and bytes written in already dirtied
+ * pages (that do not generate real i/o on block devices) will be
+ * subtracted later, following the path of aio_run_iocb().
+ */
+ inode = file->f_mapping->host;
+ bdev = inode->i_sb->s_bdev;
+ ret = cgroup_io_throttle(bdev, iocb->aio_nbytes, 0);
+ if (unlikely(ret)) {
+ fput(file);
+ ret = -EAGAIN;
+ goto out_cgroup_io_throttle;
+ }
+
req = aio_get_req(ctx); /* returns with 2 references to req */
if (unlikely(!req)) {
fput(file);
- return -EAGAIN;
+ ret = -EAGAIN;
+ goto out_cgroup_io_throttle;
}
req->ki_filp = file;
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
@@ -1622,12 +1641,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
goto out_put_req;
spin_lock_irq(&ctx->ctx_lock);
+ set_in_aio();
aio_run_iocb(req);
if (!list_empty(&ctx->run_list)) {
/* drain the run list */
while (__aio_run_iocbs(ctx))
;
}
+ unset_in_aio();
spin_unlock_irq(&ctx->ctx_lock);
aio_put_req(req); /* drop extra ref to req */
return 0;
@@ -1635,6 +1656,8 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
out_put_req:
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
+out_cgroup_io_throttle:
+ cgroup_io_throttle(bdev, -iocb->aio_nbytes, 0);
return ret;
}
@@ -1746,6 +1769,12 @@ asmlinkage long sys_io_cancel(aio_context_t ctx_id, struct iocb __user *iocb,
ret = -EAGAIN;
kiocb = lookup_kiocb(ctx, iocb, key);
if (kiocb && kiocb->ki_cancel) {
+ struct block_device *bdev;
+ struct inode *inode = kiocb->ki_filp->f_mapping->host;
+
+ bdev = inode->i_sb->s_bdev;
+ cgroup_io_throttle(bdev, -kiocb->ki_nbytes, 0);
+
cancel = kiocb->ki_cancel;
kiocb->ki_users ++;
kiocbSetCancelled(kiocb);
diff --git a/fs/buffer.c b/fs/buffer.c
index 4ffb5bb..6d4bf2c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -35,6 +35,7 @@
#include <linux/suspend.h>
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/bio.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
@@ -708,11 +709,14 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
static int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ ssize_t cgroup_io_acct = 0;
+ int ret = 0;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);
if (TestSetPageDirty(page))
- return 0;
+ goto out;
spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
@@ -723,14 +727,24 @@ static int __set_page_dirty(struct page *page,
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
- return 1;
+ ret = 1;
+out:
+ if (is_in_aio() && !cgroup_io_acct)
+ cgroup_io_acct = -PAGE_CACHE_SIZE;
+ if (cgroup_io_acct) {
+ struct block_device *bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+ cgroup_io_throttle(bdev, cgroup_io_acct, 0);
+ }
+ return ret;
}
/*
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9606ee8..f5dcb91 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/blk-io-throttle.h>
#include <asm/atomic.h>
/*
@@ -660,6 +661,9 @@ submit_page_section(struct dio *dio, struct page *page,
/*
* Read accounting is performed in submit_bio()
*/
+ struct block_device *bdev = dio->bio ?
+ dio->bio->bi_bdev : NULL;
+ cgroup_io_throttle(bdev, len, 1);
task_io_account_write(len);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba43675..9d4c755 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1250,6 +1250,9 @@ struct task_struct {
u64 rchar, wchar, syscr, syscw;
#endif
struct task_io_accounting ioac;
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_t in_aio;
+#endif
#if defined(CONFIG_TASK_XACCT)
u64 acct_rss_mem1; /* accumulated rss usage */
u64 acct_vm_mem1; /* accumulated virtual memory usage */
diff --git a/kernel/fork.c b/kernel/fork.c
index aed1ff7..f8cf5da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1029,6 +1029,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
task_io_accounting_init(p);
acct_clear_integrals(p);
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_set(&p->in_aio, 0);
+#endif
p->it_virt_expires = cputime_zero;
p->it_prof_expires = cputime_zero;
p->it_sched_expires = 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7567d86..bb80789 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -13,6 +13,7 @@
#include <linux/slab.h>
#include <linux/compiler.h>
#include <linux/fs.h>
+#include <linux/blk-io-throttle.h>
#include <linux/uaccess.h>
#include <linux/aio.h>
#include <linux/capability.h>
@@ -1011,6 +1012,7 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
pgoff_t prev_index;
unsigned long offset; /* offset into pagecache page */
unsigned int prev_offset;
+ int was_page_ok = 0;
int error;
index = *ppos >> PAGE_CACHE_SHIFT;
@@ -1023,7 +1025,8 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
struct page *page;
pgoff_t end_index;
loff_t isize;
- unsigned long nr, ret;
+ ssize_t nr;
+ unsigned long ret;
cond_resched();
find_page:
@@ -1051,6 +1054,8 @@ find_page:
desc, offset))
goto page_not_up_to_date_locked;
unlock_page(page);
+ } else {
+ was_page_ok = 1;
}
page_ok:
/*
@@ -1080,6 +1085,17 @@ page_ok:
}
nr = nr - offset;
+ /*
+ * De-account i/o in case of AIO read from the page cache.
+ * AIO accounting was performed in io_submit_one().
+ */
+ if (is_in_aio() && was_page_ok) {
+ struct block_device *bdev = (inode &&
+ inode->i_sb->s_bdev) ?
+ inode->i_sb->s_bdev : NULL;
+ cgroup_io_throttle(bdev, -nr, 0);
+ }
+
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 29b1d1e..c6207de 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -430,6 +431,9 @@ static void balance_dirty_pages(struct address_space *mapping)
unsigned long write_chunk = sync_writeback_pages();
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct block_device *bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
for (;;) {
struct writeback_control wbc = {
@@ -512,6 +516,14 @@ static void balance_dirty_pages(struct address_space *mapping)
return; /* pdflush is already workin
...
|
|
| | Topic: [PATCH -mm 2/3] i/o bandwidth controller infrastructure |
|---|
| [PATCH -mm 2/3] i/o bandwidth controller infrastructure [message #32143] |
Tue, 22 July 2008 16:58 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
This is the core io-throttle kernel infrastructure. It creates the basic
interfaces to cgroups and implements the I/O measurement and throttling
functions.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
block/Makefile | 2 +
block/blk-io-throttle.c | 668 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 41 +++
include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 10 +
5 files changed, 727 insertions(+), 0 deletions(-)
create mode 100644 block/blk-io-throttle.c
create mode 100644 include/linux/blk-io-throttle.h
diff --git a/block/Makefile b/block/Makefile
index 208000b..b3afc86 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,8 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
+
obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..6b0aa45
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,668 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/genhd.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/blk-io-throttle.h>
+
+#define IOTHROTTLE_BANDWIDTH 0
+#define IOTHROTTLE_IOPS 1
+
+/* The various types of throttling algorithms */
+enum iothrottle_strategy {
+ IOTHROTTLE_LEAKY_BUCKET = 0,
+ IOTHROTTLE_TOKEN_BUCKET = 1,
+};
+
+/**
+ * struct iothrottle_node - throttling rule of a single block device
+ * @node: list of per block device throttling rules
+ * @dev: block device number, used as key in the list
+ * @iorate: max i/o bandwidth (in bytes/s)
+ * @strategy: throttling strategy
+ * @timestamp: timestamp of the last i/o request for bandwidth limiting
+ * (in jiffies)
+ * @stat: i/o activity counter (leaky bucket only)
+ * @bucket_size: bucket size in bytes (token bucket only)
+ * @token: token counter (token bucket only)
+ * @iops: max i/o operations per second
+ * @iops_stat: i/o operations counter (leaky bucket policy)
+ * @iops_timestamp: timestamp of the last i/o request for iops/sec limiting
+ * (in jiffies)
+ *
+ * Define a i/o throttling rule for a single block device.
+ *
+ * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
+ * the limiting rules defined for that device persist and they are still valid
+ * if a new device is plugged and it uses the same dev_t number.
+ */
+struct iothrottle_node {
+ struct list_head node;
+ dev_t dev;
+
+ u64 iorate;
+ enum iothrottle_strategy strategy;
+ unsigned long timestamp;
+ atomic_long_t stat;
+ s64 bucket_size;
+ atomic_long_t token;
+
+ u64 iops;
+ atomic_long_t iops_stat;
+ unsigned long iops_timestamp;
+};
+
+/**
+ * struct iothrottle - throttling rules for a cgroup
+ * @css: pointer to the cgroup state
+ * @lock: spinlock used to protect write operations in the list
+ * @list: list of iothrottle_node elements
+ *
+ * Define multiple per-block device i/o throttling rules.
+ * Note: the list of the throttling rules is protected by RCU locking.
+ */
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ spinlock_t lock;
+ struct list_head list;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() or iot->lock held.
+ */
+static struct iothrottle_node *
+iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
+{
+ struct iothrottle_node *n;
+
+ list_for_each_entry_rcu(n, &iot->list, node)
+ if (n->dev == dev)
+ return n;
+ return NULL;
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void iothrottle_insert_node(struct iothrottle *iot,
+ struct iothrottle_node *n)
+{
+ list_add_rcu(&n->node, &iot->list);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
+ struct iothrottle_node *new)
+{
+ list_replace_rcu(&old->node, &new->node);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
+{
+ list_del_rcu(&n->node);
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle *iot;
+
+ iot = kmalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&iot->list);
+ spin_lock_init(&iot->lock);
+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle_node *n, *p;
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+
+ /*
+ * don't worry about locking here, at this point there must be not any
+ * reference to the list.
+ */
+ list_for_each_entry_safe(n, p, &iot->list, node)
+ kfree(n);
+ kfree(iot);
+}
+
+static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+ struct iothrottle_node *n;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(n, &iot->list, node) {
+ unsigned long delta;
+
+ BUG_ON(!n->dev);
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ if (!n->iorate)
+ continue;
+ delta = jiffies_to_msecs((long)jiffies -
+ (long)n->timestamp);
+ seq_printf(m, "%u %u %llu %u %li %lli %li %lu\n",
+ MAJOR(n->dev), MINOR(n->dev),
+ n->iorate, n->strategy,
+ atomic_long_read(&n->stat),
+ n->bucket_size, atomic_long_read(&n->token),
+ delta);
+ break;
+ case IOTHROTTLE_IOPS:
+ if (!n->iops)
+ continue;
+ delta = jiffies_to_msecs((long)jiffies -
+ (long)n->iops_timestamp);
+ seq_printf(m, "%u %u %llu %li %lu\n",
+ MAJOR(n->dev), MINOR(n->dev),
+ n->iops, atomic_long_read(&n->iops_stat),
+ delta);
+ break;
+ }
+ }
+ rcu_read_unlock();
+ return 0;
+}
+
+static dev_t devname2dev_t(const char *buf)
+{
+ struct block_device *bdev;
+ dev_t dev = 0;
+ struct gendisk *disk;
+ int part;
+
+ /* use a lookup to validate the block device */
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return 0;
+
+ /* only entire devices are allowed, not single partitions */
+ disk = get_gendisk(bdev->bd_dev, &part);
+ if (disk && !part) {
+ BUG_ON(!bdev->bd_inode);
+ dev = bdev->bd_inode->i_rdev;
+ }
+ bdput(bdev);
+
+ return dev;
+}
+
+/*
+ * The userspace input string must use one of the following syntaxes:
+ *
+ * blockio.bandwidth
+ * ~~~~~~~~~~~~~~~~~
+ * dev:0 <- delete a bandwidth limiting rule
+ * dev:bw-limit:0 <- set a leaky bucket throttling rule
+ * dev:bw-limit:1:bucket-size <- set a token bucket throttling rule
+ *
+ * blockio.iops
+ * ~~~~~~~~~~~~
+ * dev:0 <- delete a iops/sec throttling rule
+ * dev:iops <- set an iops/sec throttling rule
+ */
+static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
+ dev_t *dev, u64 *iops, u64 *iorate,
+ enum iothrottle_strategy *strategy,
+ s64 *bucket_size)
+{
+ char *p;
+ int count = 0;
+ char *s[4];
+ unsigned long strategy_val;
+ int ret;
+
+ memset(s, 0, sizeof(s));
+ *dev = 0;
+ *iops = 0;
+ *iorate = 0;
+ *strategy = 0;
+ *bucket_size = 0;
+
+ /* split the colon-delimited input string into its elements */
+ while (count < ARRAY_SIZE(s)) {
+ p = strsep(&buf, ":");
+ if (!p)
+ break;
+ if (!*p)
+ continue;
+ s[count++] = p;
+ }
+
+ switch (filetype) {
+ case IOTHROTTLE_BANDWIDTH:
+ /* i/o bandwidth limit */
+ if (!s[1])
+ return -EINVAL;
+ ret = strict_strtoull(s[1], 10, iorate);
+ if (ret < 0)
+ return ret;
+ if (!*iorate)
+ goto out;
+ *iorate = ALIGN(*iorate, 1024);
+ /* throttling strategy */
+ if (!s[2])
+ return -EINVAL;
+ ret = strict_strtoul(s[2], 10, &strategy_val);
+ if (ret < 0)
+ return ret;
+ *strategy = (enum iothrottle_strategy)strategy_val;
+ switch (*strategy) {
+ case IOTHROTTLE_LEAKY_BUCKET:
+ goto out;
+ case IOTH
...
|
|
| | Topic: [PATCH -mm 1/3] i/o bandwidth controller documentation |
|---|
| [PATCH -mm 1/3] i/o bandwidth controller documentation [message #32142] |
Tue, 22 July 2008 16:58 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
Documentation/controllers/io-throttle.txt | 328 +++++++++++++++++++++++++++++
1 files changed, 328 insertions(+), 0 deletions(-)
create mode 100644 Documentation/controllers/io-throttle.txt
diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..f6b8bb9
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,328 @@
+
+ Block device I/O bandwidth controller
+
+----------------------------------------------------------------------
+1. DESCRIPTION
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability and provide performance isolation of different control groups
+sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+----------------------------------------------------------------------
+2. USER INTERFACE
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+
+The same file can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+2.1. Configure I/O bandwidth limiting rules
+
+The syntax to configure a limiting rule is the following:
+
+# /bin/echo DEV:BW:STRATEGY:BUCKET_SIZE > CGROUP/blockio.bandwidth
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- BW is the maximum I/O bandwidth on DEV allowed by CGROUP; bandwidth must be
+ expressed in bytes/s. A generic I/O bandwidth limiting rule for a block
+ device DEV can be removed setting the BW value to 0.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+ requests from/to device DEV. At the moment two different strategies can be
+ used:
+
+ 0 = leaky bucket: the controller accepts at most B bytes (B = BW * time);
+ further I/O requests are delayed scheduling a timeout for
+ the tasks that made those requests.
+
+ Different I/O flow
+ | | |
+ | v |
+ | v
+ v
+ .......
+ \ /
+ \ / leaky-bucket
+ ---
+ |||
+ vvv
+ Smoothed I/O flow
+
+ 1 = token bucket: BW tokens are added to the bucket every seconds; the bucket
+ can hold at the most BUCKET_SIZE tokens; I/O requests are
+ accepted if there are available tokens in the bucket; when
+ a request of N bytes arrives N tokens are removed from the
+ bucket; if fewer than N tokens are available the request is
+ delayed until a sufficient amount of token is available in
+ the bucket.
+
+ Tokens (I/O rate)
+ o
+ o
+ o
+ ....... <--.
+ \ / | Bucket size (burst limit)
+ \ooo/ |
+ --- <--'
+ |ooo
+ Incoming --->|---> Conforming
+ I/O |oo I/O
+ requests -->|--> requests
+ |
+ ---->|
+
+ Leaky bucket is more precise than token bucket to respect the bandwidth
+ limits, because bursty workloads are always smoothed. Token bucket, instead,
+ allows a small irregularity degree in the I/O flows (burst limit), and, for
+ this, it is better in terms of efficiency (bursty workloads are not smoothed
+ when there are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+ size of the bucket in bytes.
+
+- CGROUP is the name of the limited process container.
+
+Also the following syntaxes are allowed:
+
+- remove an I/O bandwidth limiting rule
+# /bin/echo DEV:0 > CGROUP/blockio.bandwidth
+
+- configure a limiting rule using leaky bucket throttling (ignore bucket size):
+# /bin/echo DEV:BW:0 > CGROUP/blockio.bandwidth
+
+2.2. Show I/O bandwidth limiting rules
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the file blockio.bandwidth. The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR MINOR BW STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- BW, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes currently allowed by the I/O bandwidth
+ controller (only used with leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+ with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+ - the amount of jiffies elapsed from the last I/O request (token bucket)
+ - the amount of jiffies during which the bytes given by LEAKY_STAT have been
+ accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 .. n):
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+2.3 Configure I/O operations/sec limiting rules
+
+The syntax to limit I/O operations/sec is the following:
+
+# /bin/echo DEV:IOPS > CGROUP/blockio.iops
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- IOPS is the number of the maximum I/O operations per second that can be
+ issued on device DEV by the tasks belonging to cgroup CGROUP. The I/O
+ operations limit can be removed setting IOPS to 0.
+
+2.4 Show I/O operations/sec limiting rules
+
+The I/O operations limits and statistics for a specific cgroup can be shown
+reading the file blockio.ops. The following syntax is used:
+
+$ cat CGROUP/blockio.iops
+MAJOR MINOR IOPS IOPS_STAT TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- IOPS is the I/O operations/sec limit
+
+- IOPS_STAT is the number of I/O operations accumulated by the iops counter
+
+- TIME_DELTA can be one of the following:
+ - the amount of jiffies during which the number of I/O operations reported in
+ IOPS_STAT have been accumulated
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 .. n):
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR1 MINOR1 IOPS1 IOPS_STAT1 TIME_DELTA1
+MAJOR2 MINOR2 IOPS2 IOPS_STAT2 TIME_DELTA2
+...
+MAJORn MINORn IOPSn IOPS_STATn TIME_DELTAn
+
+2.5. Examples
+
+* Mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+ leaky bucket throttling strategy:
+ # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+ token bucket throttling strategy, bucket size = 8MiB:
+ # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+ and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+ defined for cgroup "foo" can be shown as following:
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 16 8388608 1 0 8388608 -522560 48
+ 8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+ # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 16 8388608 1 0 8388608 -84432 206436
+ 8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+ # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth
+ # cat /mnt/cgroup/foo/bl
...
|
|
| | Topic: [PATCH -mm 0/3] cgroup: block device i/o bandwidth controller (v7) |
|---|
| [PATCH -mm 0/3] cgroup: block device i/o bandwidth controller (v7) [message #32141] |
Tue, 22 July 2008 16:58 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
The objective of the i/o bandwidth controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.
Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.
The direct bandwidth limiting method has the advantage of improving the
performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).
Detailed informations about design, its goal and usage are described in the
documentation.
Tested against 2.6.26-rc8-mm1.
The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/
Changelog: (v6 -> v7)
- added i/o operations per second throttling
- fixed a build bug in x86 (undefined reference to `__udivdi3')
- updated documentation
Following some results of a simple test I did to check the effectiveness of the
new iops throttling functionality (for Subrata: I'll post an update for the
io-throttle testcase in LTP ASAP).
testcase overview
=================
- cgroup #1: process P1 periodically reads a 5.5MB file and prints in stdout
the time needed to read the file
- cgroup #2: a process P2 is started; P2 runs a lot of parallel md5sums of all
the files under /usr (recursively)
We want to improve P1 responsiveness and better predict P1 performance,
regardless of the other i/o activities in the system, so we're going to measure
the times printed by P1 in stdout to evaluate the effectiveness of a each
tested solution for our particular requirement.
different configurations
========================
#1: no limiting at all
#2: plain CFQ priorities (P1 runs at real-time prio class 0, P2 runs at idle prio)
#3: iops throttling (P1 = unlimited, P2 = 50 iops/sec)
#4: bandwidth throttling (P1 = unlimited, P2 = 512KiB/s)
#5: bandwidth + iops throttling (P1 = unlimited, P2 = 512KiB/s and 50 iops/sec)
#6: aggressive bandwidth + iops throttling (P1 = unlimited, P2 = 128KiB/s and 10 iops/sec)
results (P1 response times)
===========================
#1 #2 #3 #4 #5 #6
----------------------------------------------------------
4.69724 4.68447 4.80822 4.37353 4.40609 4.37175
4.71427 4.45847 4.40524 4.35441 4.37228 4.35842
4.73120 4.46849 4.39400 4.36893 4.47388 4.36529
4.83120 4.47956 4.37878 4.44221 4.36823 4.37942
4.68060 4.49554 4.43058 4.40074 4.46004 4.37354
____________________ P2 starts here! _____________________
62.83110 7.06834 6.54557 7.10171 7.21964 5.35958
59.04400 6.92486 10.30330 5.38122 5.76458 4.89837
37.23380 7.11255 9.16971 8.32928 5.37017 5.51931
32.28180 7.26239 8.91513 6.27551 5.03347 4.79848
28.74150 7.19909 8.38274 5.00802 5.50771 4.72832
-Andrea
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
|
|
| | Topic: HASH FS |
|---|
| HASH FS [message #32109] |
Mon, 21 July 2008 10:49 |
Jun OKAJIMA Messages: 30 Registered: March 2006 |
Member |
From: openvz.org
|
|
( This mail is sent to both OpenVZ and vserver ML)
Hello folks.
This is a new FS for virtual server.
Basically, Ths is "vhashify/vzcache" FS.
Just try it, please. I need your feed back.
URL:
http://www.digitalinfra.co.jp/20080720/hashfs.20080720.html
Best regards,
--- Okajima, Jun. Tokyo, Japan.
To Mr. Herbert Poetzl :
See this URL:
http://www.mail-archive.com/vserver@list.linux-vserver.org/msg04247.html
>> BTW, I also am planning a new file system for Vserver.
>well, let's hear about it then ...
This is it.
|
|
| | Topic: Re: [patch 1/1] [TCP] fix kernel panic with listening_get_next |
|---|
| Re: [patch 1/1] [TCP] fix kernel panic with listening_get_next [message #32087] |
Sat, 19 July 2008 03:16 |
davem Messages: 463 Registered: February 2006 |
Senior Member |
From: openvz.org
|
|
From: Daniel Lezcano <dlezcano@fr.ibm.com>
Date: Sat, 19 Jul 2008 08:31:58 +0200
> [TCP] fix kernel panic with listening_get_next
Applied, thanks a lot Daniel.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
|
|
| | Topic: Checkpoint/Restart mini-summit agenda |
|---|
| Checkpoint/Restart mini-summit agenda [message #32070] |
Fri, 18 July 2008 05:19 |
Daniel Lezcano Messages: 417 Registered: June 2006 |
Senior Member |
From: openvz.org
|
|
The mini-summit agenda has been updated at:
http://wiki.openvz.org/Containers/Mini-summit_2008
Thanks.
Sauf indication contraire ci-dessus:
Compagnie IBM France
Siège Social : Tour Descartes, 2, avenue Gambetta, La Défense 5, 92400
Courbevoie
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 542.737.118 ?
SIREN/SIRET : 552 118 465 02430
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
|
|
| | Topic: [ccr@linuxsymposium.org: LS Mini Summit: Schedule Update & Info] |
|---|
| [ccr@linuxsymposium.org: LS Mini Summit: Schedule Update & Info] [message #31980] |
Wed, 16 July 2008 15:18 |
serue Messages: 750 Registered: February 2006 |
Senior Member |
From: openvz.org
|
|
The containers mini-summit is happening at the Novotel Hotel.
Note that our breaks don't correspond to the break times given by
facilities. I intend to ignore that - we can break for 5 mins to
collect coffee whenever - but if someone wants to update the times on
our agenda to better match these breaks that's fine.
I did however move the first two segments back by 30 minutes so that
we start at 8:30am. Please note the time change!
-serge
----- Forwarded message from "C. Craig Ross" <ccr@linuxsymposium.org> -----
Date: Tue, 15 Jul 2008 17:01:19 -0400
From: "C. Craig Ross" <ccr@linuxsymposium.org>
To: "Serge E. Hallyn" <serue@us.ibm.com>,
"Adams, Aland (OSLO R&D)" <aland.adams@hp.com>,
"Brown, Len" <len.brown@intel.com>,
"John W. Linville" <linville@tuxdriver.com>,
Paul Moore <paul.moore@hp.com>
Subject: LS Mini Summit: Schedule Update & Info
Hi,
Here is the latest update.
Location
------------
Les Suites Hotel
1. Virtualization (Garden Suite Room) - Theatre Style Seating (35)
2. Linux Power Management (Rideau Suite) - U-Shape (20)
3. Linux Wireless LAN (Byward Suite) - U-Shape (25)
Novotel Hotel
1. Containers (Albion A) - U-Shape (35)
2. SE Linux (Albion B) - Theatre (45)
Schedule
-------------
08h30 - 16h30
With breaks at 10h00 - 10h30 and 15h30 - 16h00 and lunch at 12h30 - 13h30.
If the
breaks don't fit *exactly* with your schedule it should be ok since things
like coffee/snacks
shouldn't go back if you are 15 minutes early/late.
Registration
-----------------
Please send us your registration lists by Monday, July 21st (at the latest).
We are still trying to decide if we will be holding registration on Tuesday
or if we are just going to confirm your list against our registration list
so
that your attendees don't have to walk to the Congress Centre then walk
back.
We will provide specifics about room locations/access before the weekend.
Contact me directly if you have any questions but I think we're getting
there. :)
Cheers,
C.
On Tue, Jul 15, 2008 at 10:37 AM, C. Craig Ross <ccr@linuxsymposium.org>
wrote:
> Hi,
>
> Here are our attendance estimates, please confirm...
>
> Virtualization - 20
> Containers - 20
> SE Linux - 40
> Linux Power Management - 20
> Linux Wireless LAN - ... :-)
>
> We will be hosting the Mini Summits at Les Suites Hotel
> and most likely the Novotel Hotel (we will confirm soon).
> The rooms will be set up in theatre style seating
> unless you aren't doing formal presentations at which point
> a hollow square shape might be more beneficial.
>
> We also require confirmation for those of you who require
> projectors and screens or just flip charts and paper boards.
>
> Coffee breaks and lunches will be provided for all Mini Summits
> as they have generously been sponsored by HP. :-)
>
> Thank you.
>
> Cheers,
>
> C.
>
> P.S. You can just reply directly to me.
>
----- End forwarded message -----
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
|
|
| | Topic: [PATCH -mm 3/3] i/o accounting and control |
|---|
| [PATCH -mm 3/3] i/o accounting and control [message #31949] |
Tue, 15 July 2008 16:40 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
Apply the io-throttle controller to the opportune kernel functions. Both
accounting and throttling functionalities are performed by
cgroup_io_throttle().
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
block/blk-core.c | 9 +++++++++
fs/aio.c | 31 ++++++++++++++++++++++++++++++-
fs/buffer.c | 20 +++++++++++++++++---
fs/direct-io.c | 4 ++++
include/linux/sched.h | 3 +++
kernel/fork.c | 3 +++
mm/filemap.c | 18 +++++++++++++++++-
mm/page-writeback.c | 30 +++++++++++++++++++++++++++---
mm/readahead.c | 5 +++++
9 files changed, 115 insertions(+), 8 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4c222ba..431294f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/blktrace_api.h>
@@ -1482,7 +1483,15 @@ void submit_bio(int rw, struct bio *bio)
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
} else {
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+
task_io_account_read(bio->bi_size);
+ /*
+ * Do not throttle page requests that need to be
+ * urgently reclaimed.
+ */
+ cgroup_io_throttle(bio->bi_bdev, bio->bi_size,
+ !(PageReclaim(page) || PageSwapCache(page)));
count_vm_events(PGPGIN, count);
}
diff --git a/fs/aio.c b/fs/aio.c
index 0051fd9..1f3abb3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -22,6 +22,7 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/blk-io-throttle.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/slab.h>
@@ -1558,6 +1559,8 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
{
struct kiocb *req;
struct file *file;
+ struct block_device *bdev;
+ struct inode *inode;
ssize_t ret;
/* enforce forwards compatibility on users */
@@ -1580,10 +1583,26 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
if (unlikely(!file))
return -EBADF;
+ /*
+ * Pre-account AIO activity: we over-account *all* the bytes here;
+ * bytes read from the page cache and bytes written in already dirtied
+ * pages (that do not generate real i/o on block devices) will be
+ * subtracted later, following the path of aio_run_iocb().
+ */
+ inode = file->f_mapping->host;
+ bdev = inode->i_sb->s_bdev;
+ ret = cgroup_io_throttle(bdev, iocb->aio_nbytes, 0);
+ if (unlikely(ret)) {
+ fput(file);
+ ret = -EAGAIN;
+ goto out_cgroup_io_throttle;
+ }
+
req = aio_get_req(ctx); /* returns with 2 references to req */
if (unlikely(!req)) {
fput(file);
- return -EAGAIN;
+ ret = -EAGAIN;
+ goto out_cgroup_io_throttle;
}
req->ki_filp = file;
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
@@ -1622,12 +1641,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
goto out_put_req;
spin_lock_irq(&ctx->ctx_lock);
+ set_in_aio();
aio_run_iocb(req);
if (!list_empty(&ctx->run_list)) {
/* drain the run list */
while (__aio_run_iocbs(ctx))
;
}
+ unset_in_aio();
spin_unlock_irq(&ctx->ctx_lock);
aio_put_req(req); /* drop extra ref to req */
return 0;
@@ -1635,6 +1656,8 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
out_put_req:
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
+out_cgroup_io_throttle:
+ cgroup_io_throttle(bdev, -iocb->aio_nbytes, 0);
return ret;
}
@@ -1746,6 +1769,12 @@ asmlinkage long sys_io_cancel(aio_context_t ctx_id, struct iocb __user *iocb,
ret = -EAGAIN;
kiocb = lookup_kiocb(ctx, iocb, key);
if (kiocb && kiocb->ki_cancel) {
+ struct block_device *bdev;
+ struct inode *inode = kiocb->ki_filp->f_mapping->host;
+
+ bdev = inode->i_sb->s_bdev;
+ cgroup_io_throttle(bdev, -kiocb->ki_nbytes, 0);
+
cancel = kiocb->ki_cancel;
kiocb->ki_users ++;
kiocbSetCancelled(kiocb);
diff --git a/fs/buffer.c b/fs/buffer.c
index 4ffb5bb..89808b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -35,6 +35,7 @@
#include <linux/suspend.h>
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/bio.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
@@ -708,11 +709,14 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
static int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ ssize_t cgroup_io_acct = 0;
+ int ret = 0;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);
if (TestSetPageDirty(page))
- return 0;
+ goto out;
spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
@@ -723,14 +727,24 @@ static int __set_page_dirty(struct page *page,
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
- return 1;
+ ret = 1;
+out:
+ if (is_in_aio() && !cgroup_io_acct)
+ cgroup_io_acct = -PAGE_CACHE_SIZE;
+ if (cgroup_io_acct) {
+ struct block_device *bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+ cgroup_io_throttle(bdev, cgroup_io_acct, 0);
+ }
+ return ret;
}
/*
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9606ee8..f5dcb91 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/blk-io-throttle.h>
#include <asm/atomic.h>
/*
@@ -660,6 +661,9 @@ submit_page_section(struct dio *dio, struct page *page,
/*
* Read accounting is performed in submit_bio()
*/
+ struct block_device *bdev = dio->bio ?
+ dio->bio->bi_bdev : NULL;
+ cgroup_io_throttle(bdev, len, 1);
task_io_account_write(len);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba43675..9d4c755 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1250,6 +1250,9 @@ struct task_struct {
u64 rchar, wchar, syscr, syscw;
#endif
struct task_io_accounting ioac;
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_t in_aio;
+#endif
#if defined(CONFIG_TASK_XACCT)
u64 acct_rss_mem1; /* accumulated rss usage */
u64 acct_vm_mem1; /* accumulated virtual memory usage */
diff --git a/kernel/fork.c b/kernel/fork.c
index aed1ff7..f8cf5da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1029,6 +1029,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
task_io_accounting_init(p);
acct_clear_integrals(p);
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_set(&p->in_aio, 0);
+#endif
p->it_virt_expires = cputime_zero;
p->it_prof_expires = cputime_zero;
p->it_sched_expires = 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7567d86..bb80789 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -13,6 +13,7 @@
#include <linux/slab.h>
#include <linux/compiler.h>
#include <linux/fs.h>
+#include <linux/blk-io-throttle.h>
#include <linux/uaccess.h>
#include <linux/aio.h>
#include <linux/capability.h>
@@ -1011,6 +1012,7 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
pgoff_t prev_index;
unsigned long offset; /* offset into pagecache page */
unsigned int prev_offset;
+ int was_page_ok = 0;
int error;
index = *ppos >> PAGE_CACHE_SHIFT;
@@ -1023,7 +1025,8 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
struct page *page;
pgoff_t end_index;
loff_t isize;
- unsigned long nr, ret;
+ ssize_t nr;
+ unsigned long ret;
cond_resched();
find_page:
@@ -1051,6 +1054,8 @@ find_page:
desc, offset))
goto page_not_up_to_date_locked;
unlock_page(page);
+ } else {
+ was_page_ok = 1;
}
page_ok:
/*
@@ -1080,6 +1085,17 @@ page_ok:
}
nr = nr - offset;
+ /*
+ * De-account i/o in case of AIO read from the page cache.
+ * AIO accounting was performed in io_submit_one().
+ */
+ if (is_in_aio() && was_page_ok) {
+ struct block_device *bdev = (inode &&
+ inode->i_sb->s_bdev) ?
+ inode->i_sb->s_bdev : NULL;
+ cgroup_io_throttle(bdev, -nr, 0);
+ }
+
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 29b1d1e..c6207de 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -430,6 +431,9 @@ static void balance_dirty_pages(struct address_space *mapping)
unsigned long write_chunk = sync_writeback_pages();
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct block_device *bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
for (;;) {
struct writeback_control wbc = {
@@ -512,6 +516,14 @@ static void balance_dirty_pages(struct address_space *mapping)
return; /* pdflush is already working th
...
|
|
| | Topic: [PATCH -mm 2/3] i/o bandwidth controller infrastructure |
|---|
| [PATCH -mm 2/3] i/o bandwidth controller infrastructure [message #31951] |
Tue, 15 July 2008 16:40 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
This is the core io-throttle kernel infrastructure. It creates the basic
interfaces to cgroups and implements the I/O measurement and throttling
functions.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
block/Makefile | 2 +
block/blk-io-throttle.c | 552 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 41 +++
include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 10 +
5 files changed, 611 insertions(+), 0 deletions(-)
create mode 100644 block/blk-io-throttle.c
create mode 100644 include/linux/blk-io-throttle.h
diff --git a/block/Makefile b/block/Makefile
index 208000b..b3afc86 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,8 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
+
obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..f541e86
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,552 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/genhd.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/blk-io-throttle.h>
+
+/* The various types of throttling algorithms */
+enum iothrottle_strategy {
+ IOTHROTTLE_LEAKY_BUCKET = 0,
+ IOTHROTTLE_TOKEN_BUCKET = 1,
+};
+
+/**
+ * struct iothrottle_node - throttling rule of a single block device
+ * @node: list of per block device throttling rules
+ * @dev: block device number, used as key in the list
+ * @iorate: max i/o bandwidth (in bytes/s)
+ * @strategy: throttling strategy
+ * @timestamp: timestamp of the last I/O request (in jiffies)
+ * @stat: i/o activity counter (leaky bucket only)
+ * @bucket_size: bucket size in bytes (token bucket only)
+ * @token: token counter (token bucket only)
+ *
+ * Define a i/o throttling rule for a single block device.
+ *
+ * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
+ * the limiting rules defined for that device persist and they are still valid
+ * if a new device is plugged and it uses the same dev_t number.
+ */
+struct iothrottle_node {
+ struct list_head node;
+ dev_t dev;
+ u64 iorate;
+ enum iothrottle_strategy strategy;
+ unsigned long timestamp;
+ atomic_long_t stat;
+ s64 bucket_size;
+ atomic_long_t token;
+};
+
+/**
+ * struct iothrottle - throttling rules for a cgroup
+ * @css: pointer to the cgroup state
+ * @lock: spinlock used to protect write operations in the list
+ * @list: list of iothrottle_node elements
+ *
+ * Define multiple per-block device i/o throttling rules.
+ * Note: the list of the throttling rules is protected by RCU locking.
+ */
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ spinlock_t lock;
+ struct list_head list;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() or iot->lock held.
+ */
+static struct iothrottle_node *
+iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
+{
+ struct iothrottle_node *n;
+
+ list_for_each_entry_rcu(n, &iot->list, node)
+ if (n->dev == dev)
+ return n;
+ return NULL;
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void iothrottle_insert_node(struct iothrottle *iot,
+ struct iothrottle_node *n)
+{
+ list_add_rcu(&n->node, &iot->list);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
+ struct iothrottle_node *new)
+{
+ list_replace_rcu(&old->node, &new->node);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static struct iothrottle_node *
+iothrottle_delete_node(struct iothrottle *iot, dev_t dev)
+{
+ struct iothrottle_node *n;
+
+ list_for_each_entry(n, &iot->list, node)
+ if (n->dev == dev) {
+ list_del_rcu(&n->node);
+ return n;
+ }
+ return NULL;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle *iot;
+
+ iot = kmalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&iot->list);
+ spin_lock_init(&iot->lock);
+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle_node *n, *p;
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+
+ /*
+ * don't worry about locking here, at this point there must be not any
+ * reference to the list.
+ */
+ list_for_each_entry_safe(n, p, &iot->list, node)
+ kfree(n);
+ kfree(iot);
+}
+
+static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+ struct iothrottle_node *n;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(n, &iot->list, node) {
+ unsigned long delta;
+
+ BUG_ON(!n->dev);
+ delta = jiffies_to_msecs((long)jiffies - (long)n->timestamp);
+ seq_printf(m, "%u %u %llu %u %li %lli %li %lu\n",
+ MAJOR(n->dev), MINOR(n->dev), n->iorate,
+ n->strategy, atomic_long_read(&n->stat),
+ n->bucket_size, atomic_long_read(&n->token),
+ delta);
+ }
+ rcu_read_unlock();
+ return 0;
+}
+
+static dev_t devname2dev_t(const char *buf)
+{
+ struct block_device *bdev;
+ dev_t dev = 0;
+ struct gendisk *disk;
+ int part;
+
+ /* use a lookup to validate the block device */
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return 0;
+
+ /* only entire devices are allowed, not single partitions */
+ disk = get_gendisk(bdev->bd_dev, &part);
+ if (disk && !part) {
+ BUG_ON(!bdev->bd_inode);
+ dev = bdev->bd_inode->i_rdev;
+ }
+ bdput(bdev);
+
+ return dev;
+}
+
+/*
+ * The userspace input string must use one of the following syntax:
+ *
+ * dev:0 <- delete a limiting rule
+ * dev:bw-limit:0 <- leaky bucket throttling rule
+ * dev:bw-limit:1:bucket-size <- token bucket throttling rule
+ */
+static int iothrottle_parse_args(char *buf, size_t nbytes, dev_t *dev,
+ u64 *iorate,
+ enum iothrottle_strategy *strategy,
+ s64 *bucket_size)
+{
+ char *p;
+ int count = 0;
+ char *s[4];
+ unsigned long strategy_val;
+ int ret;
+
+ /* split the colon-delimited input string into its elements */
+ memset(s, 0, sizeof(s));
+ while (count < ARRAY_SIZE(s)) {
+ p = strsep(&buf, ":");
+ if (!p)
+ break;
+ if (!*p)
+ continue;
+ s[count++] = p;
+ }
+
+ /* i/o bandwidth limit */
+ if (!s[1])
+ return -EINVAL;
+ ret = strict_strtoull(s[1], 10, iorate);
+ if (ret < 0)
+ return ret;
+ if (!*iorate) {
+ /*
+ * we're deleting a limiting rule, so just ignore the other
+ * parameters
+ */
+ *strategy = 0;
+ *bucket_size = 0;
+ goto out;
+ }
+ *iorate = ALIGN(*iorate, 1024);
+
+ /* throttling strategy */
+ if (!s[2])
+ return -EINVAL;
+ ret = strict_strtoul(s[2], 10, &strategy_val);
+ if (ret < 0)
+ return ret;
+ *strategy = (enum iothrottle_strategy)strategy_val;
+ switch (*strategy) {
+ case IOTHROTTLE_LEAKY_BUCKET:
+ /* leaky bucket ignores bucket size */
+ *bucket_size = 0;
+ goto out;
+ case IOTHROTTLE_TOKEN_BUCKET:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ /* bucket size */
+ if (!s[3])
+ return -EINVAL;
+ ret = strict_strtoll(s[3], 10, bucket_size);
+ if (ret < 0)
+ return ret;
+ if (*bucket_size < 0)
+ return -EINVAL;
+ *bucket_size = ALIGN(*bucket_size, 1024);
+out:
+ /* block device number */
+ *dev = devname2dev_t(s[0]);
+ return *dev ? 0 : -EINVAL;
+}
+
+static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n, *newn = NULL;
+ dev_t dev;
+ u64 iorate;
+ enum iothrottle_strategy strategy;
+ s64 bucket_size;
+ char *buf;
+ size_t nbytes = strlen(buffer);
+ int ret = 0;
+
+ buf = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+ memcpy(buf, buffer, nbytes + 1);
+
+ ret = iothrottle_parse_args(buf, nbytes, &dev, &iorat
...
|
|
| | Topic: [PATCH -mm 1/3] i/o bandwidth controller documentation |
|---|
| [PATCH -mm 1/3] i/o bandwidth controller documentation [message #31950] |
Tue, 15 July 2008 16:40 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
Documentation/controllers/io-throttle.txt | 282 +++++++++++++++++++++++++++++
1 files changed, 282 insertions(+), 0 deletions(-)
create mode 100644 Documentation/controllers/io-throttle.txt
diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..9c53f6c
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,282 @@
+
+ Block device I/O bandwidth controller
+
+1. Description
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability and provide performance isolation of different control groups
+sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+2. User Interface
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+
+The same file can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+2.1. Configure I/O limiting rules
+
+The syntax to configure a limiting rule is the following:
+
+# /bin/echo DEV:BW:STRATEGY:BUCKET_SIZE > CGROUP/blockio.bandwidth
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- BW is the maximum I/O bandwidth on DEVICE allowed by CGROUP; bandwidth must
+ be expressed in bytes/s. A generic I/O bandwidth limiting rule for a block
+ device DEV can be removed setting the BW value to 0.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+ requests from/to device DEV. At the moment two different strategies can be
+ used:
+
+ 0 = leaky bucket: the controller accepts at most B bytes (B = BW * time);
+ further I/O requests are delayed scheduling a timeout for
+ the tasks that made those requests.
+
+ Different I/O flow
+ | | |
+ | v |
+ | v
+ v
+ .......
+ \ /
+ \ / leaky-bucket
+ ---
+ |||
+ vvv
+ Smoothed I/O flow
+
+ 1 = token bucket: BW tokens are added to the bucket every seconds; the bucket
+ can hold at the most BUCKET_SIZE tokens; I/O requests are
+ accepted if there are available tokens in the bucket; when
+ a request of N bytes arrives N tokens are removed from the
+ bucket; if fewer than N tokens are available the request is
+ delayed until a sufficient amount of token is available in
+ the bucket.
+
+ Tokens (I/O rate)
+ o
+ o
+ o
+ ....... <--.
+ \ / | Bucket size (burst limit)
+ \ooo/ |
+ --- <--'
+ |ooo
+ Incoming --->|---> Conforming
+ I/O |oo I/O
+ requests -->|--> requests
+ |
+ ---->|
+
+ Leaky bucket is more precise than token bucket to respect the bandwidth
+ limits, because bursty workloads are always smoothed. Token bucket, instead,
+ allows a small irregularity degree in the I/O flows (burst limit), and, for
+ this, it is better in terms of efficiency (bursty workloads are not smoothed
+ when there are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+ size of the bucket in bytes.
+
+- CGROUP is the name of the limited process container.
+
+Also the following syntaxes are allowed:
+
+- remove an I/O bandwidth limiting rule
+# /bin/echo DEV:0 > CGROUP/blockio.bandwidth
+
+- configure a limiting rule using leaky bucket throttling (ignore bucket size):
+# /bin/echo DEV:BW:0 > CGROUP/blockio.bandwidth
+
+2.2. Show I/O limiting rules
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the file blockio.bandwidth. The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR MINOR BW STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- BW, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes currently allowed by the I/O bandwidth
+ controller (only used with leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+ with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+ - the amount of jiffies elapsed from the last I/O request (token bucket)
+ - the amount of jiffies during which the bytes given by LEAKY_STAT have been
+ accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 .. n):
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+2.3. Examples
+
+* Mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+ leaky bucket throttling strategy:
+ # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+ token bucket throttling strategy, bucket size = 8MiB:
+ # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+ and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+ defined for cgroup "foo" can be shown as following:
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 16 8388608 1 0 8388608 -522560 48
+ 8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+ # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 16 8388608 1 0 8388608 -84432 206436
+ 8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+ # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 0 16777216 0 0 0 0 110388
+
+3. Advantages of providing this feature
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+ different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+ deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+ asynchronous operations, even the I/O passing through the page cache or
+ buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+ adjust the I/O workload of different process containers at run-time,
+ according to the particular users' requirements and applications' performance
+ constraints
+* It is even possible to implement event-based performance throttling
+ mechanisms; for example the same user-space application could actively
+ throttle the I/O bandwidth to reduce power consumption when the battery of a
+ mobile device is running low (power throttling) or when the temperature of a
+ hardware component is too high (thermal throttling)
+
+4. Design
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O bandwidth
+dedicated to the cgroup they belong to. I/O accounting happens per cgroup.
+
+It just works as expected for read operations: the
...
|
|
| | Topic: [PATCH -mm 0/3] cgroup: block device i/o bandwidth controller (v6) |
|---|
| [PATCH -mm 0/3] cgroup: block device i/o bandwidth controller (v6) [message #31948] |
Tue, 15 July 2008 16:40 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
The objective of the i/o bandwidth controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.
Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.
The direct bandwidth limiting method has the advantage of improving the
performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).
Detailed informations about design, its goal and usage are described in the
documentation.
Tested against 2.6.26-rc8-mm1.
The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/
Changelog: (v5 -> v6)
- do not make kernel threads to sleep
- do not throttle i/o for pages that need to be urgently reclaimed in
submit_bio(READ, ...) (i.e. tasks such as pdflush and kswapd when
performing writeout)
- minor fixes and improvements (thanks to Li Zefan review)
- fixed a small typo in the documentation (reported by Marco Innocenti)
TODO:
- see documentation
-Andrea
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
|
|
| | Topic: [PATCH -mm 3/3] i/o accounting and control |
|---|
| [PATCH -mm 3/3] i/o accounting and control [message #31899] |
Sat, 12 July 2008 07:31 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
Apply the io-throttle controller to the opportune kernel functions. Both
accounting and throttling functionalities are performed by
cgroup_io_throttle().
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
block/blk-core.c | 2 ++
fs/aio.c | 31 ++++++++++++++++++++++++++++++-
fs/buffer.c | 20 +++++++++++++++++---
fs/direct-io.c | 4 ++++
include/linux/sched.h | 3 +++
kernel/fork.c | 3 +++
mm/filemap.c | 18 +++++++++++++++++-
mm/page-writeback.c | 30 +++++++++++++++++++++++++++---
mm/readahead.c | 5 +++++
9 files changed, 108 insertions(+), 8 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4c222ba..bffce33 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/blktrace_api.h>
@@ -1483,6 +1484,7 @@ void submit_bio(int rw, struct bio *bio)
count_vm_events(PGPGOUT, count);
} else {
task_io_account_read(bio->bi_size);
+ cgroup_io_throttle(bio->bi_bdev, bio->bi_size, 1);
count_vm_events(PGPGIN, count);
}
diff --git a/fs/aio.c b/fs/aio.c
index 0051fd9..1f3abb3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -22,6 +22,7 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/blk-io-throttle.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/slab.h>
@@ -1558,6 +1559,8 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
{
struct kiocb *req;
struct file *file;
+ struct block_device *bdev;
+ struct inode *inode;
ssize_t ret;
/* enforce forwards compatibility on users */
@@ -1580,10 +1583,26 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
if (unlikely(!file))
return -EBADF;
+ /*
+ * Pre-account AIO activity: we over-account *all* the bytes here;
+ * bytes read from the page cache and bytes written in already dirtied
+ * pages (that do not generate real i/o on block devices) will be
+ * subtracted later, following the path of aio_run_iocb().
+ */
+ inode = file->f_mapping->host;
+ bdev = inode->i_sb->s_bdev;
+ ret = cgroup_io_throttle(bdev, iocb->aio_nbytes, 0);
+ if (unlikely(ret)) {
+ fput(file);
+ ret = -EAGAIN;
+ goto out_cgroup_io_throttle;
+ }
+
req = aio_get_req(ctx); /* returns with 2 references to req */
if (unlikely(!req)) {
fput(file);
- return -EAGAIN;
+ ret = -EAGAIN;
+ goto out_cgroup_io_throttle;
}
req->ki_filp = file;
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
@@ -1622,12 +1641,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
goto out_put_req;
spin_lock_irq(&ctx->ctx_lock);
+ set_in_aio();
aio_run_iocb(req);
if (!list_empty(&ctx->run_list)) {
/* drain the run list */
while (__aio_run_iocbs(ctx))
;
}
+ unset_in_aio();
spin_unlock_irq(&ctx->ctx_lock);
aio_put_req(req); /* drop extra ref to req */
return 0;
@@ -1635,6 +1656,8 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
out_put_req:
aio_put_req(req); /* drop extra ref to req */
aio_put_req(req); /* drop i/o ref to req */
+out_cgroup_io_throttle:
+ cgroup_io_throttle(bdev, -iocb->aio_nbytes, 0);
return ret;
}
@@ -1746,6 +1769,12 @@ asmlinkage long sys_io_cancel(aio_context_t ctx_id, struct iocb __user *iocb,
ret = -EAGAIN;
kiocb = lookup_kiocb(ctx, iocb, key);
if (kiocb && kiocb->ki_cancel) {
+ struct block_device *bdev;
+ struct inode *inode = kiocb->ki_filp->f_mapping->host;
+
+ bdev = inode->i_sb->s_bdev;
+ cgroup_io_throttle(bdev, -kiocb->ki_nbytes, 0);
+
cancel = kiocb->ki_cancel;
kiocb->ki_users ++;
kiocbSetCancelled(kiocb);
diff --git a/fs/buffer.c b/fs/buffer.c
index 4ffb5bb..89808b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -35,6 +35,7 @@
#include <linux/suspend.h>
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/bio.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
@@ -708,11 +709,14 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
static int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ ssize_t cgroup_io_acct = 0;
+ int ret = 0;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);
if (TestSetPageDirty(page))
- return 0;
+ goto out;
spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
@@ -723,14 +727,24 @@ static int __set_page_dirty(struct page *page,
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_acct = PAGE_CACHE_SIZE;
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
- return 1;
+ ret = 1;
+out:
+ if (is_in_aio() && !cgroup_io_acct)
+ cgroup_io_acct = -PAGE_CACHE_SIZE;
+ if (cgroup_io_acct) {
+ struct block_device *bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+ cgroup_io_throttle(bdev, cgroup_io_acct, 0);
+ }
+ return ret;
}
/*
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9606ee8..f5dcb91 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/blk-io-throttle.h>
#include <asm/atomic.h>
/*
@@ -660,6 +661,9 @@ submit_page_section(struct dio *dio, struct page *page,
/*
* Read accounting is performed in submit_bio()
*/
+ struct block_device *bdev = dio->bio ?
+ dio->bio->bi_bdev : NULL;
+ cgroup_io_throttle(bdev, len, 1);
task_io_account_write(len);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba43675..9d4c755 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1250,6 +1250,9 @@ struct task_struct {
u64 rchar, wchar, syscr, syscw;
#endif
struct task_io_accounting ioac;
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_t in_aio;
+#endif
#if defined(CONFIG_TASK_XACCT)
u64 acct_rss_mem1; /* accumulated rss usage */
u64 acct_vm_mem1; /* accumulated virtual memory usage */
diff --git a/kernel/fork.c b/kernel/fork.c
index aed1ff7..f8cf5da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1029,6 +1029,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
task_io_accounting_init(p);
acct_clear_integrals(p);
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_set(&p->in_aio, 0);
+#endif
p->it_virt_expires = cputime_zero;
p->it_prof_expires = cputime_zero;
p->it_sched_expires = 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7567d86..bb80789 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -13,6 +13,7 @@
#include <linux/slab.h>
#include <linux/compiler.h>
#include <linux/fs.h>
+#include <linux/blk-io-throttle.h>
#include <linux/uaccess.h>
#include <linux/aio.h>
#include <linux/capability.h>
@@ -1011,6 +1012,7 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
pgoff_t prev_index;
unsigned long offset; /* offset into pagecache page */
unsigned int prev_offset;
+ int was_page_ok = 0;
int error;
index = *ppos >> PAGE_CACHE_SHIFT;
@@ -1023,7 +1025,8 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
struct page *page;
pgoff_t end_index;
loff_t isize;
- unsigned long nr, ret;
+ ssize_t nr;
+ unsigned long ret;
cond_resched();
find_page:
@@ -1051,6 +1054,8 @@ find_page:
desc, offset))
goto page_not_up_to_date_locked;
unlock_page(page);
+ } else {
+ was_page_ok = 1;
}
page_ok:
/*
@@ -1080,6 +1085,17 @@ page_ok:
}
nr = nr - offset;
+ /*
+ * De-account i/o in case of AIO read from the page cache.
+ * AIO accounting was performed in io_submit_one().
+ */
+ if (is_in_aio() && was_page_ok) {
+ struct block_device *bdev = (inode &&
+ inode->i_sb->s_bdev) ?
+ inode->i_sb->s_bdev : NULL;
+ cgroup_io_throttle(bdev, -nr, 0);
+ }
+
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 29b1d1e..c6207de 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -430,6 +431,9 @@ static void balance_dirty_pages(struct address_space *mapping)
unsigned long write_chunk = sync_writeback_pages();
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct block_device *bdev = (mapping->host &&
+ mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
for (;;) {
struct writeback_control wbc = {
@@ -512,6 +516,14 @@ static void balance_dirty_pages(struct address_space *mapping)
return; /* pdflush is already working this queue */
/*
+ * Apply the cgroup i/o throttling limitations. The accounting of write
+ * activity in page cache is performed in __set_page_dirty(), but since
+ * we cannot sleep there, 0 bytes are accounted here and the functi
...
|
|
| | Topic: [PATCH -mm 1/3] i/o bandwidth controller documentation |
|---|
| [PATCH -mm 1/3] i/o bandwidth controller documentation [message #31897] |
Sat, 12 July 2008 07:31 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
Documentation of the block device I/O bandwidth controller: description, usage,
advantages and design.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
Documentation/controllers/io-throttle.txt | 282 +++++++++++++++++++++++++++++
1 files changed, 282 insertions(+), 0 deletions(-)
create mode 100644 Documentation/controllers/io-throttle.txt
diff --git a/Documentation/controllers/io-throttle.txt b/Documentation/controllers/io-throttle.txt
new file mode 100644
index 0000000..ab33633
--- /dev/null
+++ b/Documentation/controllers/io-throttle.txt
@@ -0,0 +1,282 @@
+
+ Block device I/O bandwidth controller
+
+1. Description
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability and provide performance isolation of different control groups
+sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+2. User Interface
+
+A new I/O bandwidth limitation rule is described using the file
+blockio.bandwidth.
+
+The same file can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+2.1. Configure I/O limiting rules
+
+The syntax to configure a limiting rule is the following:
+
+# /bin/echo DEV:BW:STRATEGY:BUCKET_SIZE > CGROUP/blockio.bandwidth
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- BW is the maximum I/O bandwidth on DEVICE allowed by CGROUP; bandwidth must
+ be expressed in bytes/s. A generic I/O bandwidth limiting rule for a block
+ device DEV can be removed setting the BW value to 0.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+ requests from/to device DEV. At the moment two different strategies can be
+ used:
+
+ 0 = leaky bucket: the controller accepts at most B bytes (B = BW * time);
+ further I/O requests are delayed scheduling a timeout for
+ the tasks that made those requests.
+
+ Different I/O flow
+ | | |
+ | v |
+ | v
+ v
+ .......
+ \ /
+ \ / leaky-bucket
+ ---
+ |||
+ vvv
+ Smoothed I/O flow
+
+ 1 = token bucket: BW tokens are added to the bucket every seconds; the bucket
+ can hold at the most BUCKET_SIZE tokens; I/O requests are
+ accepted if there are available tokens in the bucket; when
+ a request of N bytes arrives N tokens are removed from the
+ bucket; if fewer than N tokens are available the request is
+ delayed until a sufficient amount of token is available in
+ the bucket.
+
+ Tokens (I/O rate)
+ o
+ o
+ o
+ ....... <--.
+ \ / | Bucket size (burst limit)
+ \ooo/ |
+ --- <--'
+ |ooo
+ Incoming --->|---> Conforming
+ I/O |oo I/O
+ requests -->|--> requests
+ |
+ ---->|
+
+ Leaky bucket is more precise than token bucket to respect the bandwidth
+ limits, because bursty workloads are always smoothed. Token bucket, instead,
+ allows a small irregularity degree in the I/O flows (burst limit), and, for
+ this, it is better in terms of efficiency (bursty workloads are not smoothed
+ when there are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+ size of the bucket in bytes.
+
+- CGROUP is the name of the limited process container.
+
+Also the following syntaxes are allowed:
+
+- remove an I/O bandwidth limiting rule
+# /bin/echo DEV:0 > CGROUP/blockio.bandwidth
+
+- configure a limiting rule using leaky bucket throttling (ignore bucket size):
+# /bin/echo DEV:BW:0 > CGROUP/blockio.bandwidth
+
+2.2. Show I/O limiting rules
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the file blockio.bandwidth. The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR MINOR BW STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- BW, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes currently allowed by the I/O bandwidth
+ controller (only used with leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+ with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+ - the amount of jiffies elapsed from the last I/O request (token bucket)
+ - the amount of jiffies during which the bytes given by LEAKY_STAT have been
+ accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 .. n):
+
+$ cat CGROUP/blockio.bandwidth
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+2.3. Examples
+
+* Mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+ leaky bucket throttling strategy:
+ # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+ token bucket throttling strategy, bucket size = 8MB:
+ # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+ and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+ defined for cgroup "foo" can be shown as following:
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 16 8388608 1 0 8388608 -522560 48
+ 8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+ # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 16 8388608 1 0 8388608 -84432 206436
+ 8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+ # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth
+ # cat /mnt/cgroup/foo/blockio.bandwidth
+ 8 0 16777216 0 0 0 0 110388
+
+3. Advantages of providing this feature
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+ different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+ deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+ asynchronous operations, even the I/O passing through the page cache or
+ buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+ adjust the I/O workload of different process containers at run-time,
+ according to the particular users' requirements and applications' performance
+ constraints
+* It is even possible to implement event-based performance throttling
+ mechanisms; for example the same user-space application could actively
+ throttle the I/O bandwidth to reduce power consumption when the battery of a
+ mobile device is running low (power throttling) or when the temperature of a
+ hardware component is too high (thermal throttling)
+
+4. Design
+
+The I/O throttling is performed imposing an explicit timeout, via
+schedule_timeout_killable() on the processes that exceed the I/O bandwidth
+dedicated to the cgroup they belong to. I/O accounting happens per cgroup.
+
+It just works as expected for read operations: the
...
|
|
| | Topic: [PATCH -mm 0/3] cgroup: block device i/o bandwidth controller (v5) |
|---|
| [PATCH -mm 0/3] cgroup: block device i/o bandwidth controller (v5) [message #31896] |
Sat, 12 July 2008 07:31 |
Andrea Righi Messages: 65 Registered: May 2008 |
Member |
From: openvz.org
|
|
The objective of the i/o bandwidth controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.
Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.
The direct bandwidth limiting method has the advantage of improving the
performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).
Detailed informations about design, its goal and usage are described in the
documentation.
Tested against 2.6.26-rc8-mm1.
The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/
Thanks to Li Zefan for reviewing.
Changelog: (v4 -> v5)
- rebase to 2.6.26-rc8-mm1
- handle AIO opportunely: return -EAGAIN from sys_io_submit(), instead of
making to sleep tasks doing AIO
- userspace=>kernel interface now accepts the following syntaxes:
* dev:0 <- to delete a limiting rule
* dev:bw-limit:0 <- define a leaky bucket throttling rule
* dev:bw-limit:1:bucket-size <- define a token bucket throttling rule
- use .write_string and .read_seq_string to simplify iothrottle_read() and
iothrottle_write() functions
- use a enum structure to enumerate the various throttling algorithms
TODO:
- see documentation
-Andrea
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
|
|
|
Pages (31): [ 6 ]
Current Time: Fri May 24 02:03:48 EDT 2013
|