OpenVZ Forum


Home » Mailing lists » Devel » [PATCH 2/3] i/o bandwidth controller infrastructure
Re: [PATCH 2/3] i/o bandwidth controller infrastructure [message #31376 is a reply to message #31181] Thu, 26 June 2008 00:29 Go to previous messageGo to previous message
akpm is currently offline  akpm
Messages: 224
Registered: March 2007
Senior Member
On Fri, 20 Jun 2008 12:05:34 +0200
Andrea Righi <righi.andrea@gmail.com> wrote:

> This is the core io-throttle kernel infrastructure. It creates the basic
> interfaces to cgroups and implements the I/O measurement and throttling
> functions.
> 
> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
> ---
>  block/Makefile                  |    2 +
>  block/blk-io-throttle.c         |  393 +++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-io-throttle.h |   12 ++
>  include/linux/cgroup_subsys.h   |    6 +
>  init/Kconfig                    |   10 +
>  5 files changed, 423 insertions(+), 0 deletions(-)
>  create mode 100644 block/blk-io-throttle.c
>  create mode 100644 include/linux/blk-io-throttle.h
> 
> diff --git a/block/Makefile b/block/Makefile
> index 5a43c7d..8dec69b 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -14,3 +14,5 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
>  
>  obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
>  obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
> +
> +obj-$(CONFIG_CGROUP_IO_THROTTLE)	+= blk-io-throttle.o
> diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
> new file mode 100644
> index 0000000..4ec02bb
> --- /dev/null
> +++ b/block/blk-io-throttle.c
> @@ -0,0 +1,393 @@
> +/*
> + * blk-io-throttle.c
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + *
> + * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/gfp.h>
> +#include <linux/err.h>
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/jiffies.h>
> +#include <linux/hardirq.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/uaccess.h>
> +#include <linux/vmalloc.h>
> +#include <linux/blk-io-throttle.h>
> +
> +#define ONE_SEC 1000000L /* # of microseconds in a second */

Remove this, use USEC_PER_SEC throughout.

> +#define KBS(x) ((x) * ONE_SEC >> 10)

Convert to lower-case-named C function, please.

> +
> +struct iothrottle_node {
> +	struct list_head node;
> +	dev_t dev;
> +	unsigned long iorate;
> +	unsigned long timestamp;
> +	atomic_long_t stat;
> +};

Please document each field in structures.  This is usually more useful
and important than documenting the code whcih manipulates those fields.

It is important that the units of fields such as iorate and timestamp
and stamp be documented.

> +struct iothrottle {
> +	struct cgroup_subsys_state css;
> +	/* protects the list below, not the single elements */
> +	spinlock_t lock;
> +	struct list_head list;
> +};

Looking elsewhere in the code it appears that some RCU-based locking is
performed.  That should be documented somewhere.  Fully.  At the
definition site of the data whcih is RCU-protceted would be a good
site.


> +static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
> +{
> +	return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
> +			    struct iothrottle, css);
> +}
> +
> +static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
> +{
> +	return container_of(task_subsys_state(task, iothrottle_subsys_id),
> +			    struct iothrottle, css);
> +}
> +
> +static inline struct iothrottle_node *iothrottle_search_node(
> +					const struct iothrottle *iot,
> +					dev_t dev)
> +{
> +	struct iothrottle_node *n;
> +
> +	list_for_each_entry_rcu(n, &iot->list, node)
> +		if (n->dev == dev)
> +			return n;
> +	return NULL;
> +}

This will be too large for inlining.

This function presumably has caller-provided locking requirements? 
They should be documented here.

> +static inline void iothrottle_insert_node(struct iothrottle *iot,
> +					 struct iothrottle_node *n)
> +{
> +	list_add_rcu(&n->node, &iot->list);
> +}
> +
> +static inline struct iothrottle_node *iothrottle_replace_node(
> +					struct iothrottle *iot,
> +					struct iothrottle_node *old,
> +					struct iothrottle_node *new)
> +{
> +	list_replace_rcu(&old->node, &new->node);
> +	return old;
> +}

Dittoes.

> +static inline struct iothrottle_node *iothrottle_delete_node(
> +					struct iothrottle *iot,
> +					dev_t dev)
> +{
> +	struct iothrottle_node *n;
> +
> +	list_for_each_entry(n, &iot->list, node)
> +		if (n->dev == dev) {
> +			list_del_rcu(&n->node);
> +			return n;
> +		}
> +	return NULL;
> +}

Too large for inlining.

Was list_for_each_entry_rcu() needed?

Does this function have any caller-provided locking requirements?


> +/*
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static struct cgroup_subsys_state *iothrottle_create(
> +			struct cgroup_subsys *ss, struct cgroup *cont)

static struct cgroup_subsys_state *iothrottle_create(struct cgroup_subsys *ss,
							struct cgroup *cont)

would be more typical code layout (here and elsewhere)

static struct cgroup_subsys_state *
iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cont)

is another way.

> +{
> +	struct iothrottle *iot;
> +
> +	iot = kmalloc(sizeof(*iot), GFP_KERNEL);
> +	if (unlikely(!iot))
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&iot->list);
> +	spin_lock_init(&iot->lock);
> +
> +	return &iot->css;
> +}
> +
> +/*
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
> +{
> +	struct iothrottle_node *n, *p;
> +	struct iothrottle *iot = cgroup_to_iothrottle(cont);
> +
> +	/*
> +	 * don't worry about locking here, at this point there must be not any
> +	 * reference to the list.
> +	 */
> +	list_for_each_entry_safe(n, p, &iot->list, node)
> +		kfree(n);
> +	kfree(iot);
> +}
> +
> +static ssize_t iothrottle_read(struct cgroup *cont,
> +			       struct cftype *cft,
> +			       struct file *file,
> +			       char __user *userbuf,
> +			       size_t nbytes,
> +			       loff_t *ppos)
> +{
> +	struct iothrottle *iot;
> +	char *buffer;
> +	int s = 0;
> +	struct iothrottle_node *n;
> +	ssize_t ret;
> +
> +	buffer = kmalloc(nbytes + 1, GFP_KERNEL);
> +	if (!buffer)
> +		return -ENOMEM;
> +
> +	cgroup_lock();
> +	if (cgroup_is_removed(cont)) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	iot = cgroup_to_iothrottle(cont);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(n, &iot->list, node) {
> +		unsigned long delta, rate;
> +
> +		BUG_ON(!n->dev);
> +		delta = jiffies_to_usecs((long)jiffies - (long)n->timestamp);
> +		rate = delta ? KBS(atomic_long_read(&n->stat) / delta) : 0;
> +		s += scnprintf(buffer + s, nbytes - s,
> +			       "=== device (%u,%u) ===\n"
> +			       "  bandwidth limit: %lu KiB/sec\n"
> +			       "current i/o usage: %lu KiB/sec\n",
> +			       MAJOR(n->dev), MINOR(n->dev),
> +			       n->iorate, rate);
> +	}
> +	rcu_read_unlock();
> +	ret = simple_read_from_buffer(userbuf, nbytes, ppos, buffer, s);
> +out:
> +	cgroup_unlock();
> +	kfree(buffer);
> +	return ret;
> +}

This is a kernel->userspace interface.  It is part of the kernel ABI. 
We will need to support in a back-compatible fashion for ever.  Hence
it is important.  The entire proposed kernel<->userspace interface
should be completely described in the changelog or the documentation so
that we can understand and review what you are proposing.

> +static inline dev_t devname2dev_t(const char *buf)
> +{
> +	struct block_device *bdev;
> +	dev_t ret;
> +
> +	bdev = lookup_bdev(buf);
> +	if (IS_ERR(bdev))
> +		return 0;
> +
> +	BUG_ON(!bdev->bd_inode);
> +	ret = bdev->bd_inode->i_rdev;
> +	bdput(bdev);
> +
> +	return ret;
> +}

Too large to inline.  I get tired of telling people this.  Please just
remove all the inlining from all the patches.  Then go back and
selectively inline any functions which really do need to be inlined
(overall reduction in generated .text is a good heuristic).

How can this function not be racy?  We're returning a dev_t which
refers to a device upon which we have no reference.  A better design
might be to rework the who9le thing to operate on a `struct
block_device *' upon whcih this code holds a reference, rather than
using bare dev_t.

I _guess_ it's OK doing an in-kernel filesystem lookup here.  But did
you consider just passing the dev_t in from userspace?  It's just a
stat().

Does all this code treat /dev/sda1 as a separate device from /dev/sda2?
 If so, that wou
...

 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: mini-summit and ols
Next Topic: Re: Obtaining the latest kernel to test out the latest on the containers
Goto Forum:
  


Current Time: Sat Aug 02 10:26:31 GMT 2025

Total time taken to generate the page: 0.62983 seconds