OpenVZ Forum


Home » Mailing lists » Devel » [patch00/05]: Containers(V2)- Introduction
[patch00/05]: Containers(V2)- Introduction [message #6523] Wed, 20 September 2006 02:16 Go to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
Containers:

Commodity HW is becoming more powerful. This is giving opportunity to
run different workloads on the same platform for better HW resource
utilization. To run different workloads efficiently on the same
platform, it is critical that we have a notion of limits for each
workload in Linux kernel. Current cpuset feature in Linux kernel
provides grouping of CPU and memory support to some extent (for NUMA
machines).

For example, a user can run a batch job like backup inside containers.
This job if run unconstrained could step over most of the memory present
in system thus impacting other workloads running on the system at that
time. But when the same job is run inside containers then the backup
job is run within container limits.

We use the term container to indicate a structure against which we track
and charge utilization of system resources like memory, tasks etc for a
workload. Containers will allow system admins to customize the
underlying platform for different applications based on their
performance and HW resource utilization needs. Containers contain
enough infrastructure to allow optimal resource utilization without
bogging down rest of the kernel. A system admin should be able to
create, manage and free containers easily.

At the same time, changes in kernel are minimized so as this support can
be easily integrated with mainline kernel.

The user interface for containers is through configfs. Appropriate file
system privileges are required to do operations on each container.
Currently implemented container resources are automatically visible to
user space through /configfs/container/<container_name> after a
container is created.

Signed-off-by: Rohit Seth <rohitseth@google.com>

Diffstat for the patch set (against linux-2.6.18-rc6-mm2_:

Documentation/containers.txt | 65 ++++
fs/inode.c | 3
include/linux/container.h | 167 ++++++++++
include/linux/fs.h | 5
include/linux/mm_inline.h | 4
include/linux/mm_types.h | 4
include/linux/sched.h | 6
init/Kconfig | 8
kernel/Makefile | 1
kernel/container_configfs.c | 440 ++++++++++++++++++++++++++++
kernel/exit.c | 2
kernel/fork.c | 9
mm/Makefile | 2
mm/container.c | 658 +++++++++++++++++++++++++++++++++++++++++++
mm/container_mm.c | 512 +++++++++++++++++++++++++++++++++
mm/filemap.c | 4
mm/page_alloc.c | 3
mm/rmap.c | 8
mm/swap.c | 1
mm/vmscan.c | 1
20 files changed, 1902 insertions(+), 1 deletion(-)

Changes from version 1:
Fixed the Documentation error
Fixed the corruption in container task list
Added the support for showing all the tasks belonging to a container
through showtask attribute
moved the Kconfig changes to init directory (from mm)
Fixed the bug of unregistering container subsystem if we are not able to
create workqueue
Better support for handling limits for file pages. This now includes
support for flushing and invalidating page cache pages.
Minor other changes.

************************************************************ *****
This patch set has basic container support that includes:

- Create a container using mkdir command in configfs

- Free a container using rmdir command

- Dynamically adjust memory and task limits for container.

- Add/Remove a task to container (given a pid)

- Files are currently added as part of open from a task that already
belongs to a container.

- Keep track of active, anonymous, mapped and pagecache usage of
container memory

- Does not allow more than task_limit number of tasks to be created in
the container.

- Over the limit memory handler is called when number of pages (anon +
pagecache) exceed the limit. It is also called when number of active
pages exceed the page limit. Currently, this memory handler scans the
mappings and tasks belonging to container (file and anonymous) and tries
to deactivate pages. If the number of page cache pages is also high
then it also invalidate mappings. The thought behind this scheme is, it
is okay for containers to go over limit as long they run in degraded
manner when they are over their limit. Also, if there is any memory
pressure then pages belonging to over the limit container(s) become
prime candidates for kernel reclaimer. Container mutex is also held
during the time this handler is working its way through to prevent any
further addition of resources (like tasks or mappings) to this
container. Though it also blocks removal of same resources from the
container for the same time. It is possible that over the limit page
handler takes lot of time if memory pressure on a container is
continuously very high. The limits, like how long a task should
schedule out when it hits memory limit, is also on the lower side at
present (particularly when it is memory hogger). But should be easy to
change if need be.

- Indicate the number of times the page limit and task limit is hit

- Indicate the tasks (pids) belonging to container.

Below is a one line description for patches that will follow:

[patch01]: Documentation on how to use containers
(Documentation/container.txt)

[patch02]: Changes in the generic part of kernel code

[patch03]: Container's interface with configfs

[patch04]: Core container support

[patch05]: Over the limit memory handler.

TODO:

- some code(like container_add_task) in mm/container.c should go
elsewhere.
- Support adding/removing a file name to container through configfs
- /proc/pid/container to show the container id (or name)
- More testing for memory controller. Currently it is possible that
limits are exceeded. See if a call to reclaim can be easily integrated.
- Kernel memory tracking (based on patches from BC)
- Limit on user locked memory
- Huge memory support
- Stress testing with containers
- One shot view of all containers
- CKRM folks are interested in seeing all processes belonging to a
container. Add the attribute show_tasks to container.
- Add logic so that the sum of limits are not exceeding appropriate
system requirements.
- Extend it with other controllers (CPU and Disk I/O)
- Add flags bits for supporting different actions (like in some cases
provide a hard memory limit and in some cases it could be soft).
- Capability to kill processes for the extreme cases.
...

This is based on lot of discussions over last month or so. I hope this
patch set is something that we can agree and more support can be added
on top of this. Please provide feedback and add other extensions that
are useful in the TODO list.

Thanks,
-rohit
Re: [patch00/05]: Containers(V2)- Introduction [message #6536 is a reply to message #6523] Wed, 20 September 2006 05:39 Go to previous messageGo to next message
Nick Piggin is currently offline  Nick Piggin
Messages: 35
Registered: March 2006
Member
Rohit Seth wrote:

>Containers:
>
>
[...]

>This is based on lot of discussions over last month or so. I hope this
>patch set is something that we can agree and more support can be added
>on top of this. Please provide feedback and add other extensions that
>are useful in the TODO list.
>

Hi Rohit,

Sorry for the late reply. I was just about to comment on your earlier
patchset but I will do so here instead.

Anyway I don't think I have much to say other than: this is almost
exactly as I had imagined the memory resource tracking should look
like. Just a small number of hooks and a very simple set of rules for
tracking allocations. Also, the possibility to track kernel
allocations as a whole rather than at individual callsites (which
shouldn't be too difficult to implement).

If anything I would perhaps even argue for further cutting down the
number of hooks and add them back as they prove to be needed.

I'm not sure about containers & workload management people, but from
a core mm/ perspective I see no reason why this couldn't get in,
given review and testing. Great!

Nick
--

Send instant messages to your online friends http://au.messenger.yahoo.com
Re: [patch00/05]: Containers(V2)- Introduction [message #6574 is a reply to message #6523] Wed, 20 September 2006 13:06 Go to previous messageGo to next message
Cedric Le Goater is currently offline  Cedric Le Goater
Messages: 443
Registered: February 2006
Senior Member
Rohit Seth wrote:
>
> This is based on lot of discussions over last month or so. I hope this
> patch set is something that we can agree and more support can be added
> on top of this. Please provide feedback and add other extensions that
> are useful in the TODO list.

thanks rohit for this patchset.

I applied it on rc7-mm1, compiles smoothly and boots. Here's a edge
issue, I got this oops while rmdir a container with running tasks.

C.


BUG: unable to handle kernel paging request at virtual address 00100100
printing eip:
c01470ea
*pde = 00000000
Oops: 0000 [#1]
last sysfs file: /block/hda/range
Modules linked in:
CPU: 0
EIP: 0060:[<c01470ea>] Not tainted VLI
EFLAGS: 00000282 (2.6.18-rc7-mm1-lxc2 #8)
EIP is at container_remove_tasks+0x21/0x3c
eax: 00000007 ebx: 000ffbb0 ecx: c13e4330 edx: 00000196
esi: c13e42f4 edi: 00000000 ebp: c6e39eb8 esp: c6e39eb0
ds: 007b es: 007b ss: 0068
Process rmdir (pid: 9081, ti=c6e38000 task=c7223710 task.ti=c6e38000)
Stack: c13e42f4 00000000 c6e39ec8 c0147206 c13e42c0 c0307e44 c6e39ed4 c012e060
c13e42c0 c6e39eec c017c0ac 00000000 c13e42d8 c017c0cb c13e42c0 c6e39ef4
c017c0d6 c6e39f04 c01f441e c0307de0 c0307de0 c6e39f0c c017c068 c6e39f30
Call Trace:
[<c0147206>] free_container+0x18/0x95
[<c012e060>] simple_containerfs_release+0x15/0x21
[<c017c0ac>] config_item_cleanup+0x42/0x61
[<c017c0d6>] config_item_release+0xb/0xd
[<c01f441e>] kref_put+0x66/0x74
[<c017c068>] config_item_put+0x14/0x16
[<c017b760>] configfs_rmdir+0x16d/0x1b5
[<c014ff07>] vfs_rmdir+0x54/0x96
[<c015180c>] do_rmdir+0x8f/0xcc
[<c0151888>] sys_rmdir+0x10/0x12
[<c01027dc>] syscall_call+0x7/0xb
Re: [patch00/05]: Containers(V2)- Introduction [message #6576 is a reply to message #6536] Wed, 20 September 2006 16:27 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-09-20 at 15:39 +1000, Nick Piggin wrote:


> Anyway I don't think I have much to say other than: this is almost
> exactly as I had imagined the memory resource tracking should look
> like. Just a small number of hooks and a very simple set of rules for
> tracking allocations. Also, the possibility to track kernel
> allocations as a whole rather than at individual callsites (which
> shouldn't be too difficult to implement).
>

I've started looking in that direction. First shot could just be
tracking kernel memory consumption w/o worrying about whether it is slab
or PT etc. Hopefully next patchset will have that support integrated.

> If anything I would perhaps even argue for further cutting down the
> number of hooks and add them back as they prove to be needed.
>

I think the current set of changes (and tracking of different
components) is necessary for memory handler to do the right thing. Plus
it is possible that user land management tools can also make use of this
information.

> I'm not sure about containers & workload management people, but from
> a core mm/ perspective I see no reason why this couldn't get in,
> given review and testing. Great!
>

That is great to know. Thanks. Hopefully it is getting enough coverage
to get there.

-rohit
Re: [patch00/05]: Containers(V2)- Introduction [message #6580 is a reply to message #6574] Wed, 20 September 2006 16:45 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-09-20 at 15:06 +0200, Cedric Le Goater wrote:
> Rohit Seth wrote:
> >
> > This is based on lot of discussions over last month or so. I hope this
> > patch set is something that we can agree and more support can be added
> > on top of this. Please provide feedback and add other extensions that
> > are useful in the TODO list.
>
> thanks rohit for this patchset.
>
> I applied it on rc7-mm1, compiles smoothly and boots. Here's a edge
> issue, I got this oops while rmdir a container with running tasks.
>

Thanks for pointing this out. I will look into this and send the update.

-rohit
Re: [patch00/05]: Containers(V2)- Introduction [message #6581 is a reply to message #6523] Wed, 20 September 2006 16:48 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Nick Piggin wrote:

> > Right thats what cpusets do and it has been working fine for years. Maybe
> > Paul can help you if you find anything missing in the existing means to
> > control resources.
>
> What I like about Rohit's patches is the page tracking stuff which
> seems quite simple but capable.
>
> I suspect cpusets don't quite provide enough features for non-exclusive
> use of memory (eg. page tracking for directed reclaim).

Look at the VM statistics please. We have detailed page statistics per
zone these days. If there is anything missing then this would best be put
into general functionality. When I looked at it, I saw page statistics
that were replicating things that we already track per zone. All these
would become available if a container is realized via a node and we would
be using proven VM code.
Re: [patch00/05]: Containers(V2)- Introduction [message #6582 is a reply to message #6523] Wed, 20 September 2006 16:56 Go to previous messageGo to next message
Nick Piggin is currently offline  Nick Piggin
Messages: 35
Registered: March 2006
Member
Christoph Lameter wrote:
> On Wed, 20 Sep 2006, Nick Piggin wrote:
>
>
>>I'm not sure about containers & workload management people, but from
>>a core mm/ perspective I see no reason why this couldn't get in,
>>given review and testing. Great!
>
>
> Nack. We already have the ability to manage workloads. We may want to
> extend the existing functionality but this is duplicating what is already
> available through cpusets.

If it wasn't clear was talking specifically about the hooks for page
tracking rather than the whole patchset. If anybody wants such page
tracking infrastructure in the kernel, then this (as opposed to the
huge beancounters stuff) is what it should look like.

But as I said above, I don't know what the containers and workload
management people want exactly... The recent discussions about using
nodes and cpusets for memory workload management does seem like a
promising idea, and if it would avoid the need for this kind of
per-page tracking entirely, then that would probably be even better.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Re: [patch00/05]: Containers(V2)- Introduction [message #6583 is a reply to message #6523] Wed, 20 September 2006 17:00 Go to previous messageGo to next message
Nick Piggin is currently offline  Nick Piggin
Messages: 35
Registered: March 2006
Member
(this time to the lists as well)

Peter Zijlstra wrote:

> I'd much rather containterize the whole reclaim code, which should not
> be too hard since he already adds a container pointer to struct page.


Yes, and I tend to agree with you. I probably wasn't clear, but I was
mainly talking about just the memory resource tracking part of this
patchset.

I am less willing to make a judgement about reclaim, because I don't
know very much about the workloads or the guarantees they attempt to
provide.

> Esp. when we get some of my page reclaim abstractions merged, moving the
> reclaim from struct zone to a container is not a lot of work. (this is
> basically what one of the ckrm mm policies did too)


I do agree that it would be nicer to not have a completely different
scheme for doing their own page reclaim, but rather use the existing
code (*provided* that it is designed in the same, minimally intrusive
manner as the page tracking).

I can understand how it is attractive to create a new subsystem to
solve a particular problem, but once it is in the kernel it has to be
maintained regardless, so if it can be done in a way that shares more
of the current infrastructure (nicely) then that would be a better
solution.

I like that they're investigating the use of memory nodes for this.
It seems like the logical starting place.

> I still have to reread what Rohit does for file backed pages, that gave
> my head a spin.
> I've been thinking a bit on that problem, and it would be possible to
> share all address_space pages equally between attached containers, this
> would lose some accuracy, since one container could read 10% of the file
> and another 90%, but I don't think that is a common scenario.


Yeah, I'm not sure about that. I don't think really complex schemes
are needed... but again I might need more knowledge of their workloads
and problems.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Re: [patch00/05]: Containers(V2)- Introduction [message #6584 is a reply to message #6523] Wed, 20 September 2006 16:25 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Tue, 19 Sep 2006, Rohit Seth wrote:

> For example, a user can run a batch job like backup inside containers.
> This job if run unconstrained could step over most of the memory present
> in system thus impacting other workloads running on the system at that
> time. But when the same job is run inside containers then the backup
> job is run within container limits.

I just saw this for the first time since linux-mm was not cced. We have
discussed a similar mechanism on linux-mm.

We already have such a functionality in the kernel its called a cpuset. A
container could be created simply by creating a fake node that then
allows constraining applications to this node. We already track the
types of pages per node. The statistics you want are already existing.
See /proc/zoneinfo and /sys/devices/system/node/node*/*.

> We use the term container to indicate a structure against which we track
> and charge utilization of system resources like memory, tasks etc for a
> workload. Containers will allow system admins to customize the
> underlying platform for different applications based on their
> performance and HW resource utilization needs. Containers contain
> enough infrastructure to allow optimal resource utilization without
> bogging down rest of the kernel. A system admin should be able to
> create, manage and free containers easily.

Right thats what cpusets do and it has been working fine for years. Maybe
Paul can help you if you find anything missing in the existing means to
control resources.
Re: [patch00/05]: Containers(V2)- Introduction [message #6585 is a reply to message #6536] Wed, 20 September 2006 16:26 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Nick Piggin wrote:

> I'm not sure about containers & workload management people, but from
> a core mm/ perspective I see no reason why this couldn't get in,
> given review and testing. Great!

Nack. We already have the ability to manage workloads. We may want to
extend the existing functionality but this is duplicating what is already
available through cpusets.
Re: [patch00/05]: Containers(V2)- Introduction [message #6586 is a reply to message #6582] Wed, 20 September 2006 17:08 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Thu, 21 Sep 2006, Nick Piggin wrote:

> If it wasn't clear was talking specifically about the hooks for page
> tracking rather than the whole patchset. If anybody wants such page
> tracking infrastructure in the kernel, then this (as opposed to the
> huge beancounters stuff) is what it should look like.

Could you point to the patch and a description for what is meant here by
page tracking (did not see that in the patch, maybe I could not find it)?
If these are just statistics then we likely already have
them.
Re: [patch00/05]: Containers(V2)- Introduction [message #6587 is a reply to message #6584] Wed, 20 September 2006 17:10 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-09-20 am 09:25 -0700, ysgrifennodd Christoph Lameter:
> We already have such a functionality in the kernel its called a cpuset. A
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.

CPUsets don't appear to scale to large numbers of containers (say 5000,
with 200-500 doing stuff at a time). They also don't appear to do any
tracking of kernel side resource objects, which is critical to
containment. Indeed for some of us the CPU management and user memory
management angle is mostly uninteresting.

I'm also not clear how you handle shared pages correctly under the fake
node system, can you perhaps explain that further how this works for say
a single apache/php/glibc shared page set across 5000 containers each a
web site.

Alan
Re: [patch00/05]: Containers(V2)- Introduction [message #6588 is a reply to message #6523] Wed, 20 September 2006 17:12 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Nick Piggin wrote:

> Look at what the patches do. These are not only for hard partitioning
> of memory per container but also those that share memory (eg. you might
> want each to share 100MB of memory, up to a max of 80MB for an individual
> container).

So far I have not been able to find the hooks to the VM. The sharing
would also work with nodes. Just create a couple of nodes with the sizes you
want and then put the node with the shared memory into the cpusets for the
apps sharing them.

> The nodes+cpusets stuff doesn't seem to help with that because you
> with that because you fundamentally need to track pages on a per
> container basis otherwise you don't know who's got what.

Hmmm... That gets into issues of knowing how many pages are in use by an
application and that is fundamentally difficult to do due to pages being
shared between processes.

> Now if, in practice, it turns out that nobody really needed these
> features then of course I would prefer the cpuset+nodes approach. My
> point is that I am not in a position to know who wants what, so I
> hope people will come out and discuss some of these issues.

I'd really be interested to have proper tracking of memory use for
processes but I am not sure what use the containers would be there.
Re: [patch00/05]: Containers(V2)- Introduction [message #6589 is a reply to message #6583] Wed, 20 September 2006 17:12 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Iau, 2006-09-21 am 03:00 +1000, ysgrifennodd Nick Piggin:
> > I've been thinking a bit on that problem, and it would be possible to
> > share all address_space pages equally between attached containers, this
> > would lose some accuracy, since one container could read 10% of the file
> > and another 90%, but I don't think that is a common scenario.
>
>
> Yeah, I'm not sure about that. I don't think really complex schemes
> are needed... but again I might need more knowledge of their workloads
> and problems.

Any scenario which permits "cheating" will be a scenario that happens
because people will try and cheat.

Alan
Re: [patch00/05]: Containers(V2)- Introduction [message #6590 is a reply to message #6587] Wed, 20 September 2006 17:15 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Alan Cox wrote:

> Ar Mer, 2006-09-20 am 09:25 -0700, ysgrifennodd Christoph Lameter:
> > We already have such a functionality in the kernel its called a cpuset. A
> > container could be created simply by creating a fake node that then
> > allows constraining applications to this node. We already track the
> > types of pages per node. The statistics you want are already existing.
> > See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> CPUsets don't appear to scale to large numbers of containers (say 5000,
> with 200-500 doing stuff at a time). They also don't appear to do any
> tracking of kernel side resource objects, which is critical to
> containment. Indeed for some of us the CPU management and user memory
> management angle is mostly uninteresting.

The scalability issues can certainly be managed. See the discussions on
linux-mm. Kernel side resource objects? slab pages? Those are tracked.

> I'm also not clear how you handle shared pages correctly under the fake
> node system, can you perhaps explain that further how this works for say
> a single apache/php/glibc shared page set across 5000 containers each a
> web site.

Cpusets can share nodes. I am not sure what the problem would be? Paul may
be able to give you more details.
Re: [patch00/05]: Containers(V2)- Introduction [message #6591 is a reply to message #6582] Wed, 20 September 2006 17:16 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Iau, 2006-09-21 am 02:56 +1000, ysgrifennodd Nick Piggin:
> But as I said above, I don't know what the containers and workload
> management people want exactly... The recent discussions about using
> nodes and cpusets for memory workload management does seem like a
> promising idea, and if it would avoid the need for this kind of
> per-page tracking entirely, then that would probably be even better.

I think you can roughly break it down to

- I want one group of users not to be able to screw another group of
users or the box but don't care about anything else. The basic
beancounter stuff handles this. Generally they also want maximal
sharing.

- I want to charge people for portions of machine use (mostly accounting
and some fairness)

- I don't want any user to be able to hog the system to the harm of the
others but don't care about overcommit of idle resources (think about
the 5000 apaches on a box case)

- I want to be able to divide resources reasonably accurately all the
time between groups of users

Alan
Re: [patch00/05]: Containers(V2)- Introduction [message #6593 is a reply to message #6590] Wed, 20 September 2006 17:24 Go to previous messageGo to next message
Alan Cox is currently offline  Alan Cox
Messages: 48
Registered: May 2006
Member
Ar Mer, 2006-09-20 am 10:15 -0700, ysgrifennodd Christoph Lameter:
> The scalability issues can certainly be managed. See the discussions on
> linux-mm.

I'll take a look at a web archive of it, I don't follow -mm.

> Kernel side resource objects? slab pages? Those are tracked.

Slab pages isn't a useful tracking tool for two reasons. The first is
that some resources are genuinely a shared kernel managed pool and
should be treated that way - thats obviously easy to sort out.

The second is that slab pages are not the granularity of allocations so
it becomes possible (and deliberately influencable) to make someone else
allocate the pages all the time so you don't pay the cost. Hence the
beancounters track the real objects.

> Cpusets can share nodes. I am not sure what the problem would be? Paul may
> be able to give you more details.

If it can do it in a human understandable way, configured at runtime
with dynamic sharing, overcommit and reconfiguration of sizes then
great. Lets see what Paul has to say.

Alan
Re: [patch00/05]: Containers(V2)- Introduction [message #6594 is a reply to message #6584] Wed, 20 September 2006 17:26 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-09-20 at 09:25 -0700, Christoph Lameter wrote:
> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.

cpusets provides cpu and memory NODES binding to tasks. And I think it
works great for NUMA machines where you have different nodes with its
own set of CPUs and memory. The number of those nodes on a commodity HW
is still 1. And they can have 8-16 CPUs and in access of 100G of
memory. You may start using fake nodes (untested territory) to
translate a single node machine into N different nodes. But am not sure
if this number of nodes can change dynamically on the running machine or
a reboot is required to change the number of nodes.

Though when you want to have in access of 100 containers then the cpuset
function starts popping up on the oprofile chart very aggressively. And
this is the cost that shouldn't have to be paid (particularly) for a
single node machine.

And what happens when you want to have cpuset with memory that needs to
be even further fine grained than each node.

Containers also provide a mechanism to move files to containers. Any
further references to this file come from the same container rather than
the container which is bringing in a new page.

In future there will be more handlers like CPU and disk that can be
easily embeded into this container infrastructure.

-rohit
Re: [patch00/05]: Containers(V2)- Introduction [message #6596 is a reply to message #6589] Wed, 20 September 2006 17:30 Go to previous messageGo to next message
Nick Piggin is currently offline  Nick Piggin
Messages: 35
Registered: March 2006
Member
Alan Cox wrote:
> Ar Iau, 2006-09-21 am 03:00 +1000, ysgrifennodd Nick Piggin:
>
>> > I've been thinking a bit on that problem, and it would be possible to
>> > share all address_space pages equally between attached containers, this
>> > would lose some accuracy, since one container could read 10% of the file
>> > and another 90%, but I don't think that is a common scenario.
>>
>>
>>Yeah, I'm not sure about that. I don't think really complex schemes
>>are needed... but again I might need more knowledge of their workloads
>>and problems.
>
>
> Any scenario which permits "cheating" will be a scenario that happens
> because people will try and cheat.

That's true, and that's one reason why I've advocated the solution
implemented by Rohit's patches, that is: just throw in the towel and
be happy to count just pages.

Look at the beancounter stuff, and it has hooks (in the form of gfp
flags) throughput the tree, and they still manage to miss accounting
user exploitable memory overallocation from some callers. Maintaining
that will be much more difficult and error prone.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Re: [patch00/05]: Containers(V2)- Introduction [message #6597 is a reply to message #6523] Wed, 20 September 2006 17:30 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Thu, 21 Sep 2006, Nick Piggin wrote:

> Patch 2/5 in this series provides hooks, and they are pretty unintrusive.

Ok. We shadow existing vm counters add stuff to the adress_space
structure. The task add / remove is duplicating what some of the cpuset
hooks do. That clearly shows that we are just duplicating functionality.

The mapping things are new.
Re: [patch00/05]: Containers(V2)- Introduction [message #6598 is a reply to message #6593] Wed, 20 September 2006 17:35 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Alan Cox wrote:

> If it can do it in a human understandable way, configured at runtime
> with dynamic sharing, overcommit and reconfiguration of sizes then
> great. Lets see what Paul has to say.

You have full VM support this means overcommit and every else you are used
ot.
Re: [patch00/05]: Containers(V2)- Introduction [message #6599 is a reply to message #6594] Wed, 20 September 2006 17:38 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Rohit Seth wrote:

> cpusets provides cpu and memory NODES binding to tasks. And I think it
> works great for NUMA machines where you have different nodes with its
> own set of CPUs and memory. The number of those nodes on a commodity HW
> is still 1. And they can have 8-16 CPUs and in access of 100G of
> memory. You may start using fake nodes (untested territory) to

See linux-mm. We just went through a series of tests and functionality
wise it worked just fine.

> translate a single node machine into N different nodes. But am not sure
> if this number of nodes can change dynamically on the running machine or
> a reboot is required to change the number of nodes.

This is commonly discussed under the subject of memory hotplug.

> Though when you want to have in access of 100 containers then the cpuset
> function starts popping up on the oprofile chart very aggressively. And
> this is the cost that shouldn't have to be paid (particularly) for a
> single node machine.

Yes this is a new way of using cpusets but it works and we could fix the
scalability issues rather than adding new subsystems.

> And what happens when you want to have cpuset with memory that needs to
> be even further fine grained than each node.

New node?

> Containers also provide a mechanism to move files to containers. Any
> further references to this file come from the same container rather than
> the container which is bringing in a new page.

Hmmmm... Thats is interesting.

> In future there will be more handlers like CPU and disk that can be
> easily embeded into this container infrastructure.

I think we should have one container mechanism instead of multiple. Maybe
merge the two? The cpuset functionality is well established and working
right.
Re: [patch00/05]: Containers(V2)- Introduction [message #6601 is a reply to message #6583] Wed, 20 September 2006 17:50 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Thu, 2006-09-21 at 03:00 +1000, Nick Piggin wrote:
> (this time to the lists as well)
>
> Peter Zijlstra wrote:
>
> > I'd much rather containterize the whole reclaim code, which should not
> > be too hard since he already adds a container pointer to struct page.
>
>

Right now the memory handler in this container subsystem is written in
such a way that when existing kernel reclaimer kicks in, it will first
operate on those (container with pages over the limit) pages first. But
in general I like the notion of containerizing the whole reclaim code.

> > I still have to reread what Rohit does for file backed pages, that gave
> > my head a spin.

Please let me know if there is any specific part that isn't making much
sense.

-rohit
Re: [patch00/05]: Containers(V2)- Introduction [message #6602 is a reply to message #6601] Wed, 20 September 2006 17:52 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Rohit Seth wrote:

> Right now the memory handler in this container subsystem is written in
> such a way that when existing kernel reclaimer kicks in, it will first
> operate on those (container with pages over the limit) pages first. But
> in general I like the notion of containerizing the whole reclaim code.

Which comes naturally with cpusets.
Re: [patch00/05]: Containers(V2)- Introduction [message #6603 is a reply to message #6597] Wed, 20 September 2006 18:03 Go to previous messageGo to next message
Nick Piggin is currently offline  Nick Piggin
Messages: 35
Registered: March 2006
Member
Christoph Lameter wrote:
> On Thu, 21 Sep 2006, Nick Piggin wrote:
>
>
>>Patch 2/5 in this series provides hooks, and they are pretty unintrusive.
>
>
> Ok. We shadow existing vm counters add stuff to the adress_space
> structure. The task add / remove is duplicating what some of the cpuset
> hooks do. That clearly shows that we are just duplicating functionality.

I don't think so. To start with, the point about containers is they are
not per address_space.

But secondly, these are hooks from the container subsystem into the mm
subsystem. As such, they might do something a bit more or different
than simple statistics, and we don't want to teach the core mm/ about
what that might be. You also want to be able to configure them out
entirely.

I think it is fine to add some new hooks in fundamental (ie mm agnostic)
points. Without getting to the fine details about exactly how the hooks
are implemented, or what information needs to be tracked, I think we can
say that they are not much burden for mm/ to bear (if they turn out to
be usable).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Re: [patch00/05]: Containers(V2)- Introduction [message #6604 is a reply to message #6599] Wed, 20 September 2006 18:07 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-09-20 at 10:38 -0700, Christoph Lameter wrote:
> On Wed, 20 Sep 2006, Rohit Seth wrote:
>
> > cpusets provides cpu and memory NODES binding to tasks. And I think it
> > works great for NUMA machines where you have different nodes with its
> > own set of CPUs and memory. The number of those nodes on a commodity HW
> > is still 1. And they can have 8-16 CPUs and in access of 100G of
> > memory. You may start using fake nodes (untested territory) to
>
> See linux-mm. We just went through a series of tests and functionality
> wise it worked just fine.
>

I thought the fake NUMA support still does not work on x86_64 baseline
kernel. Though Paul and Andrew have patches to make it work.

> > translate a single node machine into N different nodes. But am not sure
> > if this number of nodes can change dynamically on the running machine or
> > a reboot is required to change the number of nodes.
>
> This is commonly discussed under the subject of memory hotplug.
>

So now we depend on getting memory hot-plug to work for faking up these
nodes ...for the memory that is already present in the system. It just
does not sound logical.

> > Though when you want to have in access of 100 containers then the cpuset
> > function starts popping up on the oprofile chart very aggressively. And
> > this is the cost that shouldn't have to be paid (particularly) for a
> > single node machine.
>
> Yes this is a new way of using cpusets but it works and we could fix the
> scalability issues rather than adding new subsystems.
>

I think when you have 100's of zones then cost of allocating a page will
include checking cpuset validation and different zone list traversals
and checks...unless there is major surgery.

> > And what happens when you want to have cpuset with memory that needs to
> > be even further fine grained than each node.
>
> New node?
>

Am not clear how is this possible. Could you or Paul please explain.

> > Containers also provide a mechanism to move files to containers. Any
> > further references to this file come from the same container rather than
> > the container which is bringing in a new page.
>
> Hmmmm... Thats is interesting.
>
> > In future there will be more handlers like CPU and disk that can be
> > easily embeded into this container infrastructure.
>
> I think we should have one container mechanism instead of multiple. Maybe
> merge the two? The cpuset functionality is well established and working
> right.
>

I agree that we will need one container subsystem in the long run.
Something that can easily adapt to different configurations.

-rohit
Re: [patch00/05]: Containers(V2)- Introduction [message #6607 is a reply to message #6523] Wed, 20 September 2006 18:14 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-09-20 at 20:06 +0200, Peter Zijlstra wrote:
> On Wed, 2006-09-20 at 10:52 -0700, Christoph Lameter wrote:
> > On Wed, 20 Sep 2006, Rohit Seth wrote:
> >
> > > Right now the memory handler in this container subsystem is written in
> > > such a way that when existing kernel reclaimer kicks in, it will first
> > > operate on those (container with pages over the limit) pages first. But
> > > in general I like the notion of containerizing the whole reclaim code.
> >
> > Which comes naturally with cpusets.
>
> How are shared mappings dealt with, are pages charged to the set that
> first faults them in?
>

For anonymous pages (simpler case), they get charged to the faulting
task's container.

For filesystem pages (could be shared across tasks running different
containers): Every time a new file mapping is created, it is bound to a
container of the process creating that mapping. All subsequent pages
belonging to this mapping will belong to this container, irrespective of
different tasks running in different containers accessing these pages.
Currently, I've not implemented a mechanism to allow a file to be
specifically moved into or out of container. But when that gets
implemented then all pages belonging to a mapping will also move out of
container (or into a new container).

-rohit
Re: [patch00/05]: Containers(V2)- Introduction [message #6608 is a reply to message #6586] Wed, 20 September 2006 17:19 Go to previous messageGo to next message
Nick Piggin is currently offline  Nick Piggin
Messages: 35
Registered: March 2006
Member
Christoph Lameter wrote:
> On Thu, 21 Sep 2006, Nick Piggin wrote:
>
>
>>If it wasn't clear was talking specifically about the hooks for page
>>tracking rather than the whole patchset. If anybody wants such page
>>tracking infrastructure in the kernel, then this (as opposed to the
>>huge beancounters stuff) is what it should look like.
>
>
> Could you point to the patch and a description for what is meant here by
> page tracking (did not see that in the patch, maybe I could not find it)?
> If these are just statistics then we likely already have
> them.
>

Patch 2/5 in this series provides hooks, and they are pretty unintrusive.

mm/filemap.c | 4 ++++
mm/page_alloc.c | 3 +++
mm/rmap.c | 8 +++++++-
mm/swap.c | 1 +
mm/vmscan.c | 1 +

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6612 is a reply to message #6584] Wed, 20 September 2006 18:34 Go to previous messageGo to next message
Chandra Seetharaman is currently offline  Chandra Seetharaman
Messages: 88
Registered: August 2006
Member
On Wed, 2006-09-20 at 09:25 -0700, Christoph Lameter wrote:
> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A

Christoph,

There had been multiple discussions in the past (as recent as Aug 18,
2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
resource management are orthogonal features.

cpuset provides "resource isolation", and what we, the resource
management guys want is work-conserving resource control.

cpuset partitions resource and hence the resource that are assigned to a
node is not available for other cpuset, which is not good for "resource
management".

chandra
PS:
Aug 18 link: http://marc.theaimsgroup.com/?l=linux-
kernel&m=115593114408336&w=2

Feb 2005 thread: http://marc.theaimsgroup.com/?l=ckrm-
tech&m=110790400330617&w=2
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.
>
> ------------------------------------------------------------ -------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourc eforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--

------------------------------------------------------------ ----------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
------------------------------------------------------------ ----------
Re: [patch00/05]: Containers(V2)- Introduction [message #6613 is a reply to message #6523] Wed, 20 September 2006 18:38 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-09-20 at 20:27 +0200, Peter Zijlstra wrote:

> Yes, I read that in your patches, I was wondering how the cpuset
> approach would handle this.
>
> Neither are really satisfactory for shared mappings.
>

In which way? We could have the per container flag indicating whether
to charge this container for shared mapping that it initiates or to the
container where mapping belongs...or is there something different that
you are referring.

-rohit
Re: [patch00/05]: Containers(V2)- Introduction [message #6615 is a reply to message #6523] Wed, 20 September 2006 18:57 Go to previous messageGo to next message
Rohit Seth is currently offline  Rohit Seth
Messages: 101
Registered: August 2006
Senior Member
On Wed, 2006-09-20 at 20:37 +0200, Peter Zijlstra wrote:
> On Wed, 2006-09-20 at 10:50 -0700, Rohit Seth wrote:
> > On Thu, 2006-09-21 at 03:00 +1000, Nick Piggin wrote:
> > > (this time to the lists as well)
> > >
> > > Peter Zijlstra wrote:
> > >
> > > > I'd much rather containterize the whole reclaim code, which should not
> > > > be too hard since he already adds a container pointer to struct page.
> > >
> > >
> >
> > Right now the memory handler in this container subsystem is written in
> > such a way that when existing kernel reclaimer kicks in, it will first
> > operate on those (container with pages over the limit) pages first. But
> > in general I like the notion of containerizing the whole reclaim code.
>
> Patch 5/5 seems to have a horrid deactivation scheme.
>
> > > > I still have to reread what Rohit does for file backed pages, that gave
> > > > my head a spin.
> >
> > Please let me know if there is any specific part that isn't making much
> > sense.
>
> Well, the whole over the limit handler is quite painfull, having taken a
> second reading it isn't all that complex after all, just odd.
>

It is very basic right now.

> You just start invalidating whole files for file backed pages. Granted,
> this will get you below the threshold. but you might just have destroyed
> your working set.
>

When a container gone over the limit then it is okay to penalize it. I
agree that I'm not making an attempt to maintain the current working
set. Any suggestions that I can incorporate to improve this algorithm
will be very appreciated.

> Pretty much the same for you anonymous memory handler, you scan through
> the pages in linear fashion and demote the first that you encounter.
>
> Both things pretty thoroughly destroy the existing kernel reclaim.
>

I agree that with in a container I need to do add more smarts to (for
example) not do a linear search. Simple additions like last task or
last mapping visited could be useful. And I definitely want to improve
on that.

Though it should not destroy the existing kernel reclaim. Pages
belonging to over the limit container should be the first ones to either
get flushed out to FS or swapped if necessary. (Means that is the cost
that you will have to pay if you, for example, want to container your
tar to 100MB memory foot print).

-rohit
Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6616 is a reply to message #6584] Wed, 20 September 2006 19:09 Go to previous messageGo to next message
Chandra Seetharaman is currently offline  Chandra Seetharaman
Messages: 88
Registered: August 2006
Member
On Wed, 2006-09-20 at 09:25 -0700, Christoph Lameter wrote:

For some reason the email i sent about 30 mins back didn't make it...
her is a resend.

> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A

Christoph,

There had been multiple discussions in the past (as recent as Aug 18,
2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
resource management are orthogonal features.

cpuset provides "resource isolation", and what we, the resource
management guys want is work-conserving resource control.

cpuset partitions resource and hence the resource that are assigned to a
node is not available for other cpuset, which is not good for "resource
management".

chandra
PS:
Aug 18 link: http://marc.theaimsgroup.com/?l=linux-
kernel&m=115593114408336&w=2

Feb 2005 thread: http://marc.theaimsgroup.com/?l=ckrm-
tech&m=110790400330617&w=2

> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.
>
> ------------------------------------------------------------ -------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourc eforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--

------------------------------------------------------------ ----------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
------------------------------------------------------------ ----------
Re: [patch00/05]: Containers(V2)- Introduction [message #6619 is a reply to message #6523] Wed, 20 September 2006 19:48 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
Peter wrote:
> > Which comes naturally with cpusets.
>
> How are shared mappings dealt with, are pages charged to the set that
> first faults them in?

Cpusets does not attempt to manage how much memory a task can allocate,
but where it can allocate it. If a task can find an existing page to
share, and avoid the allocation, then it entirely avoids dealing with
cpusets in that case.

Cpusets pays no attention to how often a page is shared. It controls
which tasks can allocate a given free page, based on the node on which
that page resides. If that node is allowed in a tasks 'nodemask_t
mems_allowed' (a task struct field), then the task can allocate
that page, so far as cpusets is concerned.

Cpusets does not care who links to a page, once it is allocated.

Every page is assigned to one specific node, and may only be allocated
by tasks allowed to allocate from that node.

These cpusets can overlap - which so far as memory goes, roughly means
that the various mems_allowed nodemask_t's of different tasks can overlap.

Here's an oddball example configuration that might make this easier to
think about.

Let's say we have a modest sized NUMA system with an extra bank
of memory added, in addition to the per-node memory. Let's say
the extra bank is a huge pile of cheaper (slower) memory, off a
slower bus.

Normal sized tasks running on one or more of the NUMA nodes just
get to fight for the CPUs and memory on those nodes allowed them.

Let's say an occassional big memory job is to be allowed to use
some of the extra cheap memory, and we use the idea of Andrew
and others to split that memory into fake nodes to manage the
portion of memory available to specified tasks.

Then one of these big jobs could be in a cpuset that let it use
one or more of the CPUs and memory on the node it ran on, plus
some number of the fake nodes on the extra cheap memory.

Other jobs could be allowed, using cpusets, to use any combination
of the same or overlapping CPUs or nodes, and/or other disjoint
CPUs or nodes, fake or real.

Another example, restating some of the above.

If say some application happened to fault in a libc.so page,
it would be required to place that page on one of the nodes
allowed to it. If an other application comes along later and
ends up wanting shared references to that same page, it could
certainly do so, regardless of its cpuset settings. It would
not be allocating a new page for this, so would not encounter
the cpuset constraints on where it could allocate such a page.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [patch00/05]: Containers(V2)- Introduction [message #6620 is a reply to message #6523] Wed, 20 September 2006 19:48 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Peter Zijlstra wrote:

> > Which comes naturally with cpusets.
>
> How are shared mappings dealt with, are pages charged to the set that
> first faults them in?

They are charged to the node from which they were allocated. If the
process is restricted to the node (container) then all pages allocated
are are charged to the container regardless if they are shared or not.
Re: [patch00/05]: Containers(V2)- Introduction [message #6621 is a reply to message #6604] Wed, 20 September 2006 19:51 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Rohit Seth wrote:

> I thought the fake NUMA support still does not work on x86_64 baseline
> kernel. Though Paul and Andrew have patches to make it work.

Read linux-mm. There is work in progress.

> > This is commonly discussed under the subject of memory hotplug.
> So now we depend on getting memory hot-plug to work for faking up these
> nodes ...for the memory that is already present in the system. It just
> does not sound logical.

It is logical since nodes are containers of memory today and we have
established VM functionality to deal with these containers. If you read
the latest linx-mm then you will find that this was tested and works fine.
Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6622 is a reply to message #6612] Wed, 20 September 2006 19:52 Go to previous messageGo to next message
Christoph Lameter is currently offline  Christoph Lameter
Messages: 123
Registered: September 2006
Senior Member
On Wed, 20 Sep 2006, Chandra Seetharaman wrote:

> cpuset partitions resource and hence the resource that are assigned to a
> node is not available for other cpuset, which is not good for "resource
> management".

cpusets can have one node in multiple cpusets.
Re: [patch00/05]: Containers(V2)- Introduction [message #6624 is a reply to message #6604] Wed, 20 September 2006 20:06 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
Seth wrote:
> I thought the fake NUMA support still does not work on x86_64 baseline
> kernel. Though Paul and Andrew have patches to make it work.

It works. Having long zonelists where one expects to have to scan a
long way down the list has a performance glitch, in the
get_page_from_freelist() code sucks. We don't want to be doing a
linear scan of a long list on this code path.

The cpuset_zone_allowed() routine happens to be the most obvious canary
in this linear scan loop (google 'canary in the mine shaft' for the
idiom), so shows up the problem first.

We don't have patches yet to fix this (well, we might, I still haven't
digested the last couple days worth of postings.) But we are persuing
Andrew's suggestion to cache the zone that we found memory on last time
around, so as to dramatically reduce the chance we have to rescan the
entire dang zonelist every time through this code.

Initially these zonelists had been designed to handle the various
kinds of dma, main and upper memory on common PC architectures, then
they were (ab)used to handle multiple Non-Uniform Memory Nodes (NUMA)
on bigger boxen. So it is not entirely surprising that we hit a
performance speed bump when further (ab)using them to handle multiple
Uniform sub-nodes as part of a memory containerization effort. Each
different kind of use hits these algorithms and data structures
differently.

It seems pretty clear by now that we will be able to pave over this
speed bump without doing any major reconstruction.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [patch00/05]: Containers(V2)- Introduction [message #6630 is a reply to message #6588] Wed, 20 September 2006 22:27 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
Christoph, responding to Nick:
> > Look at what the patches do. These are not only for hard partitioning
> > of memory per container but also those that share memory (eg. you might
> > want each to share 100MB of memory, up to a max of 80MB for an individual
> > container).
>
> So far I have not been able to find the hooks to the VM. The sharing
> would also work with nodes. Just create a couple of nodes with the sizes you
> want and then put the node with the shared memory into the cpusets for the
> apps sharing them.

Cpusets certainly allows for sharing - in the sense that multiple
tasks can be each be allowed to allocate from the same node (fake
or real.)

However, this is not sharing quite in the sense that Nick describes it.

In cpuset sharing, it is predetermined which pages are allowed to be
allocated by which tasks. Not "how many" pages, but "just which" pages.

Let's say we carve this 100 MB's up into 5 cpusets, of 20 MBs each, and
allow each of our many tasks to allocate from some specified 4 of these
5 cpusets. Then, even if some of those 100 MB's were still free, and
if a task was well below its allowed 80 MB's, the task might still not
be able to use that free memory, if that free memory happened to be in
whatever was the 5th cpuset that it was not allowed to use.

Seth:

Could your container proposal handle the above example, and let that
task have some of that memory, up to 80 MB's if available, but not
more, regardless of what node the free memory was on?

I presume so.


Another example that highlights this difference - airline overbooking.
If an airline has to preassign every seat, it can't overbook, short of
putting two passengers in the same seat and hoping one is a no show,
which is pretty cut throat. If an airline is willing to bet that
seldom more than 90% of the ticketed passengers will show up, and it
doesn't preassign all seats, it can wait until flight time, see who
shows up, and hand out the seats then. It can preassign some seats,
but it needs some passengers showing up unassigned, free to take what's
left over.

Cpusets preassigns which nodes are allowed a task. If not all the
pages on a node are allocated by one of the tasks it is preassigned to,
those pages "fly empty" -- remain unallocated. This happens regardless
of how overbooked is the memory on other nodes.

If you just want to avoid fisticuffs at the gate between overbooked
passengers, cpusets are enough. If you further want to maximize utilization,
then you need the capacity management of resource groups, or some such.


> > The nodes+cpusets stuff doesn't seem to help with that because you
> > with that because you fundamentally need to track pages on a per
> > container basis otherwise you don't know who's got what.
>
> Hmmm... That gets into issues of knowing how many pages are in use by an
> application and that is fundamentally difficult to do due to pages being
> shared between processes.

Fundamentally difficult or not, it seems to be required for what Nick
describes, and for sure cpusets doesn't do it (track memory usage per
container.)

> > Now if, in practice, it turns out that nobody really needed these
> > features then of course I would prefer the cpuset+nodes approach. My
> > point is that I am not in a position to know who wants what, so I
> > hope people will come out and discuss some of these issues.

I don't know either ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [patch00/05]: Containers(V2)- Introduction [message #6631 is a reply to message #6594] Wed, 20 September 2006 22:51 Go to previous messageGo to next message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
Seth wrote:
> But am not sure
> if this number of nodes can change dynamically on the running machine or
> a reboot is required to change the number of nodes.

The current numa=fake=N kernel command line option is just boottime,
and just x86_64.

I presume we'd have to remove these two constraints for this to be
generally usable to containerize memory.

We also, in my current opinion, need to fix up the node_distance
between such fake numa sibling nodes, to correctly reflect that they
are on the same real node (LOCAL_DISTANCE).

And some non-trivial, arch-specific, zonelist sorting and reconstruction
work will be needed.

And an API devised for the above mentioned dynamic changing.

And this will push on the memory hotplug/unplug technology.

All in all, it could avoid anything more than trivial changes to the
existing memory allocation code hot paths. But the infrastructure
needed for managing this mechanism needs some non-trivial work.


> Though when you want to have in access of 100 containers then the cpuset
> function starts popping up on the oprofile chart very aggressively.

As the linux-mm discussion last weekend examined in detail, we can
eliminate this performance speed bump, probably by caching the
last zone on which we found some memory. The linear search that was
implicit in __alloc_pages()'s use of zonelists for many years finally
become explicit with this new usage pattern.


> Containers also provide a mechanism to move files to containers. Any
> further references to this file come from the same container rather than
> the container which is bringing in a new page.

I haven't read these patches enough to quite make sense of this, but I
suspect that this is not a distinction between cpusets and these
containers, for the basic reason that cpusets doesn't need to 'move'
a file's references because it has no clue what such are.


> In future there will be more handlers like CPU and disk that can be
> easily embeded into this container infrastructure.

This may be a deciding point.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Re: [patch00/05]: Containers(V2)- Introduction [message #6632 is a reply to message #6604] Wed, 20 September 2006 22:58 Go to previous messageGo to previous message
Paul Jackson is currently offline  Paul Jackson
Messages: 157
Registered: February 2006
Senior Member
Seth wrote:
> So now we depend on getting memory hot-plug to work for faking up these
> nodes ...for the memory that is already present in the system. It just
> does not sound logical.

It's logical to me. Part of memory hotplug is adding physial memory,
which is not an issue here. Part of it is adding another logical
memory node (turning on another bit in node_online_map) and fixing up
any code that thought a systems memory nodes were baked in at boottime.
Perhaps the hardest part is the memory hot-un-plug, which would become
more urgently needed with such use of fake numa nodes. The assumption
that memory doesn't just up and vanish is non-trivial to remove from
the kernel. A useful memory containerization should (IMHO) allow for
both adding and removing such containers.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
Previous Topic: Re: Re: namespace and nsproxy syscalls
Next Topic: [Lxc-devel] [patch 093/203] pidspace: is_init()
Goto Forum:
  


Current Time: Sat Nov 02 00:10:21 GMT 2024

Total time taken to generate the page: 0.03313 seconds