Home » Mailing lists » Devel » [patch00/05]: Containers(V2)- Introduction
[patch00/05]: Containers(V2)- Introduction [message #6523] |
Wed, 20 September 2006 02:16 |
Rohit Seth
Messages: 101 Registered: August 2006
|
Senior Member |
|
|
Containers:
Commodity HW is becoming more powerful. This is giving opportunity to
run different workloads on the same platform for better HW resource
utilization. To run different workloads efficiently on the same
platform, it is critical that we have a notion of limits for each
workload in Linux kernel. Current cpuset feature in Linux kernel
provides grouping of CPU and memory support to some extent (for NUMA
machines).
For example, a user can run a batch job like backup inside containers.
This job if run unconstrained could step over most of the memory present
in system thus impacting other workloads running on the system at that
time. But when the same job is run inside containers then the backup
job is run within container limits.
We use the term container to indicate a structure against which we track
and charge utilization of system resources like memory, tasks etc for a
workload. Containers will allow system admins to customize the
underlying platform for different applications based on their
performance and HW resource utilization needs. Containers contain
enough infrastructure to allow optimal resource utilization without
bogging down rest of the kernel. A system admin should be able to
create, manage and free containers easily.
At the same time, changes in kernel are minimized so as this support can
be easily integrated with mainline kernel.
The user interface for containers is through configfs. Appropriate file
system privileges are required to do operations on each container.
Currently implemented container resources are automatically visible to
user space through /configfs/container/<container_name> after a
container is created.
Signed-off-by: Rohit Seth <rohitseth@google.com>
Diffstat for the patch set (against linux-2.6.18-rc6-mm2_:
Documentation/containers.txt | 65 ++++
fs/inode.c | 3
include/linux/container.h | 167 ++++++++++
include/linux/fs.h | 5
include/linux/mm_inline.h | 4
include/linux/mm_types.h | 4
include/linux/sched.h | 6
init/Kconfig | 8
kernel/Makefile | 1
kernel/container_configfs.c | 440 ++++++++++++++++++++++++++++
kernel/exit.c | 2
kernel/fork.c | 9
mm/Makefile | 2
mm/container.c | 658 +++++++++++++++++++++++++++++++++++++++++++
mm/container_mm.c | 512 +++++++++++++++++++++++++++++++++
mm/filemap.c | 4
mm/page_alloc.c | 3
mm/rmap.c | 8
mm/swap.c | 1
mm/vmscan.c | 1
20 files changed, 1902 insertions(+), 1 deletion(-)
Changes from version 1:
Fixed the Documentation error
Fixed the corruption in container task list
Added the support for showing all the tasks belonging to a container
through showtask attribute
moved the Kconfig changes to init directory (from mm)
Fixed the bug of unregistering container subsystem if we are not able to
create workqueue
Better support for handling limits for file pages. This now includes
support for flushing and invalidating page cache pages.
Minor other changes.
************************************************************ *****
This patch set has basic container support that includes:
- Create a container using mkdir command in configfs
- Free a container using rmdir command
- Dynamically adjust memory and task limits for container.
- Add/Remove a task to container (given a pid)
- Files are currently added as part of open from a task that already
belongs to a container.
- Keep track of active, anonymous, mapped and pagecache usage of
container memory
- Does not allow more than task_limit number of tasks to be created in
the container.
- Over the limit memory handler is called when number of pages (anon +
pagecache) exceed the limit. It is also called when number of active
pages exceed the page limit. Currently, this memory handler scans the
mappings and tasks belonging to container (file and anonymous) and tries
to deactivate pages. If the number of page cache pages is also high
then it also invalidate mappings. The thought behind this scheme is, it
is okay for containers to go over limit as long they run in degraded
manner when they are over their limit. Also, if there is any memory
pressure then pages belonging to over the limit container(s) become
prime candidates for kernel reclaimer. Container mutex is also held
during the time this handler is working its way through to prevent any
further addition of resources (like tasks or mappings) to this
container. Though it also blocks removal of same resources from the
container for the same time. It is possible that over the limit page
handler takes lot of time if memory pressure on a container is
continuously very high. The limits, like how long a task should
schedule out when it hits memory limit, is also on the lower side at
present (particularly when it is memory hogger). But should be easy to
change if need be.
- Indicate the number of times the page limit and task limit is hit
- Indicate the tasks (pids) belonging to container.
Below is a one line description for patches that will follow:
[patch01]: Documentation on how to use containers
(Documentation/container.txt)
[patch02]: Changes in the generic part of kernel code
[patch03]: Container's interface with configfs
[patch04]: Core container support
[patch05]: Over the limit memory handler.
TODO:
- some code(like container_add_task) in mm/container.c should go
elsewhere.
- Support adding/removing a file name to container through configfs
- /proc/pid/container to show the container id (or name)
- More testing for memory controller. Currently it is possible that
limits are exceeded. See if a call to reclaim can be easily integrated.
- Kernel memory tracking (based on patches from BC)
- Limit on user locked memory
- Huge memory support
- Stress testing with containers
- One shot view of all containers
- CKRM folks are interested in seeing all processes belonging to a
container. Add the attribute show_tasks to container.
- Add logic so that the sum of limits are not exceeding appropriate
system requirements.
- Extend it with other controllers (CPU and Disk I/O)
- Add flags bits for supporting different actions (like in some cases
provide a hard memory limit and in some cases it could be soft).
- Capability to kill processes for the extreme cases.
...
This is based on lot of discussions over last month or so. I hope this
patch set is something that we can agree and more support can be added
on top of this. Please provide feedback and add other extensions that
are useful in the TODO list.
Thanks,
-rohit
|
|
|
|
|
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6582 is a reply to message #6523] |
Wed, 20 September 2006 16:56 |
Nick Piggin
Messages: 35 Registered: March 2006
|
Member |
|
|
Christoph Lameter wrote:
> On Wed, 20 Sep 2006, Nick Piggin wrote:
>
>
>>I'm not sure about containers & workload management people, but from
>>a core mm/ perspective I see no reason why this couldn't get in,
>>given review and testing. Great!
>
>
> Nack. We already have the ability to manage workloads. We may want to
> extend the existing functionality but this is duplicating what is already
> available through cpusets.
If it wasn't clear was talking specifically about the hooks for page
tracking rather than the whole patchset. If anybody wants such page
tracking infrastructure in the kernel, then this (as opposed to the
huge beancounters stuff) is what it should look like.
But as I said above, I don't know what the containers and workload
management people want exactly... The recent discussions about using
nodes and cpusets for memory workload management does seem like a
promising idea, and if it would avoid the need for this kind of
per-page tracking entirely, then that would probably be even better.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6583 is a reply to message #6523] |
Wed, 20 September 2006 17:00 |
Nick Piggin
Messages: 35 Registered: March 2006
|
Member |
|
|
(this time to the lists as well)
Peter Zijlstra wrote:
> I'd much rather containterize the whole reclaim code, which should not
> be too hard since he already adds a container pointer to struct page.
Yes, and I tend to agree with you. I probably wasn't clear, but I was
mainly talking about just the memory resource tracking part of this
patchset.
I am less willing to make a judgement about reclaim, because I don't
know very much about the workloads or the guarantees they attempt to
provide.
> Esp. when we get some of my page reclaim abstractions merged, moving the
> reclaim from struct zone to a container is not a lot of work. (this is
> basically what one of the ckrm mm policies did too)
I do agree that it would be nicer to not have a completely different
scheme for doing their own page reclaim, but rather use the existing
code (*provided* that it is designed in the same, minimally intrusive
manner as the page tracking).
I can understand how it is attractive to create a new subsystem to
solve a particular problem, but once it is in the kernel it has to be
maintained regardless, so if it can be done in a way that shares more
of the current infrastructure (nicely) then that would be a better
solution.
I like that they're investigating the use of memory nodes for this.
It seems like the logical starting place.
> I still have to reread what Rohit does for file backed pages, that gave
> my head a spin.
> I've been thinking a bit on that problem, and it would be possible to
> share all address_space pages equally between attached containers, this
> would lose some accuracy, since one container could read 10% of the file
> and another 90%, but I don't think that is a common scenario.
Yeah, I'm not sure about that. I don't think really complex schemes
are needed... but again I might need more knowledge of their workloads
and problems.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
|
|
|
|
|
|
|
|
|
|
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6596 is a reply to message #6589] |
Wed, 20 September 2006 17:30 |
Nick Piggin
Messages: 35 Registered: March 2006
|
Member |
|
|
Alan Cox wrote:
> Ar Iau, 2006-09-21 am 03:00 +1000, ysgrifennodd Nick Piggin:
>
>> > I've been thinking a bit on that problem, and it would be possible to
>> > share all address_space pages equally between attached containers, this
>> > would lose some accuracy, since one container could read 10% of the file
>> > and another 90%, but I don't think that is a common scenario.
>>
>>
>>Yeah, I'm not sure about that. I don't think really complex schemes
>>are needed... but again I might need more knowledge of their workloads
>>and problems.
>
>
> Any scenario which permits "cheating" will be a scenario that happens
> because people will try and cheat.
That's true, and that's one reason why I've advocated the solution
implemented by Rohit's patches, that is: just throw in the towel and
be happy to count just pages.
Look at the beancounter stuff, and it has hooks (in the form of gfp
flags) throughput the tree, and they still manage to miss accounting
user exploitable memory overallocation from some callers. Maintaining
that will be much more difficult and error prone.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
|
|
|
|
|
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6603 is a reply to message #6597] |
Wed, 20 September 2006 18:03 |
Nick Piggin
Messages: 35 Registered: March 2006
|
Member |
|
|
Christoph Lameter wrote:
> On Thu, 21 Sep 2006, Nick Piggin wrote:
>
>
>>Patch 2/5 in this series provides hooks, and they are pretty unintrusive.
>
>
> Ok. We shadow existing vm counters add stuff to the adress_space
> structure. The task add / remove is duplicating what some of the cpuset
> hooks do. That clearly shows that we are just duplicating functionality.
I don't think so. To start with, the point about containers is they are
not per address_space.
But secondly, these are hooks from the container subsystem into the mm
subsystem. As such, they might do something a bit more or different
than simple statistics, and we don't want to teach the core mm/ about
what that might be. You also want to be able to configure them out
entirely.
I think it is fine to add some new hooks in fundamental (ie mm agnostic)
points. Without getting to the fine details about exactly how the hooks
are implemented, or what information needs to be tracked, I think we can
say that they are not much burden for mm/ to bear (if they turn out to
be usable).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
|
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6607 is a reply to message #6523] |
Wed, 20 September 2006 18:14 |
Rohit Seth
Messages: 101 Registered: August 2006
|
Senior Member |
|
|
On Wed, 2006-09-20 at 20:06 +0200, Peter Zijlstra wrote:
> On Wed, 2006-09-20 at 10:52 -0700, Christoph Lameter wrote:
> > On Wed, 20 Sep 2006, Rohit Seth wrote:
> >
> > > Right now the memory handler in this container subsystem is written in
> > > such a way that when existing kernel reclaimer kicks in, it will first
> > > operate on those (container with pages over the limit) pages first. But
> > > in general I like the notion of containerizing the whole reclaim code.
> >
> > Which comes naturally with cpusets.
>
> How are shared mappings dealt with, are pages charged to the set that
> first faults them in?
>
For anonymous pages (simpler case), they get charged to the faulting
task's container.
For filesystem pages (could be shared across tasks running different
containers): Every time a new file mapping is created, it is bound to a
container of the process creating that mapping. All subsequent pages
belonging to this mapping will belong to this container, irrespective of
different tasks running in different containers accessing these pages.
Currently, I've not implemented a mechanism to allow a file to be
specifically moved into or out of container. But when that gets
implemented then all pages belonging to a mapping will also move out of
container (or into a new container).
-rohit
|
|
|
|
Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6612 is a reply to message #6584] |
Wed, 20 September 2006 18:34 |
Chandra Seetharaman
Messages: 88 Registered: August 2006
|
Member |
|
|
On Wed, 2006-09-20 at 09:25 -0700, Christoph Lameter wrote:
> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A
Christoph,
There had been multiple discussions in the past (as recent as Aug 18,
2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
resource management are orthogonal features.
cpuset provides "resource isolation", and what we, the resource
management guys want is work-conserving resource control.
cpuset partitions resource and hence the resource that are assigned to a
node is not available for other cpuset, which is not good for "resource
management".
chandra
PS:
Aug 18 link: http://marc.theaimsgroup.com/?l=linux-
kernel&m=115593114408336&w=2
Feb 2005 thread: http://marc.theaimsgroup.com/?l=ckrm-
tech&m=110790400330617&w=2
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.
>
> ------------------------------------------------------------ -------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourc eforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--
------------------------------------------------------------ ----------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
------------------------------------------------------------ ----------
|
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6615 is a reply to message #6523] |
Wed, 20 September 2006 18:57 |
Rohit Seth
Messages: 101 Registered: August 2006
|
Senior Member |
|
|
On Wed, 2006-09-20 at 20:37 +0200, Peter Zijlstra wrote:
> On Wed, 2006-09-20 at 10:50 -0700, Rohit Seth wrote:
> > On Thu, 2006-09-21 at 03:00 +1000, Nick Piggin wrote:
> > > (this time to the lists as well)
> > >
> > > Peter Zijlstra wrote:
> > >
> > > > I'd much rather containterize the whole reclaim code, which should not
> > > > be too hard since he already adds a container pointer to struct page.
> > >
> > >
> >
> > Right now the memory handler in this container subsystem is written in
> > such a way that when existing kernel reclaimer kicks in, it will first
> > operate on those (container with pages over the limit) pages first. But
> > in general I like the notion of containerizing the whole reclaim code.
>
> Patch 5/5 seems to have a horrid deactivation scheme.
>
> > > > I still have to reread what Rohit does for file backed pages, that gave
> > > > my head a spin.
> >
> > Please let me know if there is any specific part that isn't making much
> > sense.
>
> Well, the whole over the limit handler is quite painfull, having taken a
> second reading it isn't all that complex after all, just odd.
>
It is very basic right now.
> You just start invalidating whole files for file backed pages. Granted,
> this will get you below the threshold. but you might just have destroyed
> your working set.
>
When a container gone over the limit then it is okay to penalize it. I
agree that I'm not making an attempt to maintain the current working
set. Any suggestions that I can incorporate to improve this algorithm
will be very appreciated.
> Pretty much the same for you anonymous memory handler, you scan through
> the pages in linear fashion and demote the first that you encounter.
>
> Both things pretty thoroughly destroy the existing kernel reclaim.
>
I agree that with in a container I need to do add more smarts to (for
example) not do a linear search. Simple additions like last task or
last mapping visited could be useful. And I definitely want to improve
on that.
Though it should not destroy the existing kernel reclaim. Pages
belonging to over the limit container should be the first ones to either
get flushed out to FS or swapped if necessary. (Means that is the cost
that you will have to pay if you, for example, want to container your
tar to 100MB memory foot print).
-rohit
|
|
|
Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6616 is a reply to message #6584] |
Wed, 20 September 2006 19:09 |
Chandra Seetharaman
Messages: 88 Registered: August 2006
|
Member |
|
|
On Wed, 2006-09-20 at 09:25 -0700, Christoph Lameter wrote:
For some reason the email i sent about 30 mins back didn't make it...
her is a resend.
> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A
Christoph,
There had been multiple discussions in the past (as recent as Aug 18,
2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
resource management are orthogonal features.
cpuset provides "resource isolation", and what we, the resource
management guys want is work-conserving resource control.
cpuset partitions resource and hence the resource that are assigned to a
node is not available for other cpuset, which is not good for "resource
management".
chandra
PS:
Aug 18 link: http://marc.theaimsgroup.com/?l=linux-
kernel&m=115593114408336&w=2
Feb 2005 thread: http://marc.theaimsgroup.com/?l=ckrm-
tech&m=110790400330617&w=2
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.
>
> ------------------------------------------------------------ -------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourc eforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--
------------------------------------------------------------ ----------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
------------------------------------------------------------ ----------
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6619 is a reply to message #6523] |
Wed, 20 September 2006 19:48 |
Paul Jackson
Messages: 157 Registered: February 2006
|
Senior Member |
|
|
Peter wrote:
> > Which comes naturally with cpusets.
>
> How are shared mappings dealt with, are pages charged to the set that
> first faults them in?
Cpusets does not attempt to manage how much memory a task can allocate,
but where it can allocate it. If a task can find an existing page to
share, and avoid the allocation, then it entirely avoids dealing with
cpusets in that case.
Cpusets pays no attention to how often a page is shared. It controls
which tasks can allocate a given free page, based on the node on which
that page resides. If that node is allowed in a tasks 'nodemask_t
mems_allowed' (a task struct field), then the task can allocate
that page, so far as cpusets is concerned.
Cpusets does not care who links to a page, once it is allocated.
Every page is assigned to one specific node, and may only be allocated
by tasks allowed to allocate from that node.
These cpusets can overlap - which so far as memory goes, roughly means
that the various mems_allowed nodemask_t's of different tasks can overlap.
Here's an oddball example configuration that might make this easier to
think about.
Let's say we have a modest sized NUMA system with an extra bank
of memory added, in addition to the per-node memory. Let's say
the extra bank is a huge pile of cheaper (slower) memory, off a
slower bus.
Normal sized tasks running on one or more of the NUMA nodes just
get to fight for the CPUs and memory on those nodes allowed them.
Let's say an occassional big memory job is to be allowed to use
some of the extra cheap memory, and we use the idea of Andrew
and others to split that memory into fake nodes to manage the
portion of memory available to specified tasks.
Then one of these big jobs could be in a cpuset that let it use
one or more of the CPUs and memory on the node it ran on, plus
some number of the fake nodes on the extra cheap memory.
Other jobs could be allowed, using cpusets, to use any combination
of the same or overlapping CPUs or nodes, and/or other disjoint
CPUs or nodes, fake or real.
Another example, restating some of the above.
If say some application happened to fault in a libc.so page,
it would be required to place that page on one of the nodes
allowed to it. If an other application comes along later and
ends up wanting shared references to that same page, it could
certainly do so, regardless of its cpuset settings. It would
not be allocating a new page for this, so would not encounter
the cpuset constraints on where it could allocate such a page.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
|
|
|
|
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6630 is a reply to message #6588] |
Wed, 20 September 2006 22:27 |
Paul Jackson
Messages: 157 Registered: February 2006
|
Senior Member |
|
|
Christoph, responding to Nick:
> > Look at what the patches do. These are not only for hard partitioning
> > of memory per container but also those that share memory (eg. you might
> > want each to share 100MB of memory, up to a max of 80MB for an individual
> > container).
>
> So far I have not been able to find the hooks to the VM. The sharing
> would also work with nodes. Just create a couple of nodes with the sizes you
> want and then put the node with the shared memory into the cpusets for the
> apps sharing them.
Cpusets certainly allows for sharing - in the sense that multiple
tasks can be each be allowed to allocate from the same node (fake
or real.)
However, this is not sharing quite in the sense that Nick describes it.
In cpuset sharing, it is predetermined which pages are allowed to be
allocated by which tasks. Not "how many" pages, but "just which" pages.
Let's say we carve this 100 MB's up into 5 cpusets, of 20 MBs each, and
allow each of our many tasks to allocate from some specified 4 of these
5 cpusets. Then, even if some of those 100 MB's were still free, and
if a task was well below its allowed 80 MB's, the task might still not
be able to use that free memory, if that free memory happened to be in
whatever was the 5th cpuset that it was not allowed to use.
Seth:
Could your container proposal handle the above example, and let that
task have some of that memory, up to 80 MB's if available, but not
more, regardless of what node the free memory was on?
I presume so.
Another example that highlights this difference - airline overbooking.
If an airline has to preassign every seat, it can't overbook, short of
putting two passengers in the same seat and hoping one is a no show,
which is pretty cut throat. If an airline is willing to bet that
seldom more than 90% of the ticketed passengers will show up, and it
doesn't preassign all seats, it can wait until flight time, see who
shows up, and hand out the seats then. It can preassign some seats,
but it needs some passengers showing up unassigned, free to take what's
left over.
Cpusets preassigns which nodes are allowed a task. If not all the
pages on a node are allocated by one of the tasks it is preassigned to,
those pages "fly empty" -- remain unallocated. This happens regardless
of how overbooked is the memory on other nodes.
If you just want to avoid fisticuffs at the gate between overbooked
passengers, cpusets are enough. If you further want to maximize utilization,
then you need the capacity management of resource groups, or some such.
> > The nodes+cpusets stuff doesn't seem to help with that because you
> > with that because you fundamentally need to track pages on a per
> > container basis otherwise you don't know who's got what.
>
> Hmmm... That gets into issues of knowing how many pages are in use by an
> application and that is fundamentally difficult to do due to pages being
> shared between processes.
Fundamentally difficult or not, it seems to be required for what Nick
describes, and for sure cpusets doesn't do it (track memory usage per
container.)
> > Now if, in practice, it turns out that nobody really needed these
> > features then of course I would prefer the cpuset+nodes approach. My
> > point is that I am not in a position to know who wants what, so I
> > hope people will come out and discuss some of these issues.
I don't know either ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
|
|
|
Re: [patch00/05]: Containers(V2)- Introduction [message #6631 is a reply to message #6594] |
Wed, 20 September 2006 22:51 |
Paul Jackson
Messages: 157 Registered: February 2006
|
Senior Member |
|
|
Seth wrote:
> But am not sure
> if this number of nodes can change dynamically on the running machine or
> a reboot is required to change the number of nodes.
The current numa=fake=N kernel command line option is just boottime,
and just x86_64.
I presume we'd have to remove these two constraints for this to be
generally usable to containerize memory.
We also, in my current opinion, need to fix up the node_distance
between such fake numa sibling nodes, to correctly reflect that they
are on the same real node (LOCAL_DISTANCE).
And some non-trivial, arch-specific, zonelist sorting and reconstruction
work will be needed.
And an API devised for the above mentioned dynamic changing.
And this will push on the memory hotplug/unplug technology.
All in all, it could avoid anything more than trivial changes to the
existing memory allocation code hot paths. But the infrastructure
needed for managing this mechanism needs some non-trivial work.
> Though when you want to have in access of 100 containers then the cpuset
> function starts popping up on the oprofile chart very aggressively.
As the linux-mm discussion last weekend examined in detail, we can
eliminate this performance speed bump, probably by caching the
last zone on which we found some memory. The linear search that was
implicit in __alloc_pages()'s use of zonelists for many years finally
become explicit with this new usage pattern.
> Containers also provide a mechanism to move files to containers. Any
> further references to this file come from the same container rather than
> the container which is bringing in a new page.
I haven't read these patches enough to quite make sense of this, but I
suspect that this is not a distinction between cpusets and these
containers, for the basic reason that cpusets doesn't need to 'move'
a file's references because it has no clue what such are.
> In future there will be more handlers like CPU and disk that can be
> easily embeded into this container infrastructure.
This may be a deciding point.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
|
|
|
|
Goto Forum:
Current Time: Mon Nov 18 22:40:37 GMT 2024
Total time taken to generate the page: 0.02998 seconds
|