| Home » Mailing lists » Devel » [patch00/05]: Containers(V2)- Introduction Goto Forum:
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [patch00/05]: Containers(V2)- Introduction [message #6636 is a reply to message #6590] | Wed, 20 September 2006 23:18   |  
			| 
				
				
					|  Paul Jackson Messages: 157
 Registered: February 2006
 | Senior Member |  |  |  
	| Chistroph, responding to Alan: > > I'm also not clear how you handle shared pages correctly under the fake
 > > node system, can you perhaps explain that further how this works for say
 > > a single apache/php/glibc shared page set across 5000 containers each a
 > > web site.
 >
 > Cpusets can share nodes. I am not sure what the problem would be? Paul may
 > be able to give you more details.
 
 Cpusets share pre-assigned nodes, but not anonymous proportions of the
 total system memory.
 
 So sharing an apache/php/glibc page set across 5000 containers using
 cpusets would be awkward.  Unless I'm missing something, you'd have to
 prepage in that page set, from some task allowed that many pages in
 its own cpuset, then you'd run each of the 5000 web servers in smaller
 cpusets that allowed space for the remainder of whatever that web
 server was provisioned, not counting the shared pages.  The shared pages
 wouldn't count, because cpusets doesn't ding you for using a page that
 is already in memory -- it just keeps you from allocating fresh pages
 on certain nodes.  When it came time to do rolling upgrades to new
 versions of the software, and add a marketing driven list of 57
 additional applications that the customers could use to build their
 website, this could become an official nightmare.
 
 Overbooking (selling say 10 Mbs of memory for each server, even though
 there is less than 5000 * 10 Mb total RAM in the system) would also be
 awkward.  One could simulate with overlapping sets of fake numa nodes,
 as I described in an earlier post today (the one that gave each task
 some four of the five 20 MB fake cpusets.) But there would still be
 false resource conflicts, and the (ab)use of the cpuset apparatus for
 this seems unintuitive, in my opinion.
 
 I imagine that a web site supporting 5000 web servers would be very
 interested in overbooking working well.  I'm sure the $7.99/month
 cheap as dirt virtual web servers of which I am a customer overbook.
 
 --
 I won't rest till it's the best ...
 Programmer, Linux Scalability
 Paul Jackson <pj@sgi.com> 1.925.600.0401
 |  
	|  |  |  
	| 
		
			| Re: [patch00/05]: Containers(V2)- Introduction [message #6637 is a reply to message #6631] | Wed, 20 September 2006 23:22   |  
			| 
				
				
					|  Rohit Seth Messages: 101
 Registered: August 2006
 | Senior Member |  |  |  
	| On Wed, 2006-09-20 at 15:51 -0700, Paul Jackson wrote: > Seth wrote:
 > > But am not sure
 > > if this number of nodes can change dynamically on the running machine or
 > > a reboot is required to change the number of nodes.
 >
 > The current numa=fake=N kernel command line option is just boottime,
 > and just x86_64.
 >
 
 Ah okay.
 
 > I presume we'd have to remove these two constraints for this to be
 > generally usable to containerize memory.
 >
 
 Right.
 
 > We also, in my current opinion, need to fix up the node_distance
 > between such fake numa sibling nodes, to correctly reflect that they
 > are on the same real node (LOCAL_DISTANCE).
 >
 > And some non-trivial, arch-specific, zonelist sorting and reconstruction
 > work will be needed.
 >
 > And an API devised for the above mentioned dynamic changing.
 >
 > And this will push on the memory hotplug/unplug technology.
 >
 
 Yes, if we use the existing notion of nodes for other purposes then you
 have captured the right set of changes that will be needed to make that
 happen.  Such changes are not required for container patches as such.
 
 > All in all, it could avoid anything more than trivial changes to the
 > existing memory allocation code hot paths.  But the infrastructure
 > needed for managing this mechanism needs some non-trivial work.
 >
 >
 > > Though when you want to have in access of 100 containers then the cpuset
 > > function starts popping up on the oprofile chart very aggressively.
 >
 > As the linux-mm discussion last weekend examined in detail, we can
 > eliminate this performance speed bump, probably by caching the
 > last zone on which we found some memory.  The linear search that was
 > implicit in __alloc_pages()'s use of zonelists for many years finally
 > become explicit with this new usage pattern.
 >
 
 Okay.
 
 >
 > > Containers also provide a mechanism to move files to containers. Any
 > > further references to this file come from the same container rather than
 > > the container which is bringing in a new page.
 >
 > I haven't read these patches enough to quite make sense of this, but I
 > suspect that this is not a distinction between cpusets and these
 > containers, for the basic reason that cpusets doesn't need to 'move'
 > a file's references because it has no clue what such are.
 >
 
 But container support will allow the certain files pages to come from
 the same container irrespective of who is using them.  Something useful
 for shared libs etc.
 
 -rohit
 
 >
 > > In future there will be more handlers like CPU and disk that can be
 > > easily embeded into this container infrastructure.
 >
 > This may be a deciding point.
 >
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [patch00/05]: Containers(V2)- Introduction [message #6729 is a reply to message #6581] | Wed, 20 September 2006 17:07   |  
			| 
				
				
					|  Nick Piggin Messages: 35
 Registered: March 2006
 | Member |  |  |  
	| On Wed, Sep 20, 2006 at 09:48:13AM -0700, Christoph Lameter wrote: > On Wed, 20 Sep 2006, Nick Piggin wrote:
 >
 > > > Right thats what cpusets do and it has been working fine for years. Maybe
 > > > Paul can help you if you find anything missing in the existing means to
 > > > control resources.
 > >
 > > What I like about Rohit's patches is the page tracking stuff which
 > > seems quite simple but capable.
 > >
 > > I suspect cpusets don't quite provide enough features for non-exclusive
 > > use of memory (eg. page tracking for directed reclaim).
 >
 > Look at the VM statistics please. We have detailed page statistics per
 > zone these days. If there is anything missing then this would best be put
 > into general functionality. When I looked at it, I saw page statistics
 > that were replicating things that we already track per zone. All these
 > would become available if a container is realized via a node and we would
 > be using proven VM code.
 
 Look at what the patches do. These are not only for hard partitioning
 of memory per container but also those that share memory (eg. you might
 want each to share 100MB of memory, up to a max of 80MB for an individual
 container).
 
 The nodes+cpusets stuff doesn't seem to help with that because you
 with that because you fundamentally need to track pages on a per
 container basis otherwise you don't know who's got what.
 
 Now if, in practice, it turns out that nobody really needed these
 features then of course I would prefer the cpuset+nodes approach. My
 point is that I am not in a position to know who wants what, so I
 hope people will come out and discuss some of these issues.
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6730 is a reply to message #6583] | Wed, 20 September 2006 17:23   |  
			| 
				
				
					|  Paul Menage Messages: 642
 Registered: September 2006
 | Senior Member |  |  |  
	| On 9/20/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: 
 > Yeah, I'm not sure about that. I don't think really complex schemes
 > are needed... but again I might need more knowledge of their workloads
 > and problems.
 >
 
 The basic need for separating files into containers distinct from the
 tasks that are using them arises when you have several "jobs" all
 working with the same large data set. (Possibly read-only data files,
 or possibly one job is updating a dataset that's being used by other
 jobs).
 For automated job-tracking and scheduling, it's important to be able
 to distinguish shared usage from individual usage (e.g. to be able to
 answer questions "if I kill job X, how much memory do I get back?" and
 "how do I recover 1G of memory on this machine")
 
 As an example, assume two jobs each with 100M of anonymous memory both
 mapping the same 1G file, for a total usage of 1.2G.
 
 Any setup that doesn't let you distinguish shared and private usage
 makes it hard to answer that kind of scheduling questions. E.g.:
 
 - first user gets charged for the page -> first job reported as 1.1G,
 and the second as 0.1G.
 
 - page charges get shared between all users of the page -> two tasks
 using 0.6G each.
 
 - all users get charged for the page -> two tasks using 1.1G each.
 
 But in fact killing either one of these jobs individually would only
 free up 100M
 
 By explicitly letting userspace see that there are two jobs each with
 a private usage of 100M, and they're sharing a dataset of 1G, it's
 possible to make more informed decisions.
 
 The issue of telling the kernel exactly which files/directories need
 to be accounted separately can be handled by userspace.
 
 It could be done by per-page accounting, or by constraining particular
 files to particular memory zones, or by just tracking/limiting the
 number of pages from each address_space in the pagecache, but I think
 that it's important that the kernel at least provide the primitive
 support for this.
 
 Paul
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [patch00/05]: Containers(V2)- Introduction [message #6735 is a reply to message #6601] | Wed, 20 September 2006 18:37   |  
			| 
				
				
					|  Peter Zijlstra Messages: 61
 Registered: September 2006
 | Member |  |  |  
	| On Wed, 2006-09-20 at 10:50 -0700, Rohit Seth wrote: > On Thu, 2006-09-21 at 03:00 +1000, Nick Piggin wrote:
 > > (this time to the lists as well)
 > >
 > > Peter Zijlstra wrote:
 > >
 > >  > I'd much rather containterize the whole reclaim code, which should not
 > >  > be too hard since he already adds a container pointer to struct page.
 > >
 > >
 >
 > Right now the memory handler in this container subsystem is written in
 > such a way that when existing kernel reclaimer kicks in, it will first
 > operate on those (container with pages over the limit) pages first.  But
 > in general I like the notion of containerizing the whole reclaim code.
 
 Patch 5/5 seems to have a horrid deactivation scheme.
 
 > >  > I still have to reread what Rohit does for file backed pages, that gave
 > >  > my head a spin.
 >
 > Please let me know if there is any specific part that isn't making much
 > sense.
 
 Well, the whole over the limit handler is quite painfull, having taken a
 second reading it isn't all that complex after all, just odd.
 
 You just start invalidating whole files for file backed pages. Granted,
 this will get you below the threshold. but you might just have destroyed
 your working set.
 
 Pretty much the same for you anonymous memory handler, you scan through
 the pages in linear fashion and demote the first that you encounter.
 
 Both things pretty thoroughly destroy the existing kernel reclaim.
 |  
	|  |  |  
	| 
		
			| Re: [patch00/05]: Containers(V2)- Introduction [message #6736 is a reply to message #6607] | Wed, 20 September 2006 18:27   |  
			| 
				
				
					|  Peter Zijlstra Messages: 61
 Registered: September 2006
 | Member |  |  |  
	| On Wed, 2006-09-20 at 11:14 -0700, Rohit Seth wrote: > On Wed, 2006-09-20 at 20:06 +0200, Peter Zijlstra wrote:
 > > On Wed, 2006-09-20 at 10:52 -0700, Christoph Lameter wrote:
 > > > On Wed, 20 Sep 2006, Rohit Seth wrote:
 > > >
 > > > > Right now the memory handler in this container subsystem is written in
 > > > > such a way that when existing kernel reclaimer kicks in, it will first
 > > > > operate on those (container with pages over the limit) pages first.  But
 > > > > in general I like the notion of containerizing the whole reclaim code.
 > > >
 > > > Which comes naturally with cpusets.
 > >
 > > How are shared mappings dealt with, are pages charged to the set that
 > > first faults them in?
 > >
 >
 > For anonymous pages (simpler case), they get charged to the faulting
 > task's container.
 >
 > For filesystem pages (could be shared across tasks running different
 > containers): Every time a new file mapping is created, it is bound to a
 > container of the process creating that mapping.  All subsequent pages
 > belonging to this mapping will belong to this container, irrespective of
 > different tasks running in different containers accessing these pages.
 > Currently, I've not implemented a mechanism to allow a file to be
 > specifically moved into or out of container. But when that gets
 > implemented then all pages belonging to a mapping will also move out of
 > container (or into a new container).
 
 Yes, I read that in your patches, I was wondering how the cpuset
 approach would handle this.
 
 Neither are really satisfactory for shared mappings.
 |  
	|  |  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6738 is a reply to message #6612] | Wed, 20 September 2006 18:43   |  
			| 
				
				
					|  Paul Menage Messages: 642
 Registered: September 2006
 | Senior Member |  |  |  
	| On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote: > > We already have such a functionality in the kernel its called a cpuset. A
 >
 > Christoph,
 >
 > There had been multiple discussions in the past (as recent as Aug 18,
 > 2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
 > resource management are orthogonal features.
 >
 > cpuset provides "resource isolation", and what we, the resource
 > management guys want is work-conserving resource control.
 
 CPUset provides two things:
 
 - a generic process container abstraction
 
 - "resource controllers" for CPU masks and memory nodes.
 
 Rather than adding a new process container abstraction, wouldn't it
 make more sense to change cpuset to make it more extensible (more
 separation between resource controllers), possibly rename it to
 "containers", and let the various resource controllers fight it out
 (e.g. zone/node-based memory controller vs multiple LRU controller,
 CPU masks vs a properly QoS-based CPU scheduler, etc)
 
 Or more specifically, what would need to be added to cpusets to make
 it possible to bolt the CKRM/RG resource controllers on to it?
 
 Paul
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6925 is a reply to message #6523] | Wed, 27 September 2006 19:50   |  
			| 
				
				
					|  Chandra Seetharaman Messages: 88
 Registered: August 2006
 | Member |  |  |  
	| Rohit, 
 I finally looked into your memory controller patches. Here are some of
 the issues I see:
 
 (All points below are in the context of page limit of containers being
 hit and the new code starts freeing up pages)
 
 1. LRU is ignored totally thereby thrashing the working set (as pointed
 by Peter Zijlstra).
 2. Frees up file pages first when hitting the page limit thereby making
 vm_swappiness ineffective.
 3. Starts writing back pages when the # of file pages is close to the
 limit, thereby breaking the current writeback algorithm/logic.
 4. MAPPED files are not counted against the page limit. why ?. This
 affects reclamation behavior and makes vm_swappiness ineffective.
 5. Starts freeing up pages from the first task or the first file in the
 linked list. This logic unfairly penalizes the early members of the
 list.
 6. Both active and inactive pages use physical pages. But, the
 controller only counts active pages and not inactive pages. why ?
 7. Page limit is checked against the sum of (anon and file pages) in
 some places and against active pages at some other places. IMO, it
 should be always compared to the same value.
 
 BTW, It will be easier to read/follow the patches if you separate them
 out as functionalities.
 
 regards,
 
 chandra
 
 On Tue, 2006-09-19 at 19:16 -0700, Rohit Seth wrote:
 > Containers:
 >
 > Commodity HW is becoming more powerful.  This is giving opportunity to
 > run different workloads on the same platform for better HW resource
 > utilization.  To run different workloads efficiently on the same
 > platform, it is critical that we have a notion of limits for each
 > workload in Linux kernel.  Current cpuset feature in Linux kernel
 > provides grouping of CPU and memory support to some extent (for NUMA
 > machines).
 >
 > For example, a user can run a batch job like backup inside containers.
 > This job if run unconstrained could step over most of the memory present
 > in system thus impacting other workloads running on the system at that
 > time.  But when the same job is run inside containers then the backup
 > job is run within container limits.
 >
 > We use the term container to indicate a structure against which we track
 > and charge utilization of system resources like memory, tasks etc for a
 > workload. Containers will allow system admins to customize the
 > underlying platform for different applications based on their
 > performance and HW resource utilization needs.  Containers contain
 > enough infrastructure to allow optimal resource utilization without
 > bogging down rest of the kernel.  A system admin should be able to
 > create, manage and free containers easily.
 >
 > At the same time, changes in kernel are minimized so as this support can
 > be easily integrated with mainline kernel.
 >
 > The user interface for containers is through configfs.  Appropriate file
 > system privileges are required to do operations on each container.
 > Currently implemented container resources are automatically visible to
 > user space through /configfs/container/<container_name> after a
 > container is created.
 >
 > Signed-off-by: Rohit Seth <rohitseth@google.com>
 >
 > Diffstat for the patch set (against linux-2.6.18-rc6-mm2_:
 >
 >  Documentation/containers.txt |   65 ++++
 >  fs/inode.c                   |    3
 >  include/linux/container.h    |  167 ++++++++++
 >  include/linux/fs.h           |    5
 >  include/linux/mm_inline.h    |    4
 >  include/linux/mm_types.h     |    4
 >  include/linux/sched.h        |    6
 >  init/Kconfig                 |    8
 >  kernel/Makefile              |    1
 >  kernel/container_configfs.c  |  440 ++++++++++++++++++++++++++++
 >  kernel/exit.c                |    2
 >  kernel/fork.c                |    9
 >  mm/Makefile                  |    2
 >  mm/container.c               |  658 +++++++++++++++++++++++++++++++++++++++++++
 >  mm/container_mm.c            |  512 +++++++++++++++++++++++++++++++++
 >  mm/filemap.c                 |    4
 >  mm/page_alloc.c              |    3
 >  mm/rmap.c                    |    8
 >  mm/swap.c                    |    1
 >  mm/vmscan.c                  |    1
 >  20 files changed, 1902 insertions(+), 1 deletion(-)
 >
 > Changes from version 1:
 > Fixed the Documentation error
 > Fixed the corruption in container task list
 > Added the support for showing all the tasks belonging to a container
 > through showtask attribute
 > moved the Kconfig changes to init directory (from mm)
 > Fixed the bug of unregistering container subsystem if we are not able to
 > create workqueue
 > Better support for handling limits for file pages.  This now includes
 > support for flushing and invalidating page cache pages.
 > Minor other changes.
 >
 >  ************************************************************ *****
 > This patch set has basic container support that includes:
 >
 > - Create a container using mkdir command in configfs
 >
 > - Free a container using rmdir command
 >
 > - Dynamically adjust memory and task limits for container.
 >
 > - Add/Remove a task to container (given a pid)
 >
 > - Files are currently added as part of open from a task that already
 > belongs to a container.
 >
 > - Keep track of active, anonymous, mapped and pagecache usage of
 > container memory
 >
 > - Does not allow more than task_limit number of tasks to be created in
 > the container.
 >
 > - Over the limit memory handler is called when number of pages (anon +
 > pagecache) exceed the limit.  It is also called when number of active
 > pages exceed the page limit.  Currently, this memory handler scans the
 > mappings and tasks belonging to container (file and anonymous) and tries
 > to deactivate pages.  If the number of page cache pages is also high
 > then it also invalidate mappings.  The thought behind this scheme is, it
 > is okay for containers to go over limit as long they run in degraded
 > manner when they are over their limit. Also, if there is any memory
 > pressure then pages belonging to over the limit container(s) become
 > prime candidates for kernel reclaimer.  Container mutex is also held
 > during the time this handler is working its way through to prevent any
 > further addition of resources (like tasks or mappings) to this
 > container.  Though it also blocks removal of same resources from the
 > container for the same time. It is possible that over the limit page
 > handler takes lot of time if memory pressure on a container is
 > continuously very high.  The limits, like how long a task should
 > schedule out when it hits memory limit, is also on the lower side at
 > present (particularly when it is memory hogger).  But should be easy to
 > change if need be.
 >
 > - Indicate the number of times the page limit and task limit is hit
 >
 > - Indicate the tasks (pids) belonging to container.
 >
 > Below is a one line description for patches that will follow:
 >
 > [patch01]: Documentation on how to use containers
 > (Documentation/container.txt)
 >
 > [patch02]: Changes in the generic part of kernel code
 >
 > [patch03]: Container's interface with configfs
 >
 > [patch04]: Core container support
 >
 > [patch05]: Over the limit memory handler.
 >
 > TODO:
 >
 > - some code(like container_add_task) in mm/container.c should go
 > elsewhere.
 > - Support adding/removing a file name to container through configfs
 > - /proc/pid/container to show the container id (or name)
 > - More testing for memory controller.  Currently it is possible that
 > limits are exceeded.  See if a call to reclaim can be easily integrated.
 > - Kernel memory tracking (based on patches from BC)
 > - Limit on user locked memory
 > - Huge memory support
 > - Stress testing with containers
 > - One shot view of all containers
 > - CKRM folks are interested in seeing all processes belonging to a
 > container.  Add the attribute show_tasks to container.
 > - Add logic so that the sum of limits are not exceeding appropriate
 > system requirements.
 > - Extend it with other controllers (CPU and Disk I/O)
 > - Add flags bits for supporting different actions (like in some cases
 > provide a hard memory limit and in some cases it could be soft).
 > - Capability to kill processes for the extreme cases.
 >  ...
 >
 > This is based on lot of discussions over last month or so.  I hope this
 > patch set is something that we can agree and more support can be added
 > on top of this.  Please provide feedback and add other extensions that
 > are useful in the TODO list.
 >
 > Thanks,
 > -rohit
 >
 >
 >
 >
 >
 >  ------------------------------------------------------------ -------------
 > Take Surveys. Earn Cash. Influence the Future of IT
 > Join SourceForge.net's Techsay panel and you'll get the chance to share your
 > opinions on IT & business topics through brief surveys -- and earn cash
 >  http://www.techsay.com/default.php?page=join.php&p=sourc eforge&CID=DEVDEV
 > _______________________________________________
 > ckrm-tech mailing list
 > https://lists.sourceforge.net/lists/listinfo/ckrm-tech
 --
 
 ------------------------------------------------------------ ----------
 Chandra Seetharaman               | Be careful what you choose....
 - sekharan@us.ibm.com   |      .......you may get it.
 ------------------------------------------------------------ ----------
...
 
 
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6928 is a reply to message #6925] | Wed, 27 September 2006 21:28   |  
			| 
				
				
					|  Rohit Seth Messages: 101
 Registered: August 2006
 | Senior Member |  |  |  
	| On Wed, 2006-09-27 at 12:50 -0700, Chandra Seetharaman wrote: > Rohit,
 >
 > I finally looked into your memory controller patches. Here are some of
 > the issues I see:
 > (All points below are in the context of page limit of containers being
 > hit and the new code starts freeing up pages)
 >
 > 1. LRU is ignored totally thereby thrashing the working set (as pointed
 >    by Peter Zijlstra).
 
 As the container goes over the limit, this algorithm deactivates some of
 the pages.  I agree that the logic to find out the correct pages to
 deactivate needs to be improved.  But the idea is that these pages go in
 front of inactive list so that if there is any memory pressure system
 wide then these pages can easily be reclaimed.
 
 > 2. Frees up file pages first when hitting the page limit thereby making
 >    vm_swappiness ineffective.
 
 Not sure if I understood this part correctly.  But the choice when the
 container goes over its limit is between swap out some of the anonymous
 memory first or writeback some of the dirty file pages belonging to this
 container.
 
 > 3. Starts writing back pages when the # of file pages is close to the
 >    limit, thereby breaking the current writeback algorithm/logic.
 
 That is done so as to ensure processes belonging to container (Whose
 limit is hit) are the first ones getting penalized.  For example, if you
 run a tar in a container with 100MB limit then the dirty file pages will
 be written back to disk when 100MB limit is hit).  Though I will be
 adding a HARD_LIMIT on page cache flag and the strict limit will be only
 maintained if this container flag is set.
 
 > 4. MAPPED files are not counted against the page limit. why ?. This
 >    affects reclamation behavior and makes vm_swappiness ineffective.
 
 num_mapped_pages only indicates how many page cache pages are mapped in
 user page tables.  More of an accounting variable.
 
 > 5. Starts freeing up pages from the first task or the first file in the
 >    linked list. This logic unfairly penalizes the early members of the
 >    list.
 
 This is the part that I've to fix.  Some per container variables that
 remembers the last values will help here.
 
 > 6. Both active and inactive pages use physical pages. But, the
 >    controller only counts active pages and not inactive pages. why ?
 
 The thought is, it is okay for containers to go over its limit as long
 as there is enough memory in the system. When there is any memory
 pressure then the inactive (+ dereferenced) pages get swapped out thus
 penalizing the container.  I'm also thinking of having hard limit for
 anonymous pages beyond which the container will not be able to grow its
 anonymous pages.
 
 > 7. Page limit is checked against the sum of (anon and file pages) in
 >    some places and against active pages at some other places. IMO, it
 >    should be always compared to the same value.
 >
 It is checked against sum of anon+file pages at the time when new pages
 is getting allocated.  But as the reclaimer activate the pages, so it is
 also important to make sure the number of active pages is not going
 above its limit.
 
 Thanks for your comments,
 -rohit
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6929 is a reply to message #6928] | Wed, 27 September 2006 22:24   |  
			| 
				
				
					|  Chandra Seetharaman Messages: 88
 Registered: August 2006
 | Member |  |  |  
	| On Wed, 2006-09-27 at 14:28 -0700, Rohit Seth wrote: 
 Rohit,
 
 For 1-4, I understand the rationale. But, your implementation deviates
 from the current behavior of the VM subsystem which could affect the
 ability of these patches getting into mainline.
 
 IMO, the current behavior in terms of reclamation, LRU, vm_swappiness,
 and writeback logic should be maintained.
 
 > On Wed, 2006-09-27 at 12:50 -0700, Chandra Seetharaman wrote:
 > > Rohit,
 > >
 > > I finally looked into your memory controller patches. Here are some of
 > > the issues I see:
 > > (All points below are in the context of page limit of containers being
 > > hit and the new code starts freeing up pages)
 > >
 > > 1. LRU is ignored totally thereby thrashing the working set (as pointed
 > >    by Peter Zijlstra).
 >
 > As the container goes over the limit, this algorithm deactivates some of
 > the pages.  I agree that the logic to find out the correct pages to
 > deactivate needs to be improved.  But the idea is that these pages go in
 > front of inactive list so that if there is any memory pressure system
 > wide then these pages can easily be reclaimed.
 >
 > > 2. Frees up file pages first when hitting the page limit thereby making
 > >    vm_swappiness ineffective.
 >
 > Not sure if I understood this part correctly.  But the choice when the
 > container goes over its limit is between swap out some of the anonymous
 > memory first or writeback some of the dirty file pages belonging to this
 > container.
 >
 > > 3. Starts writing back pages when the # of file pages is close to the
 > >    limit, thereby breaking the current writeback algorithm/logic.
 >
 > That is done so as to ensure processes belonging to container (Whose
 > limit is hit) are the first ones getting penalized.  For example, if you
 > run a tar in a container with 100MB limit then the dirty file pages will
 > be written back to disk when 100MB limit is hit).  Though I will be
 > adding a HARD_LIMIT on page cache flag and the strict limit will be only
 > maintained if this container flag is set.
 >
 > > 4. MAPPED files are not counted against the page limit. why ?. This
 > >    affects reclamation behavior and makes vm_swappiness ineffective.
 >
 > num_mapped_pages only indicates how many page cache pages are mapped in
 > user page tables.  More of an accounting variable.
 
 But, # of mapped pages is used in the reclamation path logic. These set
 of patches doesn't take them into account.
 
 >
 > > 5. Starts freeing up pages from the first task or the first file in the
 > >    linked list. This logic unfairly penalizes the early members of the
 > >    list.
 >
 > This is the part that I've to fix.  Some per container variables that
 > remembers the last values will help here.
 
 Yes, that will help in fairness between the items in the list.
 
 But, it will still suffer from (1) above, as we would have no idea of
 the current working set (LRU) (within an item or among the items).
 
 >
 > > 6. Both active and inactive pages use physical pages. But, the
 > >    controller only counts active pages and not inactive pages. why ?
 >
 > The thought is, it is okay for containers to go over its limit as long
 
 Real number of "physical pages" used by the container is the sum of
 active and inactive pages.
 
 My question is, shouldn't that be used to check against page limit
 instead of active pages alone ?
 
 How do we describe "page limit" as (to the user) ?
 
 > as there is enough memory in the system. When there is any memory
 > pressure then the inactive (+ dereferenced) pages get swapped out thus
 > penalizing the container.  I'm also thinking of having hard limit for
 
 Reclamation goes through active pages and page cache pages before it
 gets into inactive pages. So, this may not work as you are explaining.
 
 > anonymous pages beyond which the container will not be able to grow its
 > anonymous pages.
 
 You might break the current behavior (memory pressure must be very high
 before these starts failing) if you are going to be strict about it.
 
 >
 > > 7. Page limit is checked against the sum of (anon and file pages) in
 > >    some places and against active pages at some other places. IMO, it
 > >    should be always compared to the same value.
 > >
 > It is checked against sum of anon+file pages at the time when new pages
 
 why can't we check against active pages here ?
 
 > is getting allocated.  But as the reclaimer activate the pages, so it is
 > also important to make sure the number of active pages is not going
 > above its limit.
 
 My point is that they won't be same (ever) and hence the check is
 inconsistent.
 
 Also, if it is this way, how do we describe the purpose of "page
 limit" (for the user).
 
 --
 
 ------------------------------------------------------------ ----------
 Chandra Seetharaman               | Be careful what you choose....
 - sekharan@us.ibm.com   |      .......you may get it.
 ------------------------------------------------------------ ----------
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6941 is a reply to message #6929] | Thu, 28 September 2006 08:01   |  
			| 
				
				
					|  Balbir Singh Messages: 491
 Registered: August 2006
 | Senior Member |  |  |  
	| Chandra Seetharaman wrote: > On Wed, 2006-09-27 at 14:28 -0700, Rohit Seth wrote:
 >
 > Rohit,
 >
 > For 1-4, I understand the rationale. But, your implementation deviates
 > from the current behavior of the VM subsystem which could affect the
 > ability of these patches getting into mainline.
 >
 > IMO, the current behavior in terms of reclamation, LRU, vm_swappiness,
 > and writeback logic should be maintained.
 >
 
 <snip>
 
 Hi, Rohit,
 
 I have been playing around with the containers patch. I finally got
 around to reading the code.
 
 
 1. Comments on reclaiming
 
 You could try the following options to overcome some of the disadvantages of the
 current scheme.
 
 (a) You could consider a reclaim path based on Dave Hansen's Challenged memory
 controller (see  http://marc.theaimsgroup.com/?l=linux-mm&m=1155669825323 45&w=2).
 
 (b) The other option is to do what the resource group memory controller does -
 build a per group LRU list of pages (active, inactive) and reclaim
 them using the existing code (by passing the correct container pointer,
 instead of the zone pointer). One disadvantage of this approach is that
 the global reclaim is impacted as the global LRU list is broken. At the
 expense of another list, we could maintain two lists, global LRU and
 container LRU lists. Depending on the context of the reclaim - (container
 over limit, memory pressure) we could update/manipulate both lists.
 This approach is definitely very expensive.
 
 2. Comments on task migration support
 
 (a) One of the issues I found while using the container code is that, one could
 add a task to a container say "a". "a" gets charged for the tasks usage,
 when the same task moves to a different container say "b", when the task
 exits, the credit goes to "b" and "a" remains indefinitely charged.
 
 (b) For tasks addition and removal, I think it's probably better to move
 the entire process (thread group) rather than allow each individual thread
 to move across containers. Having threads belonging to the same process
 reside in different containers can be complex to handle, since they
 share the same VM. Do you have a scenario where the above condition
 would be useful?
 
 
 --
 
 Warm Regards,
 Balbir Singh,
 Linux Technology Center,
 IBM Software Labs
 
 PS: Chandra, I hope the details of the resource group memory controller
 are correct.
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6962 is a reply to message #6929] | Thu, 28 September 2006 18:12   |  
			| 
				
				
					|  Rohit Seth Messages: 101
 Registered: August 2006
 | Senior Member |  |  |  
	| On Wed, 2006-09-27 at 15:24 -0700, Chandra Seetharaman wrote: > On Wed, 2006-09-27 at 14:28 -0700, Rohit Seth wrote:
 >
 > Rohit,
 >
 > For 1-4, I understand the rationale. But, your implementation deviates
 > from the current behavior of the VM subsystem which could affect the
 > ability of these patches getting into mainline.
 >
 
 I agree that this implementation differs from existing VM subsystem.
 But the key point here is, it puts the pages that should be reclaimed.
 And this part needs further refining.
 
 > IMO, the current behavior in terms of reclamation, LRU, vm_swappiness,
 > and writeback logic should be maintained.
 >
 
 How?  I don't want to duplicate the whole logic for containers.
 
 > > On Wed, 2006-09-27 at 12:50 -0700, Chandra Seetharaman wrote:
 > > > Rohit,
 > > >
 > > > I finally looked into your memory controller patches. Here are some of
 > > > the issues I see:
 > > > (All points below are in the context of page limit of containers being
 > > > hit and the new code starts freeing up pages)
 > > >
 > > > 1. LRU is ignored totally thereby thrashing the working set (as pointed
 > > >    by Peter Zijlstra).
 > >
 > > As the container goes over the limit, this algorithm deactivates some of
 > > the pages.  I agree that the logic to find out the correct pages to
 > > deactivate needs to be improved.  But the idea is that these pages go in
 > > front of inactive list so that if there is any memory pressure system
 > > wide then these pages can easily be reclaimed.
 > >
 > > > 2. Frees up file pages first when hitting the page limit thereby making
 > > >    vm_swappiness ineffective.
 > >
 > > Not sure if I understood this part correctly.  But the choice when the
 > > container goes over its limit is between swap out some of the anonymous
 > > memory first or writeback some of the dirty file pages belonging to this
 > > container.
 > >
 > > > 3. Starts writing back pages when the # of file pages is close to the
 > > >    limit, thereby breaking the current writeback algorithm/logic.
 > >
 > > That is done so as to ensure processes belonging to container (Whose
 > > limit is hit) are the first ones getting penalized.  For example, if you
 > > run a tar in a container with 100MB limit then the dirty file pages will
 > > be written back to disk when 100MB limit is hit).  Though I will be
 > > adding a HARD_LIMIT on page cache flag and the strict limit will be only
 > > maintained if this container flag is set.
 > >
 > > > 4. MAPPED files are not counted against the page limit. why ?. This
 > > >    affects reclamation behavior and makes vm_swappiness ineffective.
 > >
 > > num_mapped_pages only indicates how many page cache pages are mapped in
 > > user page tables.  More of an accounting variable.
 >
 > But, # of mapped pages is used in the reclamation path logic. These set
 > of patches doesn't take them into account.
 >
 > >
 > > > 5. Starts freeing up pages from the first task or the first file in the
 > > >    linked list. This logic unfairly penalizes the early members of the
 > > >    list.
 > >
 > > This is the part that I've to fix.  Some per container variables that
 > > remembers the last values will help here.
 >
 > Yes, that will help in fairness between the items in the list.
 >
 > But, it will still suffer from (1) above, as we would have no idea of
 > the current working set (LRU) (within an item or among the items).
 >
 
 Please let me know how do you propose to have another LRU for pages in
 containers.  Though I can add some heuristics.
 
 > >
 > > > 6. Both active and inactive pages use physical pages. But, the
 > > >    controller only counts active pages and not inactive pages. why ?
 > >
 > > The thought is, it is okay for containers to go over its limit as long
 >
 > Real number of "physical pages" used by the container is the sum of
 > active and inactive pages.
 >
 
 >From the user pov, the real sum of pages that are used by container for
 user land is anon + file.  Now some times it is possible that there are
 active pages that are neither in page cache nor in use as anon.
 
 
 > My question is, shouldn't that be used to check against page limit
 > instead of active pages alone ?
 I can use active+inactive as the test.  Sure.  But I will have to also
 still have a check to make sure that number of active pages themselves
 is not bigger than page_limit.
 
 > How do we describe "page limit" as (to the user) ?
 >
 
 Amount of memory below which no container throttling will happen.  And
 if the system is properly configured that it also ensures that this much
 memory will always be there to user.  If a container goes over this
 limit then it will be throttled and it will suffer performance.
 
 > > as there is enough memory in the system. When there is any memory
 > > pressure then the inactive (+ dereferenced) pages get swapped out thus
 > > penalizing the container.  I'm also thinking of having hard limit for
 >
 > Reclamation goes through active pages and page cache pages before it
 > gets into inactive pages. So, this may not work as you are explaining.
 
 That is a good point.  I'll have to make a check in reclaim so that when
 the system is ready for swap or write back then containers are looked
 first.
 
 >
 > > anonymous pages beyond which the container will not be able to grow its
 > > anonymous pages.
 >
 > You might break the current behavior (memory pressure must be very high
 > before these starts failing) if you are going to be strict about it.
 >
 
 That feature when implemented will be a container specific.
 
 > >
 > > > 7. Page limit is checked against the sum of (anon and file pages) in
 > > >    some places and against active pages at some other places. IMO, it
 > > >    should be always compared to the same value.
 > > >
 > > It is checked against sum of anon+file pages at the time when new pages
 >
 > why can't we check against active pages here ?
 >
 > > is getting allocated.  But as the reclaimer activate the pages, so it is
 > > also important to make sure the number of active pages is not going
 > > above its limit.
 >
 > My point is that they won't be same (ever) and hence the check is
 > inconsistent.
 >
 
 The check ensures
 1- when a new page is getting added then the total sum of pages is
 checked against the limit.
 2- Number of active pages don't exceed the limit.
 
 These two points combined together enforce the decision that once the
 container goes over the limit, we scan the pages again to deactivate the
 excess.
 
 Thanks,
 -rohit
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6965 is a reply to message #6941] | Thu, 28 September 2006 18:31   |  
			| 
				
				
					|  Rohit Seth Messages: 101
 Registered: August 2006
 | Senior Member |  |  |  
	| On Thu, 2006-09-28 at 13:31 +0530, Balbir Singh wrote: > Chandra Seetharaman wrote:
 > > On Wed, 2006-09-27 at 14:28 -0700, Rohit Seth wrote:
 > >
 > > Rohit,
 > >
 > > For 1-4, I understand the rationale. But, your implementation deviates
 > > from the current behavior of the VM subsystem which could affect the
 > > ability of these patches getting into mainline.
 > >
 > > IMO, the current behavior in terms of reclamation, LRU, vm_swappiness,
 > > and writeback logic should be maintained.
 > >
 >
 > <snip>
 >
 > Hi, Rohit,
 >
 > I have been playing around with the containers patch. I finally got
 > around to reading the code.
 >
 >
 > 1. Comments on reclaiming
 >
 > You could try the following options to overcome some of the disadvantages of the
 > current scheme.
 >
 > (a) You could consider a reclaim path based on Dave Hansen's Challenged memory
 > controller (see  http://marc.theaimsgroup.com/?l=linux-mm&m=1155669825323 45&w=2).
 >
 
 I will go through that.  Did you get a chance to stress the system and
 found any short comings that should be resolved.
 
 > (b) The other option is to do what the resource group memory controller does -
 > build a per group LRU list of pages (active, inactive) and reclaim
 > them using the existing code (by passing the correct container pointer,
 > instead of the zone pointer). One disadvantage of this approach is that
 > the global reclaim is impacted as the global LRU list is broken. At the
 > expense of another list, we could maintain two lists, global LRU and
 > container LRU lists. Depending on the context of the reclaim - (container
 > over limit, memory pressure) we could update/manipulate both lists.
 > This approach is definitely very expensive.
 >
 
 Two LRUs is a nice idea.  Though I don't think it will go too far.  It
 will involve adding another list pointers in the page structure.  I
 agree that the mem handler is not optimal at all but I don't want to
 make it mimic kernel reclaimer at the same time.
 
 > 2. Comments on task migration support
 >
 > (a) One of the issues I found while using the container code is that, one could
 > add a task to a container say "a". "a" gets charged for the tasks usage,
 > when the same task moves to a different container say "b", when the task
 > exits, the credit goes to "b" and "a" remains indefinitely charged.
 >
 hmm, when the task is removed from "a" then "a" gets the credits for the
 amount of anon memory that is used by the task.  Or do you mean
 something different.
 
 > (b) For tasks addition and removal, I think it's probably better to move
 > the entire process (thread group) rather than allow each individual thread
 > to move across containers. Having threads belonging to the same process
 > reside in different containers can be complex to handle, since they
 > share the same VM. Do you have a scenario where the above condition
 > would be useful?
 >
 >
 I don't have a scenario where a task actually gets to move out of
 container (except exit).  That asynchronous removal of tasks has already
 got the code very complicated for locking etc.  But if you think moving
 a thread group is useful then I will add that functionality.
 
 Thanks,
 -rohit
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6970 is a reply to message #6962] | Thu, 28 September 2006 20:23   |  
			| 
				
				
					|  Chandra Seetharaman Messages: 88
 Registered: August 2006
 | Member |  |  |  
	| On Thu, 2006-09-28 at 11:12 -0700, Rohit Seth wrote: > On Wed, 2006-09-27 at 15:24 -0700, Chandra Seetharaman wrote:
 > > On Wed, 2006-09-27 at 14:28 -0700, Rohit Seth wrote:
 > >
 > > Rohit,
 > >
 > > For 1-4, I understand the rationale. But, your implementation deviates
 > > from the current behavior of the VM subsystem which could affect the
 > > ability of these patches getting into mainline.
 > >
 >
 > I agree that this implementation differs from existing VM subsystem.
 > But the key point here is, it puts the pages that should be reclaimed.
 
 But, you are putting the pages up for reclamation without any
 consideration to the working set and system memory pressure.
 
 > And this part needs further refining.
 >
 > > IMO, the current behavior in terms of reclamation, LRU, vm_swappiness,
 > > and writeback logic should be maintained.
 > >
 >
 > How?  I don't want to duplicate the whole logic for containers.
 
 We don't have to be duplicating the whole logic. Just make sure that the
 existing mechanisms are aware of containers, if they exist.
 
 <snip>
 
 > >
 > > But, it will still suffer from (1) above, as we would have no idea of
 > > the current working set (LRU) (within an item or among the items).
 > >
 >
 > Please let me know how do you propose to have another LRU for pages in
 > containers.  Though I can add some heuristics.
 
 There are multiple ways as Balbir pointed in his email:
 - reclamation per container (as in current RG implementation)
 ( + do a system wide reclaim when the system pressure is high)
 - reclaim with the knowledge of containers that are over limit
 (Dave Hansen's patches + avoid overhead of combing the list)
 - have two lists one for the system and one per container
 
 >
 > > >
 > > > > 6. Both active and inactive pages use physical pages. But, the
 > > > >    controller only counts active pages and not inactive pages. why ?
 > > >
 > > > The thought is, it is okay for containers to go over its limit as long
 > >
 > > Real number of "physical pages" used by the container is the sum of
 > > active and inactive pages.
 > >
 >
 > >From the user pov, the real sum of pages that are used by container for
 > user land is anon + file.  Now some times it is possible that there are
 > active pages that are neither in page cache nor in use as anon.
 >
 >
 > > My question is, shouldn't that be used to check against page limit
 > > instead of active pages alone ?
 > I can use active+inactive as the test.  Sure.  But I will have to also
 > still have a check to make sure that number of active pages themselves
 > is not bigger than page_limit.
 >
 > > How do we describe "page limit" as (to the user) ?
 > >
 >
 > Amount of memory below which no container throttling will happen.  And
 
 But, from the implementation one cannot clearly derive what we mean by
 "memory" here (physical ?, file + anon ?; if we say physical, it is not
 correct).
 
 > if the system is properly configured that it also ensures that this much
 > memory will always be there to user.  If a container goes over this
 > limit then it will be throttled and it will suffer performance.
 
 But, the user's expectation would be that we would be throwing out pages
 based on LRU (within that container). But this implementation doesn't
 provide that behavior. It doesn't care about the working set.
 
 Performance impact will be lesser if we consider the working set and
 throw out pages based on LRU (within a container).
 
 >
 > > > as there is enough memory in the system. When there is any memory
 > > > pressure then the inactive (+ dereferenced) pages get swapped out thus
 > > > penalizing the container.  I'm also thinking of having hard limit for
 > >
 > > Reclamation goes through active pages and page cache pages before it
 > > gets into inactive pages. So, this may not work as you are explaining.
 >
 > That is a good point.  I'll have to make a check in reclaim so that when
 > the system is ready for swap or write back then containers are looked
 > first.
 >
 > >
 > > > anonymous pages beyond which the container will not be able to grow its
 > > > anonymous pages.
 > >
 > > You might break the current behavior (memory pressure must be very high
 > > before these starts failing) if you are going to be strict about it.
 > >
 >
 > That feature when implemented will be a container specific.
 
 My point is, even though it is container specific, the behavior (inside
 a container) should be same as what a user sees at the system level now.
 
 For example, consider a workload that is run on a 1G system now, and
 user sees only occasional memory allocation failures and just a handful
 of oom kills. When the workload is moved to a container with 1G,
 failures the user see should be in the same order ( and similar with
 performance characteristics).
 
 Do you agree that it will be the user's expectation ?
 
 >
 > > >
 > > > > 7. Page limit is checked against the sum of (anon and file pages) in
 > > > >    some places and against active pages at some other places. IMO, it
 > > > >    should be always compared to the same value.
 > > > >
 > > > It is checked against sum of anon+file pages at the time when new pages
 > >
 > > why can't we check against active pages here ?
 > >
 > > > is getting allocated.  But as the reclaimer activate the pages, so it is
 > > > also important to make sure the number of active pages is not going
 > > > above its limit.
 > >
 > > My point is that they won't be same (ever) and hence the check is
 > > inconsistent.
 > >
 >
 > The check ensures
 > 1- when a new page is getting added then the total sum of pages is
 > checked against the limit.
 > 2- Number of active pages don't exceed the limit.
 >
 > These two points combined together enforce the decision that once the
 > container goes over the limit, we scan the pages again to deactivate the
 > excess.
 
 Again, I understand the rationale. But it is not consistent.
 >
 > Thanks,
 > -rohit
 >
 --
 
 ------------------------------------------------------------ ----------
 Chandra Seetharaman               | Be careful what you choose....
 - sekharan@us.ibm.com   |      .......you may get it.
 ------------------------------------------------------------ ----------
 |  
	|  |  |  
	| 
		
			| Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction [message #6973 is a reply to message #6970] | Thu, 28 September 2006 21:38   |  
			| 
				
				
					|  Rohit Seth Messages: 101
 Registered: August 2006
 | Senior Member |  |  |  
	| On Thu, 2006-09-28 at 13:23 -0700, Chandra Seetharaman wrote: > On Thu, 2006-09-28 at 11:12 -0700, Rohit Seth wrote:
 > > On Wed, 2006-09-27 at 15:24 -0700, Chandra Seetharaman wrote:
 > > > On Wed, 2006-09-27 at 14:28 -0700, Rohit Seth wrote:
 > > >
 > > > Rohit,
 > > >
 > > > For 1-4, I understand the rationale. But, your implementation deviates
 > > > from the current behavior of the VM subsystem which could affect the
 > > > ability of these patches getting into mainline.
 > > >
 > >
 > > I agree that this implementation differs from existing VM subsystem.
 > > But the key point here is, it puts the pages that should be reclaimed.
 >
 > But, you are putting the pages up for reclamation without any
 > consideration to the working set and system memory pressure.
 >
 
 And I agree with you that some heuristics need to be put there to make
 that algorithm go better.
 
 > > And this part needs further refining.
 > >
 > > > IMO, the current behavior in terms of reclamation, LRU, vm_swappiness,
 > > > and writeback logic should be maintained.
 > > >
 > >
 > > How?  I don't want to duplicate the whole logic for containers.
 >
 > We don't have to be duplicating the whole logic. Just make sure that the
 > existing mechanisms are aware of containers, if they exist.
 >
 
 The next version is going to have hooks in kernel reclaim path for
 containers.  But that will still not make it close to what normal
 reclaim path does for pages outside containers.
 
 > <snip>
 >
 > > >
 > > > But, it will still suffer from (1) above, as we would have no idea of
 > > > the current working set (LRU) (within an item or among the items).
 > > >
 > >
 > > Please let me know how do you propose to have another LRU for pages in
 > > containers.  Though I can add some heuristics.
 >
 > There are multiple ways as Balbir pointed in his email:
 >  - reclamation per container (as in current RG implementation)
 >    ( + do a system wide reclaim when the system pressure is high)
 >  - reclaim with the knowledge of containers that are over limit
 >    (Dave Hansen's patches + avoid overhead of combing the list)
 >  - have two lists one for the system and one per container
 >
 
 I will look at Dave's patch.  Having two different list is not the right
 approach.  I will add some reclaim logic in kernel reclaim path.
 
 Any idea why current RG implementation is not in mainline?  Any effort
 in reviving that and getting it in Andrew's tree.
 
 > >
 > > > >
 > > > > > 6. Both active and inactive pages use physical pages. But, the
 > > > > >    controller only counts active pages and not inactive pages. why ?
 > > > >
 > > > > The thought is, it is okay for containers to go over its limit as long
 > > >
 > > > Real number of "physical pages" used by the container is the sum of
 > > > active and inactive pages.
 > > >
 > >
 > > >From the user pov, the real sum of pages that are used by container for
 > > user land is anon + file.  Now some times it is possible that there are
 > > active pages that are neither in page cache nor in use as anon.
 > >
 > >
 > > > My question is, shouldn't that be used to check against page limit
 > > > instead of active pages alone ?
 > > I can use active+inactive as the test.  Sure.  But I will have to also
 > > still have a check to make sure that number of active pages themselves
 > > is not bigger than page_limit.
 > >
 > > > How do we describe "page limit" as (to the user) ?
 > > >
 > >
 > > Amount of memory below which no container throttling will happen.  And
 >
 > But, from the implementation one cannot clearly derive what we mean by
 > "memory" here (physical ?, file + anon ?; if we say physical, it is not
 > correct).
 >
 
 It is mostly correct. i.e. anon+file == total user physical memory for
 container.  Except for the corner cases when there could be stale
 pagecache pages that are no longer on page cache but still on LRU (not
 yet recalimed).
 
 > > if the system is properly configured that it also ensures that this much
 > > memory will always be there to user.  If a container goes over this
 > > limit then it will be throttled and it will suffer performance.
 >
 > But, the user's expectation would be that we would be throwing out pages
 > based on LRU (within that container). But this implementation doesn't
 > provide that behavior. It doesn't care about the working set.
 >
 
 IMO, user is expected to live inside the limits when containers are
 defined.  If the limits are exceeded then some performance impact will
 happen.  Having said that though I would still like to get some
 optimizations in memory handler so that more appropriate pages are
 deactivated.
 
 > Performance impact will be lesser if we consider the working set and
 > throw out pages based on LRU (within a container).
 >
 I don't deny it.  But two separate LRUs is not an option.
 
 > >
 > > > > as there is enough memory in the system. When there is any memory
 > > > > pressure then the inactive (+ dereferenced) pages get swapped out thus
 > > > > penalizing the container.  I'm also thinking of having hard limit for
 > > >
 > > > Reclamation goes through active pages and page cache pages before it
 > > > gets into inactive pages. So, this may not work as you are explaining.
 > >
 > > That is a good point.  I'll have to make a check in reclaim so that when
 > > the system is ready for swap or write back then containers are looked
 > > first.
 > >
 > > >
 > > > > anonymous pages beyond which the container will not be able to grow its
 > > > > anonymous pages.
 > > >
 > > > You might break the current behavior (memory pressure must be very high
 > > > before these starts failing) if you are going to be strict about it.
 > > >
 > >
 > > That feature when implemented will be a container specific.
 >
 > My point is, even though it is container specific, the behavior (inside
 > a container) should be same as what a user sees at the system level now.
 >
 > For example, consider a workload that is run on a 1G system now, and
 > user sees only occasional memory allocation failures and just a handful
 > of oom kills. When the workload is moved to a container with 1G,
 > failures the user see should be in the same order ( and similar with
 > performance characteristics).
 >
 > Do you agree that it will be the user's expectation ?
 >
 
 That will be nice to have feature.  And for that any container
 implementation will have to be as tightly intertwined with rest of vm as
 cpuset is.
 
 thanks,
 -rohit
 >
 |  
	|  |  | 
 
 
 Current Time: Sat Oct 25 18:43:20 GMT 2025 
 Total time taken to generate the page: 0.11204 seconds |