| Home » Mailing lists » Devel » [RFD] L2 Network namespace infrastructure Goto Forum:
	| 
		
			| [RFD] L2 Network namespace infrastructure [message #19089] | Fri, 22 June 2007 19:39  |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| Currently all of the prerequisite work for implementing a network
namespace (i.e. virtualization of the network stack with one kernel)
has already been merged or is in the process of being merged.
Therefore it is now time for a bit of high level design review of
the network namespace work and time to begin sending patches.
-- User space semantics
If you are in a different network namespace it looks like you have a
separate independent copy of the network stack.
User visible kernel structures that will appear to be per network
namespace include network devices, routing tables, sockets, and
netfilter rules.
-- The basic design
There will be a network namespace structure that holds the global
variables for a network namespace, making those global variables
per network namespace.
One of those per network namespace global variables will be the
loopback device.  Which means the network namespace a packet resides
in can be found simply by examining the network device or the socket
the packet is traversing.
Either a pointer to this global structure will be passed into
the functions that need to reference per network namespace variables
or a structure that is already passed in (such as the network device)
will be modified to contain a pointer to the network namespace
structure.
Depending upon the data structure it will either be modified to hold
a per entry network namespace pointer or it there will be a separate
copy per network namespace.  For large global data structures like
the ipv4 routing cache hash table adding an additional pointer to the
entries appears the more reasonable solution.
The initialization and cleanup functions will be refactored into
functions that do the work on a network namespace basis and functions
that perform truly global initialization and cleanup.  And a
registration mechanism will be available to register functions that
are per network namespace.
It is a namespace so like the other namespaces that have been
implemented a clone flag will exist to create the namespace during
clone or unshare.
There will be an additional network stack feature that will allow
you to migrate network devices between namespaces.
When complete all of the features of the network stack ipv4, ipv6,
decnet, sysctls, virtual devices, routing tables, scheduling, ipsec,
netfilter, etc  should be able to operate in a per network namespace
fashion.
--- The implementation plan
The plan for implementing this is to first get network namespace
infrastructure merged.  So that pieces of the network stack can be
made to operate in a per network namespace fashion.
Then the plan is to proceed as if we are doing a global kernel lock
removal.  For each layer of the networking stack pass down the per
network namespace parameter to the functions and modify the functions
to verify they are only operating on the initial network namespace.
Then one piece at a time update the code to handle working in
multiple network namespaces, and push the network namespace
information down to the lower levels.
This plan calls for a lot of patches that are essentially noise.  But
the result is simple and generally obviously correct patches, that can
be easily reviewed, and can be safely merged one at a time, and don't
impose any additional ongoing maintenance overhead.
In my current proof of concept patchset it takes about 100 patches
before ipv4 is up and working.
--- Performance
In initial measurements the only performance overhead we have been
able to measure is getting the packet to the network namespace.
Going through ethernet bridging or routing seems to trigger copies
of the packet that slow things down.  When packets go directly to
the network namespace no performance penalty has yet been measured.
--- The question
At the design level does this approach sound reasonable?
Eric
p.s.  I will follow up shortly with a patch that is one implementation
of the basic network namespace infrastructure.  Feel free to cut it
to shreds (as it is likely overkill) but it should help put the pieces
of what I am talking about into perspective.
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| [PATCH] net: Basic network namespace infrastructure. [message #19090 is a reply to message #19089] | Fri, 22 June 2007 21:22   |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| This is the basic infrastructure needed to support network
namespaces.  This infrastructure is:
- Registration functions to support initializing per network
  namespace data when a network namespaces is created or destroyed.
- struct net.  The network namespace datastructure.
  This structure will grow as variables are made per network
  namespace but this is the minimal starting point.
- Functions to grab a reference to the network namespace.
  I provide both get/put functions that keep a network namespace
  from being freed.  And hold/release functions serve as weak references
  and will warn if their count is not zero when the data structure
  is freed.  Useful for dealing with more complicated data structures
  like the ipv4 route cache.
- A list of all of the network namespaces so we can iterate over them.
- A slab for the network namespace data structures allowing leaks
  to be spotted.
I have deliberately chosen to not make it possible to compile out the
code as the support for per-network namespace initialization and
uninitialization needs to always be compiled in once code has
started using it (even if we don't have network namespaces,
and because no one has ever measured any performance overhead
specific to network namespace infrastructure.  As code
to compile out the network namespace pointers etc is complicated
it is best to avoid that code unless that complexity is justified.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/net_namespace.h |   66 ++++++++++
 net/core/Makefile           |    2 +-
 net/core/net_namespace.c    |  291 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 358 insertions(+), 1 deletions(-)
 create mode 100644 include/net/net_namespace.h
 create mode 100644 net/core/net_namespace.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
new file mode 100644
index 0000000..c909b3a
--- /dev/null
+++ b/include/net/net_namespace.h
@@ -0,0 +1,66 @@
+/*
+ * Operations on the network namespace
+ */
+#ifndef __NET_NET_NAMESPACE_H
+#define __NET_NET_NAMESPACE_H
+
+#include <asm/atomic.h>
+#include <linux/workqueue.h>
+#include <linux/list.h>
+
+struct net {
+	atomic_t count;		/* To decided when the network namespace
+				 * should go
+				 */
+	atomic_t use_count;	/* For references we destroy on demand */
+	struct list_head list;	/* list of network namespace structures */
+	struct work_struct work;	/* work struct for freeing */
+};
+
+extern struct net init_net;
+extern struct list_head net_namespace_list;
+
+extern void __put_net(struct net *net);
+
+static inline struct net *get_net(struct net *net)
+{
+	atomic_inc(&net->count);
+	return net;
+}
+
+static inline void put_net(struct net *net)
+{
+	if (atomic_dec_and_test(&net->count))
+		__put_net(net);
+}
+
+static inline struct net *hold_net(struct net *net)
+{
+	atomic_inc(&net->use_count);
+	return net;
+}
+
+static inline void release_net(struct net *net)
+{
+	atomic_dec(&net->use_count);
+}
+
+extern void net_lock(void);
+extern void net_unlock(void);
+
+#define for_each_net(VAR)				\
+	list_for_each_entry(VAR, &net_namespace_list, list);
+
+
+struct pernet_operations {
+	struct list_head list;
+	int (*init)(struct net *net);
+	void (*exit)(struct net *net);
+};
+
+extern int register_pernet_subsys(struct pernet_operations *);
+extern void unregister_pernet_subsys(struct pernet_operations *);
+extern int register_pernet_device(struct pernet_operations *);
+extern void unregister_pernet_device(struct pernet_operations *);
+
+#endif /* __NET_NET_NAMESPACE_H */
diff --git a/net/core/Makefile b/net/core/Makefile
index 4751613..ea9b3f3 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -3,7 +3,7 @@
 #
 
 obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \
-	 gen_stats.o gen_estimator.o
+	 gen_stats.o gen_estimator.o net_namespace.o
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
new file mode 100644
index 0000000..397c15f
--- /dev/null
+++ b/net/core/net_namespace.c
@@ -0,0 +1,291 @@
+#include <linux/workqueue.h>
+#include <linux/rtnetlink.h>
+#include <linux/cache.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/delay.h>
+#include <net/net_namespace.h>
+
+/*
+ *	Our network namespace constructor/destructor lists
+ */
+
+static LIST_HEAD(pernet_list);
+static struct list_head *first_device = &pernet_list;
+static DEFINE_MUTEX(net_mutex);
+
+static DEFINE_MUTEX(net_list_mutex);
+LIST_HEAD(net_namespace_list);
+
+static struct kmem_cache *net_cachep;
+
+struct net init_net;
+
+void net_lock(void)
+{
+	mutex_lock(&net_list_mutex);
+}
+
+void net_unlock(void)
+{
+	mutex_unlock(&net_list_mutex);
+}
+
+static struct net *net_alloc(void)
+{
+	return kmem_cache_alloc(net_cachep, GFP_KERNEL);
+}
+
+static void net_free(struct net *net)
+{
+	if (!net)
+		return;
+
+	if (unlikely(atomic_read(&net->use_count) != 0)) {
+		printk(KERN_EMERG "network namespace not free! Usage: %d\n",
+			atomic_read(&net->use_count));
+		return;
+	}
+
+	kmem_cache_free(net_cachep, net);
+}
+
+static void cleanup_net(struct work_struct *work)
+{
+	struct pernet_operations *ops;
+	struct list_head *ptr;
+	struct net *net;
+
+	net = container_of(work, struct net, work);
+
+	mutex_lock(&net_mutex);
+
+	/* Don't let anyone else find us. */
+	net_lock();
+	list_del(&net->list);
+	net_unlock();
+
+	/* Run all of the network namespace exit methods */
+	list_for_each_prev(ptr, &pernet_list) {
+		ops = list_entry(ptr, struct pernet_operations, list);
+		if (ops->exit)
+			ops->exit(net);
+	}
+
+	mutex_unlock(&net_mutex);
+
+	/* Ensure there are no outstanding rcu callbacks using this
+	 * network namespace.
+	 */
+	rcu_barrier();
+
+	/* Finally it is safe to free my network namespace structure */
+	net_free(net);
+}
+
+
+void __put_net(struct net *net)
+{
+	/* Cleanup the network namespace in process context */
+	INIT_WORK(&net->work, cleanup_net);
+	schedule_work(&net->work);
+}
+EXPORT_SYMBOL_GPL(__put_net);
+
+/*
+ * setup_net runs the initializers for the network namespace object.
+ */
+static int setup_net(struct net *net)
+{
+	/* Must be called with net_mutex held */
+	struct pernet_operations *ops;
+	struct list_head *ptr;
+	int error;
+
+	memset(net, 0, sizeof(struct net));
+	atomic_set(&net->count, 1);
+	atomic_set(&net->use_count, 0);
+
+	error = 0;
+	list_for_each(ptr, &pernet_list) {
+		ops = list_entry(ptr, struct pernet_operations, list);
+		if (ops->init) {
+			error = ops->init(net);
+			if (error < 0)
+				goto out_undo;
+		}
+	}
+out:
+	return error;
+out_undo:
+	/* Walk through the list backwards calling the exit functions
+	 * for the pernet modules whose init functions did not fail.
+	 */
+	for (ptr = ptr->prev; ptr != &pernet_list; ptr = ptr->prev) {
+		ops = list_entry(ptr, struct pernet_operations, list);
+		if (ops->exit)
+			ops->exit(net);
+	}
+	goto out;
+}
+
+static int __init net_ns_init(void)
+{
+	int err;
+
+	printk(KERN_INFO "net_namespace: %zd bytes\n", sizeof(struct net));
+	net_cachep = kmem_cache_create("net_namespace", sizeof(struct net),
+					SMP_CACHE_BYTES,
+					SLAB_PANIC, NULL, NULL);
+	mutex_lock(&net_mutex);
+	err = setup_net(&init_net);
+
+	net_lock();
+	list_add_tail(&init_net.list, &net_namespace_list);
+	net_unlock();
+
+	mutex_unlock(&net_mutex);
+	if (err)
+		panic("Could not setup the initial network namespace");
+
+	return 0;
+}
+
+pure_initcall(net_ns_init);
+
+static int register_pernet_operations(struct list_head *list,
+				      struct pernet_operations *ops)
+{
+	struct net *net, *undo_net;
+	int error;
+
+	error = 0;
+	list_add_tail(&ops->list, list);
+	for_each_net(net) {
+		if (ops->init) {
+			error = ops->init(net);
+			if (error)
+				goto out_undo;
+		}
+	}
+out:
+	return error;
+
+out_undo:
+	/* If I have an error cleanup all namespaces I initialized */
+	list_del(&ops->list);
+	for_each_net(undo_net) {
+		if (undo_net == net)
+			goto undone;
+		if (ops->exit)
+			ops->exit(undo_net);
+	}
+undone:
+	goto out;
+}
+
+static void unregister_pernet_operations(struct pernet_operations *ops)
+{
+	struct net *net;
+
+	list_del(&ops->list);
+	for_each_net(net)
+		if (ops->exit)
+			ops->exit(net);
+}
+
+/**
+ *      register_pernet_subsys - register a network namespace subsystem
+ *	@ops:  pernet operations structure for the subsystem
+ *
+ *	Register a subsystem which has init and exit functions
+ *	that are called when network namespaces are created and
+ *	destroyed respectively.
+ *
+ *	When registered all network namespace init functions are
+ *	called for every existing network namespace.  Allowing kernel
+ *	modules to have a race free view of the set of network namespaces.
+ *
+ *	When a new network namespace is created all of the init
+ *	methods are called in the order in which they were registered.
+ *
+ *	When a network namespace is destroyed all of the exit methods
+ *	are called in the reverse of the order with which they were
+ *	registered.
+ */
+int register_pernet_subsys(struct pernet_operations *ops)
+{
+	int error;
+	mutex_lock(&net_mutex);
+	error =  register_pernet_operations(first_device, ops);
+	mutex_unlock(&net_mutex);
+	return error;
+}
+EXPORT_SYMBOL_GPL(register_pernet_subsys);
+
+/**
+ *      unregister_pernet_subsys - unregister a network namespace subsystem
+ *	@ops: pernet operations structure to manipulate
+ *
+ *	Remove the pernet operations structure from the list to be
+ *	used when network namespaces are created or destoryed.  In
+ *	addition run the exit method for all existing network
+ *	namespaces.
+ */
+void unregister_pernet_subsys(struct pernet_operations *module)
+{
+	mutex_lock(&net...
 
 |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [RFD] L2 Network namespace infrastructure [message #19107 is a reply to message #19104] | Sat, 23 June 2007 21:41   |  
			| 
				
				
					|  ebiederm Messages: 1354
 Registered: February 2006
 | Senior Member |  |  |  
	| David Miller <davem@davemloft.net> writes:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sat, 23 Jun 2007 11:19:34 -0600
>
>> Further and fundamentally all a global achieves is removing the need
>> for the noise patches where you pass the pointer into the various
>> functions.  For long term maintenance it doesn't help anything.
>
> I don't accept that we have to add another function argument
> to a bunch of core routines just to support this crap,
> especially since you give no way to turn it off and get
> that function argument slot back.
>
> To be honest I think this form of virtualization is a complete
> waste of time, even the openvz approach.
>
> We're protecting the kernel from itself, and that's an endless
> uphill battle that you will never win.  Let's do this kind of
> stuff properly with a real minimal hypervisor, hopefully with
> appropriate hardware level support and good virtualized device
> interfaces, instead of this namespace stuff.
>
> At least the hypervisor approach you have some chance to fully
> harden in some verifyable and truly protected way, with
> namespaces it's just a pipe dream and everyone who works on
> these namespace approaches knows that very well.
>
> The only positive thing that came out of this work is the
> great auditing that the openvz folks have done and the bugs
> they have found, but it basically ends right there.
Dave thank you for your candor, it looks like I have finally made
the pieces small enough that we can discuss them.
If you want the argument to compile out.  That is not a problem at all.
I dropped that part from my patch because it makes infrastructure more
complicated and there appeared to be no gain.  However having a type
that you can pass that the compiler can optimize away is not a
problem.  Basically you just make the argument:
typedef struct {} you_can_compile_me_out;  /* when you don't want it. */
typedef void * you_can_compile_me_out;     /* when you do want it. */
And gcc will generate no code to pass the argument when you compile
it out.
As far as the hardening goes.  There is definitely a point there,
short of a kernel proof subsystem that sounds correct to me.
There are some other factors that make a different tradeoff interesting.
First hypervisors do not allow global optimizations (because of the
better isolation) so have an inherent performance disadvantage.
Something like a 10x scaling penalty from the figures I have seen.
Even more interesting for me is the possibility of unmodified
application migration.  Where the limiting factor is that
you cannot reliably restore an application because the global
identifiers are not available.
So yes monolithic kernels may have grown so complex that they cannot
be verified and thus you cannot actually keep untrusted users from
doing bad things to each other with any degree of certainty.
However the interesting cases for me are cases where the users are not
aggressively hostile with each other but being stuck with one set of
global identifiers are a problem.
Eric
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [RFD] L2 Network namespace infrastructure [message #19114 is a reply to message #19108] | Sun, 24 June 2007 01:28   |  
			| 
				
				
					|  Jeff Garzik Messages: 9
 Registered: February 2006
 | Junior Member |  |  |  
	| Eric W. Biederman wrote:
> Jeff Garzik <jeff@garzik.org> writes:
> 
>> David Miller wrote:
>>> I don't accept that we have to add another function argument
>>> to a bunch of core routines just to support this crap,
>>> especially since you give no way to turn it off and get
>>> that function argument slot back.
>>>
>>> To be honest I think this form of virtualization is a complete
>>> waste of time, even the openvz approach.
>>>
>>> We're protecting the kernel from itself, and that's an endless
>>> uphill battle that you will never win.  Let's do this kind of
>>> stuff properly with a real minimal hypervisor, hopefully with
>>> appropriate hardware level support and good virtualized device
>>> interfaces, instead of this namespace stuff.
>> Strongly seconded.  This containerized virtualization approach just bloats up
>> the kernel for something that is inherently fragile and IMO less secure --
>> protecting the kernel from itself.
>>
>> Plenty of other virt approaches don't stir the code like this, while
>> simultaneously providing fewer, more-clean entry points for the virtualization
>> to occur.
> 
> Wrong.  I really don't want to get into a my virtualization approach is better
> then yours.  But this is flat out wrong.
> 99% of the changes I'm talking about introducing are just:
> - variable 
> + ptr->variable
> 
> There are more pieces mostly with when we initialize those variables but
> that is the essence of the change.
You completely dodged the main objection.  Which is OK if you are 
selling something to marketing departments, but not OK
Containers introduce chroot-jail-like features that give one a false 
sense of security, while still requiring one to "poke holes" in the 
illusion to get hardware-specific tasks accomplished.
The capable/not-capable model (i.e. superuser / normal user) is _still_ 
being secured locally, even after decades of work and whitepapers and 
audits.
You are drinking Deep Kool-Aid if you think adding containers to the 
myriad kernel subsystems does anything besides increasing fragility, and 
decreasing security.  You are securing in-kernel subsystems against 
other in-kernel subsystems.  superuser/user model made that difficult 
enough... now containers add exponential audit complexity to that.  Who 
is to say that a local root does not also pierce the container model?
> And as opposed to other virtualization approaches so far no one has been
> able to measure the overhead.  I suspect there will be a few more cache
> line misses somewhere but they haven't shown up yet.
> 
> If the only use was strong isolation which Dave complains about I would
> concur that the namespace approach is inappropriate.  However there are
> a lot other uses.
Sure there are uses.  There are uses to putting the X server into the 
kernel, too.  At some point complexity and featuritis has to take a back 
seat to basic sanity.
	Jeff
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re: [RFD] L2 Network namespace infrastructure [message #19122 is a reply to message #19104] | Mon, 25 June 2007 15:11   |  
			| 
				
				
					|  serue Messages: 750
 Registered: February 2006
 | Senior Member |  |  |  
	| Quoting David Miller (davem@davemloft.net):
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sat, 23 Jun 2007 11:19:34 -0600
> 
> > Further and fundamentally all a global achieves is removing the need
> > for the noise patches where you pass the pointer into the various
> > functions.  For long term maintenance it doesn't help anything.
> 
> I don't accept that we have to add another function argument
> to a bunch of core routines just to support this crap,
> especially since you give no way to turn it off and get
> that function argument slot back.
> 
> To be honest I think this form of virtualization is a complete
> waste of time, even the openvz approach.
> 
> We're protecting the kernel from itself, and that's an endless
> uphill battle that you will never win.  Let's do this kind of
Hi David,
just to be clear this isn't so much about security.  Security can be
provided using selinux, just as with the userid namespace.  But like
with the userid namespace, this provides usability for the virtual
servers, plus some support for restarting checkpointed applications.
That doesn't attempt to justify the extra argument - if you don't
like it, you don't like it  :)  Just wanted to clarify.
thanks,
-serge
> stuff properly with a real minimal hypervisor, hopefully with
> appropriate hardware level support and good virtualized device
> interfaces, instead of this namespace stuff.
> 
> At least the hypervisor approach you have some chance to fully
> harden in some verifyable and truly protected way, with
> namespaces it's just a pipe dream and everyone who works on
> these namespace approaches knows that very well.
> 
> The only positive thing that came out of this work is the
> great auditing that the openvz folks have done and the bugs
> they have found, but it basically ends right there.
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	| 
		
			| Re: [RFD] L2 Network namespace infrastructure [message #19123 is a reply to message #19114] | Mon, 25 June 2007 15:23   |  
			| 
				
				
					|  serue Messages: 750
 Registered: February 2006
 | Senior Member |  |  |  
	| Quoting Jeff Garzik (jeff@garzik.org):
> Eric W. Biederman wrote:
> >Jeff Garzik <jeff@garzik.org> writes:
> >
> >>David Miller wrote:
> >>>I don't accept that we have to add another function argument
> >>>to a bunch of core routines just to support this crap,
> >>>especially since you give no way to turn it off and get
> >>>that function argument slot back.
> >>>
> >>>To be honest I think this form of virtualization is a complete
> >>>waste of time, even the openvz approach.
> >>>
> >>>We're protecting the kernel from itself, and that's an endless
> >>>uphill battle that you will never win.  Let's do this kind of
> >>>stuff properly with a real minimal hypervisor, hopefully with
> >>>appropriate hardware level support and good virtualized device
> >>>interfaces, instead of this namespace stuff.
> >>Strongly seconded.  This containerized virtualization approach just 
> >>bloats up
> >>the kernel for something that is inherently fragile and IMO less secure --
> >>protecting the kernel from itself.
> >>
> >>Plenty of other virt approaches don't stir the code like this, while
> >>simultaneously providing fewer, more-clean entry points for the 
> >>virtualization
> >>to occur.
> >
> >Wrong.  I really don't want to get into a my virtualization approach is 
> >better
> >then yours.  But this is flat out wrong.
> 
> >99% of the changes I'm talking about introducing are just:
> >- variable 
> >+ ptr->variable
> >
> >There are more pieces mostly with when we initialize those variables but
> >that is the essence of the change.
> 
> You completely dodged the main objection.  Which is OK if you are 
> selling something to marketing departments, but not OK
> 
> Containers introduce chroot-jail-like features that give one a false 
> sense of security, while still requiring one to "poke holes" in the 
> illusion to get hardware-specific tasks accomplished.
> 
> The capable/not-capable model (i.e. superuser / normal user) is _still_ 
> being secured locally, even after decades of work and whitepapers and 
> audits.
> 
> You are drinking Deep Kool-Aid if you think adding containers to the 
> myriad kernel subsystems does anything besides increasing fragility, and 
> decreasing security.  You are securing in-kernel subsystems against 
> other in-kernel subsystems.
No we're not.  As the name 'network namespaces' implies, we are
introducing namespaces for network-related variables.  That's it.
We are not trying to protect in-kernel subsystems from each other.  In
fact we're not even trying to protect userspace process from each other.
Though that will in part come free when user processes can't access each
other's data because they are in different namespaces.  But using an
LSM like selinux or a custom one to tag and enforce isolation would
still be encouraged.
> superuser/user model made that difficult 
> enough... now containers add exponential audit complexity to that.  Who 
> is to say that a local root does not also pierce the container model?
At the moment it does.
> >And as opposed to other virtualization approaches so far no one has been
> >able to measure the overhead.  I suspect there will be a few more cache
> >line misses somewhere but they haven't shown up yet.
> >
> >If the only use was strong isolation which Dave complains about I would
> >concur that the namespace approach is inappropriate.  However there are
> >a lot other uses.
> 
> Sure there are uses.  There are uses to putting the X server into the 
> kernel, too.  At some point complexity and featuritis has to take a back 
> seat to basic sanity.
Generally true, yes.
-serge
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  |  
	|  |  
	|  |  
	|  |  
	| 
		
			| Re:  Re: [RFD] L2 Network namespace infrastructure [message #19138 is a reply to message #19114] | Wed, 27 June 2007 15:38   |  
			| 
				
				
					|  dev Messages: 1693
 Registered: September 2005
 Location: Moscow
 | Senior Member |  
 |  |  
	| Jeff Garzik wrote:
> Eric W. Biederman wrote:
> 
>>Jeff Garzik <jeff@garzik.org> writes:
>>
>>
>>>David Miller wrote:
>>>
>>>>I don't accept that we have to add another function argument
>>>>to a bunch of core routines just to support this crap,
>>>>especially since you give no way to turn it off and get
>>>>that function argument slot back.
>>>>
>>>>To be honest I think this form of virtualization is a complete
>>>>waste of time, even the openvz approach.
>>>>
>>>>We're protecting the kernel from itself, and that's an endless
>>>>uphill battle that you will never win.  Let's do this kind of
>>>>stuff properly with a real minimal hypervisor, hopefully with
>>>>appropriate hardware level support and good virtualized device
>>>>interfaces, instead of this namespace stuff.
>>>
>>>Strongly seconded.  This containerized virtualization approach just bloats up
>>>the kernel for something that is inherently fragile and IMO less secure --
>>>protecting the kernel from itself.
>>>
>>>Plenty of other virt approaches don't stir the code like this, while
>>>simultaneously providing fewer, more-clean entry points for the virtualization
>>>to occur.
>>
>>Wrong.  I really don't want to get into a my virtualization approach is better
>>then yours.  But this is flat out wrong.
> 
> 
>>99% of the changes I'm talking about introducing are just:
>>- variable 
>>+ ptr->variable
>>
>>There are more pieces mostly with when we initialize those variables but
>>that is the essence of the change.
> 
> 
> You completely dodged the main objection.  Which is OK if you are 
> selling something to marketing departments, but not OK
It is a pure illusion that one kind of virtualization is better then the other one.
Look, maybe *hardware* virtualization and a small size of the hypervisor
make you feel safer, however you totally forget about all the emulation drivers etc.
which can have bugs and security implications as well and maybe triggerable
from inside VM.
> Containers introduce chroot-jail-like features that give one a false 
> sense of security, while still requiring one to "poke holes" in the 
> illusion to get hardware-specific tasks accomplished.
The concepts of users in any OS (and in Linux in particular) give people
a false sense of security as well!
I know a bunch of examples of how one user can crash/DoS/abuse another one
e.g. on RHEL5 or mainstream.
But no one have problems with that and no one thinks that multiuserness
is a step-backward or maybe someone does?!
Yes, there are always bugs and sometimes security implications,
but people leave fine with it when it is fixed.
> The capable/not-capable model (i.e. superuser / normal user) is _still_ 
> being secured locally, even after decades of work and whitepapers and 
> audits.
> You are drinking Deep Kool-Aid if you think adding containers to the 
> myriad kernel subsystems does anything besides increasing fragility, and 
> decreasing security.  You are securing in-kernel subsystems against 
> other in-kernel subsystems.  superuser/user model made that difficult 
> enough... now containers add exponential audit complexity to that.  Who 
> is to say that a local root does not also pierce the container model?
containers do the only thing:
make sure that objects from one context are not visible to another one.
If containers are not used - everything returns to the case as it is now, i.e.
everything is global and globally visible and no auditing overhead at all.
So if you are not interested in containers - the code auditing
won't be noticably harder for you.
>>And as opposed to other virtualization approaches so far no one has been
>>able to measure the overhead.  I suspect there will be a few more cache
>>line misses somewhere but they haven't shown up yet.
>>
>>If the only use was strong isolation which Dave complains about I would
>>concur that the namespace approach is inappropriate.  However there are
>>a lot other uses.
> 
> 
> Sure there are uses.  There are uses to putting the X server into the 
> kernel, too.  At some point complexity and featuritis has to take a back 
> seat to basic sanity.
I agree about sanity, however I totally disagree about complexity
you talk about.
Bugs we face/fix in Linux kernel which are found with the help of
that kind of virtualization makes me believe that Linux kernel
only wins from it.
Thanks,
Kirill
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers |  
	|  |  |  
	|  |  
	|  | 
 
 
 Current Time: Sun Oct 26 14:11:42 GMT 2025 
 Total time taken to generate the page: 0.09938 seconds |