OpenVZ Forum


Home » Mailing lists » Devel » [PATCH 1/2] namespaces: introduce sys_hijack (v10)
Re: [PATCH 1/2] namespaces: introduce sys_hijack (v10) [message #23834 is a reply to message #23812] Tue, 27 November 2007 16:11 Go to previous messageGo to previous message
serue is currently offline  serue
Messages: 750
Registered: February 2006
Senior Member
Quoting Crispin Cowan (crispin@crispincowan.com):
> Just the name "sys_hijack" makes me concerned.
> 
> This post describes a bunch of "what", but doesn't tell us about "why"
> we would want this. What is it for?

Please see my response to Casey's email.

> And I second Casey's concern about careful management of the privilege
> required to "hijack" a process.

Absolutely.  We're definately still in RFC territory.

Note that there are currently several proposed (but no upstream) ways to
accomplish entering a namespace:

	1. bind_ns() is a new pair of syscalls proposed by Cedric.  An
	nsproxy is given an integer id.  The id can be used to enter
	an nsproxy, basically a straight current->nsproxy = target_nsproxy;

	2. I had previously posted a patchset on top of the nsproxy
	cgroup which allowed entering a nsproxy through the ns cgroup
	interface.

There are objections to both those patchsets because simply switching a
task's nsproxy using a syscall or file write in the middle of running a
binary is quite unsafe.  Eric Biederman had suggested using ptrace or
something like it to accomplish the goal.

Just using ptrace is however not safe either.  You are inheriting *all*
of the target's context, so it shouldn't be difficult for a nefarious
container/vserver admin to trick the host admin into running something
which gives the container/vserver admin full access to the host.

That's where the hijack idea came from.  Yes, I called it hijack to make
sure alarm bells went off :) bc it's definately still worrisome.  But at
this point I believe it is the safest solution suggested so far.

-serge

> Crispin
> 
> Mark Nelson wrote:
> > Here's the latest version of sys_hijack.
> > Apologies for its lateness.
> >
> > Thanks!
> >
> > Mark.
> >
> > Subject: [PATCH 1/2] namespaces: introduce sys_hijack (v10)
> >
> > Move most of do_fork() into a new do_fork_task() which acts on
> > a new argument, task, rather than on current.  do_fork() becomes
> > a call to do_fork_task(current, ...).
> >
> > Introduce sys_hijack (for i386 and s390 only so far).  It is like
> > clone, but in place of a stack pointer (which is assumed null) it
> > accepts a pid.  The process identified by that pid is the one
> > which is actually cloned.  Some state - including the file
> > table, the signals and sighand (and hence tty), and the ->parent
> > are taken from the calling process.
> >
> > A process to be hijacked may be identified by process id, in the
> > case of HIJACK_PID.  Alternatively, in the case of HIJACK_CG an
> > open fd for a cgroup 'tasks' file may be specified.  The first
> > available task in that cgroup will then be hijacked.
> >
> > HIJACK_NS is implemented as a third hijack method.  The main
> > purpose is to allow entering an empty cgroup without having
> > to keep a task alive in the target cgroup.  When HIJACK_NS
> > is called, only the cgroup and nsproxy are copied from the
> > cgroup.  Security, user, and rootfs info is not retained
> > in the cgroups and so cannot be copied to the child task.
> >
> > In order to hijack a process, the calling process must be
> > allowed to ptrace the target.
> >
> > Sending sigstop to the hijacked task can trick its parent shell
> > (if it is a shell foreground task) into thinking it should retake
> > its tty.
> >
> > So try not sending SIGSTOP, and instead hold the task_lock over
> > the hijacked task throughout the do_fork_task() operation.
> > This is really dangerous.  I've fixed cgroup_fork() to not
> > task_lock(task) in the hijack case, but there may well be other
> > code called during fork which can under "some circumstances"
> > task_lock(task).
> >
> > Still, this is working for me.
> >
> > The effect is a sort of namespace enter.  The following program
> > uses sys_hijack to 'enter' all namespaces of the specified task.
> > For instance in one terminal, do
> >
> > 	mount -t cgroup -ons cgroup /cgroup
> > 	hostname
> > 	  qemu
> > 	ns_exec -u /bin/sh
> > 	  hostname serge
> >           echo $$
> >             1073
> > 	  cat /proc/$$/cgroup
> > 	    ns:/node_1073
> >
> > In another terminal then do
> >
> > 	hostname
> > 	  qemu
> > 	cat /proc/$$/cgroup
> > 	  ns:/
> > 	hijack pid 1073
> > 	  hostname
> > 	    serge
> > 	  cat /proc/$$/cgroup
> > 	    ns:/node_1073
> > 	hijack cgroup /cgroup/node_1073/tasks
> >
> > Changelog:
> > 	Aug 23: send a stop signal to the hijacked process
> > 		(like ptrace does).
> > 	Oct 09: Update for 2.6.23-rc8-mm2 (mainly pidns)
> > 		Don't take task_lock under rcu_read_lock
> > 		Send hijacked process to cgroup_fork() as
> > 		the first argument.
> > 		Removed some unneeded task_locks.
> > 	Oct 16: Fix bug introduced into alloc_pid.
> > 	Oct 16: Add 'int which' argument to sys_hijack to
> > 		allow later expansion to use cgroup in place
> > 		of pid to specify what to hijack.
> > 	Oct 24: Implement hijack by open cgroup file.
> > 	Nov 02: Switch copying of task info: do full copy
> > 		from current, then copy relevant pieces from
> > 		hijacked task.
> > 	Nov 06: Verbatim task_struct copy now comes from current,
> > 		after which copy_hijackable_taskinfo() copies
> > 		relevant context pieces from the hijack source.
> > 	Nov 07: Move arch-independent hijack code to kernel/fork.c
> > 	Nov 07: powerpc and x86_64 support (Mark Nelson)
> > 	Nov 07: Don't allow hijacking members of same session.
> > 	Nov 07: introduce cgroup_may_hijack, and may_hijack hook to
> > 		cgroup subsystems.  The ns subsystem uses this to
> > 		enforce the rule that one may only hijack descendent
> > 		namespaces.
> > 	Nov 07: s390 support
> > 	Nov 08: don't send SIGSTOP to hijack source task
> > 	Nov 10: cache reference to nsproxy in ns cgroup for use in
> > 		hijacking an empty cgroup.
> > 	Nov 10: allow partial hijack of empty cgroup
> > 	Nov 13: don't double-get cgroup for hijack_ns
> > 		find_css_set() actually returns the set with a
> > 		reference already held, so cgroup_fork_fromcgroup()
> > 		by doing a get_css_set() was getting a second
> > 		reference.  Therefore after exiting the hijack
> > 		task we could not rmdir the csgroup.
> > 	Nov 22: temporarily remove x86_64 and powerpc support
> > 	Nov 27: rebased on 2.6.24-rc3
> >
> > ==============================================================
> > hijack.c
> > ==============================================================
> > /*
> >  * Your options are:
> >  *	hijack pid 1078
> >  *	hijack cgroup /cgroup/node_1078/tasks
> >  *	hijack ns /cgroup/node_1078/tasks
> >  */
> >
> > #define _BSD_SOURCE
> > #include <unistd.h>
> > #include <sys/syscall.h>
> > #include <sys/types.h>
> > #include <sys/wait.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <sched.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> >
> > #if __i386__
> > #    define __NR_hijack		325
> > #elif __s390x__
> > #    define __NR_hijack		319
> > #else
> > #    error "Architecture not supported"
> > #endif
> >
> > #ifndef CLONE_NEWUTS
> > #define CLONE_NEWUTS 0x04000000
> > #endif
> >
> > void usage(char *me)
> > {
> > 	printf("Usage: %s pid <pid>\n", me);
> > 	printf("     | %s cgroup <cgroup_tasks_file>\n", me);
> > 	printf("     | %s ns <cgroup_tasks_file>\n", me);
> > 	exit(1);
> > }
> >
> > int exec_shell(void)
> > {
> > 	execl("/bin/sh", "/bin/sh", NULL);
> > }
> >
> > #define HIJACK_PID 1
> > #define HIJACK_CG 2
> > #define HIJACK_NS 3
> >
> > int main(int argc, char *argv[])
> > {
> > 	int id;
> > 	int ret;
> > 	int status;
> > 	int which_hijack;
> >
> > 	if (argc < 3 || !strcmp(argv[1], "-h"))
> > 		usage(argv[0]);
> > 	if (strcmp(argv[1], "cgroup") == 0)
> > 		which_hijack = HIJACK_CG;
> > 	else if (strcmp(argv[1], "ns") == 0)
> > 		which_hijack = HIJACK_NS;
> > 	else
> > 		which_hijack = HIJACK_PID;
> >
> > 	switch(which_hijack) {
> > 		case HIJACK_PID:
> > 			id = atoi(argv[2]);
> > 			printf("hijacking pid %d\n", id);
> > 			break;
> > 		case HIJACK_CG:
> > 		case HIJACK_NS:
> > 			id = open(argv[2], O_RDONLY);
> > 			if (id == -1) {
> > 				perror("cgroup open");
> > 				return 1;
> > 			}
> > 			break;
> > 	}
> >
> > 	ret = syscall(__NR_hijack, SIGCHLD, which_hijack, (unsigned long)id);
> >
> > 	if (which_hijack != HIJACK_PID)
> > 		close(id);
> > 	if  (ret == 0) {
> > 		return exec_shell();
> > 	} else if (ret < 0) {
> > 		perror("sys_hijack");
> > 	} else {
> > 		printf("waiting on cloned process %d\n", ret);
> > 		while(waitpid(-1, &status, __WALL) != -1)
> > 				;
> >
...

 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: [PATCH 2.6.25] net: removes unnecessary dependencies for net_namespace.h
Next Topic: [PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)
Goto Forum:
  


Current Time: Sat Sep 14 14:32:12 GMT 2024

Total time taken to generate the page: 0.03887 seconds