OpenVZ Forum


Home » Mailing lists » Devel » [PATCH 1/2] namespaces: introduce sys_hijack (v10)
[PATCH 1/2] namespaces: introduce sys_hijack (v10) [message #23792] Tue, 27 November 2007 01:54 Go to previous message
Mark Nelson is currently offline  Mark Nelson
Messages: 6
Registered: September 2007
Junior Member
Here's the latest version of sys_hijack.
Apologies for its lateness.

Thanks!

Mark.

Subject: [PATCH 1/2] namespaces: introduce sys_hijack (v10)

Move most of do_fork() into a new do_fork_task() which acts on
a new argument, task, rather than on current.  do_fork() becomes
a call to do_fork_task(current, ...).

Introduce sys_hijack (for i386 and s390 only so far).  It is like
clone, but in place of a stack pointer (which is assumed null) it
accepts a pid.  The process identified by that pid is the one
which is actually cloned.  Some state - including the file
table, the signals and sighand (and hence tty), and the ->parent
are taken from the calling process.

A process to be hijacked may be identified by process id, in the
case of HIJACK_PID.  Alternatively, in the case of HIJACK_CG an
open fd for a cgroup 'tasks' file may be specified.  The first
available task in that cgroup will then be hijacked.

HIJACK_NS is implemented as a third hijack method.  The main
purpose is to allow entering an empty cgroup without having
to keep a task alive in the target cgroup.  When HIJACK_NS
is called, only the cgroup and nsproxy are copied from the
cgroup.  Security, user, and rootfs info is not retained
in the cgroups and so cannot be copied to the child task.

In order to hijack a process, the calling process must be
allowed to ptrace the target.

Sending sigstop to the hijacked task can trick its parent shell
(if it is a shell foreground task) into thinking it should retake
its tty.

So try not sending SIGSTOP, and instead hold the task_lock over
the hijacked task throughout the do_fork_task() operation.
This is really dangerous.  I've fixed cgroup_fork() to not
task_lock(task) in the hijack case, but there may well be other
code called during fork which can under "some circumstances"
task_lock(task).

Still, this is working for me.

The effect is a sort of namespace enter.  The following program
uses sys_hijack to 'enter' all namespaces of the specified task.
For instance in one terminal, do

	mount -t cgroup -ons cgroup /cgroup
	hostname
	  qemu
	ns_exec -u /bin/sh
	  hostname serge
          echo $$
            1073
	  cat /proc/$$/cgroup
	    ns:/node_1073

In another terminal then do

	hostname
	  qemu
	cat /proc/$$/cgroup
	  ns:/
	hijack pid 1073
	  hostname
	    serge
	  cat /proc/$$/cgroup
	    ns:/node_1073
	hijack cgroup /cgroup/node_1073/tasks

Changelog:
	Aug 23: send a stop signal to the hijacked process
		(like ptrace does).
	Oct 09: Update for 2.6.23-rc8-mm2 (mainly pidns)
		Don't take task_lock under rcu_read_lock
		Send hijacked process to cgroup_fork() as
		the first argument.
		Removed some unneeded task_locks.
	Oct 16: Fix bug introduced into alloc_pid.
	Oct 16: Add 'int which' argument to sys_hijack to
		allow later expansion to use cgroup in place
		of pid to specify what to hijack.
	Oct 24: Implement hijack by open cgroup file.
	Nov 02: Switch copying of task info: do full copy
		from current, then copy relevant pieces from
		hijacked task.
	Nov 06: Verbatim task_struct copy now comes from current,
		after which copy_hijackable_taskinfo() copies
		relevant context pieces from the hijack source.
	Nov 07: Move arch-independent hijack code to kernel/fork.c
	Nov 07: powerpc and x86_64 support (Mark Nelson)
	Nov 07: Don't allow hijacking members of same session.
	Nov 07: introduce cgroup_may_hijack, and may_hijack hook to
		cgroup subsystems.  The ns subsystem uses this to
		enforce the rule that one may only hijack descendent
		namespaces.
	Nov 07: s390 support
	Nov 08: don't send SIGSTOP to hijack source task
	Nov 10: cache reference to nsproxy in ns cgroup for use in
		hijacking an empty cgroup.
	Nov 10: allow partial hijack of empty cgroup
	Nov 13: don't double-get cgroup for hijack_ns
		find_css_set() actually returns the set with a
		reference already held, so cgroup_fork_fromcgroup()
		by doing a get_css_set() was getting a second
		reference.  Therefore after exiting the hijack
		task we could not rmdir the csgroup.
	Nov 22: temporarily remove x86_64 and powerpc support
	Nov 27: rebased on 2.6.24-rc3

==============================================================
hijack.c
==============================================================
/*
 * Your options are:
 *	hijack pid 1078
 *	hijack cgroup /cgroup/node_1078/tasks
 *	hijack ns /cgroup/node_1078/tasks
 */

#define _BSD_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#if __i386__
#    define __NR_hijack		325
#elif __s390x__
#    define __NR_hijack		319
#else
#    error "Architecture not supported"
#endif

#ifndef CLONE_NEWUTS
#define CLONE_NEWUTS 0x04000000
#endif

void usage(char *me)
{
	printf("Usage: %s pid <pid>\n", me);
	printf("     | %s cgroup <cgroup_tasks_file>\n", me);
	printf("     | %s ns <cgroup_tasks_file>\n", me);
	exit(1);
}

int exec_shell(void)
{
	execl("/bin/sh", "/bin/sh", NULL);
}

#define HIJACK_PID 1
#define HIJACK_CG 2
#define HIJACK_NS 3

int main(int argc, char *argv[])
{
	int id;
	int ret;
	int status;
	int which_hijack;

	if (argc < 3 || !strcmp(argv[1], "-h"))
		usage(argv[0]);
	if (strcmp(argv[1], "cgroup") == 0)
		which_hijack = HIJACK_CG;
	else if (strcmp(argv[1], "ns") == 0)
		which_hijack = HIJACK_NS;
	else
		which_hijack = HIJACK_PID;

	switch(which_hijack) {
		case HIJACK_PID:
			id = atoi(argv[2]);
			printf("hijacking pid %d\n", id);
			break;
		case HIJACK_CG:
		case HIJACK_NS:
			id = open(argv[2], O_RDONLY);
			if (id == -1) {
				perror("cgroup open");
				return 1;
			}
			break;
	}

	ret = syscall(__NR_hijack, SIGCHLD, which_hijack, (unsigned long)id);

	if (which_hijack != HIJACK_PID)
		close(id);
	if  (ret == 0) {
		return exec_shell();
	} else if (ret < 0) {
		perror("sys_hijack");
	} else {
		printf("waiting on cloned process %d\n", ret);
		while(waitpid(-1, &status, __WALL) != -1)
				;
		printf("cloned process exited with %d (waitpid ret %d)\n",
				status, ret);
	}

	return ret;
}
==============================================================

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
---
 Documentation/cgroups.txt          |    9 +
 arch/s390/kernel/process.c         |   21 +++
 arch/x86/kernel/process_32.c       |   24 ++++
 arch/x86/kernel/syscall_table_32.S |    1 
 include/asm-x86/unistd_32.h        |    3 
 include/linux/cgroup.h             |   28 ++++-
 include/linux/nsproxy.h            |   12 +-
 include/linux/ptrace.h             |    1 
 include/linux/sched.h              |   19 +++
 include/linux/syscalls.h           |    2 
 kernel/cgroup.c                    |  133 +++++++++++++++++++++++-
 kernel/fork.c                      |  201 ++++++++++++++++++++++++++++++++++---
 kernel/ns_cgroup.c                 |   88 +++++++++++++++-
 kernel/nsproxy.c                   |    4 
 kernel/ptrace.c                    |    7 +
 15 files changed, 523 insertions(+), 30 deletions(-)

Index: upstream/arch/s390/kernel/process.c
===================================================================
--- upstream.orig/arch/s390/kernel/process.c
+++ upstream/arch/s390/kernel/process.c
@@ -321,6 +321,27 @@ asmlinkage long sys_clone(void)
 		       parent_tidptr, child_tidptr);
 }
 
+asmlinkage long sys_hijack(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long sp = regs->orig_gpr2;
+	unsigned long clone_flags = regs->gprs[3];
+	int which = regs->gprs[4];
+	unsigned int fd;
+	pid_t pid;
+
+	switch (which) {
+	case HIJACK_PID:
+		pid = regs->gprs[5];
+		return hijack_pid(pid, clone_flags, *regs, sp);
+	case HIJACK_CGROUP:
+		fd = (unsigned int) regs->gprs[5];
+		return hijack_cgroup(fd, clone_flags, *regs, sp);
+	default:
+		return -EINVAL;
+	}
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
Index: upstream/arch/x86/kernel/process_32.c
===================================================================
--- upstream.orig/arch/x86/kernel/process_32.c
+++ upstream/arch/x86/kernel/process_32.c
@@ -37,6 +37,7 @@
 #include <linux/personality.h>
 #include <linux/tick.h>
 #include <linux/percpu.h>
+#include <linux/cgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -781,6 +782,29 @@ asmlinkage int sys_clone(struct pt_regs 
 	return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);
 }
 
+asmlinkage int sys_hijack(struct pt_regs regs)
+{
+	unsigned long sp = regs.esp;
+	unsigned long clone_flags = regs.ebx;
+	int which = regs.ecx;
+	unsigned int fd;
+	pid_t pid;
+
+	switch (which) {
+	case HIJACK_PID:
+		pid = regs.edx;
+		return hijack_pid(pid, clone_flags, regs, sp);
+	case HIJACK_CGROUP:
+		fd = (unsigned int) regs.edx;
+		return hijack_cgroup(fd, clone_flags, regs, sp);
+	case HIJACK_NS:
+		fd = (unsigned int) regs.edx;
+		return hijack_ns(fd, clone_flags, regs, sp);
+	default:
+		return -EINVAL;
+	}
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
Index: upstream/arch/x86/kernel/syscall_table_32.S
===================================================================
--- upstream.orig/arch/x86/kernel/syscall_table_32.S
+++ upstream/arch/x86/kernel/syscall_table_32.S
@@ -324,3 +324,4 @@ ENTRY(sys_call_table)
 	.long sys_timerfd
 	.long sys_eventfd
 	.long sys_fallocate
+	.long sys_hijack		/* 325 */
Index: upstream/Documentation/cgroups.txt
===================================================================
--- upstream.orig/Documentation/cgroups.txt
+++ upstream/Documentation/cgroups.txt
@@ -495,6 +495,15 @@ LL=cgroup_m
...

 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: [PATCH 2.6.25] net: removes unnecessary dependencies for net_namespace.h
Next Topic: [PATCH] AB-BA deadlock in drop_caches sysctl (resend, the one sent was for 2.6.18)
Goto Forum:
  


Current Time: Sat Sep 14 14:30:32 GMT 2024

Total time taken to generate the page: 0.03815 seconds