Just the name "sys_hijack" makes me concerned.
This post describes a bunch of "what", but doesn't tell us about "why"
we would want this. What is it for?
And I second Casey's concern about careful management of the privilege
required to "hijack" a process.
Crispin
Mark Nelson wrote:
> Here's the latest version of sys_hijack.
> Apologies for its lateness.
>
> Thanks!
>
> Mark.
>
> Subject: [PATCH 1/2] namespaces: introduce sys_hijack (v10)
>
> Move most of do_fork() into a new do_fork_task() which acts on
> a new argument, task, rather than on current. do_fork() becomes
> a call to do_fork_task(current, ...).
>
> Introduce sys_hijack (for i386 and s390 only so far). It is like
> clone, but in place of a stack pointer (which is assumed null) it
> accepts a pid. The process identified by that pid is the one
> which is actually cloned. Some state - including the file
> table, the signals and sighand (and hence tty), and the ->parent
> are taken from the calling process.
>
> A process to be hijacked may be identified by process id, in the
> case of HIJACK_PID. Alternatively, in the case of HIJACK_CG an
> open fd for a cgroup 'tasks' file may be specified. The first
> available task in that cgroup will then be hijacked.
>
> HIJACK_NS is implemented as a third hijack method. The main
> purpose is to allow entering an empty cgroup without having
> to keep a task alive in the target cgroup. When HIJACK_NS
> is called, only the cgroup and nsproxy are copied from the
> cgroup. Security, user, and rootfs info is not retained
> in the cgroups and so cannot be copied to the child task.
>
> In order to hijack a process, the calling process must be
> allowed to ptrace the target.
>
> Sending sigstop to the hijacked task can trick its parent shell
> (if it is a shell foreground task) into thinking it should retake
> its tty.
>
> So try not sending SIGSTOP, and instead hold the task_lock over
> the hijacked task throughout the do_fork_task() operation.
> This is really dangerous. I've fixed cgroup_fork() to not
> task_lock(task) in the hijack case, but there may well be other
> code called during fork which can under "some circumstances"
> task_lock(task).
>
> Still, this is working for me.
>
> The effect is a sort of namespace enter. The following program
> uses sys_hijack to 'enter' all namespaces of the specified task.
> For instance in one terminal, do
>
> mount -t cgroup -ons cgroup /cgroup
> hostname
> qemu
> ns_exec -u /bin/sh
> hostname serge
> echo $$
> 1073
> cat /proc/$$/cgroup
> ns:/node_1073
>
> In another terminal then do
>
> hostname
> qemu
> cat /proc/$$/cgroup
> ns:/
> hijack pid 1073
> hostname
> serge
> cat /proc/$$/cgroup
> ns:/node_1073
> hijack cgroup /cgroup/node_1073/tasks
>
> Changelog:
> Aug 23: send a stop signal to the hijacked process
> (like ptrace does).
> Oct 09: Update for 2.6.23-rc8-mm2 (mainly pidns)
> Don't take task_lock under rcu_read_lock
> Send hijacked process to cgroup_fork() as
> the first argument.
> Removed some unneeded task_locks.
> Oct 16: Fix bug introduced into alloc_pid.
> Oct 16: Add 'int which' argument to sys_hijack to
> allow later expansion to use cgroup in place
> of pid to specify what to hijack.
> Oct 24: Implement hijack by open cgroup file.
> Nov 02: Switch copying of task info: do full copy
> from current, then copy relevant pieces from
> hijacked task.
> Nov 06: Verbatim task_struct copy now comes from current,
> after which copy_hijackable_taskinfo() copies
> relevant context pieces from the hijack source.
> Nov 07: Move arch-independent hijack code to kernel/fork.c
> Nov 07: powerpc and x86_64 support (Mark Nelson)
> Nov 07: Don't allow hijacking members of same session.
> Nov 07: introduce cgroup_may_hijack, and may_hijack hook to
> cgroup subsystems. The ns subsystem uses this to
> enforce the rule that one may only hijack descendent
> namespaces.
> Nov 07: s390 support
> Nov 08: don't send SIGSTOP to hijack source task
> Nov 10: cache reference to nsproxy in ns cgroup for use in
> hijacking an empty cgroup.
> Nov 10: allow partial hijack of empty cgroup
> Nov 13: don't double-get cgroup for hijack_ns
> find_css_set() actually returns the set with a
> reference already held, so cgroup_fork_fromcgroup()
> by doing a get_css_set() was getting a second
> reference. Therefore after exiting the hijack
> task we could not rmdir the csgroup.
> Nov 22: temporarily remove x86_64 and powerpc support
> Nov 27: rebased on 2.6.24-rc3
>
> ==============================================================
> hijack.c
> ==============================================================
> /*
> * Your options are:
> * hijack pid 1078
> * hijack cgroup /cgroup/node_1078/tasks
> * hijack ns /cgroup/node_1078/tasks
> */
>
> #define _BSD_SOURCE
> #include <unistd.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sched.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> #if __i386__
> # define __NR_hijack 325
> #elif __s390x__
> # define __NR_hijack 319
> #else
> # error "Architecture not supported"
> #endif
>
> #ifndef CLONE_NEWUTS
> #define CLONE_NEWUTS 0x04000000
> #endif
>
> void usage(char *me)
> {
> printf("Usage: %s pid <pid>\n", me);
> printf(" | %s cgroup <cgroup_tasks_file>\n", me);
> printf(" | %s ns <cgroup_tasks_file>\n", me);
> exit(1);
> }
>
> int exec_shell(void)
> {
> execl("/bin/sh", "/bin/sh", NULL);
> }
>
> #define HIJACK_PID 1
> #define HIJACK_CG 2
> #define HIJACK_NS 3
>
> int main(int argc, char *argv[])
> {
> int id;
> int ret;
> int status;
> int which_hijack;
>
> if (argc < 3 || !strcmp(argv[1], "-h"))
> usage(argv[0]);
> if (strcmp(argv[1], "cgroup") == 0)
> which_hijack = HIJACK_CG;
> else if (strcmp(argv[1], "ns") == 0)
> which_hijack = HIJACK_NS;
> else
> which_hijack = HIJACK_PID;
>
> switch(which_hijack) {
> case HIJACK_PID:
> id = atoi(argv[2]);
> printf("hijacking pid %d\n", id);
> break;
> case HIJACK_CG:
> case HIJACK_NS:
> id = open(argv[2], O_RDONLY);
> if (id == -1) {
> perror("cgroup open");
> return 1;
> }
> break;
> }
>
> ret = syscall(__NR_hijack, SIGCHLD, which_hijack, (unsigned long)id);
>
> if (which_hijack != HIJACK_PID)
> close(id);
> if (ret == 0) {
> return exec_shell();
> } else if (ret < 0) {
> perror("sys_hijack");
> } else {
> printf("waiting on cloned process %d\n", ret);
> while(waitpid(-1, &status, __WALL) != -1)
> ;
> printf("cloned process exited with %d (waitpid ret %d)\n",
> status, ret);
> }
>
> return ret;
> }
> ==============================================================
>
> Signed-off-by: Serge Hallyn <serue@us.ibm.com>
> Signed-off-by: Mark Nelson <markn@au1.ibm.com>
> ---
> Documentation/cgroups.txt | 9 +
> arch/s390/kernel/process.c | 21 +++
> arch/x86/kernel/process_32.c | 24 ++++
> arch/x86/kernel/syscall_table_32.S | 1
> include/asm-x86/unistd_32.h | 3
> include/linux/cgroup.h | 28 ++++-
> include/linux/nsproxy.h | 12 +-
> include/linux/ptrace.h | 1
> include/linux/sched.h | 19 +++
> include/linux/syscalls.h | 2
> kernel/cgroup.c | 133 +++++++++++++++++++++++-
> kernel/fork.c | 201 ++++++++++++++++++++++++++++++++++---
> kernel/ns_cgroup.c | 88 +++++++++++++++-
> kernel/nsproxy.c | 4
> kernel/ptrace.c | 7 +
> 15 files changed, 523 insertions(+), 30 deletions(-)
>
> Index: upstream/arch/s390/kernel/process.c
> ===================================================================
> --- upstream.orig/arch/s390/kernel/process.c
> +++ upstream/arch/s390/kernel/process.c
> @@ -321,6 +321,27 @@ asmlinkage long sys_clone(void)
> parent_tidptr, child_tidptr);
> }
>
> +asmlinkage long sys_hijack(void)
> +{
> + struct pt_regs *regs = task_pt_regs(current);
> + unsigned long sp = regs->orig_gpr2;
> + unsigned long clone_flags = regs->gprs[3];
> + int which = regs->gprs[4];
> + unsigned int fd;
> + pid_t pid;
> +
> + switch (which) {
> + case HIJACK_PID:
> + pid = regs->gprs[5];
> + return hijack_pid(pid, clone_flags, *regs, sp);
> + case HIJACK_CGROUP:
> + fd = (unsigned int) regs->gprs[5];
> + return hijack_cgroup(fd, clone_flags, *regs, sp);
> + default:
> + return -EINVAL;
> + }
> +}
> +
> /*
> * This is trivial, and on the face of it looks like it
> * could equally well be done in user mode.
> Index: upstream/arch/x86/kernel/process_32.c
> ===================================================================
> --- upstream.ori
...