Quoting Crispin Cowan (crispin@crispincowan.com):
> Just the name "sys_hijack" makes me concerned.
>
> This post describes a bunch of "what", but doesn't tell us about "why"
> we would want this. What is it for?
Please see my response to Casey's email.
> And I second Casey's concern about careful management of the privilege
> required to "hijack" a process.
Absolutely. We're definately still in RFC territory.
Note that there are currently several proposed (but no upstream) ways to
accomplish entering a namespace:
1. bind_ns() is a new pair of syscalls proposed by Cedric. An
nsproxy is given an integer id. The id can be used to enter
an nsproxy, basically a straight current->nsproxy = target_nsproxy;
2. I had previously posted a patchset on top of the nsproxy
cgroup which allowed entering a nsproxy through the ns cgroup
interface.
There are objections to both those patchsets because simply switching a
task's nsproxy using a syscall or file write in the middle of running a
binary is quite unsafe. Eric Biederman had suggested using ptrace or
something like it to accomplish the goal.
Just using ptrace is however not safe either. You are inheriting *all*
of the target's context, so it shouldn't be difficult for a nefarious
container/vserver admin to trick the host admin into running something
which gives the container/vserver admin full access to the host.
That's where the hijack idea came from. Yes, I called it hijack to make
sure alarm bells went off :) bc it's definately still worrisome. But at
this point I believe it is the safest solution suggested so far.
-serge
> Crispin
>
> Mark Nelson wrote:
> > Here's the latest version of sys_hijack.
> > Apologies for its lateness.
> >
> > Thanks!
> >
> > Mark.
> >
> > Subject: [PATCH 1/2] namespaces: introduce sys_hijack (v10)
> >
> > Move most of do_fork() into a new do_fork_task() which acts on
> > a new argument, task, rather than on current. do_fork() becomes
> > a call to do_fork_task(current, ...).
> >
> > Introduce sys_hijack (for i386 and s390 only so far). It is like
> > clone, but in place of a stack pointer (which is assumed null) it
> > accepts a pid. The process identified by that pid is the one
> > which is actually cloned. Some state - including the file
> > table, the signals and sighand (and hence tty), and the ->parent
> > are taken from the calling process.
> >
> > A process to be hijacked may be identified by process id, in the
> > case of HIJACK_PID. Alternatively, in the case of HIJACK_CG an
> > open fd for a cgroup 'tasks' file may be specified. The first
> > available task in that cgroup will then be hijacked.
> >
> > HIJACK_NS is implemented as a third hijack method. The main
> > purpose is to allow entering an empty cgroup without having
> > to keep a task alive in the target cgroup. When HIJACK_NS
> > is called, only the cgroup and nsproxy are copied from the
> > cgroup. Security, user, and rootfs info is not retained
> > in the cgroups and so cannot be copied to the child task.
> >
> > In order to hijack a process, the calling process must be
> > allowed to ptrace the target.
> >
> > Sending sigstop to the hijacked task can trick its parent shell
> > (if it is a shell foreground task) into thinking it should retake
> > its tty.
> >
> > So try not sending SIGSTOP, and instead hold the task_lock over
> > the hijacked task throughout the do_fork_task() operation.
> > This is really dangerous. I've fixed cgroup_fork() to not
> > task_lock(task) in the hijack case, but there may well be other
> > code called during fork which can under "some circumstances"
> > task_lock(task).
> >
> > Still, this is working for me.
> >
> > The effect is a sort of namespace enter. The following program
> > uses sys_hijack to 'enter' all namespaces of the specified task.
> > For instance in one terminal, do
> >
> > mount -t cgroup -ons cgroup /cgroup
> > hostname
> > qemu
> > ns_exec -u /bin/sh
> > hostname serge
> > echo $$
> > 1073
> > cat /proc/$$/cgroup
> > ns:/node_1073
> >
> > In another terminal then do
> >
> > hostname
> > qemu
> > cat /proc/$$/cgroup
> > ns:/
> > hijack pid 1073
> > hostname
> > serge
> > cat /proc/$$/cgroup
> > ns:/node_1073
> > hijack cgroup /cgroup/node_1073/tasks
> >
> > Changelog:
> > Aug 23: send a stop signal to the hijacked process
> > (like ptrace does).
> > Oct 09: Update for 2.6.23-rc8-mm2 (mainly pidns)
> > Don't take task_lock under rcu_read_lock
> > Send hijacked process to cgroup_fork() as
> > the first argument.
> > Removed some unneeded task_locks.
> > Oct 16: Fix bug introduced into alloc_pid.
> > Oct 16: Add 'int which' argument to sys_hijack to
> > allow later expansion to use cgroup in place
> > of pid to specify what to hijack.
> > Oct 24: Implement hijack by open cgroup file.
> > Nov 02: Switch copying of task info: do full copy
> > from current, then copy relevant pieces from
> > hijacked task.
> > Nov 06: Verbatim task_struct copy now comes from current,
> > after which copy_hijackable_taskinfo() copies
> > relevant context pieces from the hijack source.
> > Nov 07: Move arch-independent hijack code to kernel/fork.c
> > Nov 07: powerpc and x86_64 support (Mark Nelson)
> > Nov 07: Don't allow hijacking members of same session.
> > Nov 07: introduce cgroup_may_hijack, and may_hijack hook to
> > cgroup subsystems. The ns subsystem uses this to
> > enforce the rule that one may only hijack descendent
> > namespaces.
> > Nov 07: s390 support
> > Nov 08: don't send SIGSTOP to hijack source task
> > Nov 10: cache reference to nsproxy in ns cgroup for use in
> > hijacking an empty cgroup.
> > Nov 10: allow partial hijack of empty cgroup
> > Nov 13: don't double-get cgroup for hijack_ns
> > find_css_set() actually returns the set with a
> > reference already held, so cgroup_fork_fromcgroup()
> > by doing a get_css_set() was getting a second
> > reference. Therefore after exiting the hijack
> > task we could not rmdir the csgroup.
> > Nov 22: temporarily remove x86_64 and powerpc support
> > Nov 27: rebased on 2.6.24-rc3
> >
> > ==============================================================
> > hijack.c
> > ==============================================================
> > /*
> > * Your options are:
> > * hijack pid 1078
> > * hijack cgroup /cgroup/node_1078/tasks
> > * hijack ns /cgroup/node_1078/tasks
> > */
> >
> > #define _BSD_SOURCE
> > #include <unistd.h>
> > #include <sys/syscall.h>
> > #include <sys/types.h>
> > #include <sys/wait.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <sched.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> >
> > #if __i386__
> > # define __NR_hijack 325
> > #elif __s390x__
> > # define __NR_hijack 319
> > #else
> > # error "Architecture not supported"
> > #endif
> >
> > #ifndef CLONE_NEWUTS
> > #define CLONE_NEWUTS 0x04000000
> > #endif
> >
> > void usage(char *me)
> > {
> > printf("Usage: %s pid <pid>\n", me);
> > printf(" | %s cgroup <cgroup_tasks_file>\n", me);
> > printf(" | %s ns <cgroup_tasks_file>\n", me);
> > exit(1);
> > }
> >
> > int exec_shell(void)
> > {
> > execl("/bin/sh", "/bin/sh", NULL);
> > }
> >
> > #define HIJACK_PID 1
> > #define HIJACK_CG 2
> > #define HIJACK_NS 3
> >
> > int main(int argc, char *argv[])
> > {
> > int id;
> > int ret;
> > int status;
> > int which_hijack;
> >
> > if (argc < 3 || !strcmp(argv[1], "-h"))
> > usage(argv[0]);
> > if (strcmp(argv[1], "cgroup") == 0)
> > which_hijack = HIJACK_CG;
> > else if (strcmp(argv[1], "ns") == 0)
> > which_hijack = HIJACK_NS;
> > else
> > which_hijack = HIJACK_PID;
> >
> > switch(which_hijack) {
> > case HIJACK_PID:
> > id = atoi(argv[2]);
> > printf("hijacking pid %d\n", id);
> > break;
> > case HIJACK_CG:
> > case HIJACK_NS:
> > id = open(argv[2], O_RDONLY);
> > if (id == -1) {
> > perror("cgroup open");
> > return 1;
> > }
> > break;
> > }
> >
> > ret = syscall(__NR_hijack, SIGCHLD, which_hijack, (unsigned long)id);
> >
> > if (which_hijack != HIJACK_PID)
> > close(id);
> > if (ret == 0) {
> > return exec_shell();
> > } else if (ret < 0) {
> > perror("sys_hijack");
> > } else {
> > printf("waiting on cloned process %d\n", ret);
> > while(waitpid(-1, &status, __WALL) != -1)
> > ;
> >
...