Discussion:
[RFC,PATCH 2/2] Documentation: prctl/seccomp_filter
Will Drewry
2012-01-11 17:25:10 UTC
Permalink
Document how system call filtering with BPF works
and can be used.

Signed-off-by: Will Drewry <***@chromium.org>
---
Documentation/prctl/seccomp_filter.txt | 159 ++++++++++++++++++++++++++++++++
1 files changed, 159 insertions(+), 0 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..5fb3f44
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,159 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls. The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct. This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing. Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process. The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+ Allows the specification of a new filter using a BPF program.
+ The BPF program will be executed over a user_regs_struct data
+ reflecting system call time except with the system call number
+ resident in orig_[register]. To allow a system call, the size
+ of the data must be returned. At present, all other return values
+ result in the system call being blocked, but it is recommended to
+ return 0 in those cases. This will allow for future custom return
+ values to be introduced, if ever desired.
+
+ Usage:
+ prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which will
+ contain the filter program. If the program is invalid, the call
+ will return -1 and set errno to -EINVAL.
+
+ The struct user_regs_struct the @prog will see is based on the
+ personality of the task at the time of this prctl call. Additionally,
+ is_compat_task is also tracked for the @prog. This means that once set
+ the calling task will have all of its system calls blocked if it
+ switches its system call ABI (via personality or other means).
+
+ If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+ namespace, the @prog will be marked as inheritable across execve. Any
+ inherited filters are still subject to the system call ABI constraints
+ above and any ABI mismatched system calls will result in process death.
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err and exit
+cleanly. Without using a BPF compiler, it may be done as follows on x86 32-bit:
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+ /* Check that write is only using stdout/stderr */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+ /* Put the "accept" value in A */
+ BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+ BPF_STMT(BPF_RET+BPF_A,0),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(36, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}
+
+Additionally, if prctl(2) is allowed by the installed filter, additional
+filters may be layered on which will increase evaluation time, but allow for
+further decreasing the attack surface during execution of a process.
+
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+ a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
--
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-11 17:25:09 UTC
Permalink
This patch adds support for seccomp mode 2. This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task. The policy is expressed in terms of a BPF program,
as is used for userland-exposed socket filtering. Instead of network
data, the BPF program is evaluated over struct user_regs_struct at the
time of the system call (as retrieved using regviews).

A filter program may be installed by a userland task by calling
prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.

Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
call numbers.
- Full register information is provided which may be relevant for
certain syscalls (fork, rt_sigreturn) or for other userland
filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

Signed-off-by: Will Drewry <***@chromium.org>
---
fs/exec.c | 5 +
include/linux/prctl.h | 3 +
include/linux/seccomp.h | 70 +++++-
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 8 +
kernel/seccomp_filter.c | 639 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 12 +
9 files changed, 743 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_filter.c

diff --git a/fs/exec.c b/fs/exec.c
index 3625464..e9cc89c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -44,6 +44,7 @@
#include <linux/namei.h>
#include <linux/mount.h>
#include <linux/security.h>
+#include <linux/seccomp.h>
#include <linux/syscalls.h>
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
@@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
if (retval)
goto out_ret;

+ retval = seccomp_check_exec();
+ if (retval)
+ goto out_ret;
+
retval = -ENOMEM;
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..15e2460 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22

+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER 36
+
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
#define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..99d163e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 0, seccomp is not in use.
+ * is 1, the process is under standard seccomp rules.
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ * @filter must only be accessed from the context of current as there
+ * is no guard.
+ */
+typedef struct seccomp_struct {
+ int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filter *filter;
+#endif
+} seccomp_t;

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)

#include <linux/errno.h>

-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
#define secure_computing(x) do { } while (0)

static inline long prctl_get_seccomp(void)
@@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)

#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+
+#define seccomp_filter_init_task(_tsk) do { \
+ (_tsk)->seccomp.filter = NULL; \
+} while (0);
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+#define seccomp_filter_free_task(_tsk) do { \
+ put_seccomp_filter((_tsk)->seccomp.filter); \
+} while (0);
+
+extern int seccomp_check_exec(void);
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+#define seccomp_filter_init_task(_tsk) do { } while (0);
+#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
+#define seccomp_filter_free_task(_tsk) do { } while (0);
+
+static inline int seccomp_check_exec(void)
+{
+ return 0;
+}
+
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..cc1d628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ seccomp_filter_free_task(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);

+ seccomp_filter_init_task(p);
retval = perf_event_init_task(p);
if (retval)
goto bad_fork_cleanup_policy;
@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD)
threadgroup_fork_read_unlock(current);
perf_event_fork(p);
+ seccomp_filter_fork(p, current);
return p;

bad_fork_free_pid:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case 2:
+ if (seccomp_test_filters(this_syscall) == 0)
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..4770847
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,639 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-***@chromium.org>
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @creator: pointer to the pid that created this filter
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @flags: provide information about filter from creation time.
+ * @personality: personality of the process at filter creation time.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ struct kref usage;
+ struct pid *creator;
+ struct seccomp_filter *parent;
+ struct {
+ uint32_t admin:1, /* can allow execve */
+ compat:1, /* CONFIG_COMPAT */
+ __reserved:30;
+ } flags;
+ int personality;
+ unsigned short count; /* Instruction count */
+ struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+ struct seccomp_filter *f;
+ unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+ /* Drop oversized requests. */
+ if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+ return ERR_PTR(-EINVAL);
+
+ /* Padding should always be in sock_filter increments. */
+ BUG_ON(padding % sizeof(struct sock_filter));
+
+ f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+ kref_init(&f->usage);
+ f->creator = get_task_pid(current, PIDTYPE_PID);
+ f->count = bpf_blocks;
+ return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (!filter)
+ return;
+ put_seccomp_filter(filter->parent);
+ put_pid(filter->creator);
+ kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+ struct seccomp_filter *orig =
+ container_of(kref, struct seccomp_filter, usage);
+ seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return;
+ kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ kref_get(&orig->usage);
+ return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+ if (filter->personality != current->personality)
+ return -EACCES;
+#ifdef CONFIG_COMPAT
+ if (filter->flags.compat != (!!(is_compat_task())))
+ return -EACCES;
+#endif
+ return 0;
+}
+
+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+ const struct user_regset *regset;
+ int n;
+
+ /* Skip 0. */
+ for (n = 1; n < view->n; ++n) {
+ regset = view->regsets + n;
+ if (regset->core_note_type == NT_PRSTATUS)
+ return regset;
+ }
+
+ return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ * @scratch: preallocated storage of size @available
+ * @available: pointer to the size of scratch.
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Returns a populated pointer to @scratch by default.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality. The pointer
+ * may be to the beginning of @scratch or to an externally managed data
+ * structure. On success, @available should be updated with the
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ /* regset is usually returned based on task personality, not current
+ * system call convention. This behavior makes it unsafe to execute
+ * BPF programs over regviews if is_compat_task or the personality
+ * have changed since the program was installed.
+ */
+ const struct user_regset_view *view = task_user_regset_view(current);
+ const struct user_regset *regset = &view->regsets[0];
+ size_t scratch_size = *available;
+ if (regset->core_note_type != NT_PRSTATUS) {
+ /* The architecture should override this method for speed. */
+ regset = find_prstatus(view);
+ if (!regset)
+ return NULL;
+ }
+ *available = regset->n * regset->size;
+ /* Make sure the scratch space isn't exceeded. */
+ if (*available > scratch_size)
+ *available = scratch_size;
+ if (regset->get(current, regset, 0, *available, scratch, NULL))
+ return NULL;
+ return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ struct seccomp_filter *filter;
+ u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ size_t regs_size = sizeof(struct user_regs_struct);
+ int ret = -EACCES;
+
+ filter = current->seccomp.filter; /* uses task ref */
+ if (!filter)
+ goto out;
+
+ /* All filters in the list are required to share the same system call
+ * convention so only the first filter is ever checked.
+ */
+ if (seccomp_check_personality(filter))
+ goto out;
+
+ /* Grab the user_regs_struct. Normally, regs == &regs_tmp, but
+ * that is not mandatory. E.g., it may return a point to
+ * task_pt_regs(current). NULL checking is mandatory.
+ */
+ regs = seccomp_get_regs(regs_tmp, &regs_size);
+ if (!regs)
+ goto out;
+
+ /* Only allow a system call if it is allowed in all ancestors. */
+ ret = 0;
+ for ( ; filter != NULL; filter = filter->parent) {
+ /* Allowed if return value is the size of the data supplied. */
+ if (seccomp_run_filter(regs, regs_size, filter->insns) !=
+ regs_size)
+ ret = -EACCES;
+ }
+out:
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter = NULL;
+ /* Note, len is a short so overflow should be impossible. */
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret = -EPERM;
+
+ /* Allocate a new seccomp_filter */
+ filter = seccomp_filter_alloc(fp_size);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ /* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ filter->flags.compat = 1;
+#endif
+ filter->personality = current->personality;
+
+ /* Auditing is not needed since the capability wasn't requested */
+ if (security_real_capable_noaudit(current, current_user_ns(),
+ CAP_SYS_ADMIN) == 0)
+ filter->flags.admin = 1;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = sk_chk_filter(filter->insns, filter->count);
+ if (ret)
+ goto out;
+
+ /* If there is an existing filter, make it the parent
+ * and reuse the existing task-based ref.
+ */
+ filter->parent = current->seccomp.filter;
+
+ /* Force all filters to use one system call convention. */
+ ret = -EINVAL;
+ if (filter->parent) {
+ if (filter->parent->flags.compat != filter->flags.compat)
+ goto out;
+ if (filter->parent->personality != filter->personality)
+ goto out;
+ }
+
+ /* Double claim the new filter so we can release it below simplifying
+ * the error paths earlier.
+ */
+ ret = 0;
+ get_seccomp_filter(filter);
+ current->seccomp.filter = filter;
+ /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+ if (!current->seccomp.mode) {
+ current->seccomp.mode = 2;
+ set_thread_flag(TIF_SECCOMP);
+ }
+
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EINVAL;
+
+ ret = -EFAULT;
+ if (!user_filter)
+ goto out;
+
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/**
+ * seccomp_check_exec: determines if exec is allowed for current
+ * Returns 0 if allowed.
+ */
+int seccomp_check_exec(void)
+{
+ if (current->seccomp.mode != 2)
+ return 0;
+ /* We can rely on the task refcount for the filter. */
+ if (!current->seccomp.filter)
+ return -EPERM;
+ /* The last attached filter set for the process is checked. It must
+ * have been installed with CAP_SYS_ADMIN capabilities.
+ */
+ if (current->seccomp.filter->flags.admin)
+ return 0;
+ return -EPERM;
+}
+
+/* seccomp_filter_fork: manages inheritance on fork
+ * @child: forkee
+ * @parent: forker
+ * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent)
+{
+ if (!parent->seccomp.mode)
+ return;
+ child->seccomp.mode = parent->seccomp.mode;
+ child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries. The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+ int offset, size_t size,
+ void *unused)
+{
+ if (offset >= buflen)
+ goto fail;
+ if (offset < 0)
+ goto fail;
+ if (size > buflen - offset)
+ goto fail;
+ return buf + offset;
+fail:
+ return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ * @buf: buffer to execute the filter over
+ * @buflen: length of the buffer
+ * @fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Return length to keep, 0 for none. @buf is a regset we are
+ * filtering, @filter is the array of filter instructions.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to check
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Anything less
+ * will currently result in a seccomp failure. In the future, it may be
+ * possible to use that for hard filtering registers on the fly so it is
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *fentry)
+{
+ const void *ptr;
+ u32 A = 0; /* Accumulator */
+ u32 X = 0; /* Index Register */
+ u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
+ u32 tmp;
+ int k;
+
+ /*
+ * Process array of filter instructions.
+ */
+ for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define K (fentry->k)
+#else
+ const u32 K = fentry->k;
+#endif
+
+ switch (fentry->code) {
+ case BPF_S_ALU_ADD_X:
+ A += X;
+ continue;
+ case BPF_S_ALU_ADD_K:
+ A += K;
+ continue;
+ case BPF_S_ALU_SUB_X:
+ A -= X;
+ continue;
+ case BPF_S_ALU_SUB_K:
+ A -= K;
+ continue;
+ case BPF_S_ALU_MUL_X:
+ A *= X;
+ continue;
+ case BPF_S_ALU_MUL_K:
+ A *= K;
+ continue;
+ case BPF_S_ALU_DIV_X:
+ if (X == 0)
+ return 0;
+ A /= X;
+ continue;
+ case BPF_S_ALU_DIV_K:
+ A = reciprocal_divide(A, K);
+ continue;
+ case BPF_S_ALU_AND_X:
+ A &= X;
+ continue;
+ case BPF_S_ALU_AND_K:
+ A &= K;
+ continue;
+ case BPF_S_ALU_OR_X:
+ A |= X;
+ continue;
+ case BPF_S_ALU_OR_K:
+ A |= K;
+ continue;
+ case BPF_S_ALU_LSH_X:
+ A <<= X;
+ continue;
+ case BPF_S_ALU_LSH_K:
+ A <<= K;
+ continue;
+ case BPF_S_ALU_RSH_X:
+ A >>= X;
+ continue;
+ case BPF_S_ALU_RSH_K:
+ A >>= K;
+ continue;
+ case BPF_S_ALU_NEG:
+ A = -A;
+ continue;
+ case BPF_S_JMP_JA:
+ fentry += K;
+ continue;
+ case BPF_S_JMP_JGT_K:
+ fentry += (A > K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_K:
+ fentry += (A >= K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_K:
+ fentry += (A == K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_K:
+ fentry += (A & K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGT_X:
+ fentry += (A > X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_X:
+ fentry += (A >= X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_X:
+ fentry += (A == X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_X:
+ fentry += (A & X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_LD_W_ABS:
+ k = K;
+load_w:
+ ptr = load_pointer(buf, buflen, k, 4, &tmp);
+ if (ptr != NULL) {
+ /* Note, unlike on network data, values are not
+ * byte swapped.
+ */
+ A = *(const u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_H_ABS:
+ k = K;
+load_h:
+ ptr = load_pointer(buf, buflen, k, 2, &tmp);
+ if (ptr != NULL) {
+ A = *(const u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_B_ABS:
+ k = K;
+load_b:
+ ptr = load_pointer(buf, buflen, k, 1, &tmp);
+ if (ptr != NULL) {
+ A = *(const u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_W_LEN:
+ A = buflen;
+ continue;
+ case BPF_S_LDX_W_LEN:
+ X = buflen;
+ continue;
+ case BPF_S_LD_W_IND:
+ k = X + K;
+ goto load_w;
+ case BPF_S_LD_H_IND:
+ k = X + K;
+ goto load_h;
+ case BPF_S_LD_B_IND:
+ k = X + K;
+ goto load_b;
+ case BPF_S_LDX_B_MSH:
+ ptr = load_pointer(buf, buflen, K, 1, &tmp);
+ if (ptr != NULL) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_IMM:
+ A = K;
+ continue;
+ case BPF_S_LDX_IMM:
+ X = K;
+ continue;
+ case BPF_S_LD_MEM:
+ A = mem[K];
+ continue;
+ case BPF_S_LDX_MEM:
+ X = mem[K];
+ continue;
+ case BPF_S_MISC_TAX:
+ X = A;
+ continue;
+ case BPF_S_MISC_TXA:
+ A = X;
+ continue;
+ case BPF_S_RET_K:
+ return K;
+ case BPF_S_RET_A:
+ return A;
+ case BPF_S_ST:
+ mem[K] = A;
+ continue;
+ case BPF_S_STX:
+ mem[K] = X;
+ continue;
+ case BPF_S_ANC_PROTOCOL:
+ case BPF_S_ANC_PKTTYPE:
+ case BPF_S_ANC_IFINDEX:
+ case BPF_S_ANC_MARK:
+ case BPF_S_ANC_QUEUE:
+ case BPF_S_ANC_HATYPE:
+ case BPF_S_ANC_RXHASH:
+ case BPF_S_ANC_CPU:
+ case BPF_S_ANC_NLATTR:
+ case BPF_S_ANC_NLATTR_NEST:
+ /* ignored */
+ continue;
+ default:
+ WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+ fentry->code, fentry->jt,
+ fentry->jf, fentry->k);
+ return 0;
+ }
+ }
+
+ return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2);
break;
+ case PR_ATTACH_SECCOMP_FILTER:
+ error = prctl_attach_seccomp_filter((char __user *)
+ arg2);
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT

If you are unsure how to answer this question, answer N.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP
+ depends on EXPERIMENTAL
+ help
+ This kernel feature expands CONFIG_SECCOMP to allow computing
+ in environments with reduced kernel access dictated by a system
+ call filter, expressed in BPF, installed by the application itself
+ through prctl(2).
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Serge Hallyn
2012-01-12 08:53:16 UTC
Permalink
Post by Will Drewry
This patch adds support for seccomp mode 2. This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task. The policy is expressed in terms of a BPF program,
as is used for userland-exposed socket filtering. Instead of network
data, the BPF program is evaluated over struct user_regs_struct at the
time of the system call (as retrieved using regviews).
A filter program may be installed by a userland task by calling
prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.
If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. All attached programs
must be evaluated before a system call will be allowed to proceed.
To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.
Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).
There are a number of benefits to this approach. A few of which are
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
call numbers.
- Full register information is provided which may be relevant for
certain syscalls (fork, rt_sigreturn) or for other userland
filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.
This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)
Hey Will,

A few comments below, but otherwise

Acked-by: Serge Hallyn <***@canonical.com>

thanks,
-serge
Post by Will Drewry
---
fs/exec.c | 5 +
include/linux/prctl.h | 3 +
include/linux/seccomp.h | 70 +++++-
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 8 +
kernel/seccomp_filter.c | 639 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 12 +
9 files changed, 743 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_filter.c
diff --git a/fs/exec.c b/fs/exec.c
index 3625464..e9cc89c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -44,6 +44,7 @@
#include <linux/namei.h>
#include <linux/mount.h>
#include <linux/security.h>
+#include <linux/seccomp.h>
#include <linux/syscalls.h>
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
@@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
if (retval)
goto out_ret;
+ retval = seccomp_check_exec();
+ if (retval)
+ goto out_ret;
+
retval = -ENOMEM;
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..15e2460 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22
+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER 36
+
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
#define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..99d163e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
#ifdef CONFIG_SECCOMP
#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>
-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * if this is 0, seccomp is not in use.
+ * is 1, the process is under standard seccomp rules.
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * is no guard.
+ */
+typedef struct seccomp_struct {
+ int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filter *filter;
+#endif
+} seccomp_t;
extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
#include <linux/errno.h>
-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
#define secure_computing(x) do { } while (0)
static inline long prctl_get_seccomp(void)
@@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
#endif /* CONFIG_SECCOMP */
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define seccomp_filter_init_task(_tsk) do { \
+ (_tsk)->seccomp.filter = NULL; \
+} while (0);
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+#define seccomp_filter_free_task(_tsk) do { \
+ put_seccomp_filter((_tsk)->seccomp.filter); \
+} while (0);
+
+extern int seccomp_check_exec(void);
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+#define seccomp_filter_init_task(_tsk) do { } while (0);
+#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
+#define seccomp_filter_free_task(_tsk) do { } while (0);
+
+static inline int seccomp_check_exec(void)
+{
+ return 0;
+}
+
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..cc1d628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ seccomp_filter_free_task(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);
+ seccomp_filter_init_task(p);
retval = perf_event_init_task(p);
if (retval)
goto bad_fork_cleanup_policy;
@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD)
threadgroup_fork_read_unlock(current);
perf_event_fork(p);
+ seccomp_filter_fork(p, current);
return p;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ if (seccomp_test_filters(this_syscall) == 0)
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
BUG();
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..4770847
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,639 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ */
+struct seccomp_filter {
+ struct kref usage;
+ struct pid *creator;
+ struct seccomp_filter *parent;
+ struct {
+ uint32_t admin:1, /* can allow execve */
+ compat:1, /* CONFIG_COMPAT */
+ __reserved:30;
+ } flags;
+ int personality;
+ unsigned short count; /* Instruction count */
+ struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ *
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+ struct seccomp_filter *f;
+ unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+ /* Drop oversized requests. */
+ if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+ return ERR_PTR(-EINVAL);
+
+ /* Padding should always be in sock_filter increments. */
+ BUG_ON(padding % sizeof(struct sock_filter));
I still think the BUG_ON here is harsh given that the progsize is passed
in by userspace. Was there a reason not to return -EINVAL here?
Post by Will Drewry
+
+ f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+ kref_init(&f->usage);
+ f->creator = get_task_pid(current, PIDTYPE_PID);
+ f->count = bpf_blocks;
+ return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (!filter)
+ return;
+ put_seccomp_filter(filter->parent);
+ put_pid(filter->creator);
+ kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+ struct seccomp_filter *orig =
+ container_of(kref, struct seccomp_filter, usage);
+ seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ KSTK_EIP(current));
+}
+
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return;
+ kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ kref_get(&orig->usage);
+ return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+ if (filter->personality != current->personality)
+ return -EACCES;
+#ifdef CONFIG_COMPAT
+ if (filter->flags.compat != (!!(is_compat_task())))
+ return -EACCES;
+#endif
+ return 0;
+}
+
+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+ const struct user_regset *regset;
+ int n;
+
+ /* Skip 0. */
+ for (n = 1; n < view->n; ++n) {
+ regset = view->regsets + n;
+ if (regset->core_note_type == NT_PRSTATUS)
+ return regset;
+ }
+
+ return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality. The pointer
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ /* regset is usually returned based on task personality, not current
+ * system call convention. This behavior makes it unsafe to execute
+ * BPF programs over regviews if is_compat_task or the personality
+ * have changed since the program was installed.
+ */
+ const struct user_regset_view *view = task_user_regset_view(current);
+ const struct user_regset *regset = &view->regsets[0];
+ size_t scratch_size = *available;
+ if (regset->core_note_type != NT_PRSTATUS) {
+ /* The architecture should override this method for speed. */
+ regset = find_prstatus(view);
+ if (!regset)
+ return NULL;
+ }
+ *available = regset->n * regset->size;
+ /* Make sure the scratch space isn't exceeded. */
+ if (*available > scratch_size)
+ *available = scratch_size;
+ if (regset->get(current, regset, 0, *available, scratch, NULL))
+ return NULL;
+ return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ struct seccomp_filter *filter;
+ u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ size_t regs_size = sizeof(struct user_regs_struct);
+ int ret = -EACCES;
+
+ filter = current->seccomp.filter; /* uses task ref */
+ if (!filter)
+ goto out;
+
+ /* All filters in the list are required to share the same system call
+ * convention so only the first filter is ever checked.
+ */
+ if (seccomp_check_personality(filter))
+ goto out;
+
+ /* Grab the user_regs_struct. Normally, regs == &regs_tmp, but
+ * that is not mandatory. E.g., it may return a point to
+ * task_pt_regs(current). NULL checking is mandatory.
+ */
+ regs = seccomp_get_regs(regs_tmp, &regs_size);
+ if (!regs)
+ goto out;
+
+ /* Only allow a system call if it is allowed in all ancestors. */
+ ret = 0;
+ for ( ; filter != NULL; filter = filter->parent) {
+ /* Allowed if return value is the size of the data supplied. */
+ if (seccomp_run_filter(regs, regs_size, filter->insns) !=
+ regs_size)
+ ret = -EACCES;
+ }
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter = NULL;
+ /* Note, len is a short so overflow should be impossible. */
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret = -EPERM;
+
+ /* Allocate a new seccomp_filter */
+ filter = seccomp_filter_alloc(fp_size);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ /* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ filter->flags.compat = 1;
+#endif
+ filter->personality = current->personality;
+
+ /* Auditing is not needed since the capability wasn't requested */
+ if (security_real_capable_noaudit(current, current_user_ns(),
+ CAP_SYS_ADMIN) == 0)
+ filter->flags.admin = 1;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = sk_chk_filter(filter->insns, filter->count);
+ if (ret)
+ goto out;
+
+ /* If there is an existing filter, make it the parent
+ * and reuse the existing task-based ref.
+ */
+ filter->parent = current->seccomp.filter;
+
+ /* Force all filters to use one system call convention. */
+ ret = -EINVAL;
+ if (filter->parent) {
+ if (filter->parent->flags.compat != filter->flags.compat)
+ goto out;
+ if (filter->parent->personality != filter->personality)
+ goto out;
+ }
+
+ /* Double claim the new filter so we can release it below simplifying
+ * the error paths earlier.
+ */
+ ret = 0;
+ get_seccomp_filter(filter);
+ current->seccomp.filter = filter;
+ /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+ if (!current->seccomp.mode) {
+ current->seccomp.mode = 2;
+ set_thread_flag(TIF_SECCOMP);
+ }
+
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EINVAL;
+
+ ret = -EFAULT;
+ if (!user_filter)
+ goto out;
+
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+
+ ret = seccomp_attach_filter(&fprog);
+ return ret;
+}
+
+/**
+ * seccomp_check_exec: determines if exec is allowed for current
+ * Returns 0 if allowed.
+ */
+int seccomp_check_exec(void)
+{
+ if (current->seccomp.mode != 2)
+ return 0;
+ /* We can rely on the task refcount for the filter. */
+ if (!current->seccomp.filter)
+ return -EPERM;
+ /* The last attached filter set for the process is checked. It must
+ * have been installed with CAP_SYS_ADMIN capabilities.
This comment is confusing. By 'It must' you mean that if not, it's
denied. But if I didn't know better I would read that as "we can't
get to this code unless". Can you change it to something like
"Exec is refused unless the filter was installed with CAP_SYS_ADMIN
privilege"?
Post by Will Drewry
+ */
+ if (current->seccomp.filter->flags.admin)
+ return 0;
+ return -EPERM;
+}
+
+/* seccomp_filter_fork: manages inheritance on fork
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent)
+{
+ if (!parent->seccomp.mode)
+ return;
+ child->seccomp.mode = parent->seccomp.mode;
+ child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries. The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+ int offset, size_t size,
+ void *unused)
+{
+ if (offset >= buflen)
+ goto fail;
+ if (offset < 0)
+ goto fail;
+ if (size > buflen - offset)
+ goto fail;
+ return buf + offset;
+ return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to check
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Anything less
+ * will currently result in a seccomp failure. In the future, it may be
+ * possible to use that for hard filtering registers on the fly so it is
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *fentry)
+{
+ const void *ptr;
+ u32 A = 0; /* Accumulator */
+ u32 X = 0; /* Index Register */
+ u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
+ u32 tmp;
+ int k;
+
+ /*
+ * Process array of filter instructions.
+ */
+ for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define K (fentry->k)
+#else
+ const u32 K = fentry->k;
+#endif
+
+ switch (fentry->code) {
+ A += X;
+ continue;
+ A += K;
+ continue;
+ A -= X;
+ continue;
+ A -= K;
+ continue;
+ A *= X;
+ continue;
+ A *= K;
+ continue;
+ if (X == 0)
+ return 0;
+ A /= X;
+ continue;
+ A = reciprocal_divide(A, K);
+ continue;
+ A &= X;
+ continue;
+ A &= K;
+ continue;
+ A |= X;
+ continue;
+ A |= K;
+ continue;
+ A <<= X;
+ continue;
+ A <<= K;
+ continue;
+ A >>= X;
+ continue;
+ A >>= K;
+ continue;
+ A = -A;
+ continue;
+ fentry += K;
+ continue;
+ fentry += (A > K) ? fentry->jt : fentry->jf;
+ continue;
+ fentry += (A >= K) ? fentry->jt : fentry->jf;
+ continue;
+ fentry += (A == K) ? fentry->jt : fentry->jf;
+ continue;
+ fentry += (A & K) ? fentry->jt : fentry->jf;
+ continue;
+ fentry += (A > X) ? fentry->jt : fentry->jf;
+ continue;
+ fentry += (A >= X) ? fentry->jt : fentry->jf;
+ continue;
+ fentry += (A == X) ? fentry->jt : fentry->jf;
+ continue;
+ fentry += (A & X) ? fentry->jt : fentry->jf;
+ continue;
+ k = K;
+ ptr = load_pointer(buf, buflen, k, 4, &tmp);
+ if (ptr != NULL) {
+ /* Note, unlike on network data, values are not
+ * byte swapped.
+ */
+ A = *(const u32 *)ptr;
+ continue;
+ }
+ return 0;
+ k = K;
+ ptr = load_pointer(buf, buflen, k, 2, &tmp);
+ if (ptr != NULL) {
+ A = *(const u16 *)ptr;
+ continue;
+ }
+ return 0;
+ k = K;
+ ptr = load_pointer(buf, buflen, k, 1, &tmp);
+ if (ptr != NULL) {
+ A = *(const u8 *)ptr;
+ continue;
+ }
+ return 0;
+ A = buflen;
+ continue;
+ X = buflen;
+ continue;
+ k = X + K;
+ goto load_w;
+ k = X + K;
+ goto load_h;
+ k = X + K;
+ goto load_b;
+ ptr = load_pointer(buf, buflen, K, 1, &tmp);
+ if (ptr != NULL) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ A = K;
+ continue;
+ X = K;
+ continue;
+ A = mem[K];
+ continue;
+ X = mem[K];
+ continue;
+ X = A;
+ continue;
+ A = X;
+ continue;
+ return K;
+ return A;
+ mem[K] = A;
+ continue;
+ mem[K] = X;
+ continue;
+ /* ignored */
+ continue;
+ WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+ fentry->code, fentry->jt,
+ fentry->jf, fentry->k);
+ return 0;
+ }
+ }
+
+ return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = prctl_set_seccomp(arg2);
break;
+ error = prctl_attach_seccomp_filter((char __user *)
+ arg2);
+ break;
error = GET_TSC_CTL(arg2);
break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
If you are unsure how to answer this question, answer N.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP
+ depends on EXPERIMENTAL
+ help
+ This kernel feature expands CONFIG_SECCOMP to allow computing
+ in environments with reduced kernel access dictated by a system
+ call filter, expressed in BPF, installed by the application itself
+ through prctl(2).
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--
1.7.5.4
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 16:54:16 UTC
Permalink
On Thu, Jan 12, 2012 at 2:53 AM, Serge Hallyn
Post by Serge Hallyn
This patch adds support for seccomp mode 2. =A0This mode enables dyn=
amic
Post by Serge Hallyn
enforcement of system call filtering policy in the kernel as specifi=
ed
Post by Serge Hallyn
by a userland task. =A0The policy is expressed in terms of a BPF pro=
gram,
Post by Serge Hallyn
as is used for userland-exposed socket filtering. =A0Instead of netw=
ork
Post by Serge Hallyn
data, the BPF program is evaluated over struct user_regs_struct at t=
he
Post by Serge Hallyn
time of the system call (as retrieved using regviews).
A filter program may be installed by a userland task by calling
=A0 prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.
If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. =A0All attached programs
must be evaluated before a system call will be allowed to proceed.
To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, =
it
Post by Serge Hallyn
is not allowed to make system calls or attach additional filters whi=
ch
Post by Serge Hallyn
use a different combination of is_compat_task() and
current->personality.
Filter programs may _only_ cross the execve(2) barrier if last filte=
r
Post by Serge Hallyn
program was attached by a task with CAP_SYS_ADMIN capabilities in it=
s
Post by Serge Hallyn
user namespace. =A0Once a task-local filter program is attached from=
a
Post by Serge Hallyn
process without privileges, execve will fail. =A0This ensures that o=
nly
Post by Serge Hallyn
privileged parent task can affect its privileged children (e.g., set=
uid
Post by Serge Hallyn
binary).
There are a number of benefits to this approach. A few of which are
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and syste=
m
Post by Serge Hallyn
=A0 call numbers.
- Full register information is provided which may be relevant for
=A0 certain syscalls (fork, rt_sigreturn) or for other userland
=A0 filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible=
=2E
Post by Serge Hallyn
This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. =A0It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) =
can
Post by Serge Hallyn
be tackled separately via separate patchsets. (And at some point sha=
ring
Post by Serge Hallyn
BPF JIT code!)
Hey Will,
A few comments below, but otherwise
Thanks! Unimportant responses below. Fixes will be incorporated in
the next round (along with Oleg's feedback).

cheers,
will
Post by Serge Hallyn
thanks,
-serge
---
=A0fs/exec.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A05 +
=A0include/linux/prctl.h =A0 | =A0 =A03 +
=A0include/linux/seccomp.h | =A0 70 +++++-
=A0kernel/Makefile =A0 =A0 =A0 =A0 | =A0 =A01 +
=A0kernel/fork.c =A0 =A0 =A0 =A0 =A0 | =A0 =A04 +
=A0kernel/seccomp.c =A0 =A0 =A0 =A0| =A0 =A08 +
=A0kernel/seccomp_filter.c | =A0639 ++++++++++++++++++++++++++++++++=
+++++++++++++++
Post by Serge Hallyn
=A0kernel/sys.c =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A04 +
=A0security/Kconfig =A0 =A0 =A0 =A0| =A0 12 +
=A09 files changed, 743 insertions(+), 3 deletions(-)
=A0create mode 100644 kernel/seccomp_filter.c
diff --git a/fs/exec.c b/fs/exec.c
index 3625464..e9cc89c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -44,6 +44,7 @@
=A0#include <linux/namei.h>
=A0#include <linux/mount.h>
=A0#include <linux/security.h>
+#include <linux/seccomp.h>
=A0#include <linux/syscalls.h>
=A0#include <linux/tsacct_kern.h>
=A0#include <linux/cn_proc.h>
@@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filen=
ame,
Post by Serge Hallyn
=A0 =A0 =A0 if (retval)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out_ret;
+ =A0 =A0 retval =3D seccomp_check_exec();
+ =A0 =A0 if (retval)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out_ret;
+
=A0 =A0 =A0 retval =3D -ENOMEM;
=A0 =A0 =A0 bprm =3D kzalloc(sizeof(*bprm), GFP_KERNEL);
=A0 =A0 =A0 if (!bprm)
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..15e2460 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
=A0#define PR_GET_SECCOMP =A0 =A0 =A0 21
=A0#define PR_SET_SECCOMP =A0 =A0 =A0 22
+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER =A0 =A0 36
+
=A0/* Get/set the capability bounding set (as per security/commoncap=
=2Ec) */
Post by Serge Hallyn
=A0#define PR_CAPBSET_READ 23
=A0#define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..99d163e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
=A0#ifdef CONFIG_SECCOMP
=A0#include <linux/thread_info.h>
+#include <linux/types.h>
=A0#include <asm/seccomp.h>
-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * =A0 =A0 if this is 0, seccomp is not in use.
+ * =A0 =A0 =A0 =A0 =A0 =A0 is 1, the process is under standard secc=
omp rules.
Post by Serge Hallyn
+ * =A0 =A0 =A0 =A0 =A0 =A0 is 2, the process is only allowed to mak=
e system calls where
Post by Serge Hallyn
+ * =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 associated filters evaluate =
successfully.
t of current as there
Post by Serge Hallyn
+ * =A0 =A0 =A0 =A0 =A0is no guard.
+ */
+typedef struct seccomp_struct {
+ =A0 =A0 int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ =A0 =A0 struct seccomp_filter *filter;
+#endif
+} seccomp_t;
=A0extern void __secure_computing(int);
=A0static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
=A0#include <linux/errno.h>
-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
=A0#define secure_computing(x) do { } while (0)
=A0static inline long prctl_get_seccomp(void)
@@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
=A0#endif /* CONFIG_SECCOMP */
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define seccomp_filter_init_task(_tsk) do { \
+ =A0 =A0 (_tsk)->seccomp.filter =3D NULL; \
+} while (0);
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+#define seccomp_filter_free_task(_tsk) do { \
+ =A0 =A0 put_seccomp_filter((_tsk)->seccomp.filter); \
+} while (0);
+
+extern int seccomp_check_exec(void);
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_fil=
ter *);
Post by Serge Hallyn
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_filter_fork(struct task_struct *child,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct tas=
k_struct *parent);
Post by Serge Hallyn
+
+#else =A0/* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+#define seccomp_filter_init_task(_tsk) do { } while (0);
+#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
+#define seccomp_filter_free_task(_tsk) do { } while (0);
+
+static inline int seccomp_check_exec(void)
+{
+ =A0 =A0 return 0;
+}
+
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ =A0 =A0 return -ENOSYS;
+}
+
+#endif =A0/* CONFIG_SECCOMP_FILTER */
=A0#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) +=3D hung_task.o
=A0obj-$(CONFIG_LOCKUP_DETECTOR) +=3D watchdog.o
=A0obj-$(CONFIG_GENERIC_HARDIRQS) +=3D irq/
=A0obj-$(CONFIG_SECCOMP) +=3D seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) +=3D seccomp_filter.o
=A0obj-$(CONFIG_RCU_TORTURE_TEST) +=3D rcutorture.o
=A0obj-$(CONFIG_TREE_RCU) +=3D rcutree.o
=A0obj-$(CONFIG_TREE_PREEMPT_RCU) +=3D rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..cc1d628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
=A0#include <linux/cgroup.h>
=A0#include <linux/security.h>
=A0#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
=A0#include <linux/swap.h>
=A0#include <linux/syscalls.h>
=A0#include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
=A0 =A0 =A0 free_thread_info(tsk->stack);
=A0 =A0 =A0 rt_mutex_debug_task_free(tsk);
=A0 =A0 =A0 ftrace_graph_exit_task(tsk);
+ =A0 =A0 seccomp_filter_free_task(tsk);
=A0 =A0 =A0 free_task_struct(tsk);
=A0}
=A0EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsign=
ed long clone_flags,
Post by Serge Hallyn
=A0 =A0 =A0 /* Perform scheduler related setup. Assign this task to =
a CPU. */
Post by Serge Hallyn
=A0 =A0 =A0 sched_fork(p);
+ =A0 =A0 seccomp_filter_init_task(p);
=A0 =A0 =A0 retval =3D perf_event_init_task(p);
=A0 =A0 =A0 if (retval)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 goto bad_fork_cleanup_policy;
@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsign=
ed long clone_flags,
Post by Serge Hallyn
=A0 =A0 =A0 if (clone_flags & CLONE_THREAD)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 threadgroup_fork_read_unlock(current);
=A0 =A0 =A0 perf_event_fork(p);
+ =A0 =A0 seccomp_filter_fork(p, current);
=A0 =A0 =A0 return p;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 } while (*++syscall);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
+#ifdef CONFIG_SECCOMP_FILTER
+ =A0 =A0 =A0 =A0 =A0 =A0 if (seccomp_test_filters(this_syscall) =3D=
=3D 0)
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
+
+ =A0 =A0 =A0 =A0 =A0 =A0 seccomp_filter_log_failure(this_syscall);
+ =A0 =A0 =A0 =A0 =A0 =A0 break;
+#endif
=A0 =A0 =A0 =A0 =A0 =A0 =A0 BUG();
=A0 =A0 =A0 }
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..4770847
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,639 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or mo=
dify
Post by Serge Hallyn
+ * it under the terms of the GNU General Public License as publishe=
d by
Post by Serge Hallyn
+ * the Free Software Foundation; either version 2 of the License, o=
r
Post by Serge Hallyn
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. =A0See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens=
e
Post by Serge Hallyn
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-=
1307, USA.
Post by Serge Hallyn
+ *
mium.org>
Post by Serge Hallyn
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * =A0 =A0 =A0 =A0 get/put helpers should be used when accessing an=
instance
Post by Serge Hallyn
+ * =A0 =A0 =A0 =A0 outside of a lifetime-guarded section. =A0In gen=
eral, this
Post by Serge Hallyn
+ * =A0 =A0 =A0 =A0 is only needed for handling filters shared acros=
s tasks.
sed with.
=2E
Post by Serge Hallyn
+ *
+ * seccomp_filter objects should never be modified after being atta=
ched
Post by Serge Hallyn
+ */
+struct seccomp_filter {
+ =A0 =A0 struct kref usage;
+ =A0 =A0 struct pid *creator;
+ =A0 =A0 struct seccomp_filter *parent;
+ =A0 =A0 struct {
+ =A0 =A0 =A0 =A0 =A0 =A0 uint32_t admin:1, =A0/* can allow execve *=
/
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0compat:1, =A0/* CONFIG_=
COMPAT */
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0__reserved:30;
+ =A0 =A0 } flags;
+ =A0 =A0 int personality;
+ =A0 =A0 unsigned short count; =A0/* Instruction count */
+ =A0 =A0 struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0const size_t buflen,
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0const struct sock_filter *);
Post by Serge Hallyn
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ *
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long pa=
dding)
Post by Serge Hallyn
+{
+ =A0 =A0 struct seccomp_filter *f;
+ =A0 =A0 unsigned long bpf_blocks =3D padding / sizeof(struct sock_=
filter);
Post by Serge Hallyn
+
+ =A0 =A0 /* Drop oversized requests. */
+ =A0 =A0 if (bpf_blocks =3D=3D 0 || bpf_blocks > BPF_MAXINSNS)
+ =A0 =A0 =A0 =A0 =A0 =A0 return ERR_PTR(-EINVAL);
+
+ =A0 =A0 /* Padding should always be in sock_filter increments. */
+ =A0 =A0 BUG_ON(padding % sizeof(struct sock_filter));
I still think the BUG_ON here is harsh given that the progsize is pas=
sed
Post by Serge Hallyn
in by userspace. =A0Was there a reason not to return -EINVAL here?
I've changed it in the next revision. As is, I don't believe
userspace can control
the size of padding directly, just the increment since it specifies
its length in terms
of bpf blocks (sizeof(struct sock_filter)). But EINVAL is certainly
less aggressive :)
Post by Serge Hallyn
+
+ =A0 =A0 f =3D kzalloc(sizeof(struct seccomp_filter) + padding, GFP=
_KERNEL);
Post by Serge Hallyn
+ =A0 =A0 if (!f)
+ =A0 =A0 =A0 =A0 =A0 =A0 return ERR_PTR(-ENOMEM);
+ =A0 =A0 kref_init(&f->usage);
+ =A0 =A0 f->creator =3D get_task_pid(current, PIDTYPE_PID);
+ =A0 =A0 f->count =3D bpf_blocks;
+ =A0 =A0 return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ =A0 =A0 if (!filter)
+ =A0 =A0 =A0 =A0 =A0 =A0 return;
+ =A0 =A0 put_seccomp_filter(filter->parent);
+ =A0 =A0 put_pid(filter->creator);
+ =A0 =A0 kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+ =A0 =A0 struct seccomp_filter *orig =3D
+ =A0 =A0 =A0 =A0 =A0 =A0 container_of(kref, struct seccomp_filter, =
usage);
Post by Serge Hallyn
+ =A0 =A0 seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ =A0 =A0 pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+ =A0 =A0 =A0 =A0 =A0 =A0 current->comm, task_pid_nr(current), sysca=
ll,
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 KSTK_EIP(current));
+}
+
ree. */
Post by Serge Hallyn
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ =A0 =A0 if (!orig)
+ =A0 =A0 =A0 =A0 =A0 =A0 return;
+ =A0 =A0 kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *or=
ig)
Post by Serge Hallyn
+{
+ =A0 =A0 if (!orig)
+ =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
+ =A0 =A0 kref_get(&orig->usage);
+ =A0 =A0 return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+ =A0 =A0 if (filter->personality !=3D current->personality)
+ =A0 =A0 =A0 =A0 =A0 =A0 return -EACCES;
+#ifdef CONFIG_COMPAT
+ =A0 =A0 if (filter->flags.compat !=3D (!!(is_compat_task())))
+ =A0 =A0 =A0 =A0 =A0 =A0 return -EACCES;
+#endif
+ =A0 =A0 return 0;
+}
+
+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+ =A0 =A0 const struct user_regset *regset;
+ =A0 =A0 int n;
+
+ =A0 =A0 /* Skip 0. */
+ =A0 =A0 for (n =3D 1; n < view->n; ++n) {
+ =A0 =A0 =A0 =A0 =A0 =A0 regset =3D view->regsets + n;
+ =A0 =A0 =A0 =A0 =A0 =A0 if (regset->core_note_type =3D=3D NT_PRSTA=
TUS)
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return regset;
+ =A0 =A0 }
+
+ =A0 =A0 return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Otherwise, returns a pointer to a a u8 array containing the stru=
ct
Post by Serge Hallyn
+ * user_regs_struct appropriate for the task personality. =A0The po=
inter
data
Post by Serge Hallyn
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may =
pointer to
Post by Serge Hallyn
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ =A0 =A0 /* regset is usually returned based on task personality, n=
ot current
Post by Serge Hallyn
+ =A0 =A0 =A0* system call convention. =A0This behavior makes it uns=
afe to execute
Post by Serge Hallyn
+ =A0 =A0 =A0* BPF programs over regviews if is_compat_task or the p=
ersonality
Post by Serge Hallyn
+ =A0 =A0 =A0* have changed since the program was installed.
+ =A0 =A0 =A0*/
+ =A0 =A0 const struct user_regset_view *view =3D task_user_regset_v=
iew(current);
Post by Serge Hallyn
+ =A0 =A0 const struct user_regset *regset =3D &view->regsets[0];
+ =A0 =A0 size_t scratch_size =3D *available;
+ =A0 =A0 if (regset->core_note_type !=3D NT_PRSTATUS) {
+ =A0 =A0 =A0 =A0 =A0 =A0 /* The architecture should override this m=
ethod for speed. */
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 regset =3D find_prstatus(view);
+ =A0 =A0 =A0 =A0 =A0 =A0 if (!regset)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
+ =A0 =A0 }
+ =A0 =A0 *available =3D regset->n * regset->size;
+ =A0 =A0 /* Make sure the scratch space isn't exceeded. */
+ =A0 =A0 if (*available > scratch_size)
+ =A0 =A0 =A0 =A0 =A0 =A0 *available =3D scratch_size;
+ =A0 =A0 if (regset->get(current, regset, 0, *available, scratch, N=
ULL))
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
+ =A0 =A0 return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ =A0 =A0 struct seccomp_filter *filter;
+ =A0 =A0 u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ =A0 =A0 size_t regs_size =3D sizeof(struct user_regs_struct);
+ =A0 =A0 int ret =3D -EACCES;
+
+ =A0 =A0 filter =3D current->seccomp.filter; /* uses task ref */
+ =A0 =A0 if (!filter)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 /* All filters in the list are required to share the same =
system call
Post by Serge Hallyn
+ =A0 =A0 =A0* convention so only the first filter is ever checked.
+ =A0 =A0 =A0*/
+ =A0 =A0 if (seccomp_check_personality(filter))
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 /* Grab the user_regs_struct. =A0Normally, regs =3D=3D &re=
gs_tmp, but
Post by Serge Hallyn
+ =A0 =A0 =A0* that is not mandatory. =A0E.g., it may return a point=
to
Post by Serge Hallyn
+ =A0 =A0 =A0* task_pt_regs(current). =A0NULL checking is mandatory.
+ =A0 =A0 =A0*/
+ =A0 =A0 regs =3D seccomp_get_regs(regs_tmp, &regs_size);
+ =A0 =A0 if (!regs)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 /* Only allow a system call if it is allowed in all ancest=
ors. */
Post by Serge Hallyn
+ =A0 =A0 ret =3D 0;
+ =A0 =A0 for ( ; filter !=3D NULL; filter =3D filter->parent) {
+ =A0 =A0 =A0 =A0 =A0 =A0 /* Allowed if return value is the size of =
the data supplied. */
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 if (seccomp_run_filter(regs, regs_size, fi=
lter->insns) !=3D
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 regs_size)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D -EACCES;
+ =A0 =A0 }
+ =A0 =A0 return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ *
+ * Context: User context only. This function may sleep on allocatio=
n and
Post by Serge Hallyn
+ * =A0 =A0 =A0 =A0 =A0operates on current. current must be attempti=
ng a system call
Post by Serge Hallyn
+ * =A0 =A0 =A0 =A0 =A0when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional fil=
ters.
Post by Serge Hallyn
+ * Every filter successfully installed will be evaluated (in revers=
e order)
Post by Serge Hallyn
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ =A0 =A0 struct seccomp_filter *filter =3D NULL;
+ =A0 =A0 /* Note, len is a short so overflow should be impossible. =
*/
Post by Serge Hallyn
+ =A0 =A0 unsigned long fp_size =3D fprog->len * sizeof(struct sock_=
filter);
Post by Serge Hallyn
+ =A0 =A0 long ret =3D -EPERM;
+
+ =A0 =A0 /* Allocate a new seccomp_filter */
+ =A0 =A0 filter =3D seccomp_filter_alloc(fp_size);
+ =A0 =A0 if (IS_ERR(filter)) {
+ =A0 =A0 =A0 =A0 =A0 =A0 ret =3D PTR_ERR(filter);
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+ =A0 =A0 }
+
+ =A0 =A0 /* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+ =A0 =A0 if (is_compat_task())
+ =A0 =A0 =A0 =A0 =A0 =A0 filter->flags.compat =3D 1;
+#endif
+ =A0 =A0 filter->personality =3D current->personality;
+
+ =A0 =A0 /* Auditing is not needed since the capability wasn't requ=
ested */
Post by Serge Hallyn
+ =A0 =A0 if (security_real_capable_noaudit(current, current_user_ns=
(),
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0 =A0 CAP_SYS_ADMIN) =3D=3D 0)
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 filter->flags.admin =3D 1;
+
+ =A0 =A0 /* Copy the instructions from fprog. */
+ =A0 =A0 ret =3D -EFAULT;
+ =A0 =A0 if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 /* Check the fprog */
+ =A0 =A0 ret =3D sk_chk_filter(filter->insns, filter->count);
+ =A0 =A0 if (ret)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 /* If there is an existing filter, make it the parent
+ =A0 =A0 =A0* and reuse the existing task-based ref.
+ =A0 =A0 =A0*/
+ =A0 =A0 filter->parent =3D current->seccomp.filter;
+
+ =A0 =A0 /* Force all filters to use one system call convention. */
+ =A0 =A0 ret =3D -EINVAL;
+ =A0 =A0 if (filter->parent) {
+ =A0 =A0 =A0 =A0 =A0 =A0 if (filter->parent->flags.compat !=3D filt=
er->flags.compat)
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+ =A0 =A0 =A0 =A0 =A0 =A0 if (filter->parent->personality !=3D filte=
r->personality)
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+ =A0 =A0 }
+
+ =A0 =A0 /* Double claim the new filter so we can release it below =
simplifying
Post by Serge Hallyn
+ =A0 =A0 =A0* the error paths earlier.
+ =A0 =A0 =A0*/
+ =A0 =A0 ret =3D 0;
+ =A0 =A0 get_seccomp_filter(filter);
+ =A0 =A0 current->seccomp.filter =3D filter;
+ =A0 =A0 /* Engage seccomp if it wasn't. This doesn't use PR_SET_SE=
CCOMP. */
Post by Serge Hallyn
+ =A0 =A0 if (!current->seccomp.mode) {
+ =A0 =A0 =A0 =A0 =A0 =A0 current->seccomp.mode =3D 2;
+ =A0 =A0 =A0 =A0 =A0 =A0 set_thread_flag(TIF_SECCOMP);
+ =A0 =A0 }
+
+ =A0 =A0 put_seccomp_filter(filter); =A0/* for get or task, on err =
*/
Post by Serge Hallyn
+ =A0 =A0 return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+ =A0 =A0 struct sock_fprog fprog;
+ =A0 =A0 long ret =3D -EINVAL;
+
+ =A0 =A0 ret =3D -EFAULT;
+ =A0 =A0 if (!user_filter)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 ret =3D seccomp_attach_filter(&fprog);
+ =A0 =A0 return ret;
+}
+
+/**
+ * seccomp_check_exec: determines if exec is allowed for current
+ * Returns 0 if allowed.
+ */
+int seccomp_check_exec(void)
+{
+ =A0 =A0 if (current->seccomp.mode !=3D 2)
+ =A0 =A0 =A0 =A0 =A0 =A0 return 0;
+ =A0 =A0 /* We can rely on the task refcount for the filter. */
+ =A0 =A0 if (!current->seccomp.filter)
+ =A0 =A0 =A0 =A0 =A0 =A0 return -EPERM;
+ =A0 =A0 /* The last attached filter set for the process is checked=
=2E It must
Post by Serge Hallyn
+ =A0 =A0 =A0* have been installed with CAP_SYS_ADMIN capabilities.
This comment is confusing. =A0By 'It must' you mean that if not, it's
denied. =A0But if I didn't know better I would read that as "we can't
get to this code unless". =A0Can you change it to something like
"Exec is refused unless the filter was installed with CAP_SYS_ADMIN
privilege"?
Sounds good!
Post by Serge Hallyn
+ =A0 =A0 =A0*/
+ =A0 =A0 if (current->seccomp.filter->flags.admin)
+ =A0 =A0 =A0 =A0 =A0 =A0 return 0;
+ =A0 =A0 return -EPERM;
+}
+
+/* seccomp_filter_fork: manages inheritance on fork
led
Post by Serge Hallyn
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_filter_fork(struct task_struct *child,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct task_struct *par=
ent)
Post by Serge Hallyn
+{
+ =A0 =A0 if (!parent->seccomp.mode)
+ =A0 =A0 =A0 =A0 =A0 =A0 return;
+ =A0 =A0 child->seccomp.mode =3D parent->seccomp.mode;
+ =A0 =A0 child->seccomp.filter =3D get_seccomp_filter(parent->secco=
mp.filter);
Post by Serge Hallyn
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset=
and size
Post by Serge Hallyn
+ * boundaries. =A0The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int offset=
, size_t size,
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 void *unus=
ed)
Post by Serge Hallyn
+{
+ =A0 =A0 if (offset >=3D buflen)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto fail;
+ =A0 =A0 if (offset < 0)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto fail;
+ =A0 =A0 if (size > buflen - offset)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto fail;
+ =A0 =A0 return buf + offset;
+ =A0 =A0 return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to che=
ck
Post by Serge Hallyn
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage =
of
Post by Serge Hallyn
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Any=
thing less
Post by Serge Hallyn
+ * will currently result in a seccomp failure. =A0In the future, it=
may be
Post by Serge Hallyn
+ * possible to use that for hard filtering registers on the fly so =
it is
Post by Serge Hallyn
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0const size_t buflen,
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0const struct sock_filter *fentry)
Post by Serge Hallyn
+{
+ =A0 =A0 const void *ptr;
+ =A0 =A0 u32 A =3D 0; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/*=
Accumulator */
Post by Serge Hallyn
+ =A0 =A0 u32 X =3D 0; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/*=
Index Register */
Post by Serge Hallyn
+ =A0 =A0 u32 mem[BPF_MEMWORDS]; =A0 =A0 =A0 =A0 =A0/* Scratch Memor=
y Store */
Post by Serge Hallyn
+ =A0 =A0 u32 tmp;
+ =A0 =A0 int k;
+
+ =A0 =A0 /*
+ =A0 =A0 =A0* Process array of filter instructions.
+ =A0 =A0 =A0*/
+ =A0 =A0 for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define =A0 =A0 =A0K (fentry->k)
+#else
+ =A0 =A0 =A0 =A0 =A0 =A0 const u32 K =3D fentry->k;
+#endif
+
+ =A0 =A0 =A0 =A0 =A0 =A0 switch (fentry->code) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A +=3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A +=3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A -=3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A -=3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A *=3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A *=3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (X =3D=3D 0)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A /=3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D reciprocal_divide(A,=
K);
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A &=3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A &=3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A |=3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A |=3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A <<=3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A <<=3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A >>=3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A >>=3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D -A;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D (A > K) ? fent=
ry->jt : fentry->jf;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D (A >=3D K) ? f=
entry->jt : fentry->jf;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D (A =3D=3D K) ?=
fentry->jt : fentry->jf;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D (A & K) ? fent=
ry->jt : fentry->jf;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D (A > X) ? fent=
ry->jt : fentry->jf;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D (A >=3D X) ? f=
entry->jt : fentry->jf;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D (A =3D=3D X) ?=
fentry->jt : fentry->jf;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fentry +=3D (A & X) ? fent=
ry->jt : fentry->jf;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 k =3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ptr =3D load_pointer(buf, =
buflen, k, 4, &tmp);
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ptr !=3D NULL) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Note, u=
nlike on network data, values are not
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* byte =
swapped.
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D *(co=
nst u32 *)ptr;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 k =3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ptr =3D load_pointer(buf, =
buflen, k, 2, &tmp);
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ptr !=3D NULL) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D *(co=
nst u16 *)ptr;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 k =3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ptr =3D load_pointer(buf, =
buflen, k, 1, &tmp);
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ptr !=3D NULL) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D *(co=
nst u8 *)ptr;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D buflen;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 X =3D buflen;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 k =3D X + K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto load_w;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 k =3D X + K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto load_h;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 k =3D X + K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto load_b;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ptr =3D load_pointer(buf, =
buflen, K, 1, &tmp);
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ptr !=3D NULL) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 X =3D (*(u=
8 *)ptr & 0xf) << 2;
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 X =3D K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D mem[K];
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 X =3D mem[K];
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 X =3D A;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 A =3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return K;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return A;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem[K] =3D A;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem[K] =3D X;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* ignored */
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 WARN_RATELIMIT(1, "Unknown=
code:%u jt:%u tf:%u k:%u\n",
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0fentry->code, fentry->jt,
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0fentry->jf, fentry->k);
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0;
+ =A0 =A0 =A0 =A0 =A0 =A0 }
+ =A0 =A0 }
+
+ =A0 =A0 return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned =
long, arg2, unsigned long, arg3,
Post by Serge Hallyn
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 error =3D prctl_set_secc=
omp(arg2);
Post by Serge Hallyn
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 error =3D prctl_attach_sec=
comp_filter((char __user *)
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 arg2);
Post by Serge Hallyn
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 error =3D GET_TSC_CTL(ar=
g2);
Post by Serge Hallyn
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
=A0 =A0 =A0 =A0 If you are unsure how to answer this question, answe=
r N.
Post by Serge Hallyn
+config SECCOMP_FILTER
+ =A0 =A0 bool "Enable seccomp-based system call filtering"
+ =A0 =A0 select SECCOMP
+ =A0 =A0 depends on EXPERIMENTAL
+ =A0 =A0 help
+ =A0 =A0 =A0 This kernel feature expands CONFIG_SECCOMP to allow co=
mputing
Post by Serge Hallyn
+ =A0 =A0 =A0 in environments with reduced kernel access dictated by=
a system
Post by Serge Hallyn
+ =A0 =A0 =A0 call filter, expressed in BPF, installed by the applic=
ation itself
Post by Serge Hallyn
+ =A0 =A0 =A0 through prctl(2).
+
+ =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more de=
tail.
Post by Serge Hallyn
+
=A0config SECURITY
=A0 =A0 =A0 bool "Enable different security models"
=A0 =A0 =A0 depends on SYSFS
--
1.7.5.4
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-12 14:50:40 UTC
Permalink
Post by Will Drewry
This patch adds support for seccomp mode 2. This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task. The policy is expressed in terms of a BPF program,
as is used for userland-exposed socket filtering. Instead of network
data, the BPF program is evaluated over struct user_regs_struct at the
time of the system call (as retrieved using regviews).
Cool ;)

I didn't really read this patch yet, just one nit.
Post by Will Drewry
+#define seccomp_filter_init_task(_tsk) do { \
+ (_tsk)->seccomp.filter = NULL; \
+} while (0);
Cosmetic and subjective, but imho it would be better to add inline
functions instead of define's.
Post by Will Drewry
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ seccomp_filter_free_task(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);
+ seccomp_filter_init_task(p);
This doesn't look right or I missed something. something seccomp_filter_init_task()
should be called right after dup_task_struct(), at least before copy process can
fail.

Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
current->seccomp.filter copied by arch_dup_task_struct().
Post by Will Drewry
+struct seccomp_filter {
+ struct kref usage;
+ struct pid *creator;
Why? seccomp_filter->creator is never used, no?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 16:55:27 UTC
Permalink
Post by Oleg Nesterov
This patch adds support for seccomp mode 2. =A0This mode enables dyn=
amic
Post by Oleg Nesterov
enforcement of system call filtering policy in the kernel as specifi=
ed
Post by Oleg Nesterov
by a userland task. =A0The policy is expressed in terms of a BPF pro=
gram,
Post by Oleg Nesterov
as is used for userland-exposed socket filtering. =A0Instead of netw=
ork
Post by Oleg Nesterov
data, the BPF program is evaluated over struct user_regs_struct at t=
he
Post by Oleg Nesterov
time of the system call (as retrieved using regviews).
Cool ;)
I didn't really read this patch yet, just one nit.
+#define seccomp_filter_init_task(_tsk) do { \
+ =A0 =A0 (_tsk)->seccomp.filter =3D NULL; \
+} while (0);
Cosmetic and subjective, but imho it would be better to add inline
functions instead of define's.
Refactoring it a bit to make that possible. Since seccomp fork/init/fr=
ee
never needs access to the whole task_structs, I'll just pass in what's
needed (and avoid the sched.h inclusion recursion).

Comments on the next round will most definitely be appreciated!
Post by Oleg Nesterov
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
=A0 =A0 =A0 free_thread_info(tsk->stack);
=A0 =A0 =A0 rt_mutex_debug_task_free(tsk);
=A0 =A0 =A0 ftrace_graph_exit_task(tsk);
+ =A0 =A0 seccomp_filter_free_task(tsk);
=A0 =A0 =A0 free_task_struct(tsk);
=A0}
=A0EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsign=
ed long clone_flags,
Post by Oleg Nesterov
=A0 =A0 =A0 /* Perform scheduler related setup. Assign this task to =
a CPU. */
Post by Oleg Nesterov
=A0 =A0 =A0 sched_fork(p);
+ =A0 =A0 seccomp_filter_init_task(p);
This doesn't look right or I missed something. something seccomp_filt=
er_init_task()
Post by Oleg Nesterov
should be called right after dup_task_struct(), at least before copy =
process can
Post by Oleg Nesterov
fail.
Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can=
put
Post by Oleg Nesterov
current->seccomp.filter copied by arch_dup_task_struct().
Ah - makes sense! I moved it under dup_task_struct before any goto's
to bad_fork_free.
Post by Oleg Nesterov
+struct seccomp_filter {
+ =A0 =A0 struct kref usage;
+ =A0 =A0 struct pid *creator;
Why? seccomp_filter->creator is never used, no?
Removing it. It is from a related patch I'm experimenting with (adding
optional tracehook support), but it has no bearing here.

Thanks - new patch revision incoming!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2012-01-12 15:43:35 UTC
Permalink
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).
This means that a non privileged user can not run another program with
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first and
then the execv() would be performed. But after the filter is attached,
the execv is prevented?

Maybe I don't understand this correctly.

-- Steve
Andrew Lutomirski
2012-01-12 16:14:25 UTC
Permalink
Filter programs may _only_ cross the execve(2) barrier if last filte=
r
program was attached by a task with CAP_SYS_ADMIN capabilities in it=
s
user namespace. =A0Once a task-local filter program is attached from=
a
process without privileges, execve will fail. =A0This ensures that o=
nly
privileged parent task can affect its privileged children (e.g., set=
uid
binary).
This means that a non privileged user can not run another program wit=
h
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first an=
d
then the execv() would be performed. But after the filter is attached=
,
the execv is prevented?
Maybe I don't understand this correctly.
Time to resurrect execve_nosecurity? If so, then the rule could be
simplified to: seccomp programs cannot use normal execve at all.

The longer I linger on lists and see neat ideas like this, the more I
get annoyed that execve is magical. I dream of a distribution that
doesn't use setuid, file capabilities, selinux transitions on exec, or
any other privilege changes on exec *at all*. I think that the only
things missing in the kernel (other than something intelligent to do
about SELinux) are execve_nosecurity and the ability for a normal
program to wait for an unrelated program to finish (or some other way
that a program can ask a daemon to spawn a privileged program for it
and then to cleanly wait for that program to finish in a way that
could survive re-exec of the daemon).

--Andy
-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2012-01-12 16:27:15 UTC
Permalink
Post by Andrew Lutomirski
The longer I linger on lists and see neat ideas like this, the more I
get annoyed that execve is magical. I dream of a distribution that
doesn't use setuid, file capabilities, selinux transitions on exec, or
any other privilege changes on exec *at all*.
Is that the fear with filtering on execv? That if we have filters on an
execv calling a setuid program that we change the behavior of that
privileged program and might cause unexpected results?

In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-12 16:51:38 UTC
Permalink
The longer I linger on lists and see neat ideas like this, the more =
I
get annoyed that execve is magical. =A0I dream of a distribution tha=
t
doesn't use setuid, file capabilities, selinux transitions on exec, =
or
any other privilege changes on exec *at all*.
Is that the fear with filtering on execv? That if we have filters on =
an
execv calling a setuid program that we change the behavior of that
privileged program and might cause unexpected results?
Exactly.
In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.
How do you define "non-magical"?

If setuid is set, then it's obviously magical. On a nosuid
filesystem, strange things happen. If file capabilities are enabled
and set, then different magic happens. With LSMs involved, anything
can be magical. (SELinux AFAICT looks up rules on every single exec,
so it might be impossible for execve to be non-magical.) If execve is
banned entirely when seccomp is enabled, then there will never be any
attacks based on abusing these mechanisms.

My proposal is to have an alternative mechanism that, from a security
POV, does nothing that the caller couldn't have done on its own. The
only reason it would be needed at all is because implementing execve
with correct semantics from userspace is a PITA -- the right set of
fds needs to be closed, threads need to be killed (without races),
vmas need to be found an unmapped, a new program needs to be mapped in
(possibly at the same place that the old one was mapped at),
/proc/self/exe needs to be updated, auxv needs to be recreated
(including using values that glibc might have erased already), etc.

The code is short and it works (although I have no idea whether it
applies to current kernels).

Oleg: my only issue with setting something like LSM_UNSAFE_SECCOMP is
that a different class of vulnerability might be introduced: take a
setuid program that calls other setuid programs (or just uses execve
as a way to get the default execve capability handling, SELinux
handling, etc), run it (as root!) inside seccomp, and watch it
possibly develop security holes. If the alternate execve is a
different syscall, then this can't happen. And if someone remaps
execve to execve_nosecurity (from userspace or via some in-kernel
mechanism) and causes problems, it's entirely clear who to blame.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-01-12 17:09:31 UTC
Permalink
Post by Steven Rostedt
In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.
The whole "fail security escalations" thing goes way beyond just
filtering, I think we could seriously try to make it a generic
feature.

For example, somebody just asked me the other day why "chroot()"
requires admin privileges, since it would be good to limit even
non-root things.

And it's really the exact same issue as filtering: in some sense,
chroot() "filters" FS name lookups, and can be used to fool programs
that are written to be secure.

We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.

There are probably other things like that - things like allowing users
to do bind mounts etc - that aren't dangerous in themselves, but that
are dangerous mainly because they can be used to fool things into
privilege escalations. So this is definitely not a filter-only issue.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2012-01-12 17:17:54 UTC
Permalink
Post by Linus Torvalds
The whole "fail security escalations" thing goes way beyond just
filtering, I think we could seriously try to make it a generic
feature.
After I wrote this comment I thought the same thing. It would be nice to
have a way to just set a flag to a process that will prevent it from
doing any escalating of privileges.

I totally agree, this would solve a whole host of issues with regard to
security issues in things that shouldn't be a problem but currently are.

-- Steve




--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-12 18:18:55 UTC
Permalink
On Thu, Jan 12, 2012 at 9:09 AM, Linus Torvalds
Post by Linus Torvalds
Post by Steven Rostedt
In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.
The whole "fail security escalations" thing goes way beyond just
filtering, I think we could seriously try to make it a generic
feature.
For example, somebody just asked me the other day why "chroot()"
requires admin privileges, since it would be good to limit even
non-root things.
And it's really the exact same issue as filtering: in some sense,
chroot() "filters" FS name lookups, and can be used to fool programs
that are written to be secure.
We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.
Like this?

http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html

(This depends on execve_nosecurity, which is controversial, but that
dependency would be trivial to remove.)

Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. (I think that this is a really awful idea, but it's in
the kernel, so we're stuck with it.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-01-12 18:32:59 UTC
Permalink
Post by Andrew Lutomirski
Like this?
http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
I don't know the execve_nosecurity patches, so the diff makes little
sense to me, but yeah, I wouldn't expect it to be more than a couple
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.
Post by Andrew Lutomirski
Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. =A0(I think that this is a really awful idea, but it's=
in
Post by Andrew Lutomirski
the kernel, so we're stuck with it.)
You can do any amount of crazy things with selinux, but the other side
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..

I really don't think this is just about "execve cannot do setuid". I
think it's about the process being marked as restricted.

So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try to
artificially limit it to be some "execve feature", and more think of
it as a "this is a process that has *no* extra privileges at all, and
can never get them".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-12 18:44:27 UTC
Permalink
On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
Post by Linus Torvalds
Post by Andrew Lutomirski
Like this?
http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
I don't know the execve_nosecurity patches, so the diff makes little
sense to me, but yeah, I wouldn't expect it to be more than a couple
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.
Post by Andrew Lutomirski
Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. =A0(I think that this is a really awful idea, but it'=
s in
Post by Linus Torvalds
Post by Andrew Lutomirski
the kernel, so we're stuck with it.)
You can do any amount of crazy things with selinux, but the other sid=
e
Post by Linus Torvalds
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..
I really don't think this is just about "execve cannot do setuid". I
think it's about the process being marked as restricted.
So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try t=
o
Post by Linus Torvalds
artificially limit it to be some "execve feature", and more think of
it as a "this is a process that has *no* extra privileges at all, and
can never get them".
=46air enough. I'll submit the simpler patch tonight.

execve_nosecurity was my attempt to sidestep selinux issues. It's a
different syscall that does all of the non-security-related things
that execve does but does not escalate (or even change) any
privileges. Maybe I'll try to rework that for newer kernels as well.
The idea is that programs that expect to run in sandboxes / chroots /
namespaces / whatever can use it, and older programs that might
malfunction dangerously if the semantics of execve change will just
fail instead.

--Andy
Kyle Moffett
2012-01-12 19:08:18 UTC
Permalink
Post by Linus Torvalds
Post by Andrew Lutomirski
Like this?
http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
I don't know the execve_nosecurity patches, so the diff makes little
sense to me, but yeah, I wouldn't expect it to be more than a couple
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.
Post by Andrew Lutomirski
Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. =C2=A0(I think that this is a really awful idea, but=
it's in
Post by Linus Torvalds
Post by Andrew Lutomirski
the kernel, so we're stuck with it.)
You can do any amount of crazy things with selinux, but the other si=
de
Post by Linus Torvalds
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..
I really don't think this is just about "execve cannot do setuid". I
think it's about the process being marked as restricted.
So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try =
to
Post by Linus Torvalds
artificially limit it to be some "execve feature", and more think of
it as a "this is a process that has *no* extra privileges at all, an=
d
Post by Linus Torvalds
can never get them".
execve_nosecurity was my attempt to sidestep selinux issues. =C2=A0It=
's a
different syscall that does all of the non-security-related things
that execve does but does not escalate (or even change) any
privileges. =C2=A0Maybe I'll try to rework that for newer kernels as =
well.
The idea is that programs that expect to run in sandboxes / chroots /
namespaces / whatever can use it, and older programs that might
malfunction dangerously if the semantics of execve change will just
fail instead.
I don't see any issues with SELinux support for this feature.

Specifically, when you try to execute something in SELinux, it will
first look at the types and try to "execute" (involving a type
transition IE: security label change).

But if that fails in many cases it may still be allowed to
"execute_no_trans" (IE: regular non-privileged exec() without a
transition).

If you add this feature, it should just disable the normal "execute"
with transition path and unconditionally fall back to
"execute_no_trans".

Likewise, enabling these bits should also disable the "transition" and
"dyntransition" process access vectors, and I'm on the fence about
whether "setfscreate", etc should be allowed.

Cheers,
Kyle Moffett

--=20
Curious about my work on the Debian powerpcspe port?
I'm keeping a blog here: http://pureperl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Paris
2012-01-12 23:05:36 UTC
Permalink
Post by Kyle Moffett
Post by Linus Torvalds
You can do any amount of crazy things with selinux, but the other side
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..
I don't see any issues with SELinux support for this feature.
Specifically, when you try to execute something in SELinux, it will
first look at the types and try to "execute" (involving a type
transition IE: security label change).
But if that fails in many cases it may still be allowed to
"execute_no_trans" (IE: regular non-privileged exec() without a
transition).
That's not true. See specifically
security/selinux/hooks.c::selinux_bprm_set_creds() We calculate a label
for the new task (that may or may not be the same) and then check if
there is permission to run the new binary with the new label. There is
no fallback.

The exception would be if the binary is on a MNT_NOSUID mount point, in
which case we calculate the new label, then just revert to the same
label.

At first glance it looks to me like a reasonable way to implement this
at first would be to do the new checks right next to any place we
already do MNT_NOSUID checks and mimic their behavior. If there are
other priv escalation points in the kernel we might need to consider if
MNT_NOSUID is adequate....

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-12 23:33:47 UTC
Permalink
Post by Kyle Moffett
You can do any amount of crazy things with selinux, but the other=
side
Post by Kyle Moffett
of the coin is that it would also be trivial to teach selinux abo=
ut
Post by Kyle Moffett
this same "restricted environment" bit, and just say that a proce=
ss
Post by Kyle Moffett
with that bit set doesn't get to match whatever selinux privilege
escalation rules..
I don't see any issues with SELinux support for this feature.
Specifically, when you try to execute something in SELinux, it will
first look at the types and try to "execute" (involving a type
transition IE: security label change).
But if that fails in many cases it may still be allowed to
"execute_no_trans" (IE: regular non-privileged exec() without a
transition).
That's not true. =A0See specifically
security/selinux/hooks.c::selinux_bprm_set_creds() =A0We calculate a =
label
for the new task (that may or may not be the same) and then check if
there is permission to run the new binary with the new label. =A0Ther=
e is
no fallback.
The exception would be if the binary is on a MNT_NOSUID mount point, =
in
which case we calculate the new label, then just revert to the same
label.
At first glance it looks to me like a reasonable way to implement thi=
s
at first would be to do the new checks right next to any place we
already do MNT_NOSUID checks and mimic their behavior. =A0If there ar=
e
other priv escalation points in the kernel we might need to consider =
if
MNT_NOSUID is adequate....
I don't really like the current logic. It does:

if (old_tsec->exec_sid) {
new_tsec->sid =3D old_tsec->exec_sid;
/* Reset exec SID on execve. */
new_tsec->exec_sid =3D 0;
} else {
/* Check for a default transition on this program. */
rc =3D security_transition_sid(old_tsec->sid, isec->sid=
,
SECCLASS_PROCESS, NULL,
&new_tsec->sid);
if (rc)
return rc;
}

COMMON_AUDIT_DATA_INIT(&ad, PATH);
ad.u.path =3D bprm->file->f_path;

if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
new_tsec->sid =3D old_tsec->sid;

which means that, if MNT_NOSUD, then exec_sid is silently ignored.
I'd rather fail in that case, but it's probably too late for that.
However, if we set the "no new privileges" flag, then we could fail,
since there's no old ABI to be compatible with. I'll implement it
that way.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 19:40:23 UTC
Permalink
Post by Andrew Lutomirski
On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
Post by Linus Torvalds
Post by Andrew Lutomirski
Like this?
http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
I don't know the execve_nosecurity patches, so the diff makes little
sense to me, but yeah, I wouldn't expect it to be more than a couple
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.
Post by Andrew Lutomirski
Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. =A0(I think that this is a really awful idea, but it=
's in
Post by Andrew Lutomirski
Post by Linus Torvalds
Post by Andrew Lutomirski
the kernel, so we're stuck with it.)
You can do any amount of crazy things with selinux, but the other si=
de
Post by Andrew Lutomirski
Post by Linus Torvalds
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..
I really don't think this is just about "execve cannot do setuid". I
think it's about the process being marked as restricted.
So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try =
to
Post by Andrew Lutomirski
Post by Linus Torvalds
artificially limit it to be some "execve feature", and more think of
it as a "this is a process that has *no* extra privileges at all, an=
d
Post by Andrew Lutomirski
Post by Linus Torvalds
can never get them".
Fair enough. =A0I'll submit the simpler patch tonight.
This sounds cool. Do you think you'll go for a new task_struct member
or will it a securebit? (Seems like securebits might be too tied to
posix file caps, but I figured I'd ask).

I'm planning on going ahead and mocking up your potential patch so I
can respin this series using it and make sure I understand the
interactions.

thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 19:42:55 UTC
Permalink
Post by Andrew Lutomirski
On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
Post by Andrew Lutomirski
Like this?
http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
I don't know the execve_nosecurity patches, so the diff makes littl=
e
Post by Andrew Lutomirski
sense to me, but yeah, I wouldn't expect it to be more than a coupl=
e
Post by Andrew Lutomirski
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.
Post by Andrew Lutomirski
Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. =A0(I think that this is a really awful idea, but i=
t's in
Post by Andrew Lutomirski
Post by Andrew Lutomirski
the kernel, so we're stuck with it.)
You can do any amount of crazy things with selinux, but the other s=
ide
Post by Andrew Lutomirski
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..
I really don't think this is just about "execve cannot do setuid". =
I
Post by Andrew Lutomirski
think it's about the process being marked as restricted.
So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try=
to
Post by Andrew Lutomirski
artificially limit it to be some "execve feature", and more think o=
f
Post by Andrew Lutomirski
it as a "this is a process that has *no* extra privileges at all, a=
nd
Post by Andrew Lutomirski
can never get them".
Fair enough. =A0I'll submit the simpler patch tonight.
This sounds cool. =A0Do you think you'll go for a new task_struct mem=
ber
or will it a securebit? =A0(Seems like securebits might be too tied t=
o
posix file caps, but I figured I'd ask).
Or cred member, etc.
I'm planning on going ahead and mocking up your potential patch so I
can respin this series using it and make sure I understand the
interactions.
thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-12 19:46:37 UTC
Permalink
Post by Andrew Lutomirski
On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds
Post by Andrew Lutomirski
Like this?
http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
I don't know the execve_nosecurity patches, so the diff makes littl=
e
Post by Andrew Lutomirski
sense to me, but yeah, I wouldn't expect it to be more than a coupl=
e
Post by Andrew Lutomirski
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.
Post by Andrew Lutomirski
Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. =A0(I think that this is a really awful idea, but i=
t's in
Post by Andrew Lutomirski
Post by Andrew Lutomirski
the kernel, so we're stuck with it.)
You can do any amount of crazy things with selinux, but the other s=
ide
Post by Andrew Lutomirski
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..
I really don't think this is just about "execve cannot do setuid". =
I
Post by Andrew Lutomirski
think it's about the process being marked as restricted.
So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try=
to
Post by Andrew Lutomirski
artificially limit it to be some "execve feature", and more think o=
f
Post by Andrew Lutomirski
it as a "this is a process that has *no* extra privileges at all, a=
nd
Post by Andrew Lutomirski
can never get them".
Fair enough. =A0I'll submit the simpler patch tonight.
This sounds cool. =A0Do you think you'll go for a new task_struct mem=
ber
or will it a securebit? =A0(Seems like securebits might be too tied t=
o
posix file caps, but I figured I'd ask).
I'm planning on going ahead and mocking up your potential patch so I
can respin this series using it and make sure I understand the
interactions.
I think securebits and cred didn't exist the first time I did this,
and sticking it in struct cred might unnecessarily prevent sharing
cred (assuming that even happens). So I'd say task_struct.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-01-12 20:00:58 UTC
Permalink
Post by Andrew Lutomirski
I think securebits and cred didn't exist the first time I did this,
and sticking it in struct cred might unnecessarily prevent sharing
cred (assuming that even happens). =A0So I'd say task_struct.
I think it almost has to be task state, since we very much want to
make sure it's trivial to see that nothing ever clears that bit, and
that it always gets copied right over a fork/exec/whatever.

Putting it in some cred or capability bit or somethin would make that
kind of transparency pretty much totally impossible.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-12 16:14:18 UTC
Permalink
Post by Steven Rostedt
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).
This means that a non privileged user can not run another program with
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first and
then the execv() would be performed. But after the filter is attached,
the execv is prevented?
Maybe I don't understand this correctly.
May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.

OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

Oleg.
Steven Rostedt
2012-01-12 16:38:44 UTC
Permalink
Post by Oleg Nesterov
May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
I've never used seccomp, so I admit I'm totally ignorant on this topic.

But looking at seccomp from the outside, the biggest advantage to this
would be the ability for normal processes to be able to limit tasks it
kicks off. If I want to run a task in a sandbox, I don't want to be root
to do so.

I guess a web browser doesn't perform an exec to run java programs. But
it would be nice if I could execute something from the command line that
I could run in a sand box.

What's the problem with making sure that the setuid isn't set before
doing an execv? Only fail when setuid (or some other magic) is enabled
on the file being exec'd.

Or is this a race where I can have a soft link pointing to a normal
file, run this, and have the link change to a setuid file at just the
right time that causes it to happen?


-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-12 16:47:51 UTC
Permalink
Post by Steven Rostedt
Post by Oleg Nesterov
May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
I've never used seccomp, so I admit I'm totally ignorant on this topic.
me too ;)
Post by Steven Rostedt
But looking at seccomp from the outside, the biggest advantage to this
would be the ability for normal processes to be able to limit tasks it
kicks off. If I want to run a task in a sandbox, I don't want to be root
to do so.
I guess a web browser doesn't perform an exec to run java programs. But
it would be nice if I could execute something from the command line that
I could run in a sand box.
What's the problem with making sure that the setuid isn't set before
doing an execv? Only fail when setuid (or some other magic) is enabled
on the file being exec'd.
I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_creds.
Just I do not know what would be the most simple/clean way to do this.


And in any case I agree that the current seccomp_check_exec() looks
strange. Btw, it does
{
if (current->seccomp.mode != 2)
return 0;
/* We can rely on the task refcount for the filter. */
if (!current->seccomp.filter)
return -EPERM;

How it is possible to have seccomp.filter == NULL with mode == 2?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 17:08:08 UTC
Permalink
Post by Oleg Nesterov
Post by Oleg Nesterov
May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode =3D=3D 2 into accoun=
t, I dunno.
Post by Oleg Nesterov
Post by Oleg Nesterov
OTOH, currently seccomp.mode =3D=3D 1 doesn't allow to exec at all=
=2E
Post by Oleg Nesterov
I've never used seccomp, so I admit I'm totally ignorant on this top=
ic.
Post by Oleg Nesterov
me too ;)
But looking at seccomp from the outside, the biggest advantage to th=
is
Post by Oleg Nesterov
would be the ability for normal processes to be able to limit tasks =
it
Post by Oleg Nesterov
kicks off. If I want to run a task in a sandbox, I don't want to be =
root
Post by Oleg Nesterov
to do so.
I guess a web browser doesn't perform an exec to run java programs. =
But
Post by Oleg Nesterov
it would be nice if I could execute something from the command line =
that
Post by Oleg Nesterov
I could run in a sand box.
What's the problem with making sure that the setuid isn't set before
doing an execv? Only fail when setuid (or some other magic) is enabl=
ed
Post by Oleg Nesterov
on the file being exec'd.
I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_cred=
s.
Post by Oleg Nesterov
Just I do not know what would be the most simple/clean way to do this=
=2E
Post by Oleg Nesterov
And in any case I agree that the current seccomp_check_exec() looks
strange. Btw, it does
{
=A0 =A0 =A0 =A0if (current->seccomp.mode !=3D 2)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;
=A0 =A0 =A0 =A0/* We can rely on the task refcount for the filter. */
=A0 =A0 =A0 =A0if (!current->seccomp.filter)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return -EPERM;
How it is possible to have seccomp.filter =3D=3D NULL with mode =3D=3D=
2?

It shouldn't be. It's another relic I missed from development. (Adding =
to v3 :)
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2012-01-12 17:30:48 UTC
Permalink
Post by Steven Rostedt
Post by Oleg Nesterov
May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
I've never used seccomp, so I admit I'm totally ignorant on this topic.
But looking at seccomp from the outside, the biggest advantage to this
would be the ability for normal processes to be able to limit tasks it
kicks off. If I want to run a task in a sandbox, I don't want to be root
to do so.
I guess a web browser doesn't perform an exec to run java programs.
Actually it does. Firefox on Linux forks and execs the Java VM.
Same for Flash, using "plugin-container".
Post by Steven Rostedt
But it would be nice if I could execute something from the command
line that I could run in a sand box.
You can do this now, using ptrace(). It's horrible, but half of the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address. (The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2012-01-12 17:40:06 UTC
Permalink
Post by Jamie Lokier
You can do this now, using ptrace(). It's horrible, but half of the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address. (The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)
Yeah I know the horrid use of ptrace, I've implemented programs that use
it :-p

I guess ptrace can capture the execv and determine if it is OK or not to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace check
say its OK, and then switching the symlink to a setuid file.

When the new execv executed, the parent process would lose all control
over it. The idea is to prevent this.

I like Alan's suggestion. Have userspace decide to allow execv or not,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once that
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2012-01-12 17:44:33 UTC
Permalink
Post by Steven Rostedt
Post by Jamie Lokier
You can do this now, using ptrace(). It's horrible, but half of the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address. (The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)
Yeah I know the horrid use of ptrace, I've implemented programs that use
it :-p
That warm fuzzy feeling :-)
Post by Steven Rostedt
I guess ptrace can capture the execv and determine if it is OK or not to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace check
say its OK, and then switching the symlink to a setuid file.
When the new execv executed, the parent process would lose all control
over it. The idea is to prevent this.
fexecve() exists to solve the problem.
Also known as execve("/proc/self/fd/...") on Linux.
Post by Steven Rostedt
I like Alan's suggestion. Have userspace decide to allow execv or not,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once that
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.
I like the idea of letting the tracer decide what it wants.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2012-01-12 17:56:21 UTC
Permalink
Post by Jamie Lokier
Post by Steven Rostedt
I like Alan's suggestion. Have userspace decide to allow execv or not,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once that
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.
I like the idea of letting the tracer decide what it wants.
Right, and if we implement the suggestion that Linus made, to set a flag
to prevent a task from every getting privilege, then seccomp can add
that too.

That is, there can be a filter to say "prevent this task from doing
anything with privilege" and that will prevent execv from gaining setuid
privilege. Perhaps, it would still do the execv, but the program that is
executed will run as the normal user, and just fail when it tries to do
something that requires sys admin privilege.

Thus, execv will not be a "special" case here. Seccomp either allows it
or not. But also add a command to tell seccomp that this task will not
be allowed to do anything privileged.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Cox
2012-01-12 23:27:35 UTC
Permalink
Post by Steven Rostedt
Thus, execv will not be a "special" case here. Seccomp either allows it
or not. But also add a command to tell seccomp that this task will not
be allowed to do anything privileged.
A setuid binary is not necessarily priviledged - indeed a root -> user
transition via setuid is pretty much the reverse.

It's a change of user context. Things like ptrace and file permissions
basically mean you can't build a barrier between stuff running as the
same uid to a great extent except with heavy restricting, but saying
"you can't become someone else" is very useful.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-01-12 23:38:39 UTC
Permalink
Post by Alan Cox
It's a change of user context. Things like ptrace and file permissions
basically mean you can't build a barrier between stuff running as the
same uid to a great extent except with heavy restricting, but saying
"you can't become someone else" is very useful.
Not just "someone else".

The guarantee basically has to be "you can't change your security
context". Where "become somebody else" is part of it, but any
capability changes etc would be part of it too. So it should disable
all games with capabilities etc.

And I don't think selinux really should be all that much of a problem
- we should just make sure that selinux would honor such a bit, and
refuse to do any op that would change any selinux capabilities either.
Same goes for other security models.

And that may include restricting the ways a binary can be executed
totally outside of suid/sgid bits. For example, if you consider
binaries under /home to have different selinxu rules than system
binaries in /usr/bin, then a cross-execute from one to the other may
not work, regardless of whether it's suid or not.

I think that is the kind of guarantee a sandbox environment really
wants: "I'm setting up a sandbox, you'd better not change the
permissions on me regardless of what crazy things I do".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 22:18:46 UTC
Permalink
You can do this now, using ptrace(). =A0It's horrible, but half of t=
he
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address. =A0(The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)
Yeah I know the horrid use of ptrace, I've implemented programs that =
use
it :-p
I guess ptrace can capture the execv and determine if it is OK or not=
to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace ch=
eck
say its OK, and then switching the symlink to a setuid file.
When the new execv executed, the parent process would lose all contro=
l
over it. The idea is to prevent this.
I like Alan's suggestion. Have userspace decide to allow execv or not=
,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once th=
at
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.
In the ptrace case, doesn't it just downgrade the privileges of the new=
process
if there is a tracer, rather than detach the tracer?

Ignoring that, I've been looking at system call filters as being equiva=
lent to
something like the caps bounding set. Once reduced, there's no going
back. I think Linus's proposal perfectly resolves the policy decision a=
round
suid execution behavior in the run-with-privs or not scenarios (just li=
ke with
how ptrace does it). However, I'd like to avoid allowing any process t=
o
escape system call filters once installed. (It's doable to add
suid/caps-based-bypass, but it certainly not ideal from my perspective.=
)

cheers,
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-12 23:00:46 UTC
Permalink
You can do this now, using ptrace(). =A0It's horrible, but half of =
the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address. =A0(The other half is a ton o=
f
undocumented but important ptrace() behaviours on Linux.)
Yeah I know the horrid use of ptrace, I've implemented programs that=
use
it :-p
I guess ptrace can capture the execv and determine if it is OK or no=
t to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace c=
heck
say its OK, and then switching the symlink to a setuid file.
When the new execv executed, the parent process would lose all contr=
ol
over it. The idea is to prevent this.
I like Alan's suggestion. Have userspace decide to allow execv or no=
t,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once t=
hat
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.
In the ptrace case, doesn't it just downgrade the privileges of the n=
ew process
if there is a tracer, rather than detach the tracer?
Ignoring that, I've been looking at system call filters as being equi=
valent to
something like the caps bounding set. =A0Once reduced, there's no goi=
ng
back. I think Linus's proposal perfectly resolves the policy decision=
around
suid execution behavior in the run-with-privs or not scenarios (just =
like with
how ptrace does it). =A0However, I'd like to avoid allowing any proce=
ss to
escape system call filters once installed. =A0(It's doable to add
suid/caps-based-bypass, but it certainly not ideal from my perspectiv=
e.)

I agree.

In principle, it could be safe for an outside (non-seccomp) process
with appropriate credentials to lift seccomp restrictions from a
different process. But why?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 16:59:15 UTC
Permalink
Filter programs may _only_ cross the execve(2) barrier if last filte=
r
program was attached by a task with CAP_SYS_ADMIN capabilities in it=
s
user namespace. =A0Once a task-local filter program is attached from=
a
process without privileges, execve will fail. =A0This ensures that o=
nly
privileged parent task can affect its privileged children (e.g., set=
uid
binary).
This means that a non privileged user can not run another program wit=
h
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first an=
d
then the execv() would be performed. But after the filter is attached=
,
the execv is prevented?
Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked executable
using LD_PRELOAD.
Maybe I don't understand this correctly.
You're right on. This was to ensure that one process didn't cause
crazy behavior in another. I think Alan has a better proposal than
mine below. (Goes back to catching up.)
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 17:35:55 UTC
Permalink
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last fi=
lter
Post by Will Drewry
program was attached by a task with CAP_SYS_ADMIN capabilities in=
its
Post by Will Drewry
user namespace. =A0Once a task-local filter program is attached f=
rom a
Post by Will Drewry
process without privileges, execve will fail. =A0This ensures tha=
t only
Post by Will Drewry
privileged parent task can affect its privileged children (e.g., =
setuid
Post by Will Drewry
binary).
This means that a non privileged user can not run another program =
with
Post by Will Drewry
limited features? How would a process exec another program and fil=
ter
Post by Will Drewry
it? I would assume that the filter would need to be attached first=
and
Post by Will Drewry
then the execv() would be performed. But after the filter is attac=
hed,
Post by Will Drewry
the execv is prevented?
Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked executab=
le
Post by Will Drewry
using LD_PRELOAD.
Maybe I don't understand this correctly.
You're right on. =A0This was to ensure that one process didn't cause
crazy behavior in another. I think Alan has a better proposal than
mine below. =A0(Goes back to catching up.)
You can already use ptrace() to cause crazy behaviour in another
process, including modifying registers arbitrarily at syscall entry
and exit, aborting and emulating syscalls.
ptrace() is quite slow and it would be really nice to speed it up,
especially for trapping a small subset of syscalls, or limiting some
kinds of access to some file descriptors, while everything else runs
at normal speed.
Speeding up ptrace() with BPF filters would be a really nice. =A0Not
that I like ptrace(), but sometimes it's the only thing you can rely =
on.
LD_PRELOAD and code running in the target process address space can't
always be trusted in some contexts (e.g. the target process may modif=
y
the tracing code or its data); whereas ptrace() is pretty complete an=
d
reliable, if ugly.
There's already a security model around who can use ptrace(); speedin=
g
it up needn't break that.
If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
needed as userspace could have done it, with exactly the restrictions
it wants. =A0Google's NaCl comes to mind as a potential user.
That's not entirely true. ptrace supervisors are subject to races and
always fail open. This makes them effective but not as robust as a
seccomp solution can provide.

With seccomp, it fails close. What I think would make sense would be
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.

Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.

Does that make sense?
thanks!
will
Jamie Lokier
2012-01-12 17:57:50 UTC
Permalink
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last =
filter
Post by Will Drewry
program was attached by a task with CAP_SYS_ADMIN capabilities =
in its
Post by Will Drewry
user namespace. =A0Once a task-local filter program is attached=
from a
Post by Will Drewry
process without privileges, execve will fail. =A0This ensures t=
hat only
Post by Will Drewry
privileged parent task can affect its privileged children (e.g.=
, setuid
Post by Will Drewry
binary).
This means that a non privileged user can not run another progra=
m with
Post by Will Drewry
limited features? How would a process exec another program and f=
ilter
Post by Will Drewry
it? I would assume that the filter would need to be attached fir=
st and
Post by Will Drewry
then the execv() would be performed. But after the filter is att=
ached,
Post by Will Drewry
the execv is prevented?
Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked execut=
able
Post by Will Drewry
using LD_PRELOAD.
Maybe I don't understand this correctly.
You're right on. =A0This was to ensure that one process didn't cau=
se
Post by Will Drewry
crazy behavior in another. I think Alan has a better proposal than
mine below. =A0(Goes back to catching up.)
You can already use ptrace() to cause crazy behaviour in another
process, including modifying registers arbitrarily at syscall entry
and exit, aborting and emulating syscalls.
ptrace() is quite slow and it would be really nice to speed it up,
especially for trapping a small subset of syscalls, or limiting som=
e
kinds of access to some file descriptors, while everything else run=
s
at normal speed.
Speeding up ptrace() with BPF filters would be a really nice. =A0No=
t
that I like ptrace(), but sometimes it's the only thing you can rel=
y on.
LD_PRELOAD and code running in the target process address space can=
't
always be trusted in some contexts (e.g. the target process may mod=
ify
the tracing code or its data); whereas ptrace() is pretty complete =
and
reliable, if ugly.
There's already a security model around who can use ptrace(); speed=
ing
it up needn't break that.
If we'd had BPF ptrace in the first place, SECCOMP wouldn't have be=
en
needed as userspace could have done it, with exactly the restrictio=
ns
it wants. =A0Google's NaCl comes to mind as a potential user.
=20
That's not entirely true. ptrace supervisors are subject to races an=
d
always fail open. This makes them effective but not as robust as a
seccomp solution can provide.
What races do you know about?

I'm not aware of any ptrace races if it's used properly. I'm also not
sure what you mean by fail open/close here, unless you mean the target
process gets to carry on if the tracing process dies.

Having said that, I can think of one race, but I think your BPF scheme
has the same one: After checking the syscall's string arguments and
other pointed to data, another thread can change those arguments
before the real syscall uses them.
With seccomp, it fails close. What I think would make sense would be
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.
It think it's a nice idea. While you're at it could you fix all the
architectures to actually use tracehooks for syscall tracing ;-)

(I think it's ok to call the tracehook function on all archs though.)
Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.
=20
Does that make sense?
It does, I agree that ptrace() is quite cumbersome and you don't
always want a separate tracing process, especially if "failure" means
to die or get an error.

On the other hand, sometimes when a failure occurs, having another
process decide what to do, or log the event, is exactly what you want.

=46or my nefarious purposes I'm really just looking for a faster way to
reliably trace some activities of individual processes, in particular
tracking which files they access. I'd rather not interfere with
debuggers, so I'd really like your ability to stack multiple filters
to work with separate-process tracing as well. And I'd happily use a
filter rule which can dump some information over a pipe, without
waiting for the tracer to respond in most cases.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 18:03:48 UTC
Permalink
Post by Jamie Lokier
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last=
filter
Post by Jamie Lokier
Post by Will Drewry
program was attached by a task with CAP_SYS_ADMIN capabilities=
in its
Post by Jamie Lokier
Post by Will Drewry
user namespace. =A0Once a task-local filter program is attache=
d from a
Post by Jamie Lokier
Post by Will Drewry
process without privileges, execve will fail. =A0This ensures =
that only
Post by Jamie Lokier
Post by Will Drewry
privileged parent task can affect its privileged children (e.g=
=2E, setuid
Post by Jamie Lokier
Post by Will Drewry
binary).
This means that a non privileged user can not run another progr=
am with
Post by Jamie Lokier
Post by Will Drewry
limited features? How would a process exec another program and =
filter
Post by Jamie Lokier
Post by Will Drewry
it? I would assume that the filter would need to be attached fi=
rst and
Post by Jamie Lokier
Post by Will Drewry
then the execv() would be performed. But after the filter is at=
tached,
Post by Jamie Lokier
Post by Will Drewry
the execv is prevented?
Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked execu=
table
Post by Jamie Lokier
Post by Will Drewry
using LD_PRELOAD.
Maybe I don't understand this correctly.
You're right on. =A0This was to ensure that one process didn't ca=
use
Post by Jamie Lokier
Post by Will Drewry
crazy behavior in another. I think Alan has a better proposal tha=
n
Post by Jamie Lokier
Post by Will Drewry
mine below. =A0(Goes back to catching up.)
You can already use ptrace() to cause crazy behaviour in another
process, including modifying registers arbitrarily at syscall entr=
y
Post by Jamie Lokier
and exit, aborting and emulating syscalls.
ptrace() is quite slow and it would be really nice to speed it up,
especially for trapping a small subset of syscalls, or limiting so=
me
Post by Jamie Lokier
kinds of access to some file descriptors, while everything else ru=
ns
Post by Jamie Lokier
at normal speed.
Speeding up ptrace() with BPF filters would be a really nice. =A0N=
ot
Post by Jamie Lokier
that I like ptrace(), but sometimes it's the only thing you can re=
ly on.
Post by Jamie Lokier
LD_PRELOAD and code running in the target process address space ca=
n't
Post by Jamie Lokier
always be trusted in some contexts (e.g. the target process may mo=
dify
Post by Jamie Lokier
the tracing code or its data); whereas ptrace() is pretty complete=
and
Post by Jamie Lokier
reliable, if ugly.
There's already a security model around who can use ptrace(); spee=
ding
Post by Jamie Lokier
it up needn't break that.
If we'd had BPF ptrace in the first place, SECCOMP wouldn't have b=
een
Post by Jamie Lokier
needed as userspace could have done it, with exactly the restricti=
ons
Post by Jamie Lokier
it wants. =A0Google's NaCl comes to mind as a potential user.
That's not entirely true. =A0ptrace supervisors are subject to races=
and
Post by Jamie Lokier
always fail open. =A0This makes them effective but not as robust as =
a
Post by Jamie Lokier
seccomp solution can provide.
What races do you know about?
I'm pretty sure that if you have two "isolated" processes, they could
cause irregular behavior using signals.
Post by Jamie Lokier
I'm not aware of any ptrace races if it's used properly. =A0I'm also =
not
Post by Jamie Lokier
sure what you mean by fail open/close here, unless you mean the targe=
t
Post by Jamie Lokier
process gets to carry on if the tracing process dies.
Exactly. Security systems that, on failure, allow the action to
proceed can't be relied on.
Post by Jamie Lokier
Having said that, I can think of one race, but I think your BPF schem=
e
Post by Jamie Lokier
has the same one: After checking the syscall's string arguments and
other pointed to data, another thread can change those arguments
before the real syscall uses them.
Not a problem - BPF only allows register inspection. No TOCTOU attacks
need apply :D
Post by Jamie Lokier
With seccomp, it fails close. =A0What I think would make sense would=
be
Post by Jamie Lokier
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs). =A0I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.
It think it's a nice idea. =A0While you're at it could you fix all th=
e
Post by Jamie Lokier
architectures to actually use tracehooks for syscall tracing ;-)
(I think it's ok to call the tracehook function on all archs though.)
Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.
Does that make sense?
It does, I agree that ptrace() is quite cumbersome and you don't
always want a separate tracing process, especially if "failure" means
to die or get an error.
On the other hand, sometimes when a failure occurs, having another
process decide what to do, or log the event, is exactly what you want=
=2E
Post by Jamie Lokier
For my nefarious purposes I'm really just looking for a faster way to
reliably trace some activities of individual processes, in particular
tracking which files they access. =A0I'd rather not interfere with
debuggers, so I'd really like your ability to stack multiple filters
to work with separate-process tracing as well. =A0And I'd happily use=
a
Post by Jamie Lokier
filter rule which can dump some information over a pipe, without
waiting for the tracer to respond in most cases.
Cool - if the rest of this discussion proceeds, then hopefully, we can
move towards discussing if tying it with ptrace is a good idea or a
horrible one :)

thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2012-01-13 01:34:46 UTC
Permalink
Post by Will Drewry
Post by Jamie Lokier
There's already a security model around who can use ptrace(); sp=
eeding
Post by Will Drewry
Post by Jamie Lokier
it up needn't break that.
If we'd had BPF ptrace in the first place, SECCOMP wouldn't have=
been
Post by Will Drewry
Post by Jamie Lokier
needed as userspace could have done it, with exactly the restric=
tions
Post by Will Drewry
Post by Jamie Lokier
it wants. Google's NaCl comes to mind as a potential user.
That's not entirely true. ptrace supervisors are subject to races=
and
Post by Will Drewry
Post by Jamie Lokier
always fail open. This makes them effective but not as robust as =
a
Post by Will Drewry
Post by Jamie Lokier
seccomp solution can provide.
What races do you know about?
=20
I'm pretty sure that if you have two "isolated" processes, they could
cause irregular behavior using signals.
Do you have an example? I'm not aware of one and I've been studying
ptrace quite a bit lately. If there's a race (other than temporary
kernel bugs with all the ptrace patching lately ;-), I would like to
know and maybe patch it.

The only signal confusion when ptracing syscalls I'm aware of is with
SIGTRAP, and that was fixed in 2.5.46, long, long ago (PTRACE_SETOPTION=
S).
Post by Will Drewry
Post by Jamie Lokier
I'm not aware of any ptrace races if it's used properly. =A0I'm als=
o not
Post by Will Drewry
Post by Jamie Lokier
sure what you mean by fail open/close here, unless you mean the tar=
get
Post by Will Drewry
Post by Jamie Lokier
process gets to carry on if the tracing process dies.
=20
Exactly. Security systems that, on failure, allow the action to
proceed can't be relied on.
That's fair enough. There are numerous occasions when ptracer death
should kill the tracee anyway regardless of security. E.g. "strace
command..." and strace dies, you'd normally want the command to
be killed as well. So that could be worth a ptrace option anyway.

Thanks,
-- Jamie
Chris Evans
2012-01-13 06:33:17 UTC
Permalink
Post by Jamie Lokier
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last=
filter
Post by Jamie Lokier
Post by Will Drewry
program was attached by a task with CAP_SYS_ADMIN capabilities=
in its
Post by Jamie Lokier
Post by Will Drewry
user namespace. =A0Once a task-local filter program is attache=
d from a
Post by Jamie Lokier
Post by Will Drewry
process without privileges, execve will fail. =A0This ensures =
that only
Post by Jamie Lokier
Post by Will Drewry
privileged parent task can affect its privileged children (e.g=
=2E, setuid
Post by Jamie Lokier
Post by Will Drewry
binary).
This means that a non privileged user can not run another progr=
am with
Post by Jamie Lokier
Post by Will Drewry
limited features? How would a process exec another program and =
filter
Post by Jamie Lokier
Post by Will Drewry
it? I would assume that the filter would need to be attached fi=
rst and
Post by Jamie Lokier
Post by Will Drewry
then the execv() would be performed. But after the filter is at=
tached,
Post by Jamie Lokier
Post by Will Drewry
the execv is prevented?
Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked execu=
table
Post by Jamie Lokier
Post by Will Drewry
using LD_PRELOAD.
Maybe I don't understand this correctly.
You're right on. =A0This was to ensure that one process didn't ca=
use
Post by Jamie Lokier
Post by Will Drewry
crazy behavior in another. I think Alan has a better proposal tha=
n
Post by Jamie Lokier
Post by Will Drewry
mine below. =A0(Goes back to catching up.)
You can already use ptrace() to cause crazy behaviour in another
process, including modifying registers arbitrarily at syscall entr=
y
Post by Jamie Lokier
and exit, aborting and emulating syscalls.
ptrace() is quite slow and it would be really nice to speed it up,
especially for trapping a small subset of syscalls, or limiting so=
me
Post by Jamie Lokier
kinds of access to some file descriptors, while everything else ru=
ns
Post by Jamie Lokier
at normal speed.
Speeding up ptrace() with BPF filters would be a really nice. =A0N=
ot
Post by Jamie Lokier
that I like ptrace(), but sometimes it's the only thing you can re=
ly on.
Post by Jamie Lokier
LD_PRELOAD and code running in the target process address space ca=
n't
Post by Jamie Lokier
always be trusted in some contexts (e.g. the target process may mo=
dify
Post by Jamie Lokier
the tracing code or its data); whereas ptrace() is pretty complete=
and
Post by Jamie Lokier
reliable, if ugly.
There's already a security model around who can use ptrace(); spee=
ding
Post by Jamie Lokier
it up needn't break that.
If we'd had BPF ptrace in the first place, SECCOMP wouldn't have b=
een
Post by Jamie Lokier
needed as userspace could have done it, with exactly the restricti=
ons
Post by Jamie Lokier
it wants. =A0Google's NaCl comes to mind as a potential user.
That's not entirely true. =A0ptrace supervisors are subject to races=
and
Post by Jamie Lokier
always fail open. =A0This makes them effective but not as robust as =
a
Post by Jamie Lokier
seccomp solution can provide.
What races do you know about?
I'm not aware of any ptrace races if it's used properly. =A0I'm also =
not
Post by Jamie Lokier
sure what you mean by fail open/close here, unless you mean the targe=
t
Post by Jamie Lokier
process gets to carry on if the tracing process dies.
Yeah, that's one and it's a pretty awful one when you can consider
that the untrusted tracee can play games such as trying to get the
kernel to fire OOM SIGKILLs.

My memory is hazy but the last time I looked at this in detail there
were other racy areas:

- Bad problems if the tracee takes a SIGTSTP or (real) SIGCONT.
- Difficulty in stopping the syscall from executing once it has
started, especially if the tracer dies.


Cheers
Chris
Post by Jamie Lokier
Having said that, I can think of one race, but I think your BPF schem=
e
Post by Jamie Lokier
has the same one: After checking the syscall's string arguments and
other pointed to data, another thread can change those arguments
before the real syscall uses them.
With seccomp, it fails close. =A0What I think would make sense would=
be
Post by Jamie Lokier
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs). =A0I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.
It think it's a nice idea. =A0While you're at it could you fix all th=
e
Post by Jamie Lokier
architectures to actually use tracehooks for syscall tracing ;-)
(I think it's ok to call the tracehook function on all archs though.)
Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.
Does that make sense?
It does, I agree that ptrace() is quite cumbersome and you don't
always want a separate tracing process, especially if "failure" means
to die or get an error.
On the other hand, sometimes when a failure occurs, having another
process decide what to do, or log the event, is exactly what you want=
=2E
Post by Jamie Lokier
For my nefarious purposes I'm really just looking for a faster way to
reliably trace some activities of individual processes, in particular
tracking which files they access. =A0I'd rather not interfere with
debuggers, so I'd really like your ability to stack multiple filters
to work with separate-process tracing as well. =A0And I'd happily use=
a
Post by Jamie Lokier
filter rule which can dump some information over a pipe, without
waiting for the tracer to respond in most cases.
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2012-01-12 17:22:41 UTC
Permalink
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last fil=
ter
Post by Will Drewry
program was attached by a task with CAP_SYS_ADMIN capabilities in =
its
Post by Will Drewry
user namespace. =A0Once a task-local filter program is attached fr=
om a
Post by Will Drewry
process without privileges, execve will fail. =A0This ensures that=
only
Post by Will Drewry
privileged parent task can affect its privileged children (e.g., s=
etuid
Post by Will Drewry
binary).
This means that a non privileged user can not run another program w=
ith
Post by Will Drewry
limited features? How would a process exec another program and filt=
er
Post by Will Drewry
it? I would assume that the filter would need to be attached first =
and
Post by Will Drewry
then the execv() would be performed. But after the filter is attach=
ed,
Post by Will Drewry
the execv is prevented?
=20
Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked executabl=
e
Post by Will Drewry
using LD_PRELOAD.
=20
Maybe I don't understand this correctly.
=20
You're right on. This was to ensure that one process didn't cause
crazy behavior in another. I think Alan has a better proposal than
mine below. (Goes back to catching up.)
You can already use ptrace() to cause crazy behaviour in another
process, including modifying registers arbitrarily at syscall entry
and exit, aborting and emulating syscalls.

ptrace() is quite slow and it would be really nice to speed it up,
especially for trapping a small subset of syscalls, or limiting some
kinds of access to some file descriptors, while everything else runs
at normal speed.

Speeding up ptrace() with BPF filters would be a really nice. Not
that I like ptrace(), but sometimes it's the only thing you can rely on=
=2E

LD_PRELOAD and code running in the target process address space can't
always be trusted in some contexts (e.g. the target process may modify
the tracing code or its data); whereas ptrace() is pretty complete and
reliable, if ugly.

There's already a security model around who can use ptrace(); speeding
it up needn't break that.

If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
needed as userspace could have done it, with exactly the restrictions
it wants. Google's NaCl comes to mind as a potential user.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2012-01-12 17:36:17 UTC
Permalink
Post by Steven Rostedt
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).
This means that a non privileged user can not run another program with
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first and
then the execv() would be performed. But after the filter is attached,
the execv is prevented?
Ugly method: Using ptrace(), trap after the execve() and issue fake
syscalls to install the filter. I feel dirty thinking it, in a good way.

LD_PRELOAD has been suggested. It's not 100% reliable because not all
executables are dynamic (on some uClinux platforms none of them are),
but it will usually work.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Cox
2012-01-12 16:18:06 UTC
Permalink
Post by Will Drewry
Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).
I think this model is wrong. The rest of the policy rules all work on the
basis that dumpable is the decider (the same rules for not dumping, not
tracing, etc). A user should be able to apply filter to their own code
arbitarily. Any setuid app should IMHO lose the trace subject to the usual
uid rules and capability rules. That would seem to be more flexible and
also the path of least surprise.

[plus you can implement non setuid exec entirely in userspace so it's
a rather meaningless distinction you propose]
Post by Will Drewry
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)
A BPF jit ought to be trivial and would be a big win.

In general I like this approach. It's simple, it's compact and it offers
interesting possibilities for solving some interesting problem spaces,
without the full weight of SELinux, SMACK etc which are still needed for
heavyweight security.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 17:03:20 UTC
Permalink
Filter programs may _only_ cross the execve(2) barrier if last filte=
r
program was attached by a task with CAP_SYS_ADMIN capabilities in it=
s
user namespace. =A0Once a task-local filter program is attached from=
a
process without privileges, execve will fail. =A0This ensures that o=
nly
privileged parent task can affect its privileged children (e.g., set=
uid
binary).
I think this model is wrong. The rest of the policy rules all work on=
the
basis that dumpable is the decider (the same rules for not dumping, n=
ot
tracing, etc). A user should be able to apply filter to their own cod=
e
arbitarily. Any setuid app should IMHO lose the trace subject to the =
usual
uid rules and capability rules. That would seem to be more flexible a=
nd
also the path of least surprise.
My line of thinking up to now has been that disallowing setuid exec
would mean there is no risk of an errant setuid binary allowing escape
from the system call filters (which the containers people may care
more about). Since setuid is privilege escalation, then perhaps it
makes sense to allow it as an escape hatch.

Would it be sane to just disallow setuid exec exclusively?
[plus you can implement non setuid exec entirely in userspace so it's
a rather meaningless distinction you propose]
Agreed.
be tackled separately via separate patchsets. (And at some point sha=
ring
BPF JIT code!)
A BPF jit ought to be trivial and would be a big win.
In general I like this approach. It's simple, it's compact and it off=
ers
interesting possibilities for solving some interesting problem spaces=
,
without the full weight of SELinux, SMACK etc which are still needed =
for
heavyweight security.
Thanks! Yeah I think merging with the network stack is eminently
doable, but I didn't want to bog down the proposal in how much
overhead I might be adding to the network layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alan Cox
2012-01-12 17:11:45 UTC
Permalink
Post by Will Drewry
more about). Since setuid is privilege escalation, then perhaps it
makes sense to allow it as an escape hatch.
Would it be sane to just disallow setuid exec exclusively?
I think that is a policy question. I can imagine cases where either
behaviour is the "right" one so it may need to be a parameter ?

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 17:52:59 UTC
Permalink
Post by Alan Cox
more about). =A0Since setuid is privilege escalation, then perhaps i=
t
Post by Alan Cox
makes sense to allow it as an escape hatch.
Would it be sane to just disallow setuid exec exclusively?
I think that is a policy question. I can imagine cases where either
behaviour is the "right" one so it may need to be a parameter ?
Makes sense. I'll make it flaggable (ignoring the parallel conversation
about having a thread-wide suidable bit).

thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Morris
2012-01-13 01:31:53 UTC
Permalink
Post by Alan Cox
In general I like this approach. It's simple, it's compact and it offers
interesting possibilities for solving some interesting problem spaces,
without the full weight of SELinux, SMACK etc which are still needed for
heavyweight security.
Yes, I can see potential to vastly simplify MAC policy in some cases.


- James
--
James Morris
<***@namei.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-12 16:22:31 UTC
Permalink
Post by Will Drewry
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ /* regset is usually returned based on task personality, not current
+ * system call convention. This behavior makes it unsafe to execute
+ * BPF programs over regviews if is_compat_task or the personality
+ * have changed since the program was installed.
+ */
+ const struct user_regset_view *view = task_user_regset_view(current);
+ const struct user_regset *regset = &view->regsets[0];
+ size_t scratch_size = *available;
+ if (regset->core_note_type != NT_PRSTATUS) {
+ /* The architecture should override this method for speed. */
+ regset = find_prstatus(view);
+ if (!regset)
+ return NULL;
+ }
+ *available = regset->n * regset->size;
+ /* Make sure the scratch space isn't exceeded. */
+ if (*available > scratch_size)
+ *available = scratch_size;
+ if (regset->get(current, regset, 0, *available, scratch, NULL))
+ return NULL;
+ return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ struct seccomp_filter *filter;
+ u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ size_t regs_size = sizeof(struct user_regs_struct);
+ int ret = -EACCES;
+
+ filter = current->seccomp.filter; /* uses task ref */
+ if (!filter)
+ goto out;
+
+ /* All filters in the list are required to share the same system call
+ * convention so only the first filter is ever checked.
+ */
+ if (seccomp_check_personality(filter))
+ goto out;
+
+ /* Grab the user_regs_struct. Normally, regs == &regs_tmp, but
+ * that is not mandatory. E.g., it may return a point to
+ * task_pt_regs(current). NULL checking is mandatory.
+ */
+ regs = seccomp_get_regs(regs_tmp, &regs_size);
Stupid question. I am sure you know what are you doing ;) and I know
nothing about !x86 arches.

But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regsets?

Just curious.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 17:10:55 UTC
Permalink
Post by Oleg Nesterov
Post by Will Drewry
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ =A0 =A0 /* regset is usually returned based on task personality, n=
ot current
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 =A0* system call convention. =A0This behavior makes it uns=
afe to execute
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 =A0* BPF programs over regviews if is_compat_task or the p=
ersonality
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 =A0* have changed since the program was installed.
+ =A0 =A0 =A0*/
+ =A0 =A0 const struct user_regset_view *view =3D task_user_regset_v=
iew(current);
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 const struct user_regset *regset =3D &view->regsets[0];
+ =A0 =A0 size_t scratch_size =3D *available;
+ =A0 =A0 if (regset->core_note_type !=3D NT_PRSTATUS) {
+ =A0 =A0 =A0 =A0 =A0 =A0 /* The architecture should override this m=
ethod for speed. */
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 regset =3D find_prstatus(view);
+ =A0 =A0 =A0 =A0 =A0 =A0 if (!regset)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
+ =A0 =A0 }
+ =A0 =A0 *available =3D regset->n * regset->size;
+ =A0 =A0 /* Make sure the scratch space isn't exceeded. */
+ =A0 =A0 if (*available > scratch_size)
+ =A0 =A0 =A0 =A0 =A0 =A0 *available =3D scratch_size;
+ =A0 =A0 if (regset->get(current, regset, 0, *available, scratch, N=
ULL))
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
+ =A0 =A0 return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ =A0 =A0 struct seccomp_filter *filter;
+ =A0 =A0 u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ =A0 =A0 size_t regs_size =3D sizeof(struct user_regs_struct);
+ =A0 =A0 int ret =3D -EACCES;
+
+ =A0 =A0 filter =3D current->seccomp.filter; /* uses task ref */
+ =A0 =A0 if (!filter)
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 /* All filters in the list are required to share the same =
system call
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 =A0* convention so only the first filter is ever checked.
+ =A0 =A0 =A0*/
+ =A0 =A0 if (seccomp_check_personality(filter))
+ =A0 =A0 =A0 =A0 =A0 =A0 goto out;
+
+ =A0 =A0 /* Grab the user_regs_struct. =A0Normally, regs =3D=3D &re=
gs_tmp, but
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 =A0* that is not mandatory. =A0E.g., it may return a point=
to
Post by Oleg Nesterov
Post by Will Drewry
+ =A0 =A0 =A0* task_pt_regs(current). =A0NULL checking is mandatory.
+ =A0 =A0 =A0*/
+ =A0 =A0 regs =3D seccomp_get_regs(regs_tmp, &regs_size);
Stupid question. I am sure you know what are you doing ;) and I know
nothing about !x86 arches.
But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regsets=
?

So on x86 32, it would work since user_regs_struct =3D=3D task_pt_regs
(iirc), but on x86-64
and others, that's not true. I don't think it's kosher to expose
pt_regs to the userspace, but if, let's say, x86-32 overrides the weak
linkage, then it could just return task_pt_regs and be the fastest
path.

If it would be appropriate to expose pt_regs to userspace, then I'd
happily do so :)
Oleg Nesterov
2012-01-12 17:23:15 UTC
Permalink
+ =A0 =A0 =A0*/
+ =A0 =A0 regs =3D seccomp_get_regs(regs_tmp, &regs_size);
Stupid question. I am sure you know what are you doing ;) and I kno=
w
nothing about !x86 arches.
But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regse=
ts?
So on x86 32, it would work since user_regs_struct =3D=3D task_pt_reg=
s
(iirc), but on x86-64
and others, that's not true.
Yes sure, I meant that userpace should use pt_regs too.
If it would be appropriate to expose pt_regs to userspace, then I'd
happily do so :)
Ah, so that was the reason. But it is already exported? At least I see
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.

Once again, I am not arguing, just trying to understand. And I do not
know if this definition is part of abi.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 17:51:54 UTC
Permalink
Post by Oleg Nesterov
+ =A0 =A0 =A0*/
+ =A0 =A0 regs =3D seccomp_get_regs(regs_tmp, &regs_size);
Stupid question. I am sure you know what are you doing ;) and I kn=
ow
Post by Oleg Nesterov
nothing about !x86 arches.
But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regs=
ets?
Post by Oleg Nesterov
So on x86 32, it would work since user_regs_struct =3D=3D task_pt_re=
gs
Post by Oleg Nesterov
(iirc), but on x86-64
and others, that's not true.
Yes sure, I meant that userpace should use pt_regs too.
If it would be appropriate to expose pt_regs to userspace, then I'd
happily do so :)
Ah, so that was the reason. But it is already exported? At least I se=
e
Post by Oleg Nesterov
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
Once again, I am not arguing, just trying to understand. And I do not
know if this definition is part of abi.
I don't either :/ My original idea was to operate on task_pt_regs(curr=
ent),
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.

I'd love for pt_regs to be fair game to cut down on the copying!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-13 17:31:53 UTC
Permalink
Post by Oleg Nesterov
+ =A0 =A0 =A0*/
+ =A0 =A0 regs =3D seccomp_get_regs(regs_tmp, &regs_size);
Stupid question. I am sure you know what are you doing ;) and I =
know
Post by Oleg Nesterov
nothing about !x86 arches.
But could you explain why it is designed to use user_regs_struct=
?
Post by Oleg Nesterov
Why we can't simply use task_pt_regs() and avoid the (costly) re=
gsets?
Post by Oleg Nesterov
So on x86 32, it would work since user_regs_struct =3D=3D task_pt_=
regs
Post by Oleg Nesterov
(iirc), but on x86-64
and others, that's not true.
Yes sure, I meant that userpace should use pt_regs too.
If it would be appropriate to expose pt_regs to userspace, then I'=
d
Post by Oleg Nesterov
happily do so :)
Ah, so that was the reason. But it is already exported? At least I =
see
Post by Oleg Nesterov
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.=
h.
Post by Oleg Nesterov
Once again, I am not arguing, just trying to understand. And I do n=
ot
Post by Oleg Nesterov
know if this definition is part of abi.
I don't either :/ My original idea was to operate on task_pt_regs(cu=
rrent),
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.
Well, I don't know where user_regs_struct come from initially. But
probably it is needed to allow to access the "artificial" things like
fs_base. Or perhaps this struct mimics the layout in the coredump.
I'd love for pt_regs to be fair game to cut down on the copying!
Me too. I see no point in using user_regs_struct.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-13 19:01:25 UTC
Permalink
Post by Oleg Nesterov
Post by Oleg Nesterov
+ =A0 =A0 =A0*/
+ =A0 =A0 regs =3D seccomp_get_regs(regs_tmp, &regs_size);
Stupid question. I am sure you know what are you doing ;) and I=
know
Post by Oleg Nesterov
Post by Oleg Nesterov
nothing about !x86 arches.
But could you explain why it is designed to use user_regs_struc=
t ?
Post by Oleg Nesterov
Post by Oleg Nesterov
Why we can't simply use task_pt_regs() and avoid the (costly) r=
egsets?
Post by Oleg Nesterov
Post by Oleg Nesterov
So on x86 32, it would work since user_regs_struct =3D=3D task_pt=
_regs
Post by Oleg Nesterov
Post by Oleg Nesterov
(iirc), but on x86-64
and others, that's not true.
Yes sure, I meant that userpace should use pt_regs too.
If it would be appropriate to expose pt_regs to userspace, then I=
'd
Post by Oleg Nesterov
Post by Oleg Nesterov
happily do so :)
Ah, so that was the reason. But it is already exported? At least I=
see
Post by Oleg Nesterov
Post by Oleg Nesterov
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace=
=2Eh.
Post by Oleg Nesterov
Post by Oleg Nesterov
Once again, I am not arguing, just trying to understand. And I do =
not
Post by Oleg Nesterov
Post by Oleg Nesterov
know if this definition is part of abi.
I don't either :/ =A0My original idea was to operate on task_pt_regs=
(current),
Post by Oleg Nesterov
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.
Well, I don't know where user_regs_struct come from initially. But
probably it is needed to allow to access the "artificial" things like
fs_base. Or perhaps this struct mimics the layout in the coredump.
Not sure - added Roland whose name was on many of the files :)

I just noticed that ptrace ABI allows pt_regs access using the register
macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).

But I think the latter is guaranteed to have a certain layout while the=
macros
for PEEKUSR can do post-processing fixup. (Which could be done in the
bpf evaluator load_pointer() helper if needed.)
Post by Oleg Nesterov
I'd love for pt_regs to be fair game to cut down on the copying!
Me too. I see no point in using user_regs_struct.
I'll rev the change to use pt_regs and drop all the helper code. If
no one says otherwise, that certainly seems ideal from a performance
perspective, and I see pt_regs exported to userland along with ptrace
abi register offset macros.


Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-13 23:10:41 UTC
Permalink
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
+ =A0 =A0 =A0*/
+ =A0 =A0 regs =3D seccomp_get_regs(regs_tmp, &regs_size);
Stupid question. I am sure you know what are you doing ;) and =
I know
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
nothing about !x86 arches.
But could you explain why it is designed to use user_regs_stru=
ct ?
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
Why we can't simply use task_pt_regs() and avoid the (costly) =
regsets?
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
So on x86 32, it would work since user_regs_struct =3D=3D task_p=
t_regs
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
(iirc), but on x86-64
and others, that's not true.
Yes sure, I meant that userpace should use pt_regs too.
If it would be appropriate to expose pt_regs to userspace, then =
I'd
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
happily do so :)
Ah, so that was the reason. But it is already exported? At least =
I see
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrac=
e.h.
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
Once again, I am not arguing, just trying to understand. And I do=
not
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
know if this definition is part of abi.
I don't either :/ =A0My original idea was to operate on task_pt_reg=
s(current),
Post by Will Drewry
Post by Oleg Nesterov
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.
Well, I don't know where user_regs_struct come from initially. But
probably it is needed to allow to access the "artificial" things lik=
e
Post by Will Drewry
Post by Oleg Nesterov
fs_base. Or perhaps this struct mimics the layout in the coredump.
Not sure - added Roland whose name was on many of the files :)
I just noticed that ptrace ABI allows pt_regs access using the regist=
er
Post by Will Drewry
macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
But I think the latter is guaranteed to have a certain layout while t=
he macros
Post by Will Drewry
for PEEKUSR can do post-processing fixup. =A0(Which could be done in =
the
Post by Will Drewry
bpf evaluator load_pointer() helper if needed.)
Post by Oleg Nesterov
I'd love for pt_regs to be fair game to cut down on the copying!
Me too. I see no point in using user_regs_struct.
I'll rev the change to use pt_regs and drop all the helper code. =A0I=
f
Post by Will Drewry
no one says otherwise, that certainly seems ideal from a performance
perspective, and I see pt_regs exported to userland along with ptrace
abi register offset macros.
On second thought, pt_regs is scary :)

=46rom looking at
http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
and ia32syscall enty code, it appears that for x86, at least, the
pt_regs for compat processes will be 8 bytes wide per register on the
stack. This means if a self-filtering 32-bit program runs on a 64-bit =
host in
IA32_EMU, its filters will always index into pt_regs incorrectly.

I'm not 100% that I am reading the code right, but it means that I can =
either
keep using user_regs_struct or fork the code behavior based on compat. =
That
would need to be arch dependent then which is pretty rough.

Any thoughts?

I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
about the pt_regs
change yet. If the performance boost is worth the effort of having a
per-arch fixup,
I can go that route. Otherwise, I could look at some alternate approac=
h for a
faster-than-regview payload.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-13 23:12:11 UTC
Permalink
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
+ =A0 =A0 =A0*/
+ =A0 =A0 regs =3D seccomp_get_regs(regs_tmp, &regs_size);
Stupid question. I am sure you know what are you doing ;) and=
I know
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
nothing about !x86 arches.
But could you explain why it is designed to use user_regs_str=
uct ?
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
Why we can't simply use task_pt_regs() and avoid the (costly)=
regsets?
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
So on x86 32, it would work since user_regs_struct =3D=3D task_=
pt_regs
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
(iirc), but on x86-64
and others, that's not true.
Yes sure, I meant that userpace should use pt_regs too.
If it would be appropriate to expose pt_regs to userspace, then=
I'd
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
happily do so :)
Ah, so that was the reason. But it is already exported? At least=
I see
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptra=
ce.h.
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
Once again, I am not arguing, just trying to understand. And I d=
o not
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Post by Oleg Nesterov
know if this definition is part of abi.
I don't either :/ =A0My original idea was to operate on task_pt_re=
gs(current),
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.
Well, I don't know where user_regs_struct come from initially. But
probably it is needed to allow to access the "artificial" things li=
ke
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
fs_base. Or perhaps this struct mimics the layout in the coredump.
Not sure - added Roland whose name was on many of the files :)
I just noticed that ptrace ABI allows pt_regs access using the regis=
ter
Post by Will Drewry
Post by Will Drewry
macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS)=
=2E
Post by Will Drewry
Post by Will Drewry
But I think the latter is guaranteed to have a certain layout while =
the macros
Post by Will Drewry
Post by Will Drewry
for PEEKUSR can do post-processing fixup. =A0(Which could be done in=
the
Post by Will Drewry
Post by Will Drewry
bpf evaluator load_pointer() helper if needed.)
Post by Oleg Nesterov
I'd love for pt_regs to be fair game to cut down on the copying!
Me too. I see no point in using user_regs_struct.
I'll rev the change to use pt_regs and drop all the helper code. =A0=
If
Post by Will Drewry
Post by Will Drewry
no one says otherwise, that certainly seems ideal from a performance
perspective, and I see pt_regs exported to userland along with ptrac=
e
Post by Will Drewry
Post by Will Drewry
abi register offset macros.
On second thought, pt_regs is scary :)
From looking at
=A0http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L9=
7
Post by Will Drewry
and ia32syscall enty code, it appears that for x86, at least, the
pt_regs for compat processes will be 8 bytes wide per register on the
stack. =A0This means if a self-filtering 32-bit program runs on a 64-=
bit host in
Post by Will Drewry
IA32_EMU, its filters will always index into pt_regs incorrectly.
I'm not 100% that I am reading the code right, but it means that I ca=
n either
Post by Will Drewry
keep using user_regs_struct or fork the code behavior based on compat=
=2E That
Post by Will Drewry
would need to be arch dependent then which is pretty rough.
Any thoughts?
I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
about the pt_regs
change yet. =A0If the performance boost is worth the effort of having=
a
Post by Will Drewry
per-arch fixup,
I can go that route. =A0Otherwise, I could look at some alternate app=
roach for a
Post by Will Drewry
faster-than-regview payload.
Ugh. Sorry about the formatting. (The other option is to disallow compa=
t ;).

cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Paris
2012-01-13 23:30:47 UTC
Permalink
For anyone who is interested I hacked up a program to turn what I think
is a readable seccomp syntax into BPF rules. It should make it easier
to prototype this new thing. The translator needs a LOT of love to be
worth much, but for now it can handle a couple of things and can build a
set of rules!

The rules are of the form:
label object:
value label

So using Will's BPF example code in my syntax looks like:

start syscall:
rt_sigreturn success
sigreturn success
exit_group success
exit success
read read
write write
read arg0:
0 success
write arg0:
1 success
2 success

So this says the first label is "start" and it is going to deal with the
syscall number. The first value is 'rt_sigreturn' and if syscall ==
rt_sigreturn will cause you to jump to 'success' (success and fail are
implied labels). If the syscall is 'write' we will jump to 'write.'
The write rules look at arg0. If arg0 == "1" we jump to "success". If
you run that syntax through my translator you should get Will's BPF
rules!

You'll quickly notice that the translator only understands "syscall" and
"arg0" and only x86_32, but it should be easy to add more, support the
right registers on different arches, etc, etc. If others think they
might want to hack on the translator I put it at:

http://git.infradead.org/users/eparis/bpf-translate.git

-Eric
Oleg Nesterov
2012-01-16 18:37:30 UTC
Permalink
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Me too. I see no point in using user_regs_struct.
I'll rev the change to use pt_regs and drop all the helper code. =A0=
If
Post by Will Drewry
Post by Will Drewry
no one says otherwise, that certainly seems ideal from a performanc=
e
Post by Will Drewry
Post by Will Drewry
perspective, and I see pt_regs exported to userland along with ptra=
ce
Post by Will Drewry
Post by Will Drewry
abi register offset macros.
On second thought, pt_regs is scary :)
From looking at
http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
and ia32syscall enty code, it appears that for x86, at least, the
pt_regs for compat processes will be 8 bytes wide per register on the
stack. This means if a self-filtering 32-bit program runs on a 64-bi=
t host in
Post by Will Drewry
IA32_EMU, its filters will always index into pt_regs incorrectly.
Yes, thanks, I forgot about compat tasks again. But this is easy, just
we need regs_64_to_32().

Doesn't matter. I think Indan has a better suggestion.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-16 20:15:10 UTC
Permalink
Post by Will Drewry
Post by Will Drewry
Post by Oleg Nesterov
Me too. I see no point in using user_regs_struct.
I'll rev the change to use pt_regs and drop all the helper code. =A0=
If
Post by Will Drewry
Post by Will Drewry
no one says otherwise, that certainly seems ideal from a performan=
ce
Post by Will Drewry
Post by Will Drewry
perspective, and I see pt_regs exported to userland along with ptr=
ace
Post by Will Drewry
Post by Will Drewry
abi register offset macros.
On second thought, pt_regs is scary :)
From looking at
=A0 http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#=
L97
Post by Will Drewry
and ia32syscall enty code, it appears that for x86, at least, the
pt_regs for compat processes will be 8 bytes wide per register on th=
e
Post by Will Drewry
stack. =A0This means if a self-filtering 32-bit program runs on a 64=
-bit host in
Post by Will Drewry
IA32_EMU, its filters will always index into pt_regs incorrectly.
Yes, thanks, I forgot about compat tasks again. But this is easy, jus=
t
we need regs_64_to_32().
Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
regs_64_to_32. Seems kinda wonky though :/
Doesn't matter. I think Indan has a better suggestion.
I disagree, but perhaps I'm not fully understanding!

Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-17 16:45:23 UTC
Permalink
Post by Will Drewry
Post by Oleg Nesterov
Yes, thanks, I forgot about compat tasks again. But this is easy, just
we need regs_64_to_32().
Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
regs_64_to_32. Seems kinda wonky though :/
much simpler/faster than what regset does to create the artificial
user_regs_struct32.
Post by Will Drewry
Post by Oleg Nesterov
Doesn't matter. I think Indan has a better suggestion.
I disagree, but perhaps I'm not fully understanding!
I have much more chances to be wrong ;) I leave it to you and Indan.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-17 16:56:04 UTC
Permalink
Post by Oleg Nesterov
Post by Will Drewry
Yes, thanks, I forgot about compat tasks again. But this is easy, =
just
Post by Oleg Nesterov
Post by Will Drewry
we need regs_64_to_32().
Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
regs_64_to_32. =A0Seems kinda wonky though :/
much simpler/faster than what regset does to create the artificial
user_regs_struct32.
True, I could collapse pt_regs to looks like the exported ABI pt_regs.
Then only compat processes would get the copy overhead. That could
be tidy and not break ABI. It would mean that I have to assume that
if unsigned long =3D=3D 64-bit and is_compat_task(), then the task is
32-bit. Do you think if we ever add a crazy 128-bit "supercomputer"
arch that we will add a is_compat64_task() so that I could properly
collapse? :)

I like this idea!
Post by Oleg Nesterov
Post by Will Drewry
Doesn't matter. I think Indan has a better suggestion.
I disagree, but perhaps I'm not fully understanding!
I have much more chances to be wrong ;) I leave it to you and Indan.
We're being very verbose. I hope we can come to a good place! I took
a break from my response to reply here :)

thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-17 17:01:09 UTC
Permalink
Post by Oleg Nesterov
Post by Will Drewry
Yes, thanks, I forgot about compat tasks again. But this is easy,=
just
Post by Oleg Nesterov
Post by Will Drewry
we need regs_64_to_32().
Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate wit=
h
Post by Oleg Nesterov
Post by Will Drewry
regs_64_to_32. =A0Seems kinda wonky though :/
much simpler/faster than what regset does to create the artificial
user_regs_struct32.
True, I could collapse pt_regs to looks like the exported ABI pt_regs=
=2E
=A0Then only compat processes would get the copy overhead. =A0That co=
uld
be tidy and not break ABI. =A0It would mean that I have to assume tha=
t
if unsigned long =3D=3D 64-bit and is_compat_task(), then the task is
32-bit. =A0Do you think if we ever add a crazy 128-bit "supercomputer=
"
arch that we will add a is_compat64_task() so that I could properly
collapse? :)
I like this idea!
=46WIW, it's possible for a task to execute in 32-bit mode when
!is_compat_task or in 64-bit mode when is_compat_task. From earlier
in the thread, I think you were planning to block the wrong-bitness
syscall entries, but it's worth double-checking that you don't open up
a hole when a compat task issues the 64-bit syscall instruction.

(is_compat_task says whether the executable was marked as 32-bit. The
actual execution mode is determined by the cs register, which the user
can control. See the user_64bit_mode function in
arch/asm/x86/ptrace.h. But maybe it would make more sense to have a
separate 32-bit and 64-bit BPF program and select which one to use
based on the entry point.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-17 17:06:59 UTC
Permalink
Post by Oleg Nesterov
Post by Will Drewry
Yes, thanks, I forgot about compat tasks again. But this is easy=
, just
Post by Oleg Nesterov
Post by Will Drewry
we need regs_64_to_32().
Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate wi=
th
Post by Oleg Nesterov
Post by Will Drewry
regs_64_to_32. =A0Seems kinda wonky though :/
much simpler/faster than what regset does to create the artificial
user_regs_struct32.
True, I could collapse pt_regs to looks like the exported ABI pt_reg=
s.
=A0Then only compat processes would get the copy overhead. =A0That c=
ould
be tidy and not break ABI. =A0It would mean that I have to assume th=
at
if unsigned long =3D=3D 64-bit and is_compat_task(), then the task i=
s
32-bit. =A0Do you think if we ever add a crazy 128-bit "supercompute=
r"
arch that we will add a is_compat64_task() so that I could properly
collapse? :)
I like this idea!
FWIW, it's possible for a task to execute in 32-bit mode when
!is_compat_task or in 64-bit mode when is_compat_task. =A0From earlie=
r
in the thread, I think you were planning to block the wrong-bitness
syscall entries, but it's worth double-checking that you don't open u=
p
a hole when a compat task issues the 64-bit syscall instruction.
Yup - I had to (see below).
(is_compat_task says whether the executable was marked as 32-bit. =A0=
The
actual execution mode is determined by the cs register, which the use=
r
can control. =A0See the user_64bit_mode function in
arch/asm/x86/ptrace.h. =A0But maybe it would make more sense to have =
a
separate 32-bit and 64-bit BPF program and select which one to use
based on the entry point.)
So that was my original design, but the problem was with how regviews
decides on the user_regs_struct. It decides using TIF_IA32 while I
can only check the cross-arch is_compat_task() which checks TS_COMPAT
on x86. If I'm just collapsing registers for compat calls (which I am
exploring the viability of right now), then I guess I could re-fork
the filtering to support compat versus non-compat. The nastier bits
there were that I don't want to allow a compat call to be allowed
because a process only defined non-compat. I think that can be made
manage-able though.

I'll finish proving out the possibilities here.

Thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-17 17:05:12 UTC
Permalink
Post by Andrew Lutomirski
(is_compat_task says whether the executable was marked as 32-bit. The
actual execution mode is determined by the cs register, which the user
can control.
Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
along with TS_COMPAT).

TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
64-bit or not, we should treat is as 32-bit in this case.

No?

Oleg.
Andrew Lutomirski
2012-01-17 17:45:25 UTC
Permalink
(is_compat_task says whether the executable was marked as 32-bit. =A0=
The
actual execution mode is determined by the cs register, which the us=
er
can control.
Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this com=
es
along with TS_COMPAT).
TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
64-bit or not, we should treat is as 32-bit in this case.
I think you're right, and checking which entry was used is better than
checking the cs register (since 64-bit code can use int80). That's
what I get for insufficiently careful reading of the assembly. (And
for going from memory from when I wrote the vsyscall emulation code --
that code is entered from a page fault, so the entry point used is
irrelevant.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Indan Zupancic
2012-01-18 00:56:20 UTC
Permalink
(is_compat_task says whether the executable was marked as 32-bit. =EF=
=BF=BDThe
actual execution mode is determined by the cs register, which the u=
ser
can control.
Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this co=
mes
along with TS_COMPAT).
TS_COMPAT says that, say, the task did "int 80" to enters the kernel=
=2E
64-bit or not, we should treat is as 32-bit in this case.
I think you're right, and checking which entry was used is better tha=
n
checking the cs register (since 64-bit code can use int80). That's
what I get for insufficiently careful reading of the assembly. (And
for going from memory from when I wrote the vsyscall emulation code -=
-
that code is entered from a page fault, so the entry point used is
irrelevant.)
Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
int 0x80 it's changed to 32 bit mode for that system call and back to
64 bit mode when the system call is finished!?

Our ptrace jailer is checking cs to figure out if a task is a compat ta=
sk
or not, if the kernel can change that behind our back it means our jail=
er
isn't secure for x86_64 with compat enabled. Or is cs changed before th=
e
ptrace stuff and ptrace sees the "right" cs value? If not, we have to a=
dd
an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
there another way?

I think this behaviour is so unexpected that it can only cause security
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?

Greetings,

Indan
Andrew Lutomirski
2012-01-18 01:01:41 UTC
Permalink
Post by Indan Zupancic
(is_compat_task says whether the executable was marked as 32-bit. =
=EF=BF=BDThe
Post by Indan Zupancic
actual execution mode is determined by the cs register, which the =
user
Post by Indan Zupancic
can control.
Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this c=
omes
Post by Indan Zupancic
along with TS_COMPAT).
TS_COMPAT says that, say, the task did "int 80" to enters the kerne=
l.
Post by Indan Zupancic
64-bit or not, we should treat is as 32-bit in this case.
I think you're right, and checking which entry was used is better th=
an
Post by Indan Zupancic
checking the cs register (since 64-bit code can use int80). =C2=A0Th=
at's
Post by Indan Zupancic
what I get for insufficiently careful reading of the assembly. =C2=A0=
(And
Post by Indan Zupancic
for going from memory from when I wrote the vsyscall emulation code =
--
Post by Indan Zupancic
that code is entered from a page fault, so the entry point used is
irrelevant.)
Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
int 0x80 it's changed to 32 bit mode for that system call and back to
64 bit mode when the system call is finished!?
Our ptrace jailer is checking cs to figure out if a task is a compat =
task
Post by Indan Zupancic
or not, if the kernel can change that behind our back it means our ja=
iler
Post by Indan Zupancic
isn't secure for x86_64 with compat enabled. Or is cs changed before =
the
Post by Indan Zupancic
ptrace stuff and ptrace sees the "right" cs value? If not, we have to=
add
Post by Indan Zupancic
an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or =
is
Post by Indan Zupancic
there another way?
I don't know what your ptrace jailer does. But a task can switch
itself between 32-bit and 64-bit execution at will, and there's
nothing the kernel can do about it. (That isn't quite true -- in
theory the kernel could fiddle with the GDT, but that would be
expensive and wouldn't work on Xen.)

That being said, is_compat_task is apparently a good indication of
whether the current *syscall* entry is a 64-bit syscall or a 32-bit
syscall. Perhaps the function should be renamed to in_compat_syscall,
because that's what it does.
Post by Indan Zupancic
I think this behaviour is so unexpected that it can only cause securi=
ty
Post by Indan Zupancic
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?
Nowhere, I think.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Roland McGrath
2012-01-18 01:07:04 UTC
Permalink
Post by Indan Zupancic
Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
int 0x80 it's changed to 32 bit mode for that system call and back to
64 bit mode when the system call is finished!?
Well, saying it like that suggests that there is more of a "mode change"
than really exists. It's simply that any task can use int $0x80 and
this always means using the 32-bit syscall table with TS_COMPAT set.
Post by Indan Zupancic
Our ptrace jailer is checking cs to figure out if a task is a compat task
or not, if the kernel can change that behind our back it means our jailer
isn't secure for x86_64 with compat enabled. Or is cs changed before the
ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
there another way?
I don't think there's another way. hpa and I once discussed adding a field
to the extractable "register state" that would say which method the syscall
in progress had taken to enter the kernel. That would tell you which
flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
But nobody ever had a real need for it, and we didn't pursue it further.
(We originally talked about it in the context of distinguishing whether a
32-bit task had used sysenter or syscall or int $0x80, I think.)
Post by Indan Zupancic
I think this behaviour is so unexpected that it can only cause security
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?
It's documented the same place the entire Linux machine-level ABI is
documented, which is nowhere. Someone somewhere may once have been
counting on it. (The story I heard was about an implementation of valgrind
for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
was really done.) The general rule is that if it ever worked before in a
coherent way, we don't break binary compatibility.

In the implementation, it would require a special check to make it barf.
It's really just something that falls out of how the hardware and the
kernel implementation works. I suppose you could add such a check under a
new kconfig option that's marked as being potentially incompatible with
some old applications. Good luck with that.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Indan Zupancic
2012-01-18 01:47:04 UTC
Permalink
Post by Roland McGrath
Post by Indan Zupancic
Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
int 0x80 it's changed to 32 bit mode for that system call and back to
64 bit mode when the system call is finished!?
Well, saying it like that suggests that there is more of a "mode change"
than really exists. It's simply that any task can use int $0x80 and
this always means using the 32-bit syscall table with TS_COMPAT set.
True, the kernel always runs in 64-bit mode, it just selects which path
is taken.
Post by Roland McGrath
Post by Indan Zupancic
Our ptrace jailer is checking cs to figure out if a task is a compat task
or not, if the kernel can change that behind our back it means our jailer
isn't secure for x86_64 with compat enabled. Or is cs changed before the
ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
there another way?
I don't think there's another way. hpa and I once discussed adding a field
to the extractable "register state" that would say which method the syscall
in progress had taken to enter the kernel. That would tell you which
flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
But nobody ever had a real need for it, and we didn't pursue it further.
(We originally talked about it in the context of distinguishing whether a
32-bit task had used sysenter or syscall or int $0x80, I think.)
Argh. So strace and all other ptrace users will think the task is calling a
different system call than it executes, except if they check for int 0x80,
which I bet they don't.

I suppose I could cache the checked EIP-2's results, but then I also have to
check if the memory is read-only and invalide the cache when the mapping may
be changed. Probably not worth the complexity.
Post by Roland McGrath
Post by Indan Zupancic
I think this behaviour is so unexpected that it can only cause security
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?
It's documented the same place the entire Linux machine-level ABI is
documented, which is nowhere.
AMD wrote the "System V Application Binary Interface" which decribes
some Linux conventions. It's better than nothing. But it just mentions
'syscall', not what happens when int 0x80 is called anyway.
Post by Roland McGrath
Someone somewhere may once have been
counting on it. (The story I heard was about an implementation of valgrind
for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
was really done.) The general rule is that if it ever worked before in a
coherent way, we don't break binary compatibility.
Well, considering the code can't be sure if the kernel supports compat mode
at all, I think this case is getting even more obscure than it already is.
Disallowing it won't change the kernel behaviour compared to a kernel with
compat disabled.

What about disallowing this path when the task is being ptraced?
Post by Roland McGrath
In the implementation, it would require a special check to make it barf.
It's really just something that falls out of how the hardware and the
kernel implementation works. I suppose you could add such a check under a
new kconfig option that's marked as being potentially incompatible with
some old applications. Good luck with that.
That seems a hopeless path to follow, and won't solve my problem because
my code has to be able to run on all kernels. Half the point of using
ptrace for jailing was that it's mostly portable with no special kernel
support.

Greetings,

Indan


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2012-01-18 01:48:09 UTC
Permalink
Post by Indan Zupancic
(is_compat_task says whether the executable was marked as 32-bit.=
=EF=BF=BDThe
Post by Indan Zupancic
actual execution mode is determined by the cs register, which the=
user
Post by Indan Zupancic
can control.
Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this =
comes
Post by Indan Zupancic
along with TS_COMPAT).
TS_COMPAT says that, say, the task did "int 80" to enters the kern=
el.
Post by Indan Zupancic
64-bit or not, we should treat is as 32-bit in this case.
I think you're right, and checking which entry was used is better t=
han
Post by Indan Zupancic
checking the cs register (since 64-bit code can use int80). That's
what I get for insufficiently careful reading of the assembly. (An=
d
Post by Indan Zupancic
for going from memory from when I wrote the vsyscall emulation code=
--
Post by Indan Zupancic
that code is entered from a page fault, so the entry point used is
irrelevant.)
=20
Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
int 0x80 it's changed to 32 bit mode for that system call and back to
64 bit mode when the system call is finished!?
=20
Our ptrace jailer is checking cs to figure out if a task is a compat =
task
Post by Indan Zupancic
or not, if the kernel can change that behind our back it means our ja=
iler
Post by Indan Zupancic
isn't secure for x86_64 with compat enabled. Or is cs changed before =
the
Post by Indan Zupancic
ptrace stuff and ptrace sees the "right" cs value? If not, we have to=
add
Post by Indan Zupancic
an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or =
is
Post by Indan Zupancic
there another way?
PTRACE_PEEKTEXT won't securely tell you if it's int 0x80 if there's
another thread modifying the code, or changing the mappings, or it's
executing from a file or shared memory that someone's writing to.
Post by Indan Zupancic
I think this behaviour is so unexpected that it can only cause securi=
ty
Post by Indan Zupancic
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?
It's a surprise to me too. And like you I'm using ptrace, to trace
what a process touches, not restrict it, but it's subject to the same p=
roblem.

This looks like it needs a kernel patch.

-- Jamie
Andi Kleen
2012-01-18 01:50:13 UTC
Permalink
Post by Indan Zupancic
Our ptrace jailer is checking cs to figure out if a task is a compat task
or not, if the kernel can change that behind our back it means our jailer
Every user program change it behind your back.

Your ptrace jailer isn't.
Post by Indan Zupancic
I think this behaviour is so unexpected that it can only cause security
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?
Look up far jumps in any x86 manual.

-Andi
--
***@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2012-01-18 02:00:11 UTC
Permalink
Post by Andi Kleen
Every user program change it behind your back.
Your ptrace jailer isn't.
I'm sorry but I can't read the above two lines without hearing Yoda's
voice. "Hmm hmm"

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2012-01-18 02:04:53 UTC
Permalink
Post by Andi Kleen
Post by Indan Zupancic
Our ptrace jailer is checking cs to figure out if a task is a compat task
or not, if the kernel can change that behind our back it means our jailer
Every user program change it behind your back.
..
Post by Andi Kleen
Look up far jumps in any x86 manual.
I'm pretty sure this isn't about changing cs or far jumps

I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers. It looks like a hole
in ptrace which could be fixed.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen
2012-01-18 02:22:17 UTC
Permalink
Post by Jamie Lokier
I'm pretty sure this isn't about changing cs or far jumps
He's assuming that code can only run on two code segments and
not arbitarily switch between them which is a completely incorrect
assumption.
Post by Jamie Lokier
I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers. It looks like a hole
in ptrace which could be fixed.
Possibly, but anything that bases its security on ptrace is typically
unfixable racy (just think what happens with multiple threads
and syscall arguments), so it's unlikely to do any good.

-Andi
--
***@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-18 02:25:50 UTC
Permalink
Post by Andi Kleen
Post by Jamie Lokier
I'm pretty sure this isn't about changing cs or far jumps
He's assuming that code can only run on two code segments and
not arbitarily switch between them which is a completely incorrect
assumption.
I think all he needs is to figure out which type of syscall was just
intercepted. (Obviously arguments in memory are a problem.)
Indan Zupancic
2012-01-18 04:22:31 UTC
Permalink
Post by Andi Kleen
Post by Jamie Lokier
I'm pretty sure this isn't about changing cs or far jumps
He's assuming that code can only run on two code segments and
not arbitarily switch between them which is a completely incorrect
assumption.
All I assumed up to now was that cs shows the current mode of the process,
and that that defines which system call path is taken. Apparently that is
not true and int 0x80 forces the compat system call path.

Looking at EIP - 2 seems like a secure way to check how we entered the kernel.
Post by Andi Kleen
Post by Jamie Lokier
I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers.
Yes, that's what I meant.
Post by Andi Kleen
Post by Jamie Lokier
It looks like a hole in ptrace which could be fixed.
Possibly, but anything that bases its security on ptrace is typically
unfixable racy (just think what happens with multiple threads
and syscall arguments), so it's unlikely to do any good.
As far as I know, we fixed all races except symlink races caused by malicious
code outside the jail. Those are controllable by limiting what filesystem access
the prisoners get. A special open() flag which causes open to fail when a part
of the path is a symlink with a distinguishable error code would solve this for
us.

Other than that and the abysmal performance, ptrace is fine for jailing.

Greetings,

Indan
Linus Torvalds
2012-01-18 05:23:51 UTC
Permalink
Post by Indan Zupancic
Looking at EIP - 2 seems like a secure way to check how we entered the kernel.
Secure? No. Not at all.

It's actually very easy to fool it. Do something like this:

- map the same physical page executably at one address, and writably
4kB above it (use shared memory, and map it twice).

- in that page, do this:

lea 1f,%edx
movl $SYSCALL,%eax
movl $-1,4096(%edx)
1:
int 0x80

and what happens is that the move that *overwrites* the int 0x80 will
not be noticed by the I$ coherency because it's at another address,
but by the time you read at $pc-2, you'll get -1, not "int 0x80"

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-01-18 06:25:32 UTC
Permalink
On Tue, Jan 17, 2012 at 9:23 PM, Linus Torvalds
=A0 =A0 =A0lea 1f,%edx
=A0 =A0 =A0movl $SYSCALL,%eax
=A0 =A0 =A0movl $-1,4096(%edx)
=A0 =A0 =A0int 0x80
and what happens is that the move that *overwrites* the int 0x80 will
not be noticed by the I$ coherency because it's at another address,
but by the time you read at $pc-2, you'll get -1, not "int 0x80"
Btw, that's I$ coherency comment is not technically the correct explana=
tion.

The I$ coherency isn't the problem, the problem is that the pipeline
has already fetched the "int 0x80" before the write happens. And the
write - because it's not to the same linear address as the code fetch
- won't trigger the internal "pipeline flush on write to code stream".
So the D$ (and I$) will have the -1 in it, but the instruction fetch
will have walked ahead and seen the "int 80" that existed earlier, and
will execute it.

And the above depends very much on uarch details, so depending on
microarchitecture it may or may not work. But I think the "use a
different virtual address, but same physical address" thing will fake
out all modern x86 cpu's, and your 'ptrace' will see the -1, even
though the system call happened.

Anyway, the *kernel* knows, since the kernel will have seen which
entrypoint it comes through. So we can handle it in the kernel. But
no, you cannot currently securely/reliably use $pc-2 in gdb or ptrace
to determine how the system call was made, afaik.

Of course, limiting things so that you cannot map the same page
executably *and* writably is one solution - and a good idea regardless
- so secure environments can still exist. But even then you could have
races in a multi-threaded environment (they'd just be *much* harder to
trigger for an attacker).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Indan Zupancic
2012-01-18 13:12:34 UTC
Permalink
Post by Linus Torvalds
On Tue, Jan 17, 2012 at 9:23 PM, Linus Torvalds
Post by Linus Torvalds
lea 1f,%edx
movl $SYSCALL,%eax
movl $-1,4096(%edx)
int 0x80
and what happens is that the move that *overwrites* the int 0x80 will
not be noticed by the I$ coherency because it's at another address,
but by the time you read at $pc-2, you'll get -1, not "int 0x80"
Oh jolly. I feared something like that might have been possible.
Post by Linus Torvalds
Btw, that's I$ coherency comment is not technically the correct explanation.
The I$ coherency isn't the problem, the problem is that the pipeline
has already fetched the "int 0x80" before the write happens. And the
write - because it's not to the same linear address as the code fetch
- won't trigger the internal "pipeline flush on write to code stream".
So the D$ (and I$) will have the -1 in it, but the instruction fetch
will have walked ahead and seen the "int 80" that existed earlier, and
will execute it.
And the above depends very much on uarch details, so depending on
microarchitecture it may or may not work. But I think the "use a
different virtual address, but same physical address" thing will fake
out all modern x86 cpu's, and your 'ptrace' will see the -1, even
though the system call happened.
Anyway, the *kernel* knows, since the kernel will have seen which
entrypoint it comes through. So we can handle it in the kernel. But
no, you cannot currently securely/reliably use $pc-2 in gdb or ptrace
to determine how the system call was made, afaik.
So there is this gap and there is no good way to handle it at all for
user space? And even if it's fixed in the kernel, that won't help with
older kernels, so it will stay a problem for a while.

Can this int 0x80 trick be blocked for ptraced task (preferably always),
pretty please?
Post by Linus Torvalds
Of course, limiting things so that you cannot map the same page
executably *and* writably is one solution - and a good idea regardless
- so secure environments can still exist.
We got the infrastructure in place to do that, though it would be a hassle.
But browsing around in /proc/$PID/maps, it seems w+x mappings are very
common, and we want to jail normal programs, so that seems a bit of a
problem. We could disallow system calls coming from such double mapped
memory, instead of disallowing such mappings altogether.

We'd either need to keep track of all mappings or scan /proc/$PID/maps.
Because that is a pain, we need to cache the results and invalidate or
update the cache after each new writeable mapping.

Doable, but starting to look silly and fragile.

I suppose restarting the system call would avoid same-task tricks,
but doesn't solve the other-task-having-a-writeable-mapping problem.
Post by Linus Torvalds
But even then you could have
races in a multi-threaded environment (they'd just be *much* harder to
trigger for an attacker).
All hostile threads are either jailed or running as a different user,
so at least the mapping checks can be done race-free. Syscall from
unknown mappings can be disallowed.

I hope there is a really dirty trick that works reliable to find a very
subtle difference between system call entered via 'syscall' or 'int 0x80'.

At this point it starts to look attractive to only allow system calls
coming from vdso and protecting the vdso mapping (or is that done by
the kernel already?) System calls coming from elsewhere can be
restarted at the vdso (need to fix up EIP post-syscall then too.)
All in all something like this seems the simplest and most practical
solution to me.

Anyone got any better idea?

Greetings,

Indan


--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Paris
2012-01-18 15:04:43 UTC
Permalink
Post by Linus Torvalds
Of course, limiting things so that you cannot map the same page
executably *and* writably is one solution - and a good idea regardless
- so secure environments can still exist. But even then you could have
races in a multi-threaded environment (they'd just be *much* harder to
trigger for an attacker).
Gratuitous SELinux for the win e-mail! (Feel free to delete now) We
typically, for all confined domains, do not allow mapping anonymous
memory both W and X. Actually you can't even map it W and then map it
X...

Now if there is file which you have both W and X SELinux permissions
(which is rare, but not impossible) you could map it in two places. So
we can (and do) build SELinux sandboxes which address this.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-01-18 17:51:57 UTC
Permalink
Gratuitous SELinux for the win e-mail! =A0(Feel free to delete now) =A0=
We
typically, for all confined domains, do not allow mapping anonymous
memory both W and X. =A0Actually you can't even map it W and then map=
it
X...
That doesn't help.

Anonymous memory is the *one* kind of mapping that this cannot happen
for - because then you have the same page mapped only at one
particular virtual address (and all modern x86's are entirely coherent
in the pipeline for that case, afaik).
Now if there is file which you have both W and X SELinux permissions
(which is rare, but not impossible) you could map it in two places. =A0=
So
we can (and do) build SELinux sandboxes which address this.
So the cases that matter are file-backed and various shared memory setu=
ps.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Evans
2012-01-18 05:43:54 UTC
Permalink
Post by Indan Zupancic
Post by Andi Kleen
Post by Jamie Lokier
I'm pretty sure this isn't about changing cs or far jumps
He's assuming that code can only run on two code segments and
not arbitarily switch between them which is a completely incorrect
assumption.
All I assumed up to now was that cs shows the current mode of the process,
and that that defines which system call path is taken. Apparently that is
not true and int 0x80 forces the compat system call path.
Looking at EIP - 2 seems like a secure way to check how we entered the kernel.
For 64-bit processes, you need to look at that (hard due to races) and
_also_ CS.
At least that was the state the last time I played with this in
earnest: http://scary.beasts.org/security/CESA-2009-001.html

I see Linus posted one of the race conditions that "EIP - 2" is
vulnerable to. You can start to chip away at the problem by making
sure your policy doesn't allow mmap() or mprotect() with PROT_EXEC (or
MAP_SHARED) but it's a long battle.
Post by Indan Zupancic
Post by Andi Kleen
Post by Jamie Lokier
I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers.
Yes, that's what I meant.
Post by Andi Kleen
Post by Jamie Lokier
It looks like a hole in ptrace which could be fixed.
Possibly, but anything that bases its security on ptrace is typically
unfixable racy (just think what happens with multiple threads
and syscall arguments), so it's unlikely to do any good.
As far as I know, we fixed all races except symlink races caused by malicious
code outside the jail.
Are you sure? I've remembered possibly the worst one I encountered,
since my previous e-mail to Jamie:

1) Tracee is compromised; executes fork() which is syscall that isn't allowed
2) Tracee traps
2b) Tracee could take a SIGKILL here
3) Tracer looks at registers; bad syscall
3b) Or tracee could take a SIGKILL here
4) The only way to stop the bad syscall from executing is to rewrite
orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
syscall has finished)
5) Disaster: the tracee took a SIGKILL so any attempt to address it by
pid (such as PTRACE_SETREGS) fails.
6) Syscall fork() executes; possible unsupervised process now running
since the tracer wasn't expecting the fork() to be allowed.


All this ptrace() security headache is why vsftpd is waiting for
Will's seccomp enhancements to hit the kernel. Then they will be used
pronto.


Cheers
Chris
Post by Indan Zupancic
Those are controllable by limiting what filesystem access
the prisoners get. A special open() flag which causes open to fail when a part
of the path is a symlink with a distinguishable error code would solve this for
us.
Other than that and the abysmal performance, ptrace is fine for jailing.
Greetings,
Indan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Indan Zupancic
2012-01-18 12:12:42 UTC
Permalink
Post by Chris Evans
Post by Indan Zupancic
As far as I know, we fixed all races except symlink races caused by malicious
code outside the jail.
Are you sure? I've remembered possibly the worst one I encountered,
1) Tracee is compromised; executes fork() which is syscall that isn't allowed
How do you mean compromised? Tracees aren't trusted by definition. And fork is
allowed in our jail, we're ptracing all tasks within the jail.
Post by Chris Evans
2) Tracee traps
2b) Tracee could take a SIGKILL here
3) Tracer looks at registers; bad syscall
3b) Or tracee could take a SIGKILL here
4) The only way to stop the bad syscall from executing is to rewrite
orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
syscall has finished)
Yes, we rewrite it to -1.
Post by Chris Evans
5) Disaster: the tracee took a SIGKILL so any attempt to address it by
pid (such as PTRACE_SETREGS) fails.
I assume that if a task can execute system calls and we get ptrace events
for that, that we can do other ptrace operations too. Are you saying that
the kernel has this ptrace gap between SIGKILL and task exit where ptrace
doesn't work but the task continues executing system calls? That would be
a huge bug, but it seems very unlikely too, as the task is stopped and
shouldn't be able to disappear till it is continued by the tracer.

I mean, really? That would be stupid.

If true we have to work around it by disallowing SIGKILL and just sending
them ourselves within the jail. Meh.
Post by Chris Evans
6) Syscall fork() executes; possible unsupervised process now running
since the tracer wasn't expecting the fork() to be allowed.
We use PTRACE_O_TRACEFORK (or replace it with clone and set CLONE_PTRACE
for 2.4 kernels. Yes, I check for CLONE_UNTRACED in clone calls.)
Post by Chris Evans
All this ptrace() security headache is why vsftpd is waiting for
Will's seccomp enhancements to hit the kernel. Then they will be used
pronto.
How will you avoid file path races with BPF?

Greetings,

Indan


--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-18 17:00:06 UTC
Permalink
Post by Chris Evans
1) Tracee is compromised; executes fork() which is syscall that isn't allowed
2) Tracee traps
2b) Tracee could take a SIGKILL here
3) Tracer looks at registers; bad syscall
3b) Or tracee could take a SIGKILL here
4) The only way to stop the bad syscall from executing is to rewrite
orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
syscall has finished)
5) Disaster: the tracee took a SIGKILL so any attempt to address it by
pid (such as PTRACE_SETREGS) fails.
6) Syscall fork() executes; possible unsupervised process now running
since the tracer wasn't expecting the fork() to be allowed.
As for fork() in particular, it can't succeed after SIGKILL.

But I agree, probably it makes sense to change ptrace_stop() to check
fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
in TASK_TRACED. Or we can change tracehook_report_syscall_entry()

- return 0;
+ return !fatal_signal_pending();

(no, I do not literally mean the change above)

Not only for security. The current behaviour sometime confuses the
users. Debugger sends SIGKILL to the tracee and assumes it should
die asap, but the tracee exits only after syscall.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Oleg Nesterov
2012-01-18 17:12:10 UTC
Permalink
Post by Oleg Nesterov
Post by Chris Evans
1) Tracee is compromised; executes fork() which is syscall that isn't allowed
2) Tracee traps
2b) Tracee could take a SIGKILL here
3) Tracer looks at registers; bad syscall
3b) Or tracee could take a SIGKILL here
4) The only way to stop the bad syscall from executing is to rewrite
orig_eax (PTRACE_CONT + SIGKILL only kills the process after the
syscall has finished)
5) Disaster: the tracee took a SIGKILL so any attempt to address it by
pid (such as PTRACE_SETREGS) fails.
6) Syscall fork() executes; possible unsupervised process now running
since the tracer wasn't expecting the fork() to be allowed.
As for fork() in particular, it can't succeed after SIGKILL.
But I agree, probably it makes sense to change ptrace_stop() to check
fatal_signal_pending() and do do_group_exit(SIGKILL) after it sleeps
in TASK_TRACED. Or we can change tracehook_report_syscall_entry()
- return 0;
+ return !fatal_signal_pending();
(no, I do not literally mean the change above)
Not only for security. The current behaviour sometime confuses the
users. Debugger sends SIGKILL to the tracee and assumes it should
die asap, but the tracee exits only after syscall.
Something like the patch below.

Oleg.

--- x/include/linux/tracehook.h
+++ x/include/linux/tracehook.h
@@ -54,12 +54,12 @@ struct linux_binprm;
/*
* ptrace report for syscall entry and exit looks identical.
*/
-static inline void ptrace_report_syscall(struct pt_regs *regs)
+static inline int ptrace_report_syscall(struct pt_regs *regs)
{
int ptrace = current->ptrace;

if (!(ptrace & PT_PTRACED))
- return;
+ return 0;

ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));

@@ -72,6 +72,8 @@ static inline void ptrace_report_syscall
send_sig(current->exit_code, current, 1);
current->exit_code = 0;
}
+
+ return fatal_signal_pending(current);
}

/**
@@ -96,8 +98,7 @@ static inline void ptrace_report_syscall
static inline __must_check int tracehook_report_syscall_entry(
struct pt_regs *regs)
{
- ptrace_report_syscall(regs);
- return 0;
+ return ptrace_report_syscall(regs);
}

/**

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-01-18 02:27:19 UTC
Permalink
Post by Jamie Lokier
I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall=
,
Post by Jamie Lokier
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers. =A0It looks like a ho=
le
Post by Jamie Lokier
in ptrace which could be fixed.
We could possibly munge the "orig_ax" field to be different for the
int80 vs syscall cases. That's really the only field that isn't direct
x86 state. And it's 64 bits wide, but we really only care about the
low 32 bits in the kernel. So a bit in the high bits that says "this
was a int80 entry" would be possible.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen
2012-01-18 02:31:14 UTC
Permalink
Post by Linus Torvalds
Post by Jamie Lokier
I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit sysca=
ll,
Post by Linus Torvalds
Post by Jamie Lokier
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers. =A0It looks like a =
hole
Post by Linus Torvalds
Post by Jamie Lokier
in ptrace which could be fixed.
=20
We could possibly munge the "orig_ax" field to be different for the
int80 vs syscall cases. That's really the only field that isn't direc=
t
Post by Linus Torvalds
x86 state. And it's 64 bits wide, but we really only care about the
low 32 bits in the kernel. So a bit in the high bits that says "this
was a int80 entry" would be possible.
That would be incompatible. However you could just add another virtual
register with such information (in fact I thought about that
when I did the compat code originally). However I don't think it'll sal=
vage
the original broken by design ptrace jailer. And everyone else
so far has done fine without it.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-01-18 02:46:11 UTC
Permalink
Post by Andi Kleen
That would be incompatible.
No it wouldn't.

We'd only do it for the case that everybody gets wrong: int80 from a
64-bit context.

All the other cases are trivial to see (look at CS to determine 32-bit
vs 64-bit system call) and are the common case.

So the one new "incompatible" bit case would be the case that existing
users would inevitably get wrong, so it can hardly be "incompatible".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Mares
2012-01-18 14:06:12 UTC
Permalink
Hello!
Post by Linus Torvalds
Post by Andi Kleen
That would be incompatible.
No it wouldn't.
We'd only do it for the case that everybody gets wrong: int80 from a
64-bit context.
Not everybody. There are programs which try hard to distinguish between
int80 and syscall. One such example is a sandbox for programming contests
I wrote several years ago. It analyses the instruction before EIP and as
it does not allow threads nor executing writeable memory, it should be
correct.

The change you propose would break it. It is not a huge deal, I can fix it
in a minute, but I suspect there are other such pieces of code in the wild.

However, having TS_COMPAT available through ptrace would be great and I do not
see any other nice way how to export it to userspace, so maybe breaking the
ABI in this case is acceptable.

Have a nice fortnight
--
Martin `MJ' Mares <***@ucw.cz> http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Anything is good and useful if it's made of chocolate.
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-17 19:35:59 UTC
Permalink
Post by Oleg Nesterov
Post by Will Drewry
Yes, thanks, I forgot about compat tasks again. But this is easy,=
just
Post by Oleg Nesterov
Post by Will Drewry
we need regs_64_to_32().
Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate wit=
h
Post by Oleg Nesterov
Post by Will Drewry
regs_64_to_32. =A0Seems kinda wonky though :/
much simpler/faster than what regset does to create the artificial
user_regs_struct32.
True, I could collapse pt_regs to looks like the exported ABI pt_regs=
=2E
=A0Then only compat processes would get the copy overhead. =A0That co=
uld
be tidy and not break ABI. =A0It would mean that I have to assume tha=
t
if unsigned long =3D=3D 64-bit and is_compat_task(), then the task is
32-bit. =A0Do you think if we ever add a crazy 128-bit "supercomputer=
"
arch that we will add a is_compat64_task() so that I could properly
collapse? :)
I like this idea!
Ouch, so a few issues:
- pt_regs isn't exported for most arches
- is_compat_task arches would need custom fixups

I think Indan takes this round :) I'll being integrating a
syscall_get_arguments approach. Hopefully it can be quite efficient.

cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski
2012-01-12 17:02:49 UTC
Permalink
This patch adds support for seccomp mode 2. =A0This mode enables dyna=
mic
enforcement of system call filtering policy in the kernel as specifie=
d
by a userland task. =A0The policy is expressed in terms of a BPF prog=
ram,
as is used for userland-exposed socket filtering. =A0Instead of netwo=
rk
data, the BPF program is evaluated over struct user_regs_struct at th=
e
time of the system call (as retrieved using regviews).
There's some seccomp-related code in the vsyscall emulation path in
arch/x86/kernel/vsyscall_64.c. How should time(), getcpu(), and
gettimeofday() be handled? If you want filtering to work, there
aren't any real syscall registers to inspect, but they could be
synthesized.

Preventing a malicious task from figuring out approximately what time
it is is basically impossible because of the way that vvars work. I
don't know how to change that efficiently.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-16 20:28:25 UTC
Permalink
This patch adds support for seccomp mode 2. =A0This mode enables dyn=
amic
enforcement of system call filtering policy in the kernel as specifi=
ed
by a userland task. =A0The policy is expressed in terms of a BPF pro=
gram,
as is used for userland-exposed socket filtering. =A0Instead of netw=
ork
data, the BPF program is evaluated over struct user_regs_struct at t=
he
time of the system call (as retrieved using regviews).
https://www.google.com/calendar?tab=3Dmc&authuser=3D1
There's some seccomp-related code in the vsyscall emulation path in
arch/x86/kernel/vsyscall_64.c. =A0How should time(), getcpu(), and
gettimeofday() be handled?
Nice catch:
lxr.linux.no/linux+v3.2.1/arch/x86/kernel/vsyscall_64.c#L180
I'd missed it.
=A0If you want filtering to work, there
aren't any real syscall registers to inspect, but they could be
synthesized.
Hrm, I wonder if making sure orig_eax is populated with the
vsyscall_nr would be enough. Unless I'm misreading, args 0 and 1 are
correct, so there may be other noise, but performing a call to
__secure_computing() (either in the case or with a pre-validate
syscall nr: 0-2) should send the do_exit. Does that sound reasonable?

I'll try to do the right thing in my next patch set.
Preventing a malicious task from figuring out approximately what time
it is is basically impossible because of the way that vvars work. =A0=
I
don't know how to change that efficiently.
There are other ways to guess the time too, so I don't think it's that
bad. For those that are really worried, they could disable or
otherwise attempt to limit vsyscall access from their sandbox.

thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jonathan Corbet
2012-01-11 20:03:49 UTC
Permalink
Interesting approach to the problem, I think I like it. Watch for news at
11...:)
Post by Will Drewry
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err and exit
+
It seems like this little program belongs in the samples/ directory.

Thanks,

jon
Will Drewry
2012-01-11 20:10:19 UTC
Permalink
Interesting approach to the problem, I think I like it. =A0Watch for =
news at
11...:)
Thanks - I'm glad to hear it!
Post by Will Drewry
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/=
err and exit
Post by Will Drewry
+cleanly. =A0Without using a BPF compiler, it may be done as follows=
+
It seems like this little program belongs in the samples/ directory.
Cool - I'll do that and rev this patch.

cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-11 23:19:43 UTC
Permalink
Document how system call filtering with BPF works and
may be used. Includes an example for x86 (32-bit).

Signed-off-by: Will Drewry <***@chromium.org>
---
Documentation/prctl/seccomp_filter.txt | 99 ++++++++++++++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 12 ++++
samples/seccomp/bpf-example.c | 74 ++++++++++++++++++++++++
4 files changed, 186 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-example.c

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..15d4645
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,99 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls. The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct. This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing. Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process. The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+ Allows the specification of a new filter using a BPF program.
+ The BPF program will be executed over a user_regs_struct data
+ reflecting system call time except with the system call number
+ resident in orig_[register]. To allow a system call, the size
+ of the data must be returned. At present, all other return values
+ result in the system call being blocked, but it is recommended to
+ return 0 in those cases. This will allow for future custom return
+ values to be introduced, if ever desired.
+
+ Usage:
+ prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which will
+ contain the filter program. If the program is invalid, the call
+ will return -1 and set errno to -EINVAL.
+
+ The struct user_regs_struct the @prog will see is based on the
+ personality of the task at the time of this prctl call. Additionally,
+ is_compat_task is also tracked for the @prog. This means that once set
+ the calling task will have all of its system calls blocked if it
+ switches its system call ABI (via personality or other means).
+
+ If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+ namespace, the @prog will be marked as inheritable across execve. Any
+ inherited filters are still subject to the system call ABI constraints
+ above and any ABI mismatched system calls will result in process death.
+
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
+write to stdout/err, exit and signal returns for 32-bit x86.
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+ a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code

obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..80dc8e4
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-$(CONFIG_X86_32) := bpf-example
+bpf-example-objs := bpf-example.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include -m32
+HOSTLOADLIBES_bpf-example += -m32
diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
new file mode 100644
index 0000000..f98b70a
--- /dev/null
+++ b/samples/seccomp/bpf-example.c
@@ -0,0 +1,74 @@
+/*
+ * Seccomp BPF example
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromium-os-***@chromium.org>
+ * Author: Will Drewry <***@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+# define PR_ATTACH_SECCOMP_FILTER 36
+#endif
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+static int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+ /* Check that write is only using stdout/stderr */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+ /* Put the "accept" value in A */
+ BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+ BPF_STMT(BPF_RET+BPF_A,0),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}
--
1.7.5.4
Will Drewry
2012-01-12 00:29:29 UTC
Permalink
Hrm, I may need to guard sample compilation based on host arch and not
just target arch. Documentation v3 will be on the way once I have that
behaving properly. :/

Sorry!
will
Post by Will Drewry
Document how system call filtering with BPF works and
may be used. =A0Includes an example for x86 (32-bit).
---
=A0Documentation/prctl/seccomp_filter.txt | =A0 99 ++++++++++++++++++=
++++++++++++++
Post by Will Drewry
=A0samples/Makefile =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0=
=A02 +-
Post by Will Drewry
=A0samples/seccomp/Makefile =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 12 ++++
=A0samples/seccomp/bpf-example.c =A0 =A0 =A0 =A0 =A0| =A0 74 ++++++++=
++++++++++++++++
Post by Will Drewry
=A04 files changed, 186 insertions(+), 1 deletions(-)
=A0create mode 100644 Documentation/prctl/seccomp_filter.txt
=A0create mode 100644 samples/seccomp/Makefile
=A0create mode 100644 samples/seccomp/bpf-example.c
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/p=
rctl/seccomp_filter.txt
Post by Will Drewry
new file mode 100644
index 0000000..15d4645
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,99 @@
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 Seccomp filtering
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
Post by Will Drewry
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the proces=
s.
Post by Will Drewry
+As system calls change and mature, bugs are found and eradicated. =A0=
A
Post by Will Drewry
+certain subset of userland applications benefit by having a reduced =
set
Post by Will Drewry
+of available system calls. =A0The resulting set reduces the total ke=
rnel
Post by Will Drewry
+surface exposed to the application. =A0System call filtering is mean=
t for
Post by Will Drewry
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls. =A0The filter is expressed as a Berkeley =
Packet
Post by Will Drewry
+Filter program, as with socket filters, except that the data operate=
d on
Post by Will Drewry
+is the current user_regs_struct. =A0This allows for expressive filte=
ring
Post by Will Drewry
+of system calls using the pre-existing system call ABI and using a f=
ilter
Post by Will Drewry
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall p=
rey to
Post by Will Drewry
+time-of-check-time-of-use (TOCTOU) attacks that are common in system=
call
Post by Will Drewry
+interposition frameworks because the evaluated data is solely regist=
er state
Post by Will Drewry
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. =A0It provides a clearly defi=
ned
Post by Will Drewry
+mechanism for minimizing the exposed kernel surface. =A0Beyond that,
+policy for logical behavior and information flow should be managed w=
ith
Post by Will Drewry
+a combinations of other system hardening techniques and, potentially=
, a
Post by Will Drewry
+LSM of your choosing. =A0Expressive, dynamic filters provide further=
options down
Post by Will Drewry
+this path (avoiding pathological sizes or selecting which of the mul=
tiplexed
Post by Will Drewry
+system calls in socketcall() is allowed, for instance) which could b=
e
Post by Will Drewry
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set b=
y the
Post by Will Drewry
+consuming process. =A0The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+ =A0 =A0 =A0 Allows the specification of a new filter using a BPF pr=
ogram.
Post by Will Drewry
+ =A0 =A0 =A0 The BPF program will be executed over a user_regs_struc=
t data
Post by Will Drewry
+ =A0 =A0 =A0 reflecting system call time except with the system call=
number
Post by Will Drewry
+ =A0 =A0 =A0 resident in orig_[register]. =A0To allow a system call,=
the size
Post by Will Drewry
+ =A0 =A0 =A0 of the data must be returned. =A0At present, all other =
return values
Post by Will Drewry
+ =A0 =A0 =A0 result in the system call being blocked, but it is reco=
mmended to
Post by Will Drewry
+ =A0 =A0 =A0 return 0 in those cases. =A0This will allow for future =
custom return
Post by Will Drewry
+ =A0 =A0 =A0 values to be introduced, if ever desired.
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+ =A0 =A0 =A0 The 'prog' argument is a pointer to a struct sock_fprog=
which will
Post by Will Drewry
+ =A0 =A0 =A0 contain the filter program. =A0If the program is invali=
d, the call
Post by Will Drewry
+ =A0 =A0 =A0 will return -1 and set errno to -EINVAL.
+
on the
Post by Will Drewry
+ =A0 =A0 =A0 personality of the task at the time of this prctl call.=
=A0Additionally,
eans that once set
Post by Will Drewry
+ =A0 =A0 =A0 the calling task will have all of its system calls bloc=
ked if it
Post by Will Drewry
+ =A0 =A0 =A0 switches its system call ABI (via personality or other =
means).
Post by Will Drewry
+
MIN in its user
ss execve. =A0Any
Post by Will Drewry
+ =A0 =A0 =A0 inherited filters are still subject to the system call =
ABI constraints
Post by Will Drewry
+ =A0 =A0 =A0 above and any ABI mismatched system calls will result i=
n process death.
Post by Will Drewry
+
+ =A0 =A0 =A0 Additionally, if prctl(2) is allowed by the attached fi=
lter,
Post by Will Drewry
+ =A0 =A0 =A0 additional filters may be layered on which will increas=
e evaluation
Post by Will Drewry
+ =A0 =A0 =A0 time, but allow for further decreasing the attack surfa=
ce during
Post by Will Drewry
+ =A0 =A0 =A0 execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+samples/seccomp-bpf-example.c shows an example process that allows r=
ead from stdin,
Post by Will Drewry
+write to stdout/err, exit and signal returns for 32-bit x86.
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was inst=
alled by
Post by Will Drewry
+ =A0a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
=A0# Makefile for Linux samples code
=A0obj-$(CONFIG_SAMPLES) =A0+=3D kobject/ kprobes/ tracepoints/ trace=
_events/ \
Post by Will Drewry
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0hw_breakpoint/ k=
fifo/ kdb/ hidraw/
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0hw_breakpoint/ k=
fifo/ kdb/ hidraw/ seccomp/
Post by Will Drewry
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..80dc8e4
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is =
built.
Post by Will Drewry
+obj- :=3D dummy.o
+
+# List of programs to build
+hostprogs-$(CONFIG_X86_32) :=3D bpf-example
+bpf-example-objs :=3D bpf-example.o
+
+# Tell kbuild to always build the programs
+always :=3D $(hostprogs-y)
+
+HOSTCFLAGS_bpf-example.o +=3D -I$(objtree)/usr/include -m32
+HOSTLOADLIBES_bpf-example +=3D -m32
diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-exam=
ple.c
Post by Will Drewry
new file mode 100644
index 0000000..f98b70a
--- /dev/null
+++ b/samples/seccomp/bpf-example.c
@@ -0,0 +1,74 @@
+/*
+ * Seccomp BPF example
+ *
ium.org>
Post by Will Drewry
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+# =A0 =A0 =A0define PR_ATTACH_SECCOMP_FILTER 36
+#endif
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+static int install_filter(void)
+{
+ =A0 =A0 =A0 struct sock_filter filter[] =3D {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Grab the system call number */
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffse=
t(orig_eax)),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Jump table for the allowed syscalls =
*/
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt=
_sigreturn, 10, 0),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_si=
greturn, 9, 0),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_ex=
it_group, 8, 0),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_ex=
it, 7, 0),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_re=
ad, 1, 0),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_wr=
ite, 2, 6),
Post by Will Drewry
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Check that read is only using stdin.=
*/
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffse=
t(ebx)),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_F=
ILENO, 3, 4),
Post by Will Drewry
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Check that write is only using stdou=
t/stderr */
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffse=
t(ebx)),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_=
=46ILENO, 1, 0),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_=
=46ILENO, 0, 1),
Post by Will Drewry
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Put the "accept" value in A */
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 BPF_STMT(BPF_RET+BPF_A,0),
+ =A0 =A0 =A0 };
+ =A0 =A0 =A0 struct sock_fprog prog =3D {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 .len =3D (unsigned short)(sizeof(filter=
)/sizeof(filter[0])),
Post by Will Drewry
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 .filter =3D filter,
+ =A0 =A0 =A0 };
+ =A0 =A0 =A0 if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 perror("prctl");
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 1;
+ =A0 =A0 =A0 }
+ =A0 =A0 =A0 return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ =A0 =A0 =A0 char buf[4096];
+ =A0 =A0 =A0 ssize_t bytes =3D 0;
+ =A0 =A0 =A0 if (install_filter())
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 1;
+ =A0 =A0 =A0 syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT =
IS YOUR NAME? "));
Post by Will Drewry
+ =A0 =A0 =A0 bytes =3D syscall(__NR_read, STDIN_FILENO, buf, sizeof(=
buf));
Post by Will Drewry
+ =A0 =A0 =A0 syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ =A0 =A0 =A0 syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ =A0 =A0 =A0 return 0;
+}
--
1.7.5.4
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Randy Dunlap
2012-01-12 18:16:46 UTC
Permalink
Post by Will Drewry
Document how system call filtering with BPF works and
may be used. Includes an example for x86 (32-bit).
Please tell some of us what "BPF" means. wikipedia lists 15 possible
choices, but I don't know which one to choose.
Post by Will Drewry
---
Documentation/prctl/seccomp_filter.txt | 99 ++++++++++++++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 12 ++++
samples/seccomp/bpf-example.c | 74 ++++++++++++++++++++++++
4 files changed, 186 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-example.c
--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 17:23:53 UTC
Permalink
Post by Will Drewry
Document how system call filtering with BPF works and
may be used. =A0Includes an example for x86 (32-bit).
Please tell some of us what "BPF" means. =A0wikipedia lists 15 possib=
le
choices, but I don't know which one to choose.
I'll make it clearer in the documentation file and update the patch des=
cription.

BPF =3D=3D Berkeley Packet Filters which are implemented in Linux Socke=
t
=46ilters (LSF)>

thanks!
Post by Will Drewry
---
=A0Documentation/prctl/seccomp_filter.txt | =A0 99 +++++++++++++++++=
+++++++++++++++
Post by Will Drewry
=A0samples/Makefile =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0=
=A02 +-
Post by Will Drewry
=A0samples/seccomp/Makefile =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 12 +++=
+
Post by Will Drewry
=A0samples/seccomp/bpf-example.c =A0 =A0 =A0 =A0 =A0| =A0 74 +++++++=
+++++++++++++++++
Post by Will Drewry
=A04 files changed, 186 insertions(+), 1 deletions(-)
=A0create mode 100644 Documentation/prctl/seccomp_filter.txt
=A0create mode 100644 samples/seccomp/Makefile
=A0create mode 100644 samples/seccomp/bpf-example.c
--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your c=
ode ***
Steven Rostedt
2012-01-12 17:34:42 UTC
Permalink
Post by Randy Dunlap
Please tell some of us what "BPF" means. wikipedia lists 15 possible
choices, but I don't know which one to choose.
I'll make it clearer in the documentation file and update the patch description.
BPF == Berkeley Packet Filters which are implemented in Linux Socket
Filters (LSF)>
I admit, I was totally clueless in what it meant too ;)

Even the LWN article didn't explain (shame on you Jon).

"he has repurposed the networking layer's packet filtering mechanism
(BPF)"

I didn't know what did the "B" stood for.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ɓukasz Sowa
2012-01-12 13:13:00 UTC
Permalink
Hi Will,

That's very different approach to the system call interposition problem.
I find you solution very interesting. It gives far more capabilities
than my syscalls cgroup that you commented on some time ago. It's ready
now but I haven't tried filtering yet. I think that if your solution
make it to the mainline (and I guess that's really possible at current
stage :)), there will be no place for mine solution but that's ok.

There's one thing that I'm curious about - have you measured overhead in
any way? That was one of the biggest issues in all previous attempts to
limit syscalls. I'd love to compare the numbers with mine solution.

I'll examine your patch later on and put some comments if I bump into
something.

Best Regards,
Lukasz Sowa

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Will Drewry
2012-01-12 17:25:22 UTC
Permalink
Post by Łukasz Sowa
Hi Will,
That's very different approach to the system call interposition probl=
em.
Post by Łukasz Sowa
I find you solution very interesting. It gives far more capabilities
than my syscalls cgroup that you commented on some time ago. It's rea=
dy
Post by Łukasz Sowa
now but I haven't tried filtering yet. I think that if your solution
make it to the mainline (and I guess that's really possible at curren=
t
Post by Łukasz Sowa
stage :)), there will be no place for mine solution but that's ok.
Yeah - there've been so many tries, I'll be happy when one makes it in
which is usable :)
Post by Łukasz Sowa
There's one thing that I'm curious about - have you measured overhead=
in
Post by Łukasz Sowa
any way? That was one of the biggest issues in all previous attempts =
to
Post by Łukasz Sowa
limit syscalls. I'd love to compare the numbers with mine solution.
Certainly. I have some rough numbers, but nothing I'd call strong
measurements. There is still a fair amount of cost due to the syscall
slow path.
Post by Łukasz Sowa
I'll examine your patch later on and put some comments if I bump into
something.
Much appreciated - cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...