This reduces the memory footprint and it enforces that only the current
task can enable seccomp on itself (this is a requirement for a
strightforward [modulo preempt ;) ] TIF_NOTSC implementation).
Signed-off-by: Andrea Arcangeli <andrea@cpushare.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Basically, it will allow a process to unshare its user_struct table,
resetting at the same time its own user_struct and all the associated
accounting.
A new root user (uid == 0) is added to the user namespace upon creation.
Such root users have full privileges and it seems that theses privileges
should be controlled through some means (process capabilities ?)
The unshare is not included in this patch.
Changes since [try #4]:
- Updated get_user_ns and put_user_ns to accept NULL, and
get_user_ns to return the namespace.
Changes since [try #3]:
- moved struct user_namespace to files user_namespace.{c,h}
Changes since [try #2]:
- removed struct user_namespace* argument from find_user()
Changes since [try #1]:
- removed struct user_namespace* argument from find_user()
- added a root_user per user namespace
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Andrew Morgan <agm@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
attach_pid() currently takes a pid_t and then uses find_pid() to find the
corresponding struct pid. Sometimes we already have the struct pid. We can
then skip find_pid() if attach_pid() were to take a struct pid parameter.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <containers@lists.osdl.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switch to the defines for these two checks, instead of hard coding the
values.
[akpm@linux-foundation.org: add missing include]
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for
each task.
This patch permits reporting of values using the well known getrusage()
syscall, filling ru_inblock and ru_oublock instead of null values.
As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks
count doing : nr_blocks = nr_bytes / 512
Example of use :
----------------------
After patch is applied, /usr/bin/time command can now give a good
approximation of IO that the process had to do.
$ /usr/bin/time grep tototo /usr/include/*
Command exited with non-zero status 1
0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
24288inputs+0outputs (0major+259minor)pagefaults 0swaps
$ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000
1000+0 enregistrements lus
1000+0 enregistrements écrits
512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+3000outputs (0major+299minor)pagefaults 0swaps
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
sound: convert "sound" subdirectory to UTF-8
MAINTAINERS: Add cxacru website/mailing list
include files: convert "include" subdirectory to UTF-8
general: convert "kernel" subdirectory to UTF-8
documentation: convert the Documentation directory to UTF-8
Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
remove broken URLs from net drivers' output
Magic number prefix consistency change to Documentation/magic-number.txt
trivial: s/i_sem /i_mutex/
fix file specification in comments
drivers/base/platform.c: fix small typo in doc
misc doc and kconfig typos
Remove obsolete fat_cvf help text
Fix occurrences of "the the "
Fix minor typoes in kernel/module.c
Kconfig: Remove reference to external mqueue library
Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
Correct comments in genrtc.c to refer to correct /proc file.
Fix more "deprecated" spellos.
Fix "deprecated" typoes.
...
Fix trivial comment conflict in kernel/relay.c.
Since 2.6.18-something, the community has been bugged by the problem to
provide a clean and a stable mechanism to postpone a cpu-hotplug event as
lock_cpu_hotplug was badly broken.
This is another proposal towards solving that problem. This one is along the
lines of the solution provided in kernel/workqueue.c
Instead of having a global mechanism like lock_cpu_hotplug, we allow the
subsytems to define their own per-subsystem hot cpu mutexes. These would be
taken(released) where ever we are currently calling
lock_cpu_hotplug(unlock_cpu_hotplug).
Also, in the per-subsystem hotcpu callback function,we take this mutex before
we handle any pre-cpu-hotplug events and release it once we finish handling
the post-cpu-hotplug events. A standard means for doing this has been
provided in [PATCH 2/4] and demonstrated in [PATCH 3/4].
The ordering of these per-subsystem mutexes might still prove to be a
problem, but hopefully lockdep should help us get out of that muddle.
The patch set to be applied against linux-2.6.19-rc5 is as follows:
[PATCH 1/4] : Extend notifier_call_chain with an option to specify the
number of notifications to be sent and also count the
number of notifications actually sent.
[PATCH 2/4] : Define events CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE
and send out notifications for these in _cpu_up and
_cpu_down. This would help us standardise the acquire and
release of the subsystem locks in the hotcpu
callback functions of these subsystems.
[PATCH 3/4] : Eliminate lock_cpu_hotplug from kernel/sched.c.
[PATCH 4/4] : In workqueue_cpu_callback function, acquire(release) the
workqueue_mutex while handling
CPU_LOCK_ACQUIRE(CPU_LOCK_RELEASE).
If the per-subsystem-locking approach survives the test of time, we can expect
a slow phasing out of lock_cpu_hotplug, which has not yet been eliminated in
these patches :)
This patch:
Provide notifier_call_chain with an option to call only a specified number of
notifiers and also record the number of call to notifiers made.
The need for this enhancement was identified in the post entitled
"Slab - Eliminate lock_cpu_hotplug from slab"
(http://lkml.org/lkml/2006/10/28/92) by Ravikiran G Thirumalai and
Andrew Morton.
This patch adds two additional parameters to notifier_call_chain API namely
- int nr_to_calls : Number of notifier_functions to be called.
The don't care value is -1.
- unsigned int *nr_calls : Records the total number of notifier_funtions
called by notifier_call_chain. The don't care
value is NULL.
[michal.k.k.piotrowski@gmail.com: build fix]
Credit: Andrew Morton <akpm@osdl.org>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ With Johannes Berg <johannes@sipsolutions.net> ]
Separate the hibernation (aka suspend to disk code) from the other suspend
code. In particular:
* Remove the definitions related to hibernation from include/linux/pm.h
* Introduce struct hibernation_ops and a new hibernate() function to hibernate
the system, defined in include/linux/suspend.h
* Separate suspend code in kernel/power/main.c from hibernation-related code
in kernel/power/disk.c and kernel/power/user.c (with the help of
hibernation_ops)
* Switch ACPI (the only user of pm_ops.pm_disk_mode) to hibernation_ops
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the "kernel" subdirectory of the tree to UTF-8. The only file
modified is <kernel/sys.c>.
Signed-off-by: John Anthony Kazos Jr. <jakj@j-a-k-j.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
As discovered here today, the change in Kernel 2.6.17 intended to inhibit
users from setting RLIMIT_CPU to 0 (as that is equivalent to unlimited) by
"cheating" and setting it to 1 in such a case, does not make a difference,
as the check is done in the wrong place (too late), and only applies to the
profiling code.
On all systems I checked running kernels above 2.6.17, no matter what the
hard and soft CPU time limits were before, a user could escape them by
issuing in the shell (sh/bash/zsh) "ulimit -t 0", and then the user's
process was not ever killed.
Attached is a trivial patch to fix that. Simply moving the check to a
slightly earlier location (specifically, before the line that actually
assigns the limit - *old_rlim = new_rlim), does the trick.
Do note that at least the zsh (but not ash, dash, or bash) shell has the
problem of "caching" the limits set by the ulimit command, so when running
zsh the fix will not immediately be evident - after entering "ulimit -t 0",
"ulimit -a" will show "-t: cpu time (seconds) 0", even though the actual
limit as returned by getrlimit(...) will be 1. It can be verified by
opening a subshell (which will not have the values of the parent shell in
cache) and checking in it, or just by running a CPU intensive command like
"echo '65536^1048576' | bc" and verifying that it dumps core after one
second.
Regardless of whether that is a misfeature in the shell, perhaps it would
be better to return -EINVAL from setrlimit in such a case instead of
cheating and setting to 1, as that does not really reflect the actual state
of the process anymore. I do not however know what the ground for that
decision was in the original 2.6.17 change, and whether there would be any
"backward" compatibility issues, so I preferred not to touch that right
now.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove software_suspend() and all its users since
pm_suspend(PM_SUSPEND_DISK) should be equivalent and there's no point in
having two interfaces for the same thing.
The patch also changes the valid_state function to return 0 (false) for
PM_SUSPEND_DISK when SOFTWARE_SUSPEND is not configured instead of
accepting it and having the whole thing fail later.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There isn't any real advantage to this change except that it allows the old
functions to be removed. Which is easier on maintenance and puts the code in
a more uniform style.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Of kernel subsystems that work with pids the tty layer is probably the largest
consumer. But it has the nice virtue that the assiation with a session only
lasts until the session leader exits. Which means that no reference counting
is required. So using struct pid winds up being a simple optimization to
avoid hash table lookups.
In the long term the use of pid_nr also ensures that when we have multiple pid
spaces mixed everything will work correctly.
Signed-off-by: Eric W. Biederman <eric@maxwell.lnxi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:
* make multi-line initial descriptions single line
* denote some function names, constants and structs as such
* change erroneous opening '/*' to '/**' in a few places
* reword some text for clarity
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
while lock-profiling the -rt kernel i noticed weird contention during
mmap-intense workloads, and the tracer showed the following gem, in one
of our MM hotpaths:
threaded-2771 1.... 65us : sys_munmap (sysenter_do_call)
threaded-2771 1.... 66us : profile_munmap (sys_munmap)
threaded-2771 1.... 66us : blocking_notifier_call_chain (profile_munmap)
threaded-2771 1.... 66us : rt_down_read (blocking_notifier_call_chain)
ouch! a global rw-semaphore taken in one of the most performance-
sensitive codepaths of the kernel. And i dont even have oprofile
enabled! All distro kernels have CONFIG_PROFILING enabled, so this
scalability problem affects the majority of Linux users.
The fix is to enhance blocking_notifier_call_chain() to only take the
lock if there appears to be work on the call-chain.
With this patch applied i get nicely saturated system, and much higher
munmap performance, on SMP systems.
And as a bonus this also fixes a similar scalability bottleneck in the
thread-exit codepath: profile_task_exit() ...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All tasks in the process group have the same sid, we don't need to iterate
them all to check that the caller of sys_setpgid() doesn't change its
session.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace occurences of task->signal->session by a new process_session() helper
routine.
It will be useful for pid namespaces to abstract the session pid number.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the locking of signal->tty.
Use ->sighand->siglock to protect ->signal->tty; this lock is already used
by most other members of ->signal/->sighand. And unless we are 'current'
or the tasklist_lock is held we need ->siglock to access ->signal anyway.
(NOTE: sys_unshare() is broken wrt ->sighand locking rules)
Note that tty_mutex is held over tty destruction, so while holding
tty_mutex any tty pointer remains valid. Otherwise the lifetime of ttys
are governed by their open file handles. This leaves some holes for tty
access from signal->tty (or any other non file related tty access).
It solves the tty SLAB scribbles we were seeing.
(NOTE: the change from group_send_sig_info to __group_send_sig_info needs to
be examined by someone familiar with the security framework, I think
it is safe given the SEND_SIG_PRIV from other __group_send_sig_info
invocations)
[schwidefsky@de.ibm.com: 3270 fix]
[akpm@osdl.org: various post-viro fixes]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Alan Cox <alan@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.
For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.
To make this work, an extra flag is introduced into the management side of the
work_struct. This governs auto-release of the structure upon execution.
Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function. This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated.. This is a
problem if the auxiliary data is taken away (as done by the last patch).
However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems. But then the work function must itself release the
work_struct by calling work_release().
In most cases, automatic release is fine, so this is the default. Special
initiators exist for the non-auto-release case (ending in _NAR).
Signed-Off-By: David Howells <dhowells@redhat.com>
Currently the init_srcu_struct() routine has no way to report out-of-memory
errors. This patch (as761) makes it return -ENOMEM when the per-cpu data
allocation fails.
The patch also makes srcu_init_notifier_head() report a BUG if a notifier
head can't be initialized. Perhaps it should return -ENOMEM instead, but
in the most likely cases where this might occur I don't think any recovery
is possible. Notifier chains generally are not created dynamically.
[akpm@osdl.org: avoid statement-with-side-effect in macro]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch (as751) adds a new type of notifier chain, based on the SRCU
(Sleepable Read-Copy Update) primitives recently added to the kernel. An
SRCU notifier chain is much like a blocking notifier chain, in that it must
be called in process context and its callout routines are allowed to sleep.
The difference is that the chain's links are protected by the SRCU
mechanism rather than by an rw-semaphore, so calling the chain has
extremely low overhead: no memory barriers and no cache-line bouncing. On
the other hand, unregistering from the chain is expensive and the chain
head requires special runtime initialization (plus cleanup if it is to be
deallocated).
SRCU notifiers are appropriate for notifiers that will be called very
frequently and for which unregistration occurs very seldom. The proposed
"task notifier" scheme qualifies, as may some of the network notifiers.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are a few places in the kernel where the init task is signaled. The
ctrl+alt+del sequence is one them. It kills a task, usually init, using a
cached pid (cad_pid).
This patch replaces the pid_t by a struct pid to avoid pid wrap around
problem. The struct pid is initialized at boot time in init() and can be
modified through systctl with
/proc/sys/kernel/cad_pid
[ I haven't found any distro using it ? ]
It also introduces a small helper routine kill_cad_pid() which is used
where it seemed ok to use cad_pid instead of pid 1.
[akpm@osdl.org: cleanups, build fix]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace references to system_utsname to the per-process uts namespace
where appropriate. This includes things like uname.
Changes: Per Eric Biederman's comments, use the per-process uts namespace
for ELF_PLATFORM, sunrpc, and parts of net/ipv4/ipconfig.c
[jdike@addtoit.com: UML fix]
[clg@fr.ibm.com: cleanup]
[akpm@osdl.org: build fix]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When kprobe is re-entered, the re-entered kprobe kernel path will will call
atomic_notifier_call_chain function, if this function is kprobed that will
incur numerous kprobe recursive fault. This patch disallows kprobes on
atomic_notifier_call_chain function.
Signed-off-by: bibo, mao <bibo.mao@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix up kernel/sys.c to be consistent with CodingStyle and the rest of the
file.
Signed-off-by: Cal Peake <cp@absolutedigital.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Get rid of an extraneous printk in kernel_restart().
Signed-off-by: Cal Peake <cp@absolutedigital.net>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For NUMA optimization and some other algorithms it is useful to have a fast
to get the current CPU and node numbers in user space.
x86-64 added a fast way to do this in a vsyscall. This adds a generic
syscall for other architectures to make it a generic portable facility.
I expect some of them will also implement it as a faster vsyscall.
The cache is an optimization for the x86-64 vsyscall optimization. Since
what the syscall returns is an approximation anyways and user space
often wants very fast results it can be cached for some time. The norma
methods to get this information in user space are relatively slow
The vsyscall is in a better position to manage the cache because it has direct
access to a fast time stamp (jiffies). For the generic syscall optimization
it doesn't help much, but enforce a valid argument to keep programs
portable
I only added an i386 syscall entry for now. Other architectures can follow
as needed.
AK: Also added some cleanups from Andrew Morton
Signed-off-by: Andi Kleen <ak@suse.de>
Based on a patch from Ernie Petrides
During security research, Red Hat discovered a behavioral flaw in core
dump handling. A local user could create a program that would cause a
core file to be dumped into a directory they would not normally have
permissions to write to. This could lead to a denial of service (disk
consumption), or allow the local user to gain root privileges.
The prctl() system call should never allow to set "dumpable" to the
value 2. Especially not for non-privileged users.
This can be split into three cases:
1) running as root -- then core dumps will already be done as root,
and so prctl(PR_SET_DUMPABLE, 2) is not useful
2) running as non-root w/setuid-to-root -- this is the debatable case
3) running as non-root w/setuid-to-non-root -- then you definitely
do NOT want "dumpable" to get set to 2 because you have the
privilege escalation vulnerability
With case #2, the only potential usefulness is for a program that has
designed to run with higher privilege (than the user invoking it) that
wants to be able to create root-owned root-validated core dumps. This
might be useful as a debugging aid, but would only be safe if the program
had done a chdir() to a safe directory.
There is no benefit to a production setuid-to-root utility, because it
shouldn't be dumping core in the first place. If this is true, then the
same debugging aid could also be accomplished with the "suid_dumpable"
sysctl.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- proper prototypes for the following functions:
- ctrl_alt_del() (in include/linux/reboot.h)
- getrusage() (in include/linux/resource.h)
- make the following needlessly global functions static:
- kernel_restart_prepare()
- kernel_kexec()
[akpm@osdl.org: compile fix]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since raw_notifier chains don't benefit from any centralized locking
protections, they shouldn't suffer from the associated limitations. Under
some circumstances it might make sense for a raw_notifier callout routine
to unregister itself from the notifier chain. This patch (as678) changes
the notifier core to allow for such things.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kernel/sys.c doesn't have anything in it relying on linux/init.h -
remove the include.
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits)
[POWERPC] re-enable OProfile for iSeries, using timer interrupt
[POWERPC] support ibm,extended-*-frequency properties
[POWERPC] Extra sanity check in EEH code
[POWERPC] Dont look for class-code in pci children
[POWERPC] Fix mdelay badness on shared processor partitions
[POWERPC] disable floating point exceptions for init
[POWERPC] Unify ppc syscall tables
[POWERPC] mpic: add support for serial mode interrupts
[POWERPC] pseries: Print PCI slot location code on failure
[POWERPC] spufs: one more fix for 64k pages
[POWERPC] spufs: fail spu_create with invalid flags
[POWERPC] spufs: clear class2 interrupt status before wakeup
[POWERPC] spufs: fix Makefile for "make clean"
[POWERPC] spufs: remove stop_code from struct spu
[POWERPC] spufs: fix spu irq affinity setting
[POWERPC] spufs: further abstract priv1 register access
[POWERPC] spufs: split the Cell BE support into generic and platform dependant parts
[POWERPC] spufs: dont try to access SPE channel 1 count
[POWERPC] spufs: use kzalloc in create_spu
[POWERPC] spufs: fix initial state of wbox file
...
Manually resolved conflicts in:
drivers/net/phy/Makefile
include/asm-powerpc/spu.h
Avoid taking tasklist_lock for at getrusage for the multithreaded case too.
We don't need to take the tasklist lock for thread traversal of a process
since Oleg's do-__unhash_process-under-siglock.patch and related work.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This new prctl is intended for changing the execution mode of the
processor, on processors that support both a little-endian mode and a
big-endian mode. It is intended for use by programs such as
instruction set emulators (for example an x86 emulator on PowerPC),
which may find it convenient to use the processor in an alternate
endianness mode when executing translated instructions.
Note that this does not imply the existence of a fully-fledged ABI for
both endiannesses, or of compatibility code for converting system
calls done in the non-native endianness mode. The program is expected
to arrange for all of its system call arguments to be presented in the
native endianness.
Switching between big and little-endian mode will require some care in
constructing the instruction sequence for the switch. Generally the
instructions up to the instruction that invokes the prctl system call
will have to be in the old endianness, and subsequent instructions
will have to be in the new endianness.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The core problem: setsid fails if it is called by init. The effect in 2.6.16
and the earlier kernels that have this problem is that if you do a "ps -j 1 or
ps -ej 1" you will see that init and several of it's children have process
group and session == 0. Instead of process group == session == 1. Despite
init calling setsid.
The reason it fails is that daemonize calls set_special_pids(1,1) on kernel
threads that are launched before /sbin/init is called.
The only remaining effect in that current->signal->leader == 0 for init
instead of 1. And the setsid call fails. No one has noticed because
/sbin/init does not check the return value of setsid.
In 2.4 where we don't have the pidhash table, and daemonize doesn't exist
setsid actually works for init.
I care a lot about pid == 1 not being a special case that we leave broken,
because of the container/jail work that I am doing.
- Carefully allow init (pid == 1) to call setsid despite the kernel using
its session.
- Use find_task_by_pid instead of find_pid because find_pid taking a
pidtype is going away.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch reverts 'CONFIG_SMP && thread_group_empty()' optimization in
sys_times(). The reason is that the next patch breaks memory ordering which
is needed for that optimization.
tasklist_lock in sys_times() will be eliminated completely by further patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The kernel's implementation of notifier chains is unsafe. There is no
protection against entries being added to or removed from a chain while the
chain is in use. The issues were discussed in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2
We noticed that notifier chains in the kernel fall into two basic usage
classes:
"Blocking" chains are always called from a process context
and the callout routines are allowed to sleep;
"Atomic" chains can be called from an atomic context and
the callout routines are not allowed to sleep.
We decided to codify this distinction and make it part of the API. Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name). New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain. The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.
With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed. For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections. (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)
There are some limitations, which should not be too hard to live with. For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem. Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain. (This did happen in a couple of places and the code
had to be changed to avoid it.)
Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization. Instead we use RCU. The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.
Here is the list of chains that we adjusted and their classifications. None
of them use the raw API, so for the moment it is only a placeholder.
ATOMIC CHAINS
-------------
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain
BLOCKING CHAINS
---------------
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
arch/x86_64/kernel/process.c idle_notifier
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c sleep_notifier_list
drivers/macintosh/via-pmu68k.c sleep_notifier_list
drivers/macintosh/windfarm_core.c wf_client_list
drivers/usb/core/notify.c usb_notifier_list
drivers/video/fbmem.c fb_notifier_list
kernel/cpu.c cpu_chain
kernel/module.c module_notify_list
kernel/profile.c munmap_notifier
kernel/profile.c task_exit_notifier
kernel/sys.c reboot_notifier_list
net/core/dev.c netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain
It's possible that some of these classifications are wrong. If they are,
please let us know or submit a patch to fix them. Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)
The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.
[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch avoids arithmetic on 'signed' types that are slower than
'unsigned'. This saves space and cpu cycles.
size of kernel/sys.o before the patch (gcc-3.4.5)
text data bss dec hex filename
10924 252 4 11180 2bac kernel/sys.o
size of kernel/sys.o after the patch
text data bss dec hex filename
10903 252 4 11159 2b97 kernel/sys.o
I noticed that gcc-4.1.0 (from Fedora Core 5) even uses idiv instruction for
(a+b)/2 if a and b are signed.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While doing some benchmarks of an Apache/PHP SMP server, I noticed high
oprofile numbers in in_group_p() and _atomic_dec_and_lock().
rank percent
1 4.8911 % __link_path_walk
2 4.8503 % __d_lookup
*3 4.2911 % _atomic_dec_and_lock
4 3.9307 % __copy_to_user_ll
5 4.9004 % sysenter_past_esp
*6 3.3248 % in_group_p
It appears that in_group_p() does an uncessary
get_group_info(current->group_info); /* atomic_inc() */
... /* access current->group_info */
put_group_info(current->group_info); /* _atomic_dec_and_lock */
It is not necessary to do this, because the current task holds a reference
on its own group_info, and this reference cannot change during the lookup.
This patch deletes the get_group_info()/put_group_info() pair from
sys_getgroups(), in_group_p() and in_egroup_p() functions.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Tim Hockin <thockin@hockin.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move capable() to kernel/capability.c and eliminate duplicate
implementations. Add __capable() function which can be used to check for
capabiilty of any process.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Document the fact that setrlimit(RLIMIT_CPU) doesn't return error codes when
it should. I don't think we can fix this without a 2.7.x..
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At present the kernel doesn't honour an attempt to set RLIMIT_CPU to zero
seconds. But the spec says it should, and that's what 2.4.x does.
Fixing this for real would involve some complexity (such as adding a new
it-has-been-set flag to the task_struct, and testing that everwhere, instead
of overloading the value of it_prof_expires).
Given that a 2.4 kernel won't actually send the signal until one second has
expired anyway, let's just handle this case by treating the caller's
zero-seconds as one second.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Whitespace cleanups
- Make that expression comprehensible.
There's a potential logic change here: we do the "is it_prof_expires equal to
zero" test after converting it to seconds, rather than doing the comparison
between raw cputime_t's.
But given that it's in units of seconds anyway, that shouldn't change
anything.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Avoid taking the global tasklist_lock when possible, if a process is single
threaded during getrusage(). Any avoidance of tasklist_lock is good for
NUMA boxes (and possibly for large SMPs). Thanks to Oleg Nesterov for
review and suggestions.
Signed-off-by: Nippun Goel <nippung@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>