* Use struct path in fs_struct.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order
* Switch from path_release(nd) to path_put(&nd->path)
* Rename dput_path() to path_put_conditional()
[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.
Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:
without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux
with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux
This patch:
Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit c6caeb7c45 ("proc: fix the
threaded /proc/self"), since Eric says "The patch really is wrong.
There is at least one corner case in procps that cares."
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Guillaume Chazarain" <guichaz@yahoo.fr>
Cc: "Pavel Emelyanov" <xemul@openvz.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Long ago when the CLONE_THREAD support first went it someone thought it
would be wise to point /proc/self at /proc/<tgid> instead of /proc/<pid>.
Given that /proc/<tgid> can return information about a very different task
(if enough things have been unshared) then our current process /proc/<tgid>
seems blatantly wrong. So far I have yet to think up an example where the
current behavior would be advantageous, and I can see several places where
it is seriously non-intuitive.
We may be stuck with the current broken behavior for backwards
compatibility reasons but lets try fixing our ancient bug for the 2.6.25
time frame and see if anyone screams.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: "Guillaume Chazarain" <guichaz@yahoo.fr>
Cc: "Pavel Emelyanov" <xemul@openvz.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently if you access a /proc that is not mounted with your processes
current pid namespace /proc/self will point at a completely random task.
This patch fixes /proc/self to point to the current process if it is
available in the particular mount of /proc or to return -ENOENT if the
current process is not visible.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we possibly lookup the pid in the wrong pid namespace. So
seq_file convert proc_pid_status which ensures the proper pid namespaces is
passed in.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: another build fix]
[akpm@linux-foundation.org: s390 build fix]
[akpm@linux-foundation.org: fix task_name() output]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andrew Morgan <morgan@kernel.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This conversion is just for code cleanliness, uniformity, and general safety.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently (as pointed out by Oleg) do_task_stat has a race when calling
task_pid_nr_ns with the task exiting. In addition do_task_stat is not
currently displaying information in the context of the pid namespace that
mounted the /proc filesystem. So "cut -d' ' -f 1 /proc/<pid>/stat" may not
equal <pid>.
This patch fixes the problem by converting to a single_open seq_file show
method. Getting the pid namespace from the filesystem superblock instead of
current, and simply using the the struct pid from the inode instead of
attempting to get that same pid from the task.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently many /proc/pid files use a crufty precursor to the current seq_file
api, and they don't have direct access to the pid_namespace or the pid of for
which they are displaying data.
So implement proc_single_file_operations to make the seq_file routines easy to
use, and to give access to the full state of the pid of we are displaying data
for.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes a sles9 system hang in start_this_handle from a customer
with some heavy workload where all tasks are waiting on kjournald to commit
the transaction, but kjournald waits on t_updates to go down to zero (it
never does).
This was reported as a lowmem shortage deadlock but when checking the debug
data I noticed the VM wasn't under pressure at all (well it was really
under vm pressure, because lots of tasks hanged in the VM prune_dcache
methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
mode, the holder of the journal handle should have if this was a vm issue
in the first place).
No task was apparently holding the leftover handle in the committing
transaction, so I deduced t_updates was stuck to 1 because a journal_stop
was never run by some path (this turned out to be correct). With a debug
patch adding proper reverse links and stack trace logging in ext3 deployed
in production, I found journal_stop is never run because
mark_inode_dirty_sync is called inside release_task called by do_exit.
(that was quite fun because I would have never thought about this
subtleness, I thought a regular path in ext3 had a bug and it forgot to
call journal_stop)
do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
come back to run journal_stop)
The reason is that shrink_dcache_parent is racy by design (feature not
a bug) and it can do blocking I/O in some case, but the point is that
calling shrink_dcache_parent at the last stage of do_exit isn't safe
for self-reaping tasks.
I guess the memory pressure of the unbalanced highmem system allowed
to trigger this more easily.
Now mainline doesn't have this line in iput (like sles9 has):
if (inode->i_state & I_DIRTY_DELAYED)
mark_inode_dirty_sync(inode);
so it will probably not crash with ext3, but for example ext2 implements an
I/O-blocking ext2_put_inode that will lead to similar screwups with
ext2_free_blocks never coming back and it's definitely wrong to call
blocking-IO paths inside do_exit. So this should fix a subtle bug in
mainline too (not verified in practice though). The equivalent fix for
ext3 is also not verified yet to fix the problem in sles9 but I don't have
doubt it will (it usually takes days to crash, so it'll take weeks to be
sure).
An alternate fix would be to offload that work to a kernel thread, but I
don't think a reschedule for this is worth it, the vm should be able to
collect those entries for the synchronous release_task.
Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make /proc/ page monitoring configurable
This puts the following files under an embedded config option:
/proc/pid/clear_refs
/proc/pid/smaps
/proc/pid/pagemap
/proc/kpagecount
/proc/kpageflags
[akpm@linux-foundation.org: Kconfig fix]
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This interface provides a mapping for each page in an address space to its
physical page frame number, allowing precise determination of what pages are
mapped and what pages are shared between processes.
New in this version:
- headers gone again (as recommended by Dave Hansen and Alan Cox)
- 64-bit entries (as per discussion with Andi Kleen)
- swap pte information exported (from Dave Hansen)
- page walker callback for holes (from Dave Hansen)
- direct put_user I/O (as suggested by Rusty Russell)
This patch folds in cleanups and swap PTE support from Dave Hansen
<haveblue@us.ibm.com>.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This puts all the clear_refs code where it belongs and probably lets things
compile on MMU-less systems as well.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits)
Remove commented-out code copied from NFS
NFS: Switch from intr mount option to TASK_KILLABLE
Add wait_for_completion_killable
Add wait_event_killable
Add schedule_timeout_killable
Use mutex_lock_killable in vfs_readdir
Add mutex_lock_killable
Use lock_page_killable
Add lock_page_killable
Add fatal_signal_pending
Add TASK_WAKEKILL
exit: Use task_is_*
signal: Use task_is_*
sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
ptrace: Use task_is_*
power: Use task_is_*
wait: Use TASK_NORMAL
proc/base.c: Use task_is_*
proc/array.c: Use TASK_REPORT
perfmon: Use task_is_*
...
Fixed up conflicts in NFS/sunrpc manually..
LatencyTOP kernel infrastructure; it measures latencies in the
scheduler and tracks it system wide and per process.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Contents of /proc/*/maps is sensitive and may become sensitive after
open() (e.g. if target originally shares our ->mm and later does exec
on suid-root binary).
Check at read() (actually, ->start() of iterator) time that mm_struct
we'd grabbed and locked is
- still the ->mm of target
- equal to reader's ->mm or the target is ptracable by reader.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg noticed that the call of task_pid_nr_ns() in proc_pid_readdir
is racy with respect to tasks exiting.
After a bit of examination it also appears that the call itself
is completely unnecessary.
So to fix the problem this patch modifies next_tgid() to return
both a tgid and the task struct in question.
A structure is introduced to return these values because it is
slightly cleaner and easier to optimize, and the resulting code
is a little shorter.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we special case when we have only the initial pid namespace.
Unfortunately in doing so the copied case for the other namespaces was
broken so we don't properly flush the thread directories :(
So this patch removes the unnecessary special case (removing a usage of
proc_mnt) and corrects the flushing of the thread directories.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix mnt_flush_task() misplaced kernel-doc.
Fix typos in some of the doc text.
Warning(linux-2.6.23-git17//fs/proc/base.c:2280): No description found for parameter 'mnt'
Warning(linux-2.6.23-git17//fs/proc/base.c:2280): No description found for parameter 'pid'
Warning(linux-2.6.23-git17//fs/proc/base.c:2280): No description found for parameter 'tgid'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When removing the explicit task_struct->pid usage I found that
proc_readfd_common() and proc_pident_readdir() get this field, but do not
use it at all. So this cleanup is a cheap help with the task_struct->pid
isolation.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, there exists no method for a process to query the resource
limits of another process. They can be inferred via some mechanisms but
they cannot be explicitly determined. Given that this information can be
usefull to know during the debugging of an application, I've written this
patch which exports all of a processes limits via /proc/<pid>/limits.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With pid namespaces this field is now dangerous to use explicitly, so hide
it behind the helpers.
Also the pid and pgrp fields o task_struct and signal_struct are to be
deprecated. Unfortunately this patch cannot be sent right now as this
leads to tons of warnings, so start isolating them, and deprecate later.
Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
but Oleg pointed out that in case of posix cpu timers this is the same, and
thread_group_leader() is more preferable.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.
The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The namespace's proc_mnt must be kern_mount-ed to make this pointer always
valid, independently of whether the user space mounted the proc or not. This
solves raced in proc_flush_task, etc. with the proc_mnt switching from NULL
to not-NULL.
The initialization is done after the init's pid is created and hashed to make
proc_get_sb() finr it and get for root inode.
Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
superblock holds the namespace we must explicitly break this circle to destroy
all the stuff. This is done after the init of the namespace dies. Running a
few steps forward - when init exits it will kill all its children, so no
proc_mnt will be needed after its death.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This means that proc_flush_task_mnt() is to be called for many proc mounts and
with different ids, depending on the namespace this pid is to be flushed from.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When searching the task by numerical id on may need to find it using global
pid (as it is done now in kernel) or by its virtual id, e.g. when sending a
signal to a task from one namespace the sender will specify the task's virtual
id and we should find the task by this value.
[akpm@linux-foundation.org: fix gfs2 linkage]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first part is trivial - we just make the proc_flush_task() to operate on
arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
it.
The other change is more tricky: I moved the proc_flush_task() call in
release_task() higher to address the following problem.
When flushing task from many proc trees we need to know the set of ids (not
just one pid) to find the dentries' names to flush. Thus we need to pass the
task's pid to proc_flush_task() as struct pid is the only object that can
provide all the pid numbers. But after __exit_signal() task has detached all
his pids and this information is lost.
This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
tree and keep them in hash (since pids are still alive before __exit_signal())
till the next shrink, but since proc_flush_task() does not provide a 100%
guarantee that the dentries will be flushed, this is OK to do so.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists. This is
slow on read-only paths and may be impossible in some cases.
E.g. Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.
On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.
The access to other task namespaces is proposed to be performed
like this:
rcu_read_lock();
nsproxy = task_nsproxy(tsk);
if (nsproxy != NULL) {
/ *
* work with the namespaces here
* e.g. get the reference on one of them
* /
} / *
* NULL task_nsproxy() means that this task is
* almost dead (zombie)
* /
rcu_read_unlock();
This patch has passed the review by Eric and Oleg :) and,
of course, tested.
[clg@fr.ibm.com: fix unshare()]
[ebiederm@xmission.com: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem
The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/PID/environ currently truncates at 4096 characters, patch based on
the /proc/PID/mem code.
Signed-off-by: James Pearson <james-p@moving-picture.com>
Cc: Anton Arapov <aarapov@redhat.com>
Cc: Jan Engelhardt <jengelh@computergmbh.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix f_version type: should be u64 instead of long
There is a type inconsistency between struct inode i_version and struct file
f_version.
fs.h:
struct inode
u64 i_version;
and
struct file
unsigned long f_version;
Users do:
fs/ext3/dir.c:
if (filp->f_version != inode->i_version) {
So why isn't f_version a u64 ? It becomes a problem if versions gets
higher than 2^32 and we are on an architecture where longs are 32 bits.
This patch changes the f_version type to u64, and updates the users accordingly.
It applies to 2.6.23-rc2-mm2.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Martin Bligh <mbligh@google.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: <linux-ext4@vger.kernel.org>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.
This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rename all 'cnt' fields and variables to the less yucky 'count' name.
yuckage noticed by Andrew Morton.
no change in code, other than the /proc/sched_debug bkl_count string got
a bit larger:
text data bss dec hex filename
38236 3506 24 41766 a326 sched.o.before
38240 3506 24 41770 a32a sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds an interface to set/reset flags which determines each memory
segment should be dumped or not when a core file is generated.
/proc/<pid>/coredump_filter file is provided to access the flags. You can
change the flag status for a particular process by writing to or reading from
the file.
The flag status is inherited to the child process when it is created.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes mm_struct.dumpable to a pair of bit flags.
set_dumpable() converts three-value dumpable to two flags and stores it into
lower two bits of mm_struct.flags instead of mm_struct.dumpable.
get_dumpable() behaves in the opposite way.
[akpm@linux-foundation.org: export set_dumpable]
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KSYM_NAME_LEN is peculiar in that it does not include the space for the
trailing '\0', forcing all users to use KSYM_NAME_LEN + 1 when allocating
buffer. This is nonsense and error-prone. Moreover, when the caller
forgets that it's very likely to subtly bite back by corrupting the stack
because the last position of the buffer is always cleared to zero.
This patch increments KSYM_NAME_LEN by one and updates code accordingly.
* off-by-one bug in asm-powerpc/kprobes.h::kprobe_lookup_name() macro
is fixed.
* Where MODULE_NAME_LEN and KSYM_NAME_LEN were used together,
MODULE_NAME_LEN was treated as if it didn't include space for the
trailing '\0'. Fix it.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Paulo Marques <pmarques@grupopie.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reduces the memory footprint and it enforces that only the current
task can enable seccomp on itself (this is a requirement for a
strightforward [modulo preempt ;) ] TIF_NOTSC implementation).
Signed-off-by: Andrea Arcangeli <andrea@cpushare.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's a bit dopey-looking and can permit a task to cause a pagefault in an mm
which it doesn't have permission to read from.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/pid/clear_refs is only defined in the CONFIG_MMU case, so make sure we
don't have any references to clear_refs_smap() in generic procfs code.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup using simple_read_from_buffer() in procfs.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
notify_change() already calls security_inode_setattr() before
calling iop->setattr.
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kallsyms_lookup() can go iterating over modules list unprotected which is OK
for emergency situations (oops), but not OK for regular stuff like
/proc/*/wchan.
Introduce lookup_symbol_name()/lookup_module_symbol_name() which copy symbol
name into caller-supplied buffer or return -ERANGE. All copying is done with
module_mutex held, so...
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several kallsyms_lookup() pass dummy arguments but only need, say, module's
name. Make kallsyms_lookup() accept NULLs where possible.
Also, makes picture clearer about what interfaces are needed for all symbol
resolving business.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eternal quest to make
while true; do cat /proc/fs/xfs/stat >/dev/null 2>/dev/null; done
while true; do find /proc -type f 2>/dev/null | xargs cat >/dev/null 2>/dev/null; done
while true; do modprobe xfs; rmmod xfs; done
work reliably continues and now kernel oopses in the following way:
BUG: unable to handle ... at virtual address 6b6b6b6b
EIP is at badness
process: cat
proc_oom_score
proc_info_read
sys_fstat64
vfs_read
proc_info_read
sys_read
Failing code is prefetch hidden in list_for_each_entry() in badness().
badness() is reachable from two points. One is proc_oom_score, another
is out_of_memory() => select_bad_process() => badness().
Second path grabs tasklist_lock, while first doesn't.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for finding out the current file position, open flags and
possibly other info in the future.
These new entries are added:
/proc/PID/fdinfo/FD
/proc/PID/task/TID/fdinfo/FD
For each fd the information is provided in the following format:
pos: 1234
flags: 0100002
[bunk@stusta.de: make struct proc_fdinfo_file_operations static]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the order of fields of struct pid_entry (file fs/proc/base.c) in order
to avoid a hole on 64bit archs. (8 bytes saved per object)
Also change all pid_entry arrays to be const qualified, to make clear they
must not be modified.
Before (on x86_64) :
# size fs/proc/base.o
text data bss dec hex filename
15549 2192 0 17741 454d fs/proc/base.o
After :
# size fs/proc/base.o
text data bss dec hex filename
17229 176 0 17405 43fd fs/proc/base.o
Thats 336 bytes saved on kernel size on x86_64
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
information about the memory location and usage of processes. Issues:
- maps should not be world-readable, especially if programs expect any
kind of ASLR protection from local attackers.
- maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
check the maps when %n is in a *printf call, and a setuid(getuid())
process wouldn't be able to read its own maps file. (For reference
see http://lkml.org/lkml/2006/1/22/150)
- a system-wide toggle is needed to allow prior behavior in the case of
non-root applications that depend on access to the maps contents.
This change implements a check using "ptrace_may_attach" before allowing
access to read the maps contents. To control this protection, the new knob
/proc/sys/kernel/maps_protect has been added, with corresponding updates to
the procfs documentation.
[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: New sysctl numbers are old hat]
Signed-off-by: Kees Cook <kees@outflux.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/$PID/fd has r-x------ permissions, so if process does setuid(), it
will not be able to access /proc/*/fd/. This breaks fstatat() emulation
in glibc.
open("foo", O_RDONLY|O_DIRECTORY) = 4
setuid32(65534) = 0
stat64("/proc/self/fd/4/bar", 0xbfafb298) = -1 EACCES (Permission denied)
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Kirill Korotaev <dev@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds /proc/pid/clear_refs. When any non-zero number is written to this file,
pte_mkold() and ClearPageReferenced() is called for each pte and its
corresponding page, respectively, in that task's VMAs. This file is only
writable by the user who owns the task.
It is now possible to measure _approximately_ how much memory a task is using
by clearing the reference bits with
echo 1 > /proc/pid/clear_refs
and checking the reference count for each VMA from the /proc/pid/smaps output
at a measured time interval. For example, to observe the approximate change
in memory footprint for a task, write a script that clears the references
(echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
extracts the size in kB. Add the sizes for each VMA together for the total
referenced footprint. Moments later, repeat the process and observe the
difference.
For example, using an efficient Mozilla:
accumulated time referenced memory
---------------- -----------------
0 s 408 kB
1 s 408 kB
2 s 556 kB
3 s 1028 kB
4 s 872 kB
5 s 1956 kB
6 s 416 kB
7 s 1560 kB
8 s 2336 kB
9 s 1044 kB
10 s 416 kB
This is a valuable tool to get an approximate measurement of the memory
footprint for a task.
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
[akpm@linux-foundation.org: build fixes]
[mpm@selenic.com: rename for_each_pmd]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
have it return the buffer it had allocated
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pointers to user data should be marked with a __user hint. This one is
missing.
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They are fat: 4x8 bytes in task_struct.
They are uncoditionally updated in every fork, read, write and sendfile.
They are used only if you have some "extended acct fields feature".
And please, please, please, read(2) knows about bytes, not characters,
why it is called "rchar"?
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a simple /proc/pid/io to show the IO accounting fields.
Maybe this shouldn't be merged in mainline - the preferred reporting channel
is taskstats. But given the poor state of our userspace support for
taskstats, this is useful for developer-testing, at least. And it improves
the changes that the procps developers will wire it up into top(1). Opinions
are sought.
The patch also wires up the existing IO-accounting fields.
It's a bit racy on 32-bit machines: if process A reads process B's
/proc/pid/io while process B is updating one of those 64-bit counters, process
A could see an intermediate result.
Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides process filtering feature.
The process filter allows failing only permitted processes
by /proc/<pid>/make-it-fail
Please see the example that demostrates how to inject slab allocation
failures into module init/cleanup code
in Documentation/fault-injection/fault-injection.txt
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
other namespaces being developped for the containers : pid, uts, ipc, etc.
'namespace' variables and attributes are also renamed to 'mnt_ns'
Signed-off-by: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in the proc
filesystem code.
Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently a user process cannot rise its own oom_adj value (i.e.
unprotecting itself from the OOM killer). As this value is stored in the
task structure it gets inherited and the unprivileged childs will be unable
to rise it.
The EPERM will be handled by the generic proc fs layer, as only processes
with the proper caps or the owner of the process will be able to write to
the file. So we allow only the processes with CAP_SYS_RESOURCE to lower
the value, otherwise it will get an EACCES which seems more appropriate
than EPERM.
Signed-off-by: Guillem Jover <guillem.jover@nokia.com>
Acked-by: Andrea Arcangeli <andrea@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Despite mm.h is not being exported header, it does contain one thing
which is part of userspace ABI -- value disabling OOM killer for given
process. So,
a) create and export include/linux/oom.h
b) move OOM_DISABLE define there.
c) turn bounding values of /proc/$PID/oom_adj into defines and export
them too.
Note: mass __KERNEL__ removal will be done later.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Actually, the decimal representation of a 32-bit signed number can take 12
bytes, including the \0.
And then some code adds a \n as well, so let's give it 13 bytes.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
proc_pid_make_inode:
ei->pid = get_pid(task_pid(task));
I think this is not safe. get_pid() can be preempted after checking "pid
!= NULL". Then the task exits, does detach_pid(), and RCU frees the pid.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was pointed out that since I am taking ARRAY_SIZE anyway the trailing empty
entry is silly and just wastes space.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The value doesn't change but this ensures I will have the proper value when
other files are added to proc_base_stuff.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The implementation is exactly the same and there is currently nothing to
distinguish proc_tid_attr, and proc_tgid_attr. So it is pointless to have two
separate implementations.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The hard coded inode numbers in proc currently limit its maintainability,
its flexibility, and what can be done with the rest of system. /proc limits
pid-max to 32768 on 32 bit systems it limits fd-max to 32768 on all systems,
and placing the pid in the inode number really gets in the way of implementing
subdirectories of per process information.
Ever since people started adding to the middle of the file type enumeration we
haven't been maintaing the historical inode numbers, all we have really
succeeded in doing is keeping the pid in the proc inode number. The pid is
already available in the directory name so no information is lost removing it
from the inode number.
So if something in user space cares if we remove the inode number from the
/proc inode it is almost certainly broken.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To remove the hard coded proc inode numbers it is necessary to be able to
create the proc inodes during readdir. The instantiate methods are the subset
of lookup that is needed to accomplish that.
This first step just splits the lookup methods into 2 functions.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch generalizes the concept of files in /proc that are related to
processes but live in the root directory of /proc
Ideally this would reuse infrastructure from the rest of the process specific
parts of proc but unfortunately security_task_to_inode must not be called on
files that are not strictly per process. security_task_to_inode really needs
to be reexamined as the security label can change in important places that we
are not currently catching, but I'm not certain that simplifies this problem.
By at least matching the structure of the rest of proc we get more idiom reuse
and it becomes easier to spot problems in the way things are put together.
Later things like /proc/mounts are likely to be moved into proc_base as well.
If union mounts are ever supported we may be able to make /proc a union mount,
and properly split it into 2 filesystems.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This moves the mount namespace into the nsproxy. The mount namespace count
now refers to the number of nsproxies point to it, rather than the number of
tasks. As a result, the unshare_namespace() function in kernel/fork.c no
longer checks whether it is being shared.
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently proc_pident_lookup gets the names and types from a table and then
has a huge switch statement to get the inode and file operations it needs.
That is silly and is becoming increasingly hard to maintain so I just put all
of the information in the table.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There were enough changes in my last round of cleaning up proc I had to break
up the patch series into smaller chunks, and my last chunk never got resent.
This patchset gives proc dynamic inode numbers (the static inode numbers were
a pain to maintain and prevent all kinds of things), and removes the horrible
switch statements that had to be kept in sync with everything else. Being
fully table driver takes us 90% of the way of being able to register new
process specific attributes in proc.
This patch:
Group the functions by what they implement instead of by type of operation.
As it existed base.c was quickly approaching the point where it could not be
followed.
No functionality or code changes asside from adding/removing forward
declartions are implemented in this patch.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls. For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.
This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.
Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.
This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned. For directory that are
either created or destroyed during the readdir you may or may not see them.
Furthermore you may seek to a directory offset you have previously seen.
These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects. Plus it is a simple semantic to
implement reliable service. It is just a matter of calling readdir a
second time if you are wondering if something new has show up.
These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.
The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm. Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
are only 40 cache lines for the entire 32K pid space. A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.
If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.
In addition this takes no additional locks and is actually less code than
what we are doing now.
Also another very subtle bug in this area has been fixed. It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader. This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.
Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.
[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At the beginning of the routine, "copied" is set to 0, but it is no good
because in lines 805 and 812 it is set to other values. Finally, the
routine returns as if it copied 12 (=ENOMEM) bytes less than it actually
did.
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This just turns off chmod() on the /proc/<pid>/ files, since there is no
good reason to allow it, and had we disallowed it originally, the nasty
/proc race exploit wouldn't have been possible.
The other patches already fixed the problem chmod() could cause, so this
is really just some final mop-up..
This particular version is based off a patch by Eugene and Marcel which
had much better naming than my original equivalent one.
Signed-off-by: Eugene Teo <eteo@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We have a bad interaction with both the kernel and user space being able
to change some of the /proc file status. This fixes the most obvious
part of it, but I expect we'll also make it harder for users to modify
even their "own" files in /proc.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Below is a patch to add a new /proc/self/attr/sockcreate A process may write a
context into this interface and all subsequent sockets created will be labeled
with that context. This is the same idea as the fscreate interface where a
process can specify the label of a file about to be created. At this time one
envisioned user of this will be xinetd. It will be able to better label
sockets for the actual services. At this time all sockets take the label of
the creating process, so all xinitd sockets would just be labeled the same.
I tested this by creating a tcp sender and listener. The sender was able to
write to this new proc file and then create sockets with the specified label.
I am able to be sure the new label was used since the avc denial messages
kicked out by the kernel included both the new security permission
setsockcreate and all the socket denials were for the new label, not the label
of the running process.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Try to make next_tid() a bit more readable and deletes unnecessary
"pid_alive(pos)" check.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
first_tid:
/* If nr exceeds the number of threads there is nothing todo */
if (nr) {
if (nr >= get_nr_threads(leader))
goto done;
}
This is not reliable: sub-threads can exit after this check, so the
'for' loop below can overlap and proc_task_readdir() can return an
already filldir'ed dirents.
for (; pos && pid_alive(pos); pos = next_thread(pos)) {
if (--nr > 0)
continue;
Off-by-one error, will return 'leader' when nr == 1.
This patch tries to fix these problems and simplify the code.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is just like my previous removal of tasklist_lock from first_tgid, and
next_tgid. It simply had to wait until it was rcu safe to walk the thread
list.
This should be the last instance of the tasklist_lock in proc. So user
processes should not be able to influence the tasklist lock hold times.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In process of getting proc_fd_access_allowed to work it has developed a few
warts. In particular the special case that always allows introspection and
the special case to allow inspection of kernel threads.
The special case for introspection is needed for /proc/self/mem.
The special case for kernel threads really should be overridable
by security modules.
So consolidate these checks into ptrace.c:may_attach().
The check to always allow introspection is trivial.
The check to allow access to kernel threads, and zombies is a little
trickier. mem_read and mem_write already verify an mm exists so it isn't
needed twice. proc_fd_access_allowed only doesn't want a check to verify
task->mm exits, s it prevents all access to kernel threads. So just move
the task->mm check into ptrace_attach where it is needed for practical
reasons.
I did a quick audit and none of the security modules in the kernel seem to
care if they are passed a task without an mm into security_ptrace. So the
above move should be safe and it allows security modules to come up with
more restrictive policy.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since 2.2 we have been doing a chroot check to see if it is appropriate to
return a read or follow one of these magic symlinks. The chroot check was
asking a question about the visibility of files to the calling process and
it was actually checking the destination process, and not the files
themselves. That test was clearly bogus.
In my first pass through I simply fixed the test to check the visibility of
the files themselves. That naive approach to fixing the permissions was
too strict and resulted in cases where a task could not even see all of
it's file descriptors.
What has disturbed me about relaxing this check is that file descriptors
are per-process private things, and they are occasionaly used a user space
capability tokens. Looking a little farther into the symlink path on /proc
I did find userid checks and a check for capability (CAP_DAC_OVERRIDE) so
there were permissions checking this.
But I was still concerned about privacy. Besides /proc there is only one
other way to find out this kind of information, and that is ptrace. ptrace
has been around for a long time and it has a well established security
model.
So after thinking about it I finally realized that the permission checks
that make sense are the permission checks applied to ptrace_attach. The
checks are simple per process, and won't cause nasty surprises for people
coming from less capable unices.
Unfortunately there is one case that the current ptrace_attach test does
not cover: Zombies and kernel threads. Single stepping those kinds of
processes is impossible. Being able to see which file descriptors are open
on these tasks is important to lsof, fuser and friends. So for these
special processes I made the rule you can't find out unless you have
CAP_SYS_PTRACE.
These proc permission checks should now conform to the principle of least
surprise. As well as using much less code to implement :)
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The code doesn't need to sleep to when making this check so I can just do the
comparison and not worry about the reference counts.
TODO: While looking at this I realized that my original cleanup did not push
the permission check far enough down into the stack. The call of
proc_check_dentry_visible needs to move out of the generic proc
readlink/follow link code and into the individual get_link instances.
Otherwise the shared resources checks are not quite correct (shared
files_struct does not require a shared fs_struct), and there are races with
unshare.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
that they work with struct pid instead of struct task_ref.
Mostly this is a straight 1-1 substitution.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Every inode in /proc holds a reference to a struct task_struct. If a
directory or file is opened and remains open after the the task exits this
pinning continues. With 8K stacks on a 32bit machine the amount pinned per
file descriptor is about 10K.
Normally I would figure a reasonable per user process limit is about 100
processes. With 80 processes, with a 1000 file descriptors each I can trigger
the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
data.
This patch replaces the struct task_struct pointer with a pointer to a struct
task_ref which has a struct task_struct pointer. The so the pinning of dead
tasks does not happen.
The code now has to contend with the fact that the task may now exit at any
time. Which is a little but not muh more complicated.
With this change it takes about 1000 processes each opening up 1000 file
descriptors before I can trigger the OOM killer. Much better.
[mlp@google.com: task_mmu small fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Paul Jackson <pj@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Albert Cahalan <acahalan@gmail.com>
Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently in /proc at several different places we define buffers to hold a
process id, or a file descriptor . In most of them we use either a hard coded
number or a different define. Modify them all to use PROC_NUMBUF, so the code
has a chance of being maintained.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Like the bug Oleg spotted in first_tid there was also a small off by one
error in first_tgid, when a seek was done on the /proc directory. This
fixes that and changes the code structure to make it a little more obvious
what is going on.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since we no longer need the tasklist_lock for get_task_struct the lookup
methods no longer need the tasklist_lock.
This just depends on my previous patch that makes get_task_struct() rcu
safe.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We don't need the tasklist_lock to safely iterate through processes
anymore.
This depends on my previous to task patches that make get_task_struct rcu
safe, and that make next_task() rcu safe. I haven't gotten
first_tid/next_tid yet only because next_thread is missing an
rcu_dereference.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are a couple of problems this patch addresses.
- /proc/<tgid>/task currently does not work correctly if you stop reading
in the middle of a directory.
- /proc/ currently requires a full pass through the task list with
the tasklist lock held, to determine there are no more processes to read.
- The hand rolled integer to string conversion does not properly running
out of buffer space.
- We seem to be batching reading of pids from the tasklist without reason,
and complicating the logic of the code.
This patch addresses that by changing how tasks are processed. A
first_<task_type> function is built that handles restarts, and a
next_<task_type> function is built that just advances to the next task.
first_<task_type> when it detects a restart usually uses find_task_by_pid. If
that doesn't work because there has been a seek on the directory, or we have
already given a complete directory listing, it first checks the number tasks
of that type, and only if we are under that count does it walk through all of
the tasks to find the one we are interested in.
The code that fills in the directory is simpler because there is only a single
for loop.
The hand rolled integer to string conversion is replaced by snprintf which
should handle the the out of buffer case correctly.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
proc_lookup and task exiting are not synchronized, although some of the
previous code may have suggested that. Every time before we reuse a dentry
namei.c calls d_op->derevalidate which prevents us from reusing a stale dcache
entry. Unfortunately it does not prevent us from returning a stale dcache
entry. This race has been explicitly plugged in proc_pid_lookup but there is
nothing to confine it to just that proc lookup function.
So to prevent the race I call revalidate explictily in all of the proc lookup
functions after I call d_add, and report an error if the revalidate does not
succeed.
Years ago Al Viro did something similar but those changes got lost in the
churn.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To keep the dcache from filling up with dead /proc entries we flush them on
process exit. However over the years that code has gotten hairy with a
dentry_pointer and a lock in task_struct and misdocumented as a correctness
feature.
I have rewritten this code to look and see if we have a corresponding entry in
the dcache and if so flush it on process exit. This removes the extra fields
in the task_struct and allows me to trivially handle the case of a
/proc/<tgid>/task/<pid> entry as well as the current /proc/<pid> entries.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All of the functions for proc_maps_operations are already defined in
task_mmu.c so move the operations structure to keep the functionality
together.
Since task_nommu.c implements a dummy version of /proc/<pid>/maps give it a
simplified version of proc_maps_operations that it can modify to best suit its
needs.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use getattr to get an accurate link count when needed. This is cheaper and
more accurate than trying to derive it by walking the thread list of a
process.
Especially as it happens when needed stat instead of at readdir time.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Long ago and far away in 2.2 we started checking to ensure the files we
displayed in /proc were visible to the current process. It was an
unsophisticated time and no one was worried about functions full of FIXMES in
a stable kernel. As time passed the function became sacred and was enshrined
in the shrine of how things have always been. The fixes came in but only to
keep the function working no one really remembering or documenting why we did
things that way.
The intent and the functionality make a lot of sense. Don't let /proc be an
access point for files a process can see no other way. The implementation
however is completely wrong.
We are currently checking the root directories of the two processes, we are
not checking the actual file descriptors themselves.
We are strangely checking with a permission method instead of just when we use
the data.
This patch fixes the logic to actually check the file descriptors and make a
note that implementing a permission method for this part of /proc almost
certainly indicates a bug in the reasoning.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The inode operations only exist to support the proc_permission function.
Currently mem_read and mem_write have all the same permission checks as
ptrace. The fs check makes no sense in this context, and we can trivially get
around it by calling ptrace.
So simply the code by killing the strange weird case.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
First we can access every /proc/<tgid>/task/<pid> directory as /proc/<pid> so
proc_task_permission is not usefully limiting visibility.
Second having related filesystems information should have nothing to do with
process visibility. kill does not implement any checks like that.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The sole renaming use of proc_inode.type is to discover the file descriptor
number, so just store the file descriptor number and don't wory about
processing this field. This removes any /proc limits on the maximum number of
file descriptors, and clears the path to make the hard coded /proc inode
numbers go away.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently in /proc if the task is dumpable all of files are owned by the tasks
effective users. Otherwise the files are owned by root. Unless it is the
/proc/<tgid>/ or /proc/<tgid>/task/<pid> directory in that case we always make
the directory owned by the effective user.
However the special case for directories is pointless except as a way to read
the effective user, because the permissions on both of those directories are
world readable, and executable.
/proc/<tgid>/status provides a much better way to read a processes effecitve
userid, so it is silly to try to provide that on the directory.
So this patch simplifies the code by removing a pointless special case and
gets us one step closer to being able to remove the hard coded /proc inode
numbers.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The removed fields are already set by proc_alloc_inode. Initializing them in
proc_alloc_inode implies they need it for proper cleanup. At least ei->pde
was not set on all paths making it look like proc_alloc_inode was buggy. So
just remove the redundant assignments.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We already call everything except do_proc_readlink outside of the BKL in
proc_pid_followlink, and there appears to be nothing in do_proc_readlink that
needs any special protection.
So remove this leftover from one of the BKL cleanup efforts.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a /proc/<pid>/attr/keycreate entry that stores the appropriate context for
newly-created keys. Modify the selinux_key_alloc hook to make use of the new
entry. Update the flask headers to include a new "setkeycreate" permission
for processes. Update the flask headers to include a new "create" permission
for keys. Use the create permission to restrict which SIDs each task can
assign to newly-created keys. Add a new parameter to the security hook
"security_key_alloc" to indicate whether it is being invoked by the kernel, or
from userspace. If it is being invoked by the kernel, the security hook
should never fail. Update the documentation to reflect these changes.
Signed-off-by: Michael LeMay <mdlemay@epoch.ncsc.mil>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are places in the kernel where we look up files in fd tables and
access the file structure without holding refereces to the file. So, we
need special care to avoid the race between looking up files in the fd
table and tearing down of the file in another CPU. Otherwise, one might
see a NULL f_dentry or such torn down version of the file. This patch
fixes those special places where such a race may happen.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
proc_check_chroot() does the check in a very unintuitive way (keeping a
copy of the argument, then modifying the argument), and has uncommented
sideeffects.
Signed-off-by: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Create a new file under /proc/self, called mountstats, where mounted file
systems can export information (configuration options, performance counters,
and so on). Use a mechanism similar to /proc/mounts and s_ops->show_options.
This mechanism does not violate namespace security, and is safe to use while
other processes are unmounting file systems.
Thanks to Mike Waychison for his review and comments.
Test-plan:
Test concurrent mount/unmount operations while cat'ing /proc/self/mountstats.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
fs: Use <linux/capability.h> where capable() is used.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
"proc_smaps_operations" is not defined in case of "CONFIG_MMU=n".
Signed-off-by: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fs/proc/base.c: In function `proc_task_root_link':
fs/proc/base.c:364: warning: ISO C90 forbids mixed declarations and code
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the main thread of a thread group has done pthread_exit() and died,
the other threads are still happily running, but will not be visible
under /proc because their leader is no longer accessible.
This fixes the access control so that we can see the sub-threads again.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Acked-by: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With the use of RCU in files structure, the look-up of files using fds can now
be lock-free. The lookup is protected by rcu_read_lock()/rcu_read_unlock().
This patch changes the readers to use lock-free lookup.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Ravikiran Thirumalai <kiran_th@gmail.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In order for the RCU to work, the file table array, sets and their sizes must
be updated atomically. Instead of ensuring this through too many memory
barriers, we put the arrays and their sizes in a separate structure. This
patch takes the first step of putting the file table elements in a separate
structure fdtable that is embedded withing files_struct. It also changes all
the users to refer to the file table using files_fdtable() macro. Subsequent
applciation of RCU becomes easier after this.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extract common code used by ptrace_attach() and may_ptrace_attach()
into a separate function.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes wrongly placed elements in the pid_directory_inos
enum. Also add comment so this mistake is not repeated.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch cleans up proc_cwd_link() and proc_root_link() by factoring
out common code into get_fs_struct().
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a "smaps" entry to /proc/pid: show howmuch memory is resident in each
mapping.
People that want to perform a memory consumption analysing can use it
mainly if someone needs to figure out which libraries can be reduced for
embedded systems. So the new features are the physical size of shared and
clean [or dirty]; private and clean [or dirty].
Take a look the example below:
# cat /proc/4576/smaps
08048000-080dc000 r-xp /bin/bash
Size: 592 KB
Rss: 500 KB
Shared_Clean: 500 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 0 KB
080dc000-080e2000 rw-p /bin/bash
Size: 24 KB
Rss: 24 KB
Shared_Clean: 0 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 24 KB
080e2000-08116000 rw-p
Size: 208 KB
Rss: 208 KB
Shared_Clean: 0 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 208 KB
b7e2b000-b7e34000 r-xp /lib/tls/libnss_files-2.3.2.so
Size: 36 KB
Rss: 12 KB
Shared_Clean: 12 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 0 KB
...
(Includes a cleanup from "Richard Purdie" <rpurdie@rpsys.net>)
From: Torsten Foertsch <torsten.foertsch@gmx.net>
show_smap calls first show_map and then prints its additional information to
the seq_file. show_map checks if all it has to print fits into the buffer and
if yes marks the current vma as written. While that is correct for show_map
it is not for show_smap. Here the vma should be marked as written only after
the additional information is also written.
The attached patch cures the problem. It moves the functionality of the
show_map function to a new function show_map_internal that is called with an
additional struct mem_size_stats* argument. Then show_map calls
show_map_internal with NULL as struct mem_size_stats* whereas show_smap calls
it with a real pointer. Now the final
if (m->count < m->size) /* vma is copied successfully */
m->version = (vma != get_gate_vma(task))? vma->vm_start: 0;
is done only if the whole entry fits into the buffer.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch was recently discussed on linux-mm:
http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2
I inherited a large code base from Ray for page migration. There was a
small patch in there that I find to be very useful since it allows the
display of the locality of the pages in use by a process. I reworked that
patch and came up with a /proc/<pid>/numa_maps that gives more information
about the vma's of a process. numa_maps is indexes by the start address
found in /proc/<pid>/maps. F.e. with this patch you can see the page use
of the "getty" process:
margin:/proc/12008 # cat maps
00000000-00004000 r--p 00000000 00:00 0
2000000000000000-200000000002c000 r-xp 00000000 08:04 516 /lib/ld-2.3.3.so
2000000000038000-2000000000040000 rw-p 00028000 08:04 516 /lib/ld-2.3.3.so
2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0
2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000260000-2000000000268000 ---p 00208000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0
2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923 /usr/lib/locale/en_US.utf8/LC_CTYPE
2000000000300000-2000000000308000 r--s 00000000 08:04 60071467 /usr/lib/gconv/gconv-modules.cache
2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0
4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399 /sbin/mingetty
6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399 /sbin/mingetty
6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0 [heap]
60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0
60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0 [stack]
a000000000000000-a000000000020000 ---p 00000000 00:00 0 [vdso]
cat numa_maps
2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2
2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16
2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3
2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3
2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2
2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1
4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2
6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
getty uses ld.so. The first vma is the code segment which is used by 43
other processes and the pages are evenly distributed over the 4 nodes.
The second vma is the process specific data portion for ld.so. This is
only one page.
The display format is:
<startaddress> Links to information in /proc/<pid>/map
<memory policy> This can be "default" "interleave={}", "prefer=<node>" or "bind={<zones>}"
MaxRef= <maximum reference to a page in this vma>
Pages= <Nr of pages in use>
Mapped= <Nr of pages with mapcount >
Anon= <nr of anonymous pages>
Nx= <Nr of pages on Node x>
The content of the proc-file is self-evident. If this would be tied into
the sparsemem system then the contents of this file would not be too
useful.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new `suid_dumpable' sysctl:
This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are
0 - (default) - traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped
1 - (debug) - all processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is intended
for system debugging situations only. Ptrace is unchecked.
2 - (suidsafe) - any binary which normally would not be dumped is dumped
readable by root only. This allows the end user to remove such a dump but
not access it directly. For security reasons core dumps in this mode will
not overwrite one another or other files. This mode is appropriate when
adminstrators are attempting to debug problems in a normal environment.
(akpm:
> > +EXPORT_SYMBOL(suid_dumpable);
>
> EXPORT_SYMBOL_GPL?
No problem to me.
> > if (current->euid == current->uid && current->egid == current->gid)
> > current->mm->dumpable = 1;
>
> Should this be SUID_DUMP_USER?
Actually the feedback I had from last time was that the SUID_ defines
should go because its clearer to follow the numbers. They can go
everywhere (and there are lots of places where dumpable is tested/used
as a bool in untouched code)
> Maybe this should be renamed to `dump_policy' or something. Doing that
> would help us catch any code which isn't using the #defines, too.
Fair comment. The patch was designed to be easy to maintain for Red Hat
rather than for merging. Changing that field would create a gigantic
diff because it is used all over the place.
)
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some KernelDoc descriptions are updated to match the current code.
No code changes.
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I have recompiled Linux kernel 2.6.11.5 documentation for me and our
university students again. The documentation could be extended for more
sources which are equipped by structured comments for recent 2.6 kernels. I
have tried to proceed with that task. I have done that more times from 2.6.0
time and it gets boring to do same changes again and again. Linux kernel
compiles after changes for i386 and ARM targets. I have added references to
some more files into kernel-api book, I have added some section names as well.
So please, check that changes do not break something and that categories are
not too much skewed.
I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
by kernel convention. Most of the other changes are modifications in the
comments to make kernel-doc happy, accept some parameters description and do
not bail out on errors. Changed <pid> to @pid in the description, moved some
#ifdef before comments to correct function to comments bindings, etc.
You can see result of the modified documentation build at
http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz
Some more sources are ready to be included into kernel-doc generated
documentation. Sources has been added into kernel-api for now. Some more
section names added and probably some more chaos introduced as result of quick
cleanup work.
Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current logic assumes that a /proc/<PID>/task directory should have a
hardlink count of 3, probably counting ".", "..", and a directory for a
single child task.
It's fairly obvious that this doesn't work out correctly when a PID has
more than one child task, which is quite often the case.
Signed-off-by: Daniel Drake <dsd@gentoo.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The pid directories in /proc/ currently return the wrong hardlink count - 3,
when there are actually 4 : ".", "..", "fd", and "task".
This is easy to notice using find(1):
cd /proc/<pid>
find
In the output, you'll see a message similar to:
find: WARNING: Hard link count is wrong for .: this may be a bug in your
filesystem driver. Automatically turning on find's -noleaf option.
Earlier results may have failed to include directories that should have
been searched.
http://bugs.gentoo.org/show_bug.cgi?id=86031
I also noticed that CONFIG_SECURITY can add a 5th: attr, and performed a
similar fix on the task directories too.
Signed-off-by: Daniel Drake <dsd@gentoo.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Attached is a new patch that solves the issue of getting valid credentials
into the LOGIN message. The current code was assuming that the audit context
had already been copied. This is not always the case for LOGIN messages.
To solve the problem, the patch passes the task struct to the function that
emits the message where it can get valid credentials.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
iscsi/lvm2/multipath needs guaranteed protection from the oom-killer, so
make the magical value of -17 in /proc/<pid>/oom_adj defeat the oom-killer
altogether.
(akpm: we still need to document oom_adj and friends in
Documentation/filesystems/proc.txt!)
Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!