linux

q3k/linux

Author	SHA1	Message	Date
Tom Zanussi	ac1adc55fc	tracing/filters: add filter_mutex to protect filter predicates This patch adds a filter_mutex to prevent the filter predicates from being accessed concurrently by various external functions. It's based on a previous patch by Li Zefan: "[PATCH 7/7] tracing/filters: make filter preds RCU safe" v2 changes: - fixed wrong value returned in a add_subsystem_pred() failure case noticed by Li Zefan. [ Impact: fix trace filter corruption/crashes on parallel access ] Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1239946028.6639.13.camel@tropicana> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-17 18:28:27 +02:00
Li Zefan	339ae5d3c3	tracing: fix file mode of trace and README trace is read-write and README is read-only. [ Impact: fix /debug/tracing/ file permissions. ] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E7EAB6.4070605@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-17 18:04:36 +02:00
Peter Zijlstra	c8a2500586	lockdep: more robust lockdep_map init sequence Steven Rostedt reported: > OK, I think I figured this bug out. This is a lockdep issue with respect > to tracepoints. > > The trace points in lockdep are called all the time. Outside the lockdep > logic. But if lockdep were to trigger an error / warning (which this run > did) we might be in trouble. For new locks, like the dentry->d_lock, that > are created, they will not get a name: > > void lockdep_init_map(struct lockdep_map lock, const char name, > struct lock_class_key *key, int subclass) > { > if (unlikely(!debug_locks)) > return; > > When a problem is found by lockdep, debug_locks becomes false. Thus we > stop allocating names for locks. This dentry->d_lock I had, now has no > name. Worse yet, I have CONFIG_DEBUG_VM set, that scrambles non > initialized memory. Thus, when the trace point was hit, it had junk for > the lock->name, and the machine crashed. Ah, nice catch. I think we should put at least the name in regardless. Ensure we at least initialize the trivial entries of the depmap so that they can be relied upon, even when lockdep itself decided to pack up and go home. [ Impact: fix lock tracing after lockdep warnings. ] Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1239954049.23397.4156.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-17 18:00:00 +02:00
Steven Rostedt	9ea21c1ecd	tracing/events: perform function tracing in event selftests We can find some bugs in the trace events if we stress the writes as well. The function tracer is a good way to stress the events. [ Impact: extend scope of event tracer self-tests ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20090416161746.604786131@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-17 17:10:35 +02:00
Avadh Patel	69abe6a5d1	tracing: add saved_cmdlines file to show cached task comms Export the cached task comms to userspace. This allows user apps to translate the pids from a trace into their respective task command lines. [ Impact: let userspace apps reading binary buffer know comm's of pids ] Signed-off-by: Avadh Patel <avadh4all@gmail.com> [ added error checking and use of buf pointer to index file_buf ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-17 17:04:12 +02:00
Steven Rostedt	d1b182a8d4	tracing/events/ring-buffer: expose format of ring buffer headers to users Currently, every thing needed to read the binary output from the ring buffers is available, with the exception of the way the ring buffers handles itself internally. This patch creates two special files in the debugfs/tracing/events directory: # cat /debug/tracing/events/header_page field: u64 timestamp; offset:0; size:8; field: local_t commit; offset:8; size:8; field: char data; offset:16; size:4080; # cat /debug/tracing/events/header_event type : 2 bits len : 3 bits time_delta : 27 bits array : 32 bits padding : type == 0 time_extend : type == 1 data : type == 3 This is to allow a userspace app to see if the ring buffer format changes or not. [ Impact: allow userspace apps to know of ringbuffer format changes ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-17 17:03:28 +02:00
Steven Rostedt	e6187007d6	tracing/events: add startup tests for events As events start to become popular, and the new way to add tracing infrastructure into ftrace, it is important to catch any problems that might happen with a mistake in the TRACE_EVENT macro. This patch introduces a startup self test on the registered trace events. Note, it can only do a generic test, any type of testing that needs more involement is needed to be implemented by the tracepoint creators. The test goes down one by one enabling a trace point and running some random tasks (random in the sense that I just made them up). Those tasks are creating threads, grabbing mutexes and spinlocks and using workqueues. After testing each event individually, it does the same test after enabling each system of trace points. Like sched, irq, lockdep. Then finally it enables all tracepoints and performs the tasks again. The output to the console on bootup will look like this when everything works: Running tests on trace events: Testing event kfree_skb: OK Testing event kmalloc: OK Testing event kmem_cache_alloc: OK Testing event kmalloc_node: OK Testing event kmem_cache_alloc_node: OK Testing event kfree: OK Testing event kmem_cache_free: OK Testing event irq_handler_exit: OK Testing event irq_handler_entry: OK Testing event softirq_entry: OK Testing event softirq_exit: OK Testing event lock_acquire: OK Testing event lock_release: OK Testing event sched_kthread_stop: OK Testing event sched_kthread_stop_ret: OK Testing event sched_wait_task: OK Testing event sched_wakeup: OK Testing event sched_wakeup_new: OK Testing event sched_switch: OK Testing event sched_migrate_task: OK Testing event sched_process_free: OK Testing event sched_process_exit: OK Testing event sched_process_wait: OK Testing event sched_process_fork: OK Testing event sched_signal_send: OK Running tests on trace event systems: Testing event system skb: OK Testing event system kmem: OK Testing event system irq: OK Testing event system lockdep: OK Testing event system sched: OK Running tests on all trace events: Testing all events: OK [ folded in: tracing: add #include <linux/delay.h> to fix build failure in test_work() This build failure occured on a few rare configs: kernel/trace/trace_events.c: In function ‘test_work’: kernel/trace/trace_events.c:975: error: implicit declaration of function ‘udelay’ kernel/trace/trace_events.c:980: error: implicit declaration of function ‘msleep’ delay.h is included in way too many other headers, hiding cases where new usage is added without header inclusion. [ Impact: build fix ] Signed-off-by: Ingo Molnar <mingo@elte.hu> ] [ Impact: add event tracer self-tests ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-17 17:01:37 +02:00
Steven Rostedt	93eb677d74	ftrace: use module notifier for function tracer The hooks in the module code for the function tracer must be called before any of that module code runs. The function tracer hooks modify the module (replacing calls to mcount to nops). If the code is executed while the change occurs, then the CPU can take a GPF. To handle the above with a bit of paranoia, I originally implemented the hooks as calls directly from the module code. After examining the notifier calls, it looks as though the start up notify is called before any of the module's code is executed. This makes the use of the notify safe with ftrace. Only the startup notify is required to be "safe". The shutdown simply removes the entries from the ftrace function list, and does not modify any code. This change has another benefit. It removes a issue with a reverse dependency in the mutexes of ftrace_lock and module_mutex. [ Impact: fix lock dependency bug, cleanup ] Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-17 16:59:15 +02:00
Linus Torvalds	9f76208c33	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix branch tracer header tracing: Fix power tracer header	2009-04-16 18:17:22 -07:00
Linus Torvalds	d06be22150	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Avoid printing sched_group::__cpu_power for default case tracing, sched: mark get_parent_ip() notrace	2009-04-16 18:16:29 -07:00
Linus Torvalds	4d831f53dd	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kernel/softirq.c: fix sparse warning rcu: Make hierarchical RCU less IPI-happy	2009-04-16 17:56:39 -07:00
H Hartley Sweeten	79d381c9f2	kernel/softirq.c: fix sparse warning Fix sparse warning in kernel/softirq.c. warning: do-while statement is not a compound statement Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE1909015F9033@mi8nycmail19.Mi8.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-17 01:57:54 +02:00
Gautham R Shenoy	381512cf3d	sched: Avoid printing sched_group::__cpu_power for default case Commit `46e0bb9c12` ("sched: Print sched_group::__cpu_power in sched_domain_debug") produces a messy dmesg output while attempting to print the sched_group::__cpu_power for each group in the sched_domain hierarchy. Fix this by avoid printing the __cpu_power for default cases. (i.e, __cpu_power == SCHED_LOAD_SCALE). [ Impact: reduce syslog clutter ] Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Fixed-by: Tony Luck <tony.luck@intel.com> Cc: a.p.zijlstra@chello.nl LKML-Reference: <20090414033936.GA534@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-17 00:46:05 +02:00
Li Zefan	f3948f8857	blktrace: fix context-info when mixed-using blk tracer and trace events When current tracer is set to blk tracer, TRACE_ITER_CONTEXT_INFO is unset, but actually context-info is printed: pdflush-431 [000] 821.181576: 8,0 P N [pdflush] And then if we enable TRACE_ITER_CONTEXT_INFO: # echo context-info > trace_options We'll see context-info printed twice. What's worse, when we use blk tracer and trace events at the same time, we'll see no context-info for trace events at all: jbd2_commit_logging: dev dm-0:8 transaction 333227 jbd2_end_commit: dev dm-0:8 transaction 333227 head 332814 rm-25433 [001] 9578.307485: 8,18 m N cfq25433 slice expired t=0 rm-25433 [001] 9578.307486: 8,18 m N cfq25433 put_queue This patch adds blk_tracer->set_flags(), and context-info flag is unset only when we set the output to classic mode. Note after this patch, one should unset context-info explicitly if he wants to get binary output that can be parsed by blkparse: # echo nocontext-info > trace_options # echo bin > trace_options # echo blk > current_tracer # cat trace_pipe \| blkparse -i - Reported-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E54E60.50408@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-16 10:11:01 +02:00
Li Zefan	1d54ad6da9	blktrace: add trace/ to /sys/block/sda Impact: allow ftrace-plugin blktrace to trace device-mapper devices To trace a single partition: # echo 1 > /sys/block/sda/sda1/enable To trace the whole sda instead: # echo 1 > /sys/block/sda/enable Thus we also fix an issue reported by Ted, that ftrace-plugin blktrace can't be used to trace device-mapper devices. Now: # echo 1 > /sys/block/dm-0/trace/enable echo: write error: No such device or address # mount -t ext4 /dev/dm-0 /mnt # echo 1 > /sys/block/dm-0/trace/enable # echo blk > /debug/tracing/current_tracer Reported-by: Theodore Tso <tytso@mit.edu> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Shawn Du <duyuyang@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49E42665.6020506@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-16 10:10:59 +02:00
Li Zefan	9908c30997	blktrace: support per-partition tracing for ftrace plugin The previous patch adds support to trace a single partition for relay+ioctl blktrace, and this patch is for ftrace plugin blktrace: # echo 1 > /sys/block/sda/sda7/enable # cat start_lba 102398373 # cat end_lba 102703545 Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Shawn Du <duyuyang@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49E42646.4060608@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-16 10:10:58 +02:00
Shawn Du	d0deef5b14	blktrace: support per-partition tracing Though one can specify '-d /dev/sda1' when using blktrace, it still traces the whole sda. To support per-partition tracing, when we start tracing, we initialize bt->start_lba and bt->end_lba to the start and end sector of that partition. Note some actions are per device, thus we don't filter 0-sector events. The original patch and discussion can be found here: http://marc.info/?l=linux-btrace&m=122949374214540&w=2 Signed-off-by: Shawn Du <duyuyang@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49E42620.4050701@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-16 10:10:57 +02:00
David Howells	5b1d07ed0e	RCU: Don't try and predeclare inline funcs as it upsets some versions of gcc Don't try and predeclare inline funcs like this: static inline void wait_migrated_callbacks(void) ... static void _rcu_barrier(enum rcu_barrier type) { ... wait_migrated_callbacks(); } ... static inline void wait_migrated_callbacks(void) { wait_event(rcu_migrate_wq, !atomic_read(&rcu_migrate_type_count)); } as it upsets some versions of gcc under some circumstances: kernel/rcupdate.c: In function `_rcu_barrier': kernel/rcupdate.c:125: sorry, unimplemented: inlining failed in call to 'wait_migrated_callbacks': function body not available kernel/rcupdate.c:152: sorry, unimplemented: called from here This can be dealt with by simply putting the static variables (rcu_migrate_*) at the top, and moving the implementation of the function up so that it replaces its forward declaration. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-15 13:55:14 -07:00
Nikanth Karthikesan	297dbf50d7	swap: Remove code handling bio_alloc failure with __GFP_WAIT Remove code handling bio_alloc failure with __GFP_WAIT. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:13 +02:00
Steven Rostedt	ad8d75fff8	tracing/events: move trace point headers into include/trace/events Impact: clean up Create a sub directory in include/trace called events to keep the trace point headers in their own separate directory. Only headers that declare trace points should be defined in this directory. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 22:05:43 -04:00
Steven Rostedt	61f919a12f	tracing/events: fix compile for modules disabled Impact: compile fix The addition of TRACE_EVENT for modules breaks the build for when modules are disabled. This code fixes that. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 22:04:19 -04:00
Steven Rostedt	6d723736e4	tracing/events: add support for modules to TRACE_EVENT Impact: allow modules to add TRACE_EVENTS on load This patch adds the final hooks to allow modules to use the TRACE_EVENT macro. A notifier and a data structure are used to link the TRACE_EVENTs defined in the module to connect them with the ftrace event tracing system. It also adds the necessary automated clean ups to the trace events when a module is removed. Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 12:58:03 -04:00
Steven Rostedt	17c873ec28	tracing/events: add export symbols for trace events in modules Impact: let modules add trace events The trace event code requires some functions to be exported to allow modules to use TRACE_EVENT. This patch adds EXPORT_SYMBOL_GPL to the necessary functions. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 12:58:01 -04:00
Steven Rostedt	a59fd60272	tracing/events: convert event call sites to use a link list Impact: makes it possible to define events in modules The events are created by reading down the section that they are linked in by the macros. But this is not scalable to modules. This patch converts the manipulations to use a global link list, and on boot up it adds the items in the section to the list. This change will allow modules to add their tracing events to the list as well. Note, this change alone does not permit modules to use the TRACE_EVENT macros, but the change is needed for them to eventually do so. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 12:58:00 -04:00
Steven Rostedt	f42c85e74f	tracing/events: move the ftrace event tracing code to core This patch moves the ftrace creation into include/trace/ftrace.h and simplifies the work of developers in adding new tracepoints. Just the act of creating the trace points in include/trace and including define_trace.h will create the events in the debugfs/tracing/events directory. This patch removes the need of include/trace/trace_events.h Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 12:57:59 -04:00
Steven Rostedt	97f2025153	tracing/events: move declarations from trace directory to core include In preparation to allowing trace events to happen in modules, we need to move some of the local declarations in the kernel/trace directory into include/linux. This patch simply moves the declarations and performs no context changes. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 12:57:58 -04:00
Steven Rostedt	9504504cba	tracing: make trace_seq operations available for core kernel In the process to make TRACE_EVENT macro work for modules, the trace_seq operations must be available for core kernel code. These operations are quite useful and can be used for other implementations. The main idea is that we create a trace_seq handle that acts very much like the seq_file handle. struct trace_seq s = kmalloc(sizeof(s, GFP_KERNEL); trace_seq_init(s); trace_seq_printf(s, "some data %d\n", variable); printk("%s", s->buffer); The main use is to allow a top level function call several other functions that may store printf like data into the buffer. Then at the end, the top level function can process all the data with any method it would like to. It could be passed to userspace, output via printk or even use seq_file: trace_seq_to_user(s, ubuf, cnt); seq_puts(m, s->buffer); Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 12:57:57 -04:00
Steven Rostedt	a8d154b009	tracing: create automated trace defines This patch lowers the number of places a developer must modify to add new tracepoints. The current method to add a new tracepoint into an existing system is to write the trace point macro in the trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or DECLARE_TRACE, then they must add the same named item into the C file with the macro DEFINE_TRACE(name) and then add the trace point. This change cuts out the needing to add the DEFINE_TRACE(name). Every file that uses the tracepoint must still include the trace/<type>.h file, but the one C file must also add a define before the including of that file. #define CREATE_TRACE_POINTS #include <trace/mytrace.h> This will cause the trace/mytrace.h file to also produce the C code necessary to implement the trace point. Note, if more than one trace/<type>.h is used to create the C code it is best to list them all together. #define CREATE_TRACE_POINTS #include <trace/foo.h> #include <trace/bar.h> #include <trace/fido.h> Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with the cleaner solution of the define above the includes over my first design to have the C code include a "special" header. This patch converts sched, irq and lockdep and skb to use this new method. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 12:57:28 -04:00
Ingo Molnar	56449f437a	tracing: make the trace clocks available generally Jeremy Fitzhardinge reported this build failure: LD .tmp_vmlinux1 arch/x86/kernel/built-in.o: In function `ds_take_timestamp': git/linux/arch/x86/kernel/ds.c:1380: undefined reference to `trace_clock_global' git/linux/arch/x86/kernel/ds.c:1380: undefined reference to `trace_clock_global' Which is due to !CONFIG_TRACING && CONFIG_X86_DS=y. Expose the trace clock code to CONFIG_X86_DS as well. [ Unfortunately librarizing doesnt work well - ancient architectures with no raw_local_irq_save() primitive break the build. ] Reported-by: Jeremy Fitzhardinge <jeremy@goop.org> LKML-Reference: <49E4413F.7070700@goop.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 18:35:13 +02:00
Steven Rostedt	ea20d9293c	tracing: consolidate trace and trace_event headers Impact: clean up Neil Horman (et. al.) criticized the way the trace events were broken up into two files. The reason for that was that ftrace needed to separate out the declarations from where the #include <linux/tracepoint.h> was used. It then dawned on me that the tracepoint.h header only needs to define the TRACE_EVENT macro if it is not already defined. The solution is simply to test if TRACE_EVENT is defined, and if it is not then the linux/tracepoint.h header can define it. This change consolidates all the <traces>.h and <traces>_event_types.h into the <traces>.h file. Reported-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Theodore Tso <tytso@mit.edu> Reported-by: Jiaying Zhang <jiayingz@google.com> Cc: Zhaolei <zhaolei@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-14 09:43:40 -04:00
Pallipadi, Venkatesh	6ec3cfeca0	x86, irq: Remove IRQ_DISABLED check in process context IRQ move As discussed in the thread here: http://marc.info/?l=linux-kernel&m=123964468521142&w=2 Eric W. Biederman observed: > It looks like some additional bugs have slipped in since last I looked. > > set_irq_affinity does this: > ifdef CONFIG_GENERIC_PENDING_IRQ > if (desc->status & IRQ_MOVE_PCNTXT \|\| desc->status & IRQ_DISABLED) { > cpumask_copy(desc->affinity, cpumask); > desc->chip->set_affinity(irq, cpumask); > } else { > desc->status \|= IRQ_MOVE_PENDING; > cpumask_copy(desc->pending_mask, cpumask); > } > #else > > That IRQ_DISABLED case is a software state and as such it has nothing to > do with how safe it is to move an irq in process context. [...] > > The only reason we migrate MSIs in interrupt context today is that there > wasn't infrastructure for support migration both in interrupt context > and outside of it. Yes. The idea here was to force the MSI migration to happen in process context. One of the patches in the series did disable_irq(dev->irq); irq_set_affinity(dev->irq, cpumask_of(dev->cpu)); enable_irq(dev->irq); with the above patch adding irq/manage code check for interrupt disabled and moving the interrupt in process context. IIRC, there was no IRQ_MOVE_PCNTXT when we were developing this HPET code and we ended up having this ugly hack. IRQ_MOVE_PCNTXT was there when we eventually submitted the patch upstream. But, looks like I did a blind rebasing instead of using IRQ_MOVE_PCNTXT in hpet MSI code. Below patch fixes this. i.e., revert commit `932775a4ab` and add PCNTXT to HPET MSI setup. Also removes copying of desc->affinity in generic code as set_affinity routines are doing it internally. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Li Shaohua" <shaohua.li@intel.com> Cc: Gary Hade <garyhade@us.ibm.com> Cc: "lcm@us.ibm.com" <lcm@us.ibm.com> Cc: suresh.b.siddha@intel.com LKML-Reference: <20090413222058.GB8211@linux-os.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 15:21:13 +02:00
Paul E. McKenney	ef631b0ca0	rcu: Make hierarchical RCU less IPI-happy This patch fixes a hierarchical-RCU performance bug located by Anton Blanchard. The problem stems from a misguided attempt to provide a work-around for jiffies-counter failure. This work-around uses a per-CPU n_rcu_pending counter, which is incremented on each call to rcu_pending(), which in turn is called from each scheduling-clock interrupt. Each CPU then treats this counter as a surrogate for the jiffies counter, so that if the jiffies counter fails to advance, the per-CPU n_rcu_pending counter will cause RCU to invoke force_quiescent_state(), which in turn will (among other things) send resched IPIs to CPUs that have thus far failed to pass through an RCU quiescent state. Unfortunately, each CPU resets only its own counter after sending a batch of IPIs. This means that the other CPUs will also (needlessly) send -another- round of IPIs, for a full N-squared set of IPIs in the worst case every three scheduler-clock ticks until the grace period finally ends. It is not reasonable for a given CPU to reset each and every n_rcu_pending for all the other CPUs, so this patch instead simply disables the jiffies-counter "training wheels", thus eliminating the excessive IPIs. Note that the jiffies-counter IPIs do not have this problem due to the fact that the jiffies counter is global, so that the CPU sending the IPIs can easily reset things, thus preventing the other CPUs from sending redundant IPIs. Note also that the n_rcu_pending counter remains, as it will continue to be used for tracing. It may also see use to update the jiffies counter, should an appropriate kick-the-jiffies-counter API appear. Located-by: Anton Blanchard <anton@au1.ibm.com> Tested-by: Anton Blanchard <anton@au1.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: anton@samba.org Cc: akpm@linux-foundation.org Cc: dipankar@in.ibm.com Cc: manfred@colorfullife.com Cc: cl@linux-foundation.org Cc: josht@linux.vnet.ibm.com Cc: schamp@sgi.com Cc: niv@us.ibm.com Cc: dvhltc@us.ibm.com Cc: ego@in.ibm.com Cc: laijs@cn.fujitsu.com Cc: rostedt@goodmis.org Cc: peterz@infradead.org Cc: penberg@cs.helsinki.fi Cc: andi@firstfloor.org Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> LKML-Reference: <12396834793575-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 11:31:50 +02:00
Zhaolei	557055bebe	tracing: Fix branch tracer header Before patch: # tracer: branch # # TASK-PID CPU# TIMESTAMP FUNCTION # \| \| \| \| \| <...>-2981 [000] 24008.872738: [ ok ] trace_irq_handler_exit:irq_event_types.h:41 <...>-2981 [000] 24008.872742: [ ok ] note_interrupt:spurious.c:229 ... After patch: # tracer: branch # # TASK-PID CPU# TIMESTAMP CORRECT FUNC:FILE:LINE # \| \| \| \| \| \| <...>-2985 [000] 26329.142970: [ ok ] slab_free:slub.c:1776 <...>-2985 [000] 26329.142972: [ ok ] trace_kmem_cache_free:kmem_event_types.h:191 ... Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <49E2F19A.3040006@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 02:30:36 +02:00
Lai Jiangshan	132380a06b	tracing, sched: mark get_parent_ip() notrace Impact: remove overly redundant tracing entries When tracer is "function" or "function_graph", way too much "get_parent_ip" entries are recorded in ring_buffer. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D458B1.5000703@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 02:11:05 +02:00
Andi Kleen	3d26dcf767	kernel/sys.c: clean up sys_shutdown exit path Impact: cleanup, fix Clean up sys_shutdown() exit path. Factor out common code. Return correct error code instead of always 0 on failure. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:32 -07:00
Oleg Nesterov	f1671f6d78	ptrace: fix exit_ptrace() vs ptrace_traceme() race Pointed out by Roland. The bug was recently introduced by me in "forget_original_parent: split out the un-ptrace part", commit `39c626ae47`. Since that patch we have a window after exit_ptrace() drops tasklist and before forget_original_parent() takes it again. In this window the child can do ptrace(PTRACE_TRACEME) and nobody can untrace this child after that. Change ptrace_traceme() to not attach to the exiting ->real_parent. We don't report the error in this case, we pretend we attach right before ->real_parent calls exit_ptrace() which should untrace us anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:31 -07:00
Peter Zijlstra	4be6f6bb66	mm: move the scan_unevictable_pages sysctl to the vm table vm knobs should go in the vm table. Probably too late for randomize_va_space though. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:28 -07:00
Tom Zanussi	0a19e53c15	tracing/filters: allow on-the-fly filter switching This patch allows event filters to be safely removed or switched on-the-fly while avoiding the use of rcu or the suspension of tracing of previous versions. It does it by adding a new filter_pred_none() predicate function which does nothing and by never deallocating either the predicates or any of the filter_pred members used in matching; the predicate lists are allocated and initialized during ftrace_event_calls initialization. Whenever a filter is removed or replaced, the filter_pred_* functions currently in use by the affected ftrace_event_call are immediately switched over to to the filter_pred_none() function, while the rest of the filter_pred members are left intact, allowing any currently executing filter_pred_* functions to finish up, using the values they're currently using. In the case of filter replacement, the new predicate values are copied into the old predicates after the above step, and the filter_pred_none() functions are replaced by the filter_pred_* functions for the new filter. In this case, it is possible though very unlikely that a previous filter_pred_* is still running even after the filter_pred_none() switch and the switch to the new filter_pred_. In that case, however, because nothing has been deallocated in the filter_pred, the worst that can happen is that the old filter_pred_ function sees the new values and as a result produces either a false positive or a false negative, depending on the values it finds. So one downside to this method is that rarely, it can produce a bad match during the filter switch, but it should be possible to live with that, IMHO. The other downside is that at least in this patch the predicate lists are always pre-allocated, taking up memory from the start. They could probably be allocated on first-use, and de-allocated when tracing is completely stopped - if this patch makes sense, I could create another one to do that later on. Oh, and it also places a restriction on the size of __arrays in events, currently set to 128, since they can't be larger than the now embedded str_val arrays in the filter_pred struct. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1239610670.6660.49.camel@tropicana> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 00:03:55 +02:00
Ingo Molnar	b5c851a88a	Merge branch 'linus' into tracing/core Merge reason: merge latest tracing fixes to avoid conflicts in kernel/trace/trace_events_filter.c with upcoming change Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 00:02:22 +02:00
Tom Zanussi	eb02ce017d	tracing/filters: use ring_buffer_discard_commit() in filter_check_discard() This patch changes filter_check_discard() to make use of the new ring_buffer_discard_commit() function and modifies the current users to call the old commit function in the non-discard case. It also introduces a version of filter_check_discard() that uses the global trace buffer (filter_current_check_discard()) for those cases. v2 changes: - fix compile error noticed by Ingo Molnar Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: fweisbec@gmail.com LKML-Reference: <1239178554.10295.36.camel@tropicana> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 00:00:56 +02:00
Tom Zanussi	5f77a88b3f	tracing/infrastructure: separate event tracer from event support Add a new config option, CONFIG_EVENT_TRACING that gets selected when CONFIG_TRACING is selected and adds everything needed by the stuff in trace_export - basically all the event tracing support needed by e.g. bprint, minus the actual events, which are only included if CONFIG_EVENT_TRACER is selected. So CONFIG_EVENT_TRACER can be used to turn on or off the generated events (what I think of as the 'event tracer'), while CONFIG_EVENT_TRACING turns on or off the base event tracing support used by both the event tracer and the other things such as bprint that can't be configured out. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: fweisbec@gmail.com LKML-Reference: <1239178441.10295.34.camel@tropicana> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 00:00:55 +02:00
Steven Rostedt	77d9f465d4	tracing/filters: use ring_buffer_discard_commit for discarded events The ring_buffer_discard_commit makes better usage of the ring_buffer when an event has been discarded. It tries to remove it completely if possible. This patch converts the trace event filtering to use ring_buffer_discard_commit instead of the ring_buffer_event_discard. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 00:00:54 +02:00
Steven Rostedt	fa1b47dd85	ring-buffer: add ring_buffer_discard_commit The ring_buffer_discard_commit is similar to ring_buffer_event_discard but it can only be done on an event that has yet to be commited. Unpredictable results can happen otherwise. The main difference between ring_buffer_discard_commit and ring_buffer_event_discard is that ring_buffer_discard_commit will try to free the data in the ring buffer if nothing has addded data after the reserved event. If something did, then it acts almost the same as ring_buffer_event_discard followed by a ring_buffer_unlock_commit. Note, either ring_buffer_commit_discard and ring_buffer_unlock_commit can be called on an event, not both. This commit also exports both discard functions to be usable by GPL modules. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 00:00:53 +02:00
Tom Zanussi	e45f2e2bd2	tracing/filters: add TRACE_EVENT_FORMAT_NOFILTER event macro Frederic Weisbecker suggested that the trace_special event shouldn't be filterable; this patch adds a TRACE_EVENT_FORMAT_NOFILTER event macro that allows an event format to be exported without having a filter attached, and removes filtering from the trace_special event. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 00:00:51 +02:00
Tom Zanussi	e1112b4d96	tracing/filters: add run-time field descriptions to TRACE_EVENT_FORMAT events This patch adds run-time field descriptions to all the event formats exported using TRACE_EVENT_FORMAT. It also hooks up all the tracers that use them (i.e. the tracers in the 'ftrace subsystem') so they can also have their output filtered by the event-filtering mechanism. When I was testing this, there were a couple of things that fooled me into thinking the filters weren't working, when actually they were - I'll mention them here so others don't make the same mistakes (and file bug reports. ;-) One is that some of the tracers trace multiple events e.g. the sched_switch tracer uses the context_switch and wakeup events, and if you don't set filters on all of the traced events, the unfiltered output from the events without filters on them can make it look like the filtering as a whole isn't working properly, when actually it is doing what it was asked to do - it just wasn't asked to do the right thing. The other is that for the really high-volume tracers e.g. the function tracer, the volume of filtered events can be so high that it pushes the unfiltered events out of the ring buffer before they can be read so e.g. cat'ing the trace file repeatedly shows either no output, or once in awhile some output but that isn't there the next time you read the trace, which isn't what you normally expect when reading the trace file. If you read from the trace_pipe file though, you can catch them before they disappear. Changes from v1: As suggested by Frederic Weisbecker: - get rid of externs in functions - added unlikely() to filter_check_discard() Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 00:00:50 +02:00
Zhaolei	a3d03ecaf9	tracing: Fix power tracer header Before patch: # tracer: power # # TASK-PID CPU# TIMESTAMP FUNCTION # \| \| \| \| \| [ 676.875865889] CSTATE: Going to C1 on cpu 0 for 0.005911463 [ 676.882938805] CSTATE: Going to C1 on cpu 0 for 0.104796532 ... After patch: # tracer: power # # TIMESTAMP STATE EVENT # \| \| \| [ 676.875865889] CSTATE: Going to C1 on cpu 0 for 0.005911463 [ 676.882938805] CSTATE: Going to C1 on cpu 0 for 0.104796532 ... v2: Use seq_puts instead of seq_printf Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <49E2E889.5000903@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-13 23:39:57 +02:00
Rafael J. Wysocki	c751085943	PM/Hibernate: Wait for SCSI devices scan to complete during resume There is a race between resume from hibernation and the asynchronous scanning of SCSI devices and to prevent it from happening we need to call scsi_complete_async_scans() during resume from hibernation. In addition, if the resume from hibernation is userland-driven, it's better to wait for all device probes in the kernel to complete before attempting to open the resume device. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 11:37:07 -07:00
Linus Torvalds	8255309b88	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/filters: return proper error code when writing filter file tracing/filters: allow user input integer to be oct or hex tracing/filters: fix NULL pointer dereference tracing/filters: NIL-terminate user input filter ftrace: Output REC->var instead of __entry->var for trace format Make __stringify support variable argument macros too tracing: fix document references tracing: fix splice return too large tracing: update file->f_pos when splice(2) it tracing: allocate page when needed tracing: disable seeking for trace_pipe_raw	2009-04-13 11:31:28 -07:00
Linus Torvalds	bf20753c0c	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: continue lock debugging despite some taints lockdep: warn about lockdep disabling after kernel taint	2009-04-13 11:30:26 -07:00
Linus Torvalds	d811f236d9	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: percpu: unbreak alpha percpu mutex: have non-spinning mutexes on s390 by default	2009-04-13 08:22:43 -07:00
Frederic Weisbecker	574bbe7820	lockdep: continue lock debugging despite some taints Impact: broaden lockdep checks Lockdep is disabled after any kernel taints. This might be convenient to ignore bad locking issues which sources come from outside the kernel tree. Nevertheless, it might be a frustrating experience for the staging developers or those who experience a warning but are focused on another things that require lockdep. The v2 of this patch simply don't disable anymore lockdep in case of TAINT_CRAP and TAINT_WARN events. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: LTP <ltp-list@lists.sourceforge.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg KH <gregkh@suse.de> LKML-Reference: <1239412638-6739-2-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 16:10:52 +02:00
Frederic Weisbecker	9eeba6138c	lockdep: warn about lockdep disabling after kernel taint Impact: provide useful missing info for developers Kernel taint can occur in several situations such as warnings, load of prorietary or staging modules, bad page, etc... But when such taint happens, a developer might still be working on the kernel, expecting that lockdep is still enabled. But a taint disables lockdep without ever warning about it. Such a kernel behaviour doesn't really help for kernel development. This patch adds this missing warning. Since the taint is done most of the time after the main message that explain the real source issue, it seems safe to warn about it inside add_taint() so that it appears at last, without hurting the main information. v2: Use a generic helper to disable lockdep instead of an open coded xchg(). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1239412638-6739-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 16:10:51 +02:00
Li Zefan	66de7792c0	blktrace: fix output of BLK_TC_PC events BLK_TC_PC events should be treated differently with BLK_TC_FS events. Before this patch: # echo 1 > /sys/block/sda/sda1/trace/enable # echo pc > /sys/block/sda/sda1/trace/act_mask # echo blk > /debugfs/tracing/current_tracer # (generate some BLK_TC_PC events) # cat trace bash-2184 [000] 1774.275413: 8,7 I N [bash] bash-2184 [000] 1774.275435: 8,7 D N [bash] bash-2184 [000] 1774.275540: 8,7 I R [bash] bash-2184 [000] 1774.275547: 8,7 D R [bash] ksoftirqd/0-4 [000] 1774.275580: 8,7 C N 0 [0] bash-2184 [000] 1774.275648: 8,7 I R [bash] bash-2184 [000] 1774.275653: 8,7 D R [bash] ksoftirqd/0-4 [000] 1774.275682: 8,7 C N 0 [0] bash-2184 [000] 1774.275739: 8,7 I R [bash] bash-2184 [000] 1774.275744: 8,7 D R [bash] ksoftirqd/0-4 [000] 1774.275771: 8,7 C N 0 [0] bash-2184 [000] 1774.275804: 8,7 I R [bash] bash-2184 [000] 1774.275808: 8,7 D R [bash] ksoftirqd/0-4 [000] 1774.275836: 8,7 C N 0 [0] After this patch: # cat trace bash-2263 [000] 366.782149: 8,7 I N 0 (00 ..) [bash] bash-2263 [000] 366.782323: 8,7 D N 0 (00 ..) [bash] bash-2263 [000] 366.782557: 8,7 I R 8 (25 00 ..) [bash] bash-2263 [000] 366.782560: 8,7 D R 8 (25 00 ..) [bash] ksoftirqd/0-4 [000] 366.782582: 8,7 C N (25 00 ..) [0] bash-2263 [000] 366.782648: 8,7 I R 8 (5a 00 3f 00) [bash] bash-2263 [000] 366.782650: 8,7 D R 8 (5a 00 3f 00) [bash] ksoftirqd/0-4 [000] 366.782669: 8,7 C N (5a 00 3f 00) [0] bash-2263 [000] 366.782710: 8,7 I R 8 (5a 00 08 00) [bash] bash-2263 [000] 366.782713: 8,7 D R 8 (5a 00 08 00) [bash] ksoftirqd/0-4 [000] 366.782730: 8,7 C N (5a 00 08 00) [0] bash-2263 [000] 366.783375: 8,7 I R 36 (5a 00 08 00) [bash] bash-2263 [000] 366.783379: 8,7 D R 36 (5a 00 08 00) [bash] ksoftirqd/0-4 [000] 366.783404: 8,7 C N (5a 00 08 00) [0] This is what we do with PC events in user-space blktrace. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <49D32387.9040106@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 15:32:46 +02:00
Li Zefan	b78825d608	blktrace: fix output of unknown events Not all events are pc (packet command) events. An event is a pc event only if it has BLK_TC_PC bit set. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <49D3236D.3090705@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 15:32:45 +02:00
Zhaolei	02af61bb50	tracing, kmemtrace: Separate include/trace/kmemtrace.h to kmemtrace part and tracepoint part Impact: refactor code for future changes Current kmemtrace.h is used both as header file of kmemtrace and kmem's tracepoints definition. Tracepoints' definition file may be used by other code, and should only have definition of tracepoint. We can separate include/trace/kmemtrace.h into 2 files: include/linux/kmemtrace.h: header file for kmemtrace include/trace/kmem.h: definition of kmem tracepoints Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <49DEE68A.5040902@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 15:22:55 +02:00
Li Zefan	44e9c8b7ad	tracing/filters: return proper error code when writing filter file - propagate return value of filter_add_pred() to the user - return -ENOSPC but not -ENOMEM or -EINVAL when the filter array is full Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E04CF0.3010105@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 11:59:29 +02:00
Li Zefan	a3e0ab0507	tracing/filters: allow user input integer to be oct or hex Before patch: # echo 'parent_pid == 0x10' > events/sched/sched_process_fork/filter # cat sched/sched_process_fork/filter parent_pid == 0 After patch: # cat sched/sched_process_fork/filter parent_pid == 16 Also check the input more strictly. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E04C53.4010600@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 11:59:28 +02:00
Li Zefan	bcabd91c27	tracing/filters: fix NULL pointer dereference Try this, and you'll see NULL pointer dereference bug: # echo -n 'parent_comm ==' > sched/sched_process_fork/filter Because we passed NULL ptr to simple_strtoull(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E04C43.1050504@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 11:59:28 +02:00
Li Zefan	8433a40eb7	tracing/filters: NIL-terminate user input filter Make sure messages from user space are NIL-terminated strings, otherwise we could dump random memory while reading filter file. Try this: # echo 'parent_comm ==' > events/sched/sched_process_fork/filter # cat events/sched/sched_process_fork/filter parent_comm == � Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E04C32.6060508@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 11:59:27 +02:00
Linus Torvalds	d6de2c80e9	async: Fix module loading async-work regression Several drivers use asynchronous work to do device discovery, and we synchronize with them in the compiled-in case before we actually try to mount root filesystems etc. However, when compiled as modules, that synchronization is missing - the module loading completes, but the driver hasn't actually finished probing for devices, and that means that any user mode that expects to use the devices after the 'insmod' is now potentially broken. We already saw one case of a similar issue in the ACPI battery code, where the kernel itself expected the module to be all done, and unmapped the init memory - but the async device discovery was still running. That got hacked around by just removing the "__init" (see commit `5d38258ec0` "ACPI battery: fix async boot oops"), but the real fix is to just make the module loading wait for all async work to be completed. It will slow down module loading, but since common devices should be built in anyway, and since the bug is really annoying and hard to handle from user space (and caused several S3 resume regressions), the simple fix to wait is the right one. This fixes at least http://bugzilla.kernel.org/show_bug.cgi?id=13063 but probably a few other bugzilla entries too (12936, for example), and is confirmed to fix Rafael's storage driver breakage after resume bug report (no bugzilla entry). We should also be able to now revert that ACPI battery fix. Reported-and-tested-by: Rafael J. Wysocki <rjw@suse.com> Tested-by: Heinz Diehl <htd@fancy-poultry.org> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-11 12:44:49 -07:00
Zhaolei	0462b5664b	ftrace: Output REC->var instead of __entry->var for trace format print fmt: "irq=%d return=%s", __entry->irq, __entry->ret ? \"handled\" : \"unhandled\" "__entry" should be convert to "REC" by __stringify() macro. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49DC679D.2090901@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 15:48:53 +02:00
Li Zefan	4d1f4372db	tracing: fix document references When moving documents to Documentation/trace/, I forgot to grep Kconfig to find out those references. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Pekka Paalanen <pq@iki.fi> Cc: eduard.munteanu@linux360.ro LKML-Reference: <49DE97EF.7080208@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 13:08:50 +02:00
Frederic Weisbecker	2062501ae6	tracing/lockdep: report the time waited for a lock While trying to optimize the new lock on reiserfs to replace the bkl, I find the lock tracing very useful though it lacks something important for performance (and latency) instrumentation: the time a task waits for a lock. That's what this patch implements: bash-4816 [000] 202.652815: lock_contended: lock_contended: &sb->s_type->i_mutex_key bash-4816 [000] 202.652819: lock_acquired: &rq->lock (0.000 us) <...>-4787 [000] 202.652825: lock_acquired: &rq->lock (0.000 us) <...>-4787 [000] 202.652829: lock_acquired: &rq->lock (0.000 us) bash-4816 [000] 202.652833: lock_acquired: &sb->s_type->i_mutex_key (16.005 us) As shown above, the "lock acquired" field is followed by the time it has been waiting for the lock. Usually, a lock contended entry is followed by a near lock_acquired entry with a non-zero time waited. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1238975373-15739-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:50:56 +02:00
Ingo Molnar	1cad1252ed	Merge branch 'tracing/urgent' into tracing/core Merge reason: pick up both v2.6.30-rc1 [which includes tracing/urgent fixes] and pick up the current lineup of tracing/urgent fixes as well Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:46:51 +02:00
Lai Jiangshan	93cfb3c9fd	tracing: fix splice return too large I got these from strace: splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 16384 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192 I wanted to splice_read 4096 bytes, but it returns 8192 or larger. It is because the return value of tracing_buffers_splice_read() does not include "zero out any left over data" bytes. But tracing_buffers_read() includes these bytes, we make them consistent. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D46674.9030804@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:44:46 +02:00
Lai Jiangshan	c7625a555f	tracing: update file->f_pos when splice(2) it Impact: Cleanup These two lines: if (unlikely(*ppos)) return -ESPIPE; in tracing_buffers_splice_read() are not needed, VFS layer has disabled seek(2). We remove these two lines, and then we can update file->f_pos. And tracing_buffers_read() updates file->f_pos, this fix make tracing_buffers_splice_read() updates file->f_pos too. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D46670.4010503@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:44:44 +02:00
Lai Jiangshan	ddd538f3e6	tracing: allocate page when needed Impact: Cleanup Sometimes, we open trace_pipe_raw, but we don't read(2) it, we just splice(2) it, thus, the page is not used. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D4666B.4010608@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:44:43 +02:00
Lai Jiangshan	d1e7e02f30	tracing: disable seeking for trace_pipe_raw Impact: disable pread() We set tracing_buffers_fops.llseek to no_llseek, but we can still perform pread() to read this file. That is not expected. This fix uses nonseekable_open() to disable it. tracing_buffers_fops.llseek is still set to no_llseek, it mark this file is a "non-seekable device" and is used by sys_splice(). See also do_splice() or manual of splice(2): ERRORS EINVAL Target file system doesn't support splicing; neither of the descriptors refers to a pipe; or offset given for non-seekable device. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D46668.8030806@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:44:42 +02:00
Linus Torvalds	c2ea122cd7	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: consolidate documents blktrace: pass the right pointer to kfree() tracing/syscalls: use a dedicated file header tracing: append a comma to INIT_FTRACE_GRAPH	2009-04-09 10:37:46 -07:00
Linus Torvalds	17b2e9bf27	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: do not count frozen tasks toward load sched: refresh MAINTAINERS entry sched: Print sched_group::__cpu_power in sched_domain_debug cpuacct: add per-cgroup utime/stime statistics posixtimers, sched: Fix posix clock monotonicity sched_rt: don't allocate cpumask in fastpath cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2	2009-04-09 10:37:28 -07:00
Linus Torvalds	422a253483	Merge branches 'core-fixes-for-linus', 'irq-fixes-for-linus' and 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: printk: fix wrong format string iter for printk futex: comment requeue key reference semantics * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq: fix cpumask memory leak on offstack cpumask kernels * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-timers: fix RLIMIT_CPU && setitimer(CPUCLOCK_PROF) posix-timers: fix RLIMIT_CPU && fork() timers: add missing kernel-doc	2009-04-09 10:35:30 -07:00
Heiko Carstens	36cd3c9f92	mutex: have non-spinning mutexes on s390 by default Impact: performance regression fix for s390 The adaptive spinning mutexes will not always do what one would expect on virtualized architectures like s390. Especially the cpu_relax() loop in mutex_spin_on_owner might hurt if the mutex holding cpu has been scheduled away by the hypervisor. We would end up in a cpu_relax() loop when there is no chance that the state of the mutex changes until the target cpu has been scheduled again by the hypervisor. For that reason we should change the default behaviour to no-spin on s390. We do have an instruction which allows to yield the current cpu in favour of a different target cpu. Also we have an instruction which allows us to figure out if the target cpu is physically backed. However we need to do some performance tests until we can come up with a solution that will do the right thing on s390. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> LKML-Reference: <20090409184834.7a0df7b2@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 19:28:24 +02:00
Li Zefan	9eb85125ce	blktrace: pass the right pointer to kfree() Impact: fix kfree crash with non-standard act_mask string If passing a string with leading white spaces to strstrip(), the returned ptr != the original ptr. This bug was introduced by me. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49DD694C.8020902@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 05:52:40 +02:00
Frederic Weisbecker	47788c58e6	tracing/syscalls: use a dedicated file header Impact: fix build warnings and possibe compat misbehavior on IA64 Building a kernel on ia64 might trigger these ugly build warnings: CC arch/ia64/ia32/sys_ia32.o In file included from arch/ia64/ia32/sys_ia32.c:55: arch/ia64/ia32/ia32priv.h:290:1: warning: "elf_check_arch" redefined In file included from include/linux/elf.h:7, from include/linux/module.h:14, from include/linux/ftrace.h:8, from include/linux/syscalls.h:68, from arch/ia64/ia32/sys_ia32.c:18: arch/ia64/include/asm/elf.h:19:1: warning: this is the location of the previous definition [...] sys_ia32.c includes linux/syscalls.h which in turn includes linux/ftrace.h to import the syscalls tracing prototypes. But including ftrace.h can pull too much things for a low level file, especially on ia64 where the ia32 private headers conflict with higher level headers. Now we isolate the syscall tracing headers in their own lightweight file. Reported-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jason Baron <jbaron@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Michael Davidson <md@google.com> LKML-Reference: <20090408184058.GB6017@nowhere> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 05:43:32 +02:00
Andrew Morton	6b44003e5c	work_on_cpu(): rewrite it to create a kernel thread on demand Impact: circular locking bugfix The various implemetnations and proposed implemetnations of work_on_cpu() are vulnerable to various deadlocks because they all used queues of some form. Unrelated pieces of kernel code thus gained dependencies wherein if one work_on_cpu() caller holds a lock which some other work_on_cpu() callback also takes, the kernel could rarely deadlock. Fix this by creating a short-lived kernel thread for each work_on_cpu() invokation. This is not terribly fast, but the only current caller of work_on_cpu() is pci_call_probe(). It would be nice to find some other way of doing the node-local allocations in the PCI probe code so that we can zap work_on_cpu() altogether. The code there is rather nasty. I can't think of anything simple at this time... Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-04-09 09:50:37 +09:30
Oleg Nesterov	1c99315bb3	kthread: move sched-realeted initialization from kthreadd context kthreadd is the single thread which implements ths "create" request, move sched_setscheduler/etc from create_kthread() to kthread_create() to improve the scalability. We should be careful with sched_setscheduler(), use _nochek helper. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Vitaliy Gusev <vgusev@openvz.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-04-09 09:50:37 +09:30
Vitaliy Gusev	3217ab97f1	kthread: Don't looking for a task in create_kthread() #2 Remove the unnecessary find_task_by_pid_ns(). kthread() can just use "current" to get the same result. Signed-off-by: Vitaliy Gusev <vgusev@openvz.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-04-09 09:50:36 +09:30
Roland McGrath	3a70970353	ptrace: some checkpatch fixes This fixes all the checkpatch --file complaints about kernel/ptrace.c and also removes an unused #include. I've verified that there are no changes to the compiled code on x86_64. Signed-off-by: Roland McGrath <roland@redhat.com> [ Removed the parts that just split a line - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-08 10:21:44 -07:00
Oleg Nesterov	8f2e586567	posix-timers: fix RLIMIT_CPU && setitimer(CPUCLOCK_PROF) update_rlimit_cpu() tries to optimize out set_process_cpu_timer() in case when we already have CPUCLOCK_PROF timer which should expire first. But it uses cputime_lt() instead of cputime_gt(). Test case: int main(void) { struct itimerval it = { .it_value = { .tv_sec = 1000 }, }; assert(!setitimer(ITIMER_PROF, &it, NULL)); struct rlimit rl = { .rlim_cur = 1, .rlim_max = 1, }; assert(!setrlimit(RLIMIT_CPU, &rl)); for (;;) ; return 0; } Without this patch, the task is not killed as RLIMIT_CPU demands. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Peter Lojkin <ia6432@inbox.ru> Cc: Roland McGrath <roland@redhat.com> Cc: stable@kernel.org LKML-Reference: <20090327000610.GA10108@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 17:51:39 +02:00
Oleg Nesterov	6279a751fe	posix-timers: fix RLIMIT_CPU && fork() See http://bugzilla.kernel.org/show_bug.cgi?id=12911 copy_signal() copies signal->rlim, but RLIMIT_CPU is "lost". Because posix_cpu_timers_init_group() sets cputime_expires.prof_exp = 0 and thus fastpath_timer_check() returns false unless we have other expired cpu timers. Change copy_signal() to set cputime_expires.prof_exp if we have RLIMIT_CPU. Also, set cputimer.running = 1 in that case. This is not strictly necessary, but imho makes sense. Reported-by: Peter Lojkin <ia6432@inbox.ru> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Peter Lojkin <ia6432@inbox.ru> Cc: Roland McGrath <roland@redhat.com> Cc: stable@kernel.org LKML-Reference: <20090327000607.GA10104@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 17:51:38 +02:00
Ingo Molnar	5af8c4e0fa	Merge commit 'v2.6.30-rc1' into sched/urgent Merge reason: update to latest upstream to queue up fix Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 17:26:00 +02:00
Ingo Molnar	ff96e612cb	Merge commit 'v2.6.30-rc1' into core/urgent Merge reason: need latest upstream to queue up dependent fix Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 17:02:57 +02:00
Linus Torvalds	1551260d1f	Merge branch 'core/softlockup' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core/softlockup' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: softlockup: make DETECT_HUNG_TASK default depend on DETECT_SOFTLOCKUP softlockup: move 'one' to the softlockup section in sysctl.c softlockup: ensure the task has been switched out once softlockup: remove timestamp checking from hung_task softlockup: convert read_lock in hung_task to rcu_read_lock softlockup: check all tasks in hung_task softlockup: remove unused definition for spawn_softlockup_task softlockup: fix potential race in hung_task when resetting timeout softlockup: fix to allow compiling with !DETECT_HUNG_TASK softlockup: decouple hung tasks check from softlockup detection	2009-04-07 14:11:07 -07:00
Linus Torvalds	c93f216b5b	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y branch tracer: Fix for enabling branch profiling makes sparse unusable ftrace: Correct a text align for event format output Update /debug/tracing/README tracing/ftrace: alloc the started cpumask for the trace file tracing, x86: remove duplicated #include ftrace: Add check of sched_stopped for probe_sched_wakeup function-graph: add proper initialization for init task tracing/ftrace: fix missing include string.h tracing: fix incorrect return type of ns2usecs() tracing: remove CALLER_ADDR2 from wakeup tracer blktrace: fix pdu_len when tracing packet command requests blktrace: small cleanup in blk_msg_write() blktrace: NUL-terminate user space messages tracing: move scripts/trace/power.pl to scripts/tracing/power.pl	2009-04-07 14:10:10 -07:00
Linus Torvalds	c61b79b6ef	Merge branch 'irq/threaded' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq/threaded' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: fix devres.o build for GENERIC_HARDIRQS=n genirq: provide old request_irq() for CONFIG_GENERIC_HARDIRQ=n genirq: threaded irq handlers review fixups genirq: add support for threaded interrupts to devres genirq: add threaded interrupt handler support	2009-04-07 14:07:52 -07:00
Masami Hiramatsu	de5bd88d5a	kprobes: support per-kprobe disabling Add disable_kprobe() and enable_kprobe() to disable/enable kprobes temporarily. disable_kprobe() asynchronously disables probe handlers of specified kprobe. So, after calling it, some handlers can be called at a while. enable_kprobe() enables specified kprobe. aggr_pre_handler and aggr_post_handler check disabled probes. On the other hand aggr_break_handler and aggr_fault_handler don't check it because these handlers will be called while executing pre or post handlers and usually those help error handling. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:08 -07:00
Masami Hiramatsu	e579abeb58	kprobes: rename kprobe_enabled to kprobes_all_disarmed Rename kprobe_enabled to kprobes_all_disarmed and invert logic due to avoiding naming confusion from per-probe disabling. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:08 -07:00
Masami Hiramatsu	99081ab553	kprobes: move EXPORT_SYMBOL_GPL just after function definitions Clean up positions of EXPORT_SYMBOL_GPL in kernel/kprobes.c according to checkpatch.pl. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:08 -07:00
Masami Hiramatsu	b918e5e60d	kprobes: cleanup aggr_kprobe related code Currently, kprobes can disable all probes at once, but can't disable it individually (not unregister, just disable an kprobe, because unregistering needs to wait for scheduler synchronization). These patches introduce APIs for on-the-fly per-probe disabling and re-enabling by dis-arming/re-arming its breakpoint instruction. This patch: Change old_p to ap in add_new_kprobe() for readability, copy flags member in add_aggr_kprobe(), and simplify the code flow of register_aggr_kprobe(). Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:07 -07:00
Peter W Morreale	fafd688e4c	mm: add /proc controls for pdflush threads Add /proc entries to give the admin the ability to control the minimum and maximum number of pdflush threads. This allows finer control of pdflush on both large and small machines. The rationale is simply one size does not fit all. Admins on large and/or small systems may want to tune the min/max pdflush thread count to best suit their needs. Right now the min/max is hardcoded to 2/8. While probably a fair estimate for smaller machines, large machines with large numbers of CPUs and large numbers of filesystems/block devices may benefit from larger numbers of threads working on different block devices. Even if the background flushing algorithm is radically changed, it is still likely that multiple threads will be involved and admins would still desire finer control on the min/max other than to have to recompile the kernel. The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions. The minimum value for nr_pdflush_threads_min is 1 and the maximum value is the current value of nr_pdflush_threads_max. This minimum is required since additional thread creation is performed in a pdflush thread itself. The minimum value for nr_pdflush_threads_max is the current value of nr_pdflush_threads_min and the maximum value can be 1000. Documentation/sysctl/vm.txt is also updated. [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly] Signed-off-by: Peter W Morreale <pmorreale@novell.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:03 -07:00
Zhaolei	dcef788eb9	ftrace: clean up enable logic for sched_switch Unify sched_switch and sched_wakeup's action to following logic: Do record_cmdline when start_cmdline_record() is called. Start tracing events when the tracer is started. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> LKML-Reference: <49D1C596.5050203@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-07 14:43:09 +02:00
Steven Rostedt	597af81537	function-graph: use int instead of atomic for ftrace_graph_active Impact: cleanup The variable ftrace_graph_active is only modified under the ftrace_lock mutex, thus an atomic is not necessary for modification. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-07 14:43:08 +02:00
Frederic Weisbecker	5452af664f	tracing/ftrace: factorize the tracing files creation Impact: cleanup Most of the tracing files creation follow the same pattern: ret = debugfs_create_file(...) if (!ret) pr_warning("Couldn't create ... entry\n") Unify it! Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1238109938-11840-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-07 14:43:07 +02:00
Li Zefan	a5dec5573f	tracing: use macros to denote usec and nsec per second Impact: cleanup Use USEC_PER_SEC and NSEC_PER_SEC instead of 1000000 and 1000000000. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <49CC7870.9000309@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-04-07 14:43:06 +02:00
Ingo Molnar	86665c75da	Merge branch 'tracing/urgent' into tracing/ftrace	2009-04-07 14:41:17 +02:00
Zhaolei	1bbe2a83ab	ftrace: Correct a text align for event format output If we cat debugfs/tracing/events/ftrace/bprint/format, we'll see: name: bprint ID: 6 format: field:unsigned char common_type; offset:0; size:1; field:unsigned char common_flags; offset:1; size:1; field:unsigned char common_preempt_count; offset:2; size:1; field:int common_pid; offset:4; size:4; field:int common_tgid; offset:8; size:4; field:unsigned long ip; offset:12; size:4; field:char * fmt; offset:16; size:4; field: char buf; offset:20; size:0; print fmt: "%08lx (%d) fmt:%p %s" There is an inconsistent blank before char buf. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> LKML-Reference: <49D5E3EE.70201@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:02:42 +02:00
Nikanth Karthikesan	bc2b6871c1	Update /debug/tracing/README Some of the tracers have been renamed, which was not updated in the in-kernel run-time README file. Update it. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> LKML-Reference: <200903231158.32151.knikanth@suse.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:02:36 +02:00
Frederic Weisbecker	b0dfa978c7	tracing/ftrace: alloc the started cpumask for the trace file Impact: fix a crash while cat trace file Currently we are using a cpumask to remind each cpu where a trace occured. It lets us notice the user that a cpu just had its first trace. But on latest -tip we have the following crash once we cat the trace file: IP: [<c0270c4a>] print_trace_fmt+0x45/0xe7 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP last sysfs file: /sys/class/net/eth0/carrier Pid: 3897, comm: cat Not tainted (2.6.29-tip-02825-g0f22972-dirty #81) EIP: 0060:[<c0270c4a>] EFLAGS: 00010297 CPU: 0 EIP is at print_trace_fmt+0x45/0xe7 EAX: 00000000 EBX: 00000000 ECX: c12d9e98 EDX: ccdb7010 ESI: d31f4000 EDI: `00322401` EBP: d31f3f10 ESP: d31f3efc DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process cat (pid: 3897, ti=d31f2000 task=d3b3cf20 task.ti=d31f2000) Stack: d31f4080 ccdb7010 d31f4000 d691fe70 ccdb7010 d31f3f24 c0270e5c d31f4000 d691fe70 d31f4000 d31f3f34 c02718e8 c12d9e98 d691fe70 d31f3f70 c02bfc33 00001000 09130000 d3b46e00 d691fe98 00000000 00000079 00000001 00000000 Call Trace: [<c0270e5c>] ? print_trace_line+0x170/0x17c [<c02718e8>] ? s_show+0xa7/0xbd [<c02bfc33>] ? seq_read+0x24a/0x327 [<c02bf9e9>] ? seq_read+0x0/0x327 [<c02ab18b>] ? vfs_read+0x86/0xe1 [<c02ab289>] ? sys_read+0x40/0x65 [<c0202d8f>] ? sysenter_do_call+0x12/0x3c Code: 00 00 00 89 45 ec f7 c7 00 20 00 00 89 55 f0 74 4e f6 86 98 10 00 00 02 74 45 8b 86 8c 10 00 00 8b 9e a8 10 00 00 e8 52 f3 ff ff <0f> a3 03 19 c0 85 c0 75 2b 8b 86 8c 10 00 00 8b 9e a8 10 00 00 EIP: [<c0270c4a>] print_trace_fmt+0x45/0xe7 SS:ESP 0068:d31f3efc CR2: 0000000000000000 ---[ end trace aa9cf38e5ebed9dd ]--- This is because we alloc the iter->started cpumask on tracing_pipe_open but not on tracing_open. It hadn't been noticed until now because we need to have ring buffer overruns to activate the starting of cpu buffer detection. Also, we need a check to not print the messagge for the first trace on the file. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1238619188-6109-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:02:03 +02:00
Zhaolei	8bcae09b93	ftrace: Add check of sched_stopped for probe_sched_wakeup The wakeup tracing in sched_switch does not stop when a user disables tracing. This is because the probe_sched_wakeup() is missing the check to prevent the wakeup from being traced. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> LKML-Reference: <49D1C543.3010307@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:01:11 +02:00
Frederic Weisbecker	5f0c6c03c5	tracing/ftrace: fix missing include string.h Building a kernel with tracing can raise the following warning on tip/master: kernel/trace/trace.c:1249: error: implicit declaration of function 'vbin_printf' We are missing an include to string.h Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1238160130-7437-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:00:18 +02:00
Lai Jiangshan	cf8e347465	tracing: fix incorrect return type of ns2usecs() Impact: fix time output bug in 32bits system ns2usecs() returns 'long', it's incorrect. (In i386) ... <idle>-0 [000] 521.442100: _spin_lock <-tick_do_update_jiffies64 <idle>-0 [000] 521.442101: do_timer <-tick_do_update_jiffies64 <idle>-0 [000] 521.442102: update_wall_time <-do_timer <idle>-0 [000] 521.442102: update_xtime_cache <-update_wall_time .... (It always print the time less than 2200 seconds besides ...) Because 'long' is 32bits in i386. ( (1<<31) useconds is about 2200 seconds) ... <idle>-0 [001] 4154502640.134759: rcu_bh_qsctr_inc <-__do_softirq <idle>-0 [001] 4154502640.134760: _local_bh_enable <-__do_softirq <idle>-0 [001] 4154502640.134761: idle_cpu <-irq_exit ... (very large value) Because 'long' is a signed type and it is 32bits in i386. Changes in v2: return 'unsigned long long' instead of 'cycle_t' Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <49D05D10.4030009@cn.fujitsu.com> Reported-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:59:23 +02:00
Steven Rostedt	301fd748e2	tracing: remove CALLER_ADDR2 from wakeup tracer Maneesh Soni was getting a crash when running the wakeup tracer. We debugged it down to the recording of the function with the CALLER_ADDR2 macro. This is used to get the location of the caller to schedule. But the problem comes when schedule is called by assmebly. In the case that Maneesh had, retint_careful would call schedule. But retint_careful does not set up a proper frame pointer. CALLER_ADDR2 is defined as __builtin_return_address(2). This produces the following assembly in the wakeup tracer code. mov 0x0(%rbp),%rcx <--- get the frame pointer of the caller mov %r14d,%r8d mov 0xf2de8e(%rip),%rdi mov 0x8(%rcx),%rsi <-- this is __builtin_return_address(1) mov 0x28(%rdi,%rax,8),%rbx mov (%rcx),%rax <-- get the frame pointer of the caller's caller mov %r12,%rcx mov 0x8(%rax),%rdx <-- this is __builtin_return_address(2) At the reading of 0x8(%rax) Maneesh's machine would take a fault. The reason is that retint_careful did not set up the return address and the content of %rax here was zero. To verify this, I sent Maneesh a patch to create a frame pointer in retint_careful. He ran the test again but this time he would take the same type of fault from sysret_careful. The retint_careful was no longer an issue, but there are other callers that still have issues. Instead of adding frame pointers for all callers to schedule (in possibly all archs), it is much safer to simply not use CALLER_ADDR2. This loses out on knowing what called schedule, but the function tracer will help there if needed. Reported-by: Maneesh Soni <maneesh@in.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:58:54 +02:00
Ingo Molnar	93776a8ec7	Merge branch 'linus' into tracing/core Merge reason: update to upstream tracing facilities Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:47:45 +02:00
Ingo Molnar	a053958f07	Merge branch 'tracing/blktrace-fixes' into tracing/urgent Merge reason: this used to be a tracing/blktrace-v2 devel topic still cooking during the merge window - has propagated to fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:40:55 +02:00
Markus Metzger	0f4814065f	x86, ptrace: add bts context unconditionally Add the ptrace bts context field to task_struct unconditionally. Initialize the field directly in copy_process(). Remove all the unneeded functionality used to initialize that field. Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Cc: roland@redhat.com Cc: eranian@googlemail.com Cc: oleg@redhat.com Cc: juan.villacis@intel.com Cc: ak@linux.jf.intel.com LKML-Reference: <20090403144603.292754000@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:36:31 +02:00
Markus Metzger	4d657e51df	x86, hw-branch-tracer: allocate selftest iterator on heap Allocate the trace_iterator for the hw-branch-tracer selftest on the heap. Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Cc: roland@redhat.com Cc: eranian@googlemail.com Cc: oleg@redhat.com Cc: juan.villacis@intel.com Cc: ak@linux.jf.intel.com LKML-Reference: <20090403144556.578777000@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:36:21 +02:00
Markus Metzger	de79f54f53	x86, bts, hw-branch-tracer: add _noirq variants to the debug store interface The hw-branch-tracer uses debug store functions from an on_each_cpu() context, which is simply wrong since the functions may sleep. Add _noirq variants for most functions, which may be called with interrupts disabled. Separate per-cpu and per-task tracing and allow per-cpu tracing to be controlled from any cpu. Make the hw-branch-tracer use the new debug store interface, synchronize with hotplug cpu event using get/put_online_cpus(), and remove the unnecessary spinlock. Make the ptrace bts and the ds selftest code use the new interface. Defer the ds selftest. Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Cc: roland@redhat.com Cc: eranian@googlemail.com Cc: oleg@redhat.com Cc: juan.villacis@intel.com Cc: ak@linux.jf.intel.com LKML-Reference: <20090403144555.658136000@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:36:20 +02:00
Markus Metzger	a26b89f05d	sched, hw-branch-tracer: add wait_task_context_switch() function to sched.h Add a function to wait until some other task has been switched out at least once. This differs from wait_task_inactive() subtly, in that the latter will wait until the task has left the CPU. Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> Cc: markus.t.metzger@gmail.com Cc: roland@redhat.com Cc: eranian@googlemail.com Cc: oleg@redhat.com Cc: juan.villacis@intel.com Cc: ak@linux.jf.intel.com LKML-Reference: <20090403144549.794157000@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:36:12 +02:00
Ingo Molnar	2e8844e13a	Merge branch 'linus' into tracing/hw-branch-tracing Merge reason: update to latest tracing and ptrace APIs Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:34:42 +02:00
Ingo Molnar	5e34437840	Merge branch 'linus' into core/softlockup Conflicts: kernel/sysctl.c	2009-04-07 11:15:40 +02:00
Rusty Russell	2e45e77787	Revert "module: remove the SHF_ALLOC flag on the __versions section." This reverts commit `9cb610d8e3`. This was an impressively stupid patch. Firstly, we reset the SHF_ALLOC flag lower down in the same function, so the patch was useless. Even better, find_sec() ignores sections with SHF_ALLOC not set, so it breaks CONFIG_MODVERSIONS=y with CONFIG_MODULE_FORCE_LOAD=n, which refuses to load the module since it can't find the __versions section. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-04-07 17:12:43 +09:30
Oleg Nesterov	432870dab8	exit_notify: kill the wrong capable(CAP_KILL) check The CAP_KILL check in exit_notify() looks just wrong, kill it. Whatever logic we have to reset ->exit_signal, the malicious user can bypass it if it execs the setuid application before exiting. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 14:57:23 -07:00
Linus Torvalds	cd5f9a4c31	kernel/sysctl.c: avoid annoying warnings Some of the limit constants are used only depending on some complex configuration dependencies, yet it's not worth making the simple variables depend on those configuration details. Just mark them as perhaps not being unused, and avoid the warning. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 13:38:46 -07:00
Linus Torvalds	609862be07	Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: add stack dumps to asserts hrtimer: fix rq->lock inversion (again)	2009-04-06 13:37:30 -07:00
Linus Torvalds	12fe32e4f9	Merge branch 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kmemtrace: trace kfree() calls with NULL or zero-length objects kmemtrace: small cleanups kmemtrace: restore original tracing data binary format, improve ABI kmemtrace: kmemtrace_alloc() must fill type_id kmemtrace: use tracepoints kmemtrace, rcu: don't include unnecessary headers, allow kmemtrace w/ tracepoints kmemtrace, rcu: fix rcupreempt.c data structure dependencies kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_unlzma.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_bunzip2.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_inflate.c kmemtrace, squashfs: fix slab.h dependency problem in squasfs kmemtrace, befs: fix slab.h dependency problem kmemtrace, security: fix linux/key.h header file dependencies kmemtrace, fs: fix linux/fdtable.h header file dependencies kmemtrace, fs: uninline simple_transaction_set() kmemtrace, fs, security: move alloc_secdata() and free_secdata() to linux/security.h	2009-04-06 13:30:00 -07:00
Ingo Molnar	9efe21cb82	Merge branch 'linus' into irq/threaded Conflicts: include/linux/irq.h kernel/irq/handle.c	2009-04-06 01:41:22 +02:00
Linus Torvalds	0221c81b1b	Merge branch 'audit.b62' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b62' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: Audit: remove spaces from audit_log_d_path audit: audit_set_auditable defined but not used audit: incorrect ref counting in audit tree tag_chunk audit: Fix possible return value truncation in audit_get_context() audit: ignore terminating NUL in AUDIT_USER_TTY messages Audit: fix handling of 'strings' with NULL characters make the e->rule.xxx shorter in kernel auditfilter.c auditsc: fix kernel-doc notation audit: EXECVE record - removed bogus newline	2009-04-05 12:36:11 -07:00
Linus Torvalds	714f83d5d9	Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits) tracing, net: fix net tree and tracing tree merge interaction tracing, powerpc: fix powerpc tree and tracing tree interaction ring-buffer: do not remove reader page from list on ring buffer free function-graph: allow unregistering twice trace: make argument 'mem' of trace_seq_putmem() const tracing: add missing 'extern' keywords to trace_output.h tracing: provide trace_seq_reserve() blktrace: print out BLK_TN_MESSAGE properly blktrace: extract duplidate code blktrace: fix memory leak when freeing struct blk_io_trace blktrace: fix blk_probes_ref chaos blktrace: make classic output more classic blktrace: fix off-by-one bug blktrace: fix the original blktrace blktrace: fix a race when creating blk_tree_root in debugfs blktrace: fix timestamp in binary output tracing, Text Edit Lock: cleanup tracing: filter fix for TRACE_EVENT_FORMAT events ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release() x86: kretprobe-booster interrupt emulation code fix ... Fix up trivial conflicts in arch/parisc/include/asm/ftrace.h include/linux/memory.h kernel/extable.c kernel/module.c	2009-04-05 11:04:19 -07:00
Eric Paris	def5754341	Audit: remove spaces from audit_log_d_path audit_log_d_path had spaces in the strings which would be emitted on the error paths. This patch simply replaces those spaces with an _ or removes the needless spaces entirely. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:49:04 -04:00
Eric Paris	679173b724	audit: audit_set_auditable defined but not used after `0590b9335a` audit_set_auditable() is now only used by the audit tree code. If CONFIG_AUDIT_TREE is unset it will be defined but unused. This patch simply moves the function inside a CONFIG_AUDIT_TREE block. cc1: warnings being treated as errors /home/acme_unencrypted/git/linux-2.6-tip/kernel/auditsc.c:745: error: ‘audit_set_auditable’ defined but not used make[2]: * [kernel/auditsc.o] Error 1 make[1]: * [kernel] Error 2 make[1]: *** Waiting for unfinished jobs.... Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:48:52 -04:00
Eric Paris	318b6d3d7d	audit: incorrect ref counting in audit tree tag_chunk tag_chunk has bad exit paths in which the inotify ref counting is wrong. At the top of the function we found &old_watch using inotify_find_watch(). inotify_find_watch takes a reference to the watch. This is never dropped on an error path. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:48:26 -04:00
Paul Moore	6d208da89a	audit: Fix possible return value truncation in audit_get_context() The audit subsystem treats syscall return codes as type long, unfortunately the audit_get_context() function mistakenly converts the return code to an int type in the parameters which could cause problems on systems where the sizeof(int) != sizeof(long). Signed-off-by: Paul Moore <paul.moore@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:46:19 -04:00
Miloslav Trmac	55ad2f8d34	audit: ignore terminating NUL in AUDIT_USER_TTY messages AUDIT_USER_TTY, like all other messages sent from user-space, is sent NUL-terminated. Unlike other user-space audit messages, which come only from trusted sources, AUDIT_USER_TTY messages are processed using audit_log_n_untrustedstring(). This patch modifies AUDIT_USER_TTY handling to ignore the trailing NUL and use the "quoted_string" representation of the message if possible. Signed-off-by: Miloslav Trmac <mitr@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:43:36 -04:00
Miloslav Trmac	b3897f5671	Audit: fix handling of 'strings' with NULL characters currently audit_log_n_untrustedstring() uses audit_string_contains_control() to check if the 'string' has any control characters. If the 'string' has an embedded NULL audit_string_contains_control() will return that the data has no control characters and will then pass the string to audit_log_n_string with the total length, not the length up to the first NULL. audit_log_n_string() does a memcpy of the entire length and so the actual audit record emitted may then contain a NULL and then whatever random memory is after the NULL. Since we want to log the entire octet stream (if we can't trust the data to be a string we can't trust that a NULL isn't actually a part of it) we should just consider NULL as a control character. If the caller is certain they want to stop at the first NULL they should be using audit_log_untrustedstring. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:43:24 -04:00
Zhenwen Xu	c28bb7da74	make the e->rule.xxx shorter in kernel auditfilter.c make the e->rule.xxx shorter in kernel/auditfilter.c -- --------------------------------- Zhenwen Xu - Open and Free Home Page: http://zhwen.org My Studio: http://dim4.cn >From 99692dc640b278f1cb1a15646ce42f22e89c0f77 Mon Sep 17 00:00:00 2001 From: Zhenwen Xu <Helight.Xu@gmail.com> Date: Thu, 12 Mar 2009 22:04:59 +0800 Subject: [PATCH] make the e->rule.xxx shorter in kernel/auditfilter.c Signed-off-by: Zhenwen Xu <Helight.Xu@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:40:33 -04:00
Randy Dunlap	6b96255998	auditsc: fix kernel-doc notation Fix auditsc kernel-doc notation: Warning(linux-2.6.28-git7//kernel/auditsc.c:2156): No description found for parameter 'attr' Warning(linux-2.6.28-git7//kernel/auditsc.c:2156): Excess function parameter 'u_attr' description in '__audit_mq_open' Warning(linux-2.6.28-git7//kernel/auditsc.c:2204): No description found for parameter 'notification' Warning(linux-2.6.28-git7//kernel/auditsc.c:2204): Excess function parameter 'u_notification' description in '__audit_mq_notify' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:39:19 -04:00
Jiri Pirko	ca96a895a6	audit: EXECVE record - removed bogus newline (updated) Added hunk that changes the comment, the rest is the same. EXECVE records contain a newline after every argument. auditd converts "\n" to " " so you cannot see newlines even in raw logs, but they're there nevertheless. If you're not using auditd, you need to work round them. These '\n' chars are can be easily replaced by spaces when creating record in kernel. Note there is no need for trailing '\n' in an audit record. record before this patch: "type=EXECVE msg=audit(1231421801.566:31): argc=4 a0=\"./test\"\na1=\"a\"\na2=\"b\"\na3=\"c\"\n" record after this patch: "type=EXECVE msg=audit(1231421801.566:31): argc=4 a0=\"./test\" a1=\"a\" a2=\"b\" a3=\"c\"" Signed-off-by: Jiri Pirko <jpirko@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-05 13:38:59 -04:00
Linus Torvalds	90975ef712	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits) cpumask: remove cpumask allocation from idle_balance, fix numa, cpumask: move numa_node_id default implementation to topology.h, fix cpumask: remove cpumask allocation from idle_balance x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus x86: cpumask: update 32-bit APM not to mug current->cpus_allowed x86: microcode: cleanup x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash numa, cpumask: move numa_node_id default implementation to topology.h cpumask: convert node_to_cpumask_map[] to cpumask_var_t cpumask: remove x86 cpumask_t uses. cpumask: use cpumask_var_t in uv_flush_tlb_others. cpumask: remove cpumask_t assignment from vector_allocation_domain() cpumask: make Xen use the new operators. cpumask: clean up summit's send_IPI functions cpumask: use new cpumask functions throughout x86 x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t cpumask: convert node_to_cpumask_map[] to cpumask_var_t x86: unify 32 and 64-bit node_to_cpumask_map ...	2009-04-05 10:33:07 -07:00
Linus Torvalds	cab4e4c43f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-module-and-param * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-module-and-param: module: use strstarts() strstarts: helper function for !strncmp(str, prefix, strlen(prefix)) arm: allow usage of string functions in linux/string.h module: don't use stop_machine on module load module: create a request_module_nowait() module: include other structures in module version check module: remove the SHF_ALLOC flag on the __versions section. module: clarify the force-loading taint message. module: Export symbols needed for Ksplice Ksplice: Add functions for walking kallsyms symbols module: remove module_text_address() module: __module_address module: Make find_symbol return a struct kernel_symbol kernel/module.c: fix an unused goto label param: fix charp parameters set via sysfs Fix trivial conflicts in kernel/extable.c manually.	2009-04-05 10:30:21 -07:00
Linus Torvalds	e4c393fd55	Merge branch 'printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: printk: correct the behavior of printk_timed_ratelimit() vsprintf: unify the format decoding layer for its 3 users, cleanup fix regression from "vsprintf: unify the format decoding layer for its 3 users" vsprintf: fix bug in negative value printing vsprintf: unify the format decoding layer for its 3 users vsprintf: add binary printf printk: introduce printk_once() Fix trivial conflicts (printk_once vs log_buf_kexec_setup() added near each other) in include/linux/kernel.h.	2009-04-05 10:23:25 -07:00
Linus Torvalds	09f38dc19d	Merge branch 'core-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ptrace: remove a useless goto	2009-04-03 17:35:06 -07:00
Linus Torvalds	30a39e0e97	Merge branch 'stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: symbols, stacktrace: look up init symbols after module symbols	2009-04-03 17:34:41 -07:00
Linus Torvalds	c7edad5fcb	Merge branch 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: rcu_barrier VS cpu_hotplug: Ensure callbacks in dead cpu are migrated to online cpu	2009-04-03 17:34:12 -07:00
Linus Torvalds	b1dbb67911	Merge branch 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: s390: remove arch specific smp_send_stop() panic: clean up kernel/panic.c panic, smp: provide smp_send_stop() wrapper on UP too panic: decrease oops_in_progress only after having done the panic generic-ipi: eliminate WARN_ON()s during oops/panic generic-ipi: cleanups generic-ipi: remove CSD_FLAG_WAIT generic-ipi: remove kmalloc() generic IPI: simplify barriers and locking	2009-04-03 17:33:30 -07:00
Linus Torvalds	492f59f526	Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: locking: rename trace_softirq_[enter\|exit] => lockdep_softirq_[enter\|exit] lockdep: remove duplicate CONFIG_DEBUG_LOCKDEP definitions lockdep: require framepointers for x86 lockdep: remove extra "irq" string lockdep: fix incorrect state name	2009-04-03 17:29:53 -07:00
Linus Torvalds	811158b147	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) trivial: Update my email address trivial: NULL noise: drivers/mtd/tests/mtd_*test.c trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h trivial: Fix misspelling of "Celsius". trivial: remove unused variable 'path' in alloc_file() trivial: fix a pdlfush -> pdflush typo in comment trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL trivial: wusb: Storage class should be before const qualifier trivial: drivers/char/bsr.c: Storage class should be before const qualifier trivial: h8300: Storage class should be before const qualifier trivial: fix where cgroup documentation is not correctly referred to trivial: Give the right path in Documentation example trivial: MTD: remove EOL from MODULE_DESCRIPTION trivial: Fix typo in bio_split()'s documentation trivial: PWM: fix of #endif comment trivial: fix typos/grammar errors in Kconfig texts trivial: Fix misspelling of firmware trivial: cgroups: documentation typo and spelling corrections trivial: Update contact info for Jochen Hein trivial: fix typo "resgister" -> "register" ...	2009-04-03 15:24:35 -07:00
Yinghai Lu	9756b15e1b	irq: fix cpumask memory leak on offstack cpumask kernels Need to free the old cpumask for affinity and pending_mask. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <49D18FF0.50707@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 19:14:44 +02:00
David Howells	8f0aa2f25b	Document the slow work thread pool Document the slow work thread pool. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>	2009-04-03 16:42:35 +01:00
David Howells	12e22c5e4b	Make the slow work pool configurable Make the slow work pool configurable through /proc/sys/kernel/slow-work. () /proc/sys/kernel/slow-work/min-threads The minimum number of threads that should be in the pool as long as it is in use. This may be anywhere between 2 and max-threads. () /proc/sys/kernel/slow-work/max-threads The maximum number of threads that should in the pool. This may be anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater. (*) /proc/sys/kernel/slow-work/vslow-percentage The percentage of active threads in the pool that may be used to execute very slow work items. This may be between 1 and 99. The resultant number is bounded to between 1 and one fewer than the number of active threads. This ensures there is always at least one thread that can process very slow work items, and always at least one thread that won't. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>	2009-04-03 16:42:35 +01:00
David Howells	109d9272c4	Make slow-work thread pool actually dynamic Make the slow-work thread pool actually dynamic in the number of threads it contains. With this patch, it will both create additional threads when it has extra work to do, and cull excess threads that aren't doing anything. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>	2009-04-03 16:42:35 +01:00
David Howells	07fe7cb7c7	Create a dynamically sized pool of threads for doing very slow work items Create a dynamically sized pool of threads for doing very slow work items, such as invoking mkdir() or rmdir() - things that may take a long time and may sleep, holding mutexes/semaphores and hogging a thread, and are thus unsuitable for workqueues. The number of threads is always at least a settable minimum, but more are started when there's more work to do, up to a limit. Because of the nature of the load, it's not suitable for a 1-thread-per-CPU type pool. A system with one CPU may well want several threads. This is used by FS-Cache to do slow caching operations in the background, such as looking up, creating or deleting cache objects. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>	2009-04-03 16:42:35 +01:00
Li Zefan	e2494e1b42	blktrace: fix pdu_len when tracing packet command requests Impact: output all of packet commands - not just the first 4 / 8 bytes Since commit `d7e3c3249e` ("block: add large command support"), struct request->cmd has been changed from unsinged char cmd[BLK_MAX_CDB] to unsigned char *cmd. v1 -> v2: by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> - make sure rq->cmd_len is always intialized, and then we can use rq->cmd_len instead of BLK_MAX_CDB. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49D4507E.2060602@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 15:29:26 +02:00
Li Zefan	7635b03adf	blktrace: small cleanup in blk_msg_write() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "Alan D. Brunelle" <alan.brunelle@hp.com> Cc: Jens Axboe <jens.axboe@oracle.com> LKML-Reference: <49D5BB56.7000807@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 14:48:11 +02:00
Carl Henrik Lunde	a4b3ada83d	blktrace: NUL-terminate user space messages Impact: fix corrupted blkparse output Make sure messages from user space are NUL-terminated strings, otherwise we could dump random memory to the block trace file. Additionally, I've limited the message to BLK_TN_MAX_MSG-1 characters, because the last character would be stripped by vscnprintf anyway. Signed-off-by: Carl Henrik Lunde <chlunde@ping.uio.no> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "Alan D. Brunelle" <alan.brunelle@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20090403122714.GT5178@kernel.dk> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 14:46:22 +02:00
Ingo Molnar	c826e3cd0c	kmemtrace: small cleanups Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <161be9ca8a27b432c4a6ab79f47788c4521652ae.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:09 +02:00
Eduard - Gabriel Munteanu	42af9054c0	kmemtrace: restore original tracing data binary format, improve ABI When kmemtrace was ported to ftrace, the marker strings were taken as an indication of how the traced data was being exposed to the userspace. However, the actual format had been binary, not text. This restores the original binary format, while also adding an origin CPU field (since ftrace doesn't expose the data per-CPU to userspace), and re-adding the timestamp field. It also drops arch-independent field sizing where it didn't make sense, so pointers won't always be 64 bits wide like they used to. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <161be9ca8a27b432c4a6ab79f47788c4521652ae.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:08 +02:00
Eduard - Gabriel Munteanu	da2635a985	kmemtrace: kmemtrace_alloc() must fill type_id Impact: fix trace output kmemtrace_alloc() was not filling type_id, which allowed garbage to make it into tracing data. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <284dba2732a144849d5aa82258fe0de2ad8dcb0b.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:07 +02:00
Eduard - Gabriel Munteanu	ca2b84cb3c	kmemtrace: use tracepoints kmemtrace now uses tracepoints instead of markers. We no longer need to use format specifiers to pass arguments. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> [ folded: Use the new TP_PROTO and TP_ARGS to fix the build. ] [ folded: fix build when CONFIG_KMEMTRACE is disabled. ] [ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ] Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> LKML-Reference: <ae61c0f37156db8ec8dc0d5778018edde60a92e3.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:06 +02:00
Ingo Molnar	a979241c53	kmemtrace, rcu: fix rcupreempt.c data structure dependencies Impact: cleanup We want to remove percpu.h from rcupreempt.h, but if we do that the percpu primitives there wont build anymore. Move them to the .c file instead. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:04 +02:00
Ingo Molnar	6258c4fb59	kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies Impact: cleanup We want to remove rcutree internals from the public rcutree.h file for upcoming kmemtrace changes - but kernel/rcutree_trace.c depends on them. Introduce kernel/rcutree.h for internal definitions. (Probably all the other data types from include/linux/rcutree.h could be moved here too - except rcu_data.) Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:03 +02:00
Ingo Molnar	b1f77b0581	kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies Impact: build fix for all non-x86 architectures We want to remove percpu.h from rcuclassic.h/rcutree.h (for upcoming kmemtrace changes) but that would break the DECLARE_PER_CPU based declarations in these files. Move the quiescent counter management functions to their respective RCU implementation .c files - they were slightly above the inlining limit anyway. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:02 +02:00
Linus Torvalds	8fe74cf053	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Remove two unneeded exports and make two symbols static in fs/mpage.c Cleanup after commit `585d3bc06f` Trim includes of fdtable.h Don't crap into descriptor table in binfmt_som Trim includes in binfmt_elf Don't mess with descriptor table in load_elf_binary() Get rid of indirect include of fs_struct.h New helper - current_umask() check_unsafe_exec() doesn't care about signal handlers sharing New locking/refcounting for fs_struct Take fs_struct handling to new file (fs/fs_struct.c) Get rid of bumping fs_struct refcount in pivot_root(2) Kill unsharing fs_struct in __set_personality()	2009-04-02 21:09:10 -07:00
Robin Holt	f5f7eac41d	Allow rwlocks to re-enable interrupts Pass the original flags to rwlock arch-code, so that it can re-enable interrupts if implemented for that architecture. Initially, make __raw_read_lock_flags and __raw_write_lock_flags stubs which just do the same thing as non-flags variants. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Robin Holt <holt@sgi.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:11 -07:00
Robin Holt	e8c158bb31	Factor out #ifdefs from kernel/spinlock.c to LOCK_CONTENDED_FLAGS SGI has observed that on large systems, interrupts are not serviced for a long period of time when waiting for a rwlock. The following patch series re-enables irqs while waiting for the lock, resembling the code which is already there for spinlocks. I only made the ia64 version, because the patch adds some overhead to the fast path. I assume there is currently no demand to have this for other architectures, because the systems are not so large. Of course, the possibility to implement raw_{read\|write}_lock_flags for any architecture is still there. This patch: The new macro LOCK_CONTENDED_FLAGS expands to the correct implementation depending on the config options, so that IRQ's are re-enabled when possible, but they remain disabled if CONFIG_LOCKDEP is set. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Robin Holt <holt@sgi.com> Cc: <linux-arch@vger.kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:10 -07:00
Aravind Srinivasan	2c53d9109f	relay: fix for possible loss/corruption of produced subbufs Fix possible loss/corruption of produced subbufs in relay_subbufs_consumed(). When buf->subbufs_produced wraps around after UINT_MAX and buf->subbufs_consumed is still < UINT_MAX, the condition if (buf->subbufs_consumed > buf->subbufs_produced) will be true even for certain valid values of subbufs_consumed. This may lead to loss or corruption of produced subbufs. Signed-off-by: Aravind Srinivasan <raa.aars@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:05 -07:00
Dmitri Vorobiev	edb79a2132	kexec: vmcoreinfo_data[] can become static The vmcoreinfo_data[] array is not used outside of kernel/kexec.c, and can therefore become static. This patch adds the relevant keyword to the definition of the array. Noticed by sparse. Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:04 -07:00
Neil Horman	04d491ab2a	kexec: add dmesg log symbols to /proc/vmcoreinfo lists It would be nice to be able to extract the dmesg log from a vmcore file without needing to keep the debug symbols for the running kernel handy all the time. We have a facility to do this in /proc/vmcore. This patch adds the log_buf and log_end symbols to the vmcoreinfo area so that tools (like makedumpfile) can easily extract the dmesg logs from a vmcore image. [akpm@linux-foundation.org: several fixes and cleanups] [akpm@linux-foundation.org: fix unused log_buf_kexec_setup()] [akpm@linux-foundation.org: build fix] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Cc: Simon Horman <horms@verge.net.au> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:04 -07:00
Oleg Nesterov	1b0f7ffd0e	pids: kill signal_struct-> __pgrp/__session and friends We are wasting 2 words in signal_struct without any reason to implement task_pgrp_nr() and task_session_nr(). task_session_nr() has no callers since `2e2ba22ea4`, we can remove it. task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and fs/coda. This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills __pgrp/__session and the related helpers. The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alan Cox <number6@the-village.bc.nu> [tty parts] Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:02 -07:00
Oleg Nesterov	52ee2dfdd4	pids: refactor vnr/nr_ns helpers to make them safe Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy. task_pid_nr_ns(task) needs rcu/tasklist depending on task == current. As for "special" pids, vnr/nr_ns helpers always need rcu. However, if task != current, they are unsafe even under rcu lock, we can't trust task->group_leader without the special checks. And almost every helper has a callsite which needs a fix. Also, it is a bit annoying that the implementations of, say, task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical". This patch introduces the new helper, __task_pid_nr_ns(), which is always safe to use, and turns all other helpers into the trivial wrappers. After this I'll send another patch which converts task_tgid_xxx() as well, they're are a bit special. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:02 -07:00
Oleg Nesterov	2ae448efc8	pids: improve get_task_pid() to fix the unsafe sys_wait4()->task_pgrp() sys_wait4() does get_pid(task_pgrp(current)), this is not safe. We can add rcu lock/unlock around, but we already have get_task_pid() which can be improved to handle the special pids in more reliable manner. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:02 -07:00
Matthew Wilcox	8e654fba4a	sysctl: fix suid_dumpable and lease-break-time sysctls Arne de Bruijn points out that commit `76fdbb25f9` ("coredump masking: bound suid_dumpable sysctl") mistakenly limits lease-break-time instead of suid_dumpable. Signed-off-by: Matthew Wilcox <matthew@wil.cx> Reported-by: Arne de Bruijn <kernelbt@arbruijn.dds.nl> Cc: Kawai, Hidehiro <hidehiro.kawai.ez@hitachi.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:01 -07:00
Serge E. Hallyn	11dea19009	proc_sysctl: use CONFIG_PROC_SYSCTL around ipc and utsname proc_handlers As pointed out by Cedric Le Goater (in response to Alexey's original comment wrt mqns), ipc_sysctl.c and utsname_sysctl.c are using CONFIG_PROC_FS, not CONFIG_PROC_SYSCTL, to determine whether to define the proc_handlers. Change that. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:01 -07:00
Lai Jiangshan	2355b70fd5	workqueue: avoid recursion in run_workqueue() 1) lockdep will complain when run_workqueue() performs recursion. 2) The recursive implementation of run_workqueue() means that flush_workqueue() and its documentation are inconsistent. This may hide deadlocks and other bugs. 3) The recursion in run_workqueue() will poison cwq->current_work, but flush_work() and __cancel_work_timer(), etcetera need a reliable cwq->current_work. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	1ee1184485	ptrace_untrace: fix the SIGNAL_STOP_STOPPED check This bug is ancient too. ptrace_untrace() must not resume the task if the group stop in progress, we should set TASK_STOPPED instead. Unfortunately, we still have problems here: - if the process/thread was traced, SIGNAL_STOP_STOPPED does not necessary means this thread group is stopped. - ptrace breaks the bookkeeping of ->group_stop_count. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	95a3540da9	ptrace_detach: the wrong wakeup breaks the ERESTARTxxx logic Another ancient bug. Consider this trivial test-case, int main(void) { int pid = fork(); if (pid) { ptrace(PTRACE_ATTACH, pid, NULL, NULL); wait(NULL); ptrace(PTRACE_DETACH, pid, NULL, NULL); } else { pause(); printf("WE HAVE A KERNEL BUG!!!\n"); } return 0; } the child must not "escape" for sys_pause(), but it can and this was seen in practice. This is because ptrace_detach does: if (!child->exit_state) wake_up_process(child); this wakeup can happen after this child has already restarted sys_pause(), because it gets another wakeup from ptrace_untrace(). With or without this patch, perhaps sys_pause() needs a fix. But this wakeup also breaks the SIGNAL_STOP_STOPPED logic in ptrace_untrace(). Remove this wakeup. The caller saw this task in TASK_TRACED state, and unless it was SIGKILL'ed in between __ptrace_unlink()->ptrace_untrace() should handle this case correctly. If it was SIGKILL'ed, we don't need to wakup the dying tracee too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	5dfc80be73	forget_original_parent: do not abuse child->ptrace_entry By discussion with Roland. - Use ->sibling instead of ->ptrace_entry to chain the need to be release_task'd childs. Nobody else can use ->sibling, this task is EXIT_DEAD and nobody can find it on its own list. - rename ptrace_dead to dead_childs. - Now that we don't have the "parallel" untrace code, change back reparent_thread() to return void, pass dead_childs as an argument. Actually, I don't understand why do we notify /sbin/init when we reparent a zombie, probably it is better to reap it unconditionally. [akpm@linux-foundation.org: s/childs/children/] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Metzger, Markus T" <markus.t.metzger@intel.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	39c626ae47	forget_original_parent: split out the un-ptrace part By discussion with Roland. - Rename ptrace_exit() to exit_ptrace(), and change it to do all the necessary work with ->ptraced list by its own. - Move this code from exit.c to ptrace.c - Update the comment in ptrace_detach() to explain the rechecking of the child->ptrace. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Metzger, Markus T" <markus.t.metzger@intel.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	7f5d3652d4	reparent_thread: fix a zombie leak if /sbin/init ignores SIGCHLD If /sbin/init ignores SIGCHLD and we re-parent a zombie, it is leaked. reparent_thread() does do_notify_parent() which sets ->exit_signal = -1 in this case. This means that nobody except us can reap it, the detached task is not visible to do_wait(). Change reparent_thread() to return a boolean (like __pthread_detach) to indicate that the thread is dead and must be released. Also change forget_original_parent() to add the child to ptrace_dead list in this case. The naming becomes insane, the next patch does the cleanup. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	b1442b055c	reparent_thread: fix the "is it traced" check reparent_thread() uses ptrace_reparented() to check whether this thread is ptraced, in that case we should not notify the new parent. But ptrace_reparented() is not exactly correct when the reparented thread is traced by /sbin/init, because forget_original_parent() has already changed ->real_parent. Currently, the only problem is the false notification. But with the next patch the kernel crash in this (yes, pathological) case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	0a967a044a	reparent_thread: don't call kill_orphaned_pgrp() if task_detached() If task_detached(p) == T, then either a) p is not the main thread, we will find the group leader on the ->children list. or b) p is the group leader but its ->exit_state = EXIT_DEAD. This can only happen when the last sub-thread has died, but in that case that thread has already called kill_orphaned_pgrp() from exit_notify(). In both cases kill_orphaned_pgrp() looks bogus. Move the task_detached() check up and simplify the code, this is also right from the "common sense" pov: we should do nothing with the detached childs, except move them to the new parent's ->children list. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	4576145c1e	ptrace: fix possible zombie leak on PTRACE_DETACH When ptrace_detach() takes tasklist, the tracee can be SIGKILL'ed. If it has already passed exit_notify() we can leak a zombie, because a) ptracing disables the auto-reaping logic, and b) ->real_parent was not notified about the child's death. ptrace_detach() should follow the ptrace_exit's logic, change the code accordingly. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Tested-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	b1b4c6799f	ptrace: reintroduce __ptrace_detach() as a callee of ptrace_exit() No functional changes, preparation for the next patch. Move the "should we release this child" logic into the separate handler, __ptrace_detach(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	6d69cb87f0	ptrace: simplify ptrace_exit()->ignoring_children() path ignoring_children() takes parent->sighand->siglock and checks k_sigaction[SIGCHLD] atomically. But this buys nothing, we can't get the "really" wrong result even if we race with sigaction(SIGCHLD). If we read the "stale" sa_handler/sa_flags we can pretend it was changed right after the check. Remove spin_lock(->siglock), and kill "int ign" which caches the result of ignoring_children() which becomes rather trivial. Perhaps it makes sense to export this helper, do_notify_parent() can use it too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	95c3eb76dc	ptrace: kill __ptrace_detach(), fix ->exit_state check Move the code from __ptrace_detach() to its single caller and kill this helper. Also, fix the ->exit_state check, we shouldn't wake up EXIT_DEAD tasks. Actually, I think task_is_stopped_or_traced() makes more sense, but this needs another patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Sukadev Bhattiprolu	6588c1e3ff	signals: SI_USER: Masquerade si_pid when crossing pid ns boundary When sending a signal to a descendant namespace, set ->si_pid to 0 since the sender does not have a pid in the receiver's namespace. Note: - If rt_sigqueueinfo() sets si_code to SI_USER when sending a signal across a pid namespace boundary, the value in ->si_pid will be cleared to 0. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu	b3bfa0cba8	signals: protect cinit from blocked fatal signals Normally SIG_DFL signals to global and container-init are dropped early. But if a signal is blocked when it is posted, we cannot drop the signal since the receiver may install a handler before unblocking the signal. Once this signal is queued however, the receiver container-init has no way of knowing if the signal was sent from an ancestor or descendant namespace. This patch ensures that contianer-init drops all SIG_DFL signals in get_signal_to_deliver() except SIGKILL/SIGSTOP. If SIGSTOP/SIGKILL originate from a descendant of container-init they are never queued (i.e dropped in sig_ignored() in an earler patch). If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued and container-init processes the signal. IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global or container-init, the signal must have been generated internally or must have come from an ancestor ns and we process the signal. Further, the signal_group_exit() check was needed to cover the case of a multi-threaded init sending SIGKILL to other threads when doing an exit() or exec(). But since the new sig_kernel_only() check covers the SIGKILL, the signal_group_exit() check is no longer needed and can be removed. Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for container-inits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu	e4da026f98	signals: zap_pid_ns_process() should use force_sig() send_signal() assumes that signals with SEND_SIG_PRIV are generated from within the same namespace. So any nested container-init processes become immune to the SIGKILL generated by kill_proc_info() in zap_pid_ns_processes(). Use force_sig() in zap_pid_ns_processes() instead - force_sig() clears the SIGNAL_UNKILLABLE flag ensuring the signal is processed by container-inits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu	921cf9f630	signals: protect cinit from unblocked SIG_DFL signals Drop early any SIG_DFL or SIG_IGN signals to container-init from within the same container. But queue SIGSTOP and SIGKILL to the container-init if they are from an ancestor container. Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the container can still terminate the container-init. That will be addressed in the next patch. Note: To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits in a follow-on patch. Until then, this patch is just a preparatory step. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu	7978b567d3	signals: add from_ancestor_ns parameter to send_signal() send_signal() (or its helper) needs to determine the pid namespace of the sender. But a signal sent via kill_pid_info_as_uid() comes from within the kernel and send_signal() does not need to determine the pid namespace of the sender. So define a helper for send_signal() which takes an additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid() use that helper directly. The 'from_ancestor_ns' parameter will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Oleg Nesterov	f008faff0e	signals: protect init from unwanted signals more (This is a modified version of the patch submitted by Oleg Nesterov http://lkml.org/lkml/2008/11/18/249 and tries to address comments that came up in that discussion) init ignores the SIG_DFL signals but we queue them anyway, including SIGKILL. This is mostly OK, the signal will be dropped silently when dequeued, but the pending SIGKILL has 2 bad implications: - it implies fatal_signal_pending(), so we confuse things like wait_for_completion_killable/lock_page_killable. - for the sub-namespace inits, the pending SIGKILL can mask (legacy_queue) the subsequent SIGKILL from the parent namespace which must kill cinit reliably. (preparation, cinits don't have SIGNAL_UNKILLABLE yet) The patch can't help when init is ptraced, but ptracing of init is not "safe" anyway. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Oleg Nesterov	43918f2bf4	signals: remove 'handler' parameter to tracehook functions Container-init must behave like global-init to processes within the container and hence it must be immune to unhandled fatal signals from within the container (i.e SIG_DFL signals that terminate the process). But the same container-init must behave like a normal process to processes in ancestor namespaces and so if it receives the same fatal signal from a process in ancestor namespace, the signal must be processed. Implementing these semantics requires that send_signal() determine pid namespace of the sender but since signals can originate from workqueues/ interrupt-handlers, determining pid namespace of sender may not always be possible or safe. This patchset implements the design/simplified semantics suggested by Oleg Nesterov. The simplified semantics for container-init are: - container-init must never be terminated by a signal from a descendant process. - container-init must never be immune to SIGKILL from an ancestor namespace (so a process in parent namespace must always be able to terminate a descendant container). - container-init may be immune to unhandled fatal signals (like SIGUSR1) even if they are from ancestor namespace. SIGKILL/SIGSTOP are the only reliable signals to a container-init from ancestor namespace. This patch: Based on an earlier patch submitted by Oleg Nesterov and comments from Roland McGrath (http://lkml.org/lkml/2008/11/19/258). The handler parameter is currently unused in the tracehook functions. Besides, the tracehook functions are called with siglock held, so the functions can check the handler if they later need to. Removing the parameter simiplifies changes to sig_ignored() in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Oleg Nesterov	90bc8d8b1a	do_wait: fix waiting for the group stop with the dead leader do_wait(WSTOPPED) assumes that p->state must be == TASK_STOPPED, this is not true if the leader is already dead. Check SIGNAL_STOP_STOPPED instead and use signal->group_exit_code. Trivial test-case: void tfunc(void arg) { pause(); return NULL; } int main(void) { pthread_t thr; pthread_create(&thr, NULL, tfunc, NULL); pthread_exit(NULL); return 0; } It doesn't react to ^Z (and then to ^C or ^\). The task is stopped, but bash can't see this. The bug is very old, and it was reported multiple times. This patch was sent more than a year ago (http://marc.info/?t=119713920000003) but it was ignored. This change also fixes other oddities (but not all) in this area. For example, before this patch: $ sleep 100 ^Z [1]+ Stopped sleep 100 $ strace -p `pidof sleep` Process 11442 attached - interrupt to quit strace hangs in do_wait(), because ->exit_code was already consumed by bash. After this patch, strace happily proceeds: --- SIGTSTP (Stopped) @ 0 (0) --- restart_syscall(<... resuming interrupted call ...> To me, this looks much more "natural" and correct. Another example. Let's suppose we have the main thread M and sub-thread T, the process is stopped, and its parent did wait(WSTOPPED). Now we can ptrace T but not M. This looks at least strange to me. Imho, do_wait() should not confuse the per-thread ptrace stops with the per-process job control stops. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Kaz Kylheku <kkylheku@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
David Rientjes	6d7b2f5f9e	cpusets: prevent PF_THREAD_BOUND tasks from attaching to non-root cpusets Kthreads that have the PF_THREAD_BOUND bit set in their flags are bound to a specific cpu. Thus, their set of allowed cpus shall not change. This patch prevents such threads from attaching to non-root cpusets. They do not have mempolicies that restrict them to a subset of system nodes and, since their cpumask may never change, they cannot use any of the features of cpusets. The tasks will forever be a member of the root cpuset and will be returned when listing the tasks attached to that cpuset. Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Paul Menage	db7f47cf48	cpusets: allow cpusets to be configured/built on non-SMP systems Allow cpusets to be configured/built on non-SMP systems Currently it's impossible to build cpusets under UML on x86-64, since cpusets depends on SMP and x86-64 UML doesn't support SMP. There's code in cpusets that doesn't depend on SMP. This patch surrounds the minimum amount of cpusets code with #ifdef CONFIG_SMP in order to allow cpusets to build/run on UP systems (for testing purposes under UML). Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
David Rientjes	a1bc5a4eee	cpusets: replace zone allowed functions with node allowed The cpuset_zone_allowed() variants are actually only a function of the zone's node. Cc: Paul Menage <menage@google.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Li Zefan	7f81b1ae18	cpuset: remove struct cpuset_hotplug_scanner Use cgroup_scanner.data, instead of introducing cpuset_hotplug_scanner. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Li Zefan	010cfac4ca	cpuset: avoid changing cpuset's mems when errno returned When writing to cpuset.mems, cpuset has to update its mems_allowed before calling update_tasks_nodemask(), but this function might return -ENOMEM. To avoid this rare case, we allocate the memory before changing mems_allowed, and then pass to update_tasks_nodemask(). Similar to what update_cpumask() does. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Li Zefan	3b6766fe66	cpuset: rewrite update_tasks_nodemask() This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's mems_allowed. Not only simplify the code largely, but also avoid allocating an array to hold mm pointers of all the tasks in the cpuset. This array can be big (size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a chance to fail the allocation when under memory stress. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Li Zefan	0b4217b3fd	cpuset: fix possible races in cpu/memory hotplug Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected by callback_mutex, otherwise the reader may read wrong cpus/mems. This is cpuset's lock rule. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:56 -07:00
KAMEZAWA Hiroyuki	0b7f569e45	memcg: fix OOM killer under memcg This patch tries to fix OOM Killer problems caused by hierarchy. Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to kill a task in memcg. But, when hierarchy is used, it's broken and correct task cannot be killed. For example, in following cgroup /groupA/ hierarchy=1, limit=1G, 01 nolimit 02 nolimit All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to groupA's 1Gbytes but OOM Killer just kills tasks in groupA. This patch provides makes the bad process be selected from all tasks under hierarchy. BTW, currently, oom_jiffies is updated against groupA in above case. oom_jiffies of tree should be updated. To see how oom_jiffies is used, please check mem_cgroup_oom_called() callers. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: const fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:55 -07:00
Li Zefan	d969fbe69e	debug cgroup: remove unneeded cgroup_lock Since we are in cgroup write handler, so the cgrp is valid, so we don't have to hold cgroup_mutex when calling cgroup_task_count(). One similar example is in cgroup_tasks_open(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
Li Zefan	0670e08bdf	cgroups: don't change release_agent when remount failed Remount can fail in either case: - wrong mount options is specified, or option 'noprefix' is changed. - a to-be-added subsys is already mounted/active. When using remount to change 'release_agent', for the above former failure case, remount will return errno with release_agent unchanged, but for the latter case, remount will return EBUSY with relase_agent changed, which is unexpected I think: # mount -t cgroup -o cpu xxx /cgrp1 # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2 # cat /cgrp2/release_agent agent1 # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2 mount: /cgrp2 not mounted already, or bad option # cat /cgrp2/release_agent agent1 <-- ok # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2 mount: /cgrp2 is busy # cat /cgrp2/release_agent agent2 <-- unexpected! Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
Li Zefan	099fca3225	cgroups: show correct file mode We have some read-only files and write-only files, but currently they are all set to 0644, which is counter-intuitive and cause trouble for some cgroup tools like libcgroup. This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's own files' file mode, and for the most cases cft->mode can be default to 0 and cgroup will figure out proper mode. Acked-by: Paul Menage <menage@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
Jesper Juhl	66bdc9cfc7	kernel/cgroup.c: kfree(NULL) is legal Reduces object file size a bit: Before: $ size kernel/cgroup.o text data bss dec hex filename 21593 7804 4924 34321 8611 kernel/cgroup.o After: $ size kernel/cgroup.o text data bss dec hex filename 21537 7744 4924 34205 859d kernel/cgroup.o Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
KAMEZAWA Hiroyuki	ec64f51545	cgroup: fix frequent -EBUSY at rmdir In following situation, with memory subsystem, /groupA use_hierarchy==1 /01 some tasks /02 some tasks /03 some tasks /04 empty When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim is triggered and the kernel walks tree under groupA. In this case, rmdir /groupA/04 fails with -EBUSY frequently because of temporal refcnt from the kernel. In general. cgroup can be rmdir'd if there are no children groups and no tasks. Frequent fails of rmdir() is not useful to users. (And the reason for -EBUSY is unknown to users.....in most cases) This patch tries to modify above behavior, by - retries if css_refcnt is got by someone. - add "return value" to pre_destroy() and allows subsystem to say "we're really busy!" Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
KAMEZAWA Hiroyuki	38460b48d0	cgroup: CSS ID support Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code. This patch attaches unique ID to each css and provides following. - css_lookup(subsys, id) returns pointer to struct cgroup_subysys_state of id. - css_get_next(subsys, id, rootid, depth, foundid) returns the next css under "root" by scanning When cgroup_subsys->use_id is set, an id for css is maintained. The cgroup framework only parepares - css_id of root css for subsys - id is automatically attached at creation of css. - id is not freed automatically. Because the cgroup framework don't know lifetime of cgroup_subsys_state. free_css_id() function is provided. This must be called by subsys. There are several reasons to develop this. - Saving space .... For example, memcg's swap_cgroup is array of pointers to cgroup. But it is not necessary to be very fast. By replacing pointers(8bytes per ent) to ID (2byes per ent), we can reduce much amount of memory usage. - Scanning without lock. CSS_ID provides "scan id under this ROOT" function. By this, scanning css under root can be written without locks. ex) do { rcu_read_lock(); next = cgroup_get_next(subsys, id, root, &found); /* check sanity of next here */ css_tryget(); rcu_read_unlock(); id = found + 1 } while(...) Characteristics: - Each css has unique ID under subsys. - Lifetime of ID is controlled by subsys. - css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy - Allowed ID is 1-65535, ID 0 is UNUSED ID. Design Choices: - scan-by-ID v.s. scan-by-tree-walk. As /proc's pid scan does, scan-by-ID is robust when scanning is done by following kind of routine. scan -> rest a while(release a lock) -> conitunue from interrupted memcg's hierarchical reclaim does this. - When subsys->use_id is set, # of css in the system is limited to 65535. [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:53 -07:00
Grzegorz Nosek	313e924c08	cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups The ns_proxy cgroup allows moving processes to child cgroups only one level deep at a time. This commit relaxes this restriction and makes it possible to attach tasks directly to grandchild cgroups, e.g.: ($pid is in the root cgroup) echo $pid > /cgroup/CG1/CG2/tasks Previously this operation would fail with -EPERM and would have to be performed as two steps: echo $pid > /cgroup/CG1/tasks echo $pid > /cgroup/CG1/CG2/tasks Also, the target cgroup no longer needs to be empty to move a task there. Signed-off-by: Grzegorz Nosek <root@localdomain.pl> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:53 -07:00
Alexey Dobriyan	6f2c55b843	Simplify copy_thread() First argument unused since 2.3.11. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:51 -07:00
David Howells	33e5d76979	nommu: fix a number of issues with the per-MM VMA patch Fix a number of issues with the per-MM VMA patch: (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on a NOMMU system with more than 2G pages. Makes no difference on a 32-bit system. (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value, lest it overflow. (3) Move the allocation of the vm_area_struct slab back for fork.c. (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs. (5) Use BUG_ON() rather than if () BUG(). (6) Make the default validate_nommu_regions() a static inline rather than a #define. (7) Make free_page_series()'s objection to pages with a refcount != 1 more informative. (8) Adjust the __put_nommu_region() banner comment to indicate that the semaphore must be held for writing. (9) Limit the number of warnings about munmaps of non-mmapped regions. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:48 -07:00
Darren Hart	cd84a42f31	futex: comment requeue key reference semantics We've tripped over the futex_requeue drop_count refering to key2 instead of key1. The code is actually correct, but is non-intuitive. This patch adds an explicit comment explaining the requeue. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-02 23:39:53 +02:00

... 2 3 4 5 6 ...

7053 commits