linux

q3k/linux

Author	SHA1	Message	Date
Pallipadi, Venkatesh	6ec3cfeca0	x86, irq: Remove IRQ_DISABLED check in process context IRQ move As discussed in the thread here: http://marc.info/?l=linux-kernel&m=123964468521142&w=2 Eric W. Biederman observed: > It looks like some additional bugs have slipped in since last I looked. > > set_irq_affinity does this: > ifdef CONFIG_GENERIC_PENDING_IRQ > if (desc->status & IRQ_MOVE_PCNTXT \|\| desc->status & IRQ_DISABLED) { > cpumask_copy(desc->affinity, cpumask); > desc->chip->set_affinity(irq, cpumask); > } else { > desc->status \|= IRQ_MOVE_PENDING; > cpumask_copy(desc->pending_mask, cpumask); > } > #else > > That IRQ_DISABLED case is a software state and as such it has nothing to > do with how safe it is to move an irq in process context. [...] > > The only reason we migrate MSIs in interrupt context today is that there > wasn't infrastructure for support migration both in interrupt context > and outside of it. Yes. The idea here was to force the MSI migration to happen in process context. One of the patches in the series did disable_irq(dev->irq); irq_set_affinity(dev->irq, cpumask_of(dev->cpu)); enable_irq(dev->irq); with the above patch adding irq/manage code check for interrupt disabled and moving the interrupt in process context. IIRC, there was no IRQ_MOVE_PCNTXT when we were developing this HPET code and we ended up having this ugly hack. IRQ_MOVE_PCNTXT was there when we eventually submitted the patch upstream. But, looks like I did a blind rebasing instead of using IRQ_MOVE_PCNTXT in hpet MSI code. Below patch fixes this. i.e., revert commit `932775a4ab` and add PCNTXT to HPET MSI setup. Also removes copying of desc->affinity in generic code as set_affinity routines are doing it internally. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Li Shaohua" <shaohua.li@intel.com> Cc: Gary Hade <garyhade@us.ibm.com> Cc: "lcm@us.ibm.com" <lcm@us.ibm.com> Cc: suresh.b.siddha@intel.com LKML-Reference: <20090413222058.GB8211@linux-os.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 15:21:13 +02:00
Paul E. McKenney	ef631b0ca0	rcu: Make hierarchical RCU less IPI-happy This patch fixes a hierarchical-RCU performance bug located by Anton Blanchard. The problem stems from a misguided attempt to provide a work-around for jiffies-counter failure. This work-around uses a per-CPU n_rcu_pending counter, which is incremented on each call to rcu_pending(), which in turn is called from each scheduling-clock interrupt. Each CPU then treats this counter as a surrogate for the jiffies counter, so that if the jiffies counter fails to advance, the per-CPU n_rcu_pending counter will cause RCU to invoke force_quiescent_state(), which in turn will (among other things) send resched IPIs to CPUs that have thus far failed to pass through an RCU quiescent state. Unfortunately, each CPU resets only its own counter after sending a batch of IPIs. This means that the other CPUs will also (needlessly) send -another- round of IPIs, for a full N-squared set of IPIs in the worst case every three scheduler-clock ticks until the grace period finally ends. It is not reasonable for a given CPU to reset each and every n_rcu_pending for all the other CPUs, so this patch instead simply disables the jiffies-counter "training wheels", thus eliminating the excessive IPIs. Note that the jiffies-counter IPIs do not have this problem due to the fact that the jiffies counter is global, so that the CPU sending the IPIs can easily reset things, thus preventing the other CPUs from sending redundant IPIs. Note also that the n_rcu_pending counter remains, as it will continue to be used for tracing. It may also see use to update the jiffies counter, should an appropriate kick-the-jiffies-counter API appear. Located-by: Anton Blanchard <anton@au1.ibm.com> Tested-by: Anton Blanchard <anton@au1.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: anton@samba.org Cc: akpm@linux-foundation.org Cc: dipankar@in.ibm.com Cc: manfred@colorfullife.com Cc: cl@linux-foundation.org Cc: josht@linux.vnet.ibm.com Cc: schamp@sgi.com Cc: niv@us.ibm.com Cc: dvhltc@us.ibm.com Cc: ego@in.ibm.com Cc: laijs@cn.fujitsu.com Cc: rostedt@goodmis.org Cc: peterz@infradead.org Cc: penberg@cs.helsinki.fi Cc: andi@firstfloor.org Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> LKML-Reference: <12396834793575-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 11:31:50 +02:00
Zhaolei	557055bebe	tracing: Fix branch tracer header Before patch: # tracer: branch # # TASK-PID CPU# TIMESTAMP FUNCTION # \| \| \| \| \| <...>-2981 [000] 24008.872738: [ ok ] trace_irq_handler_exit:irq_event_types.h:41 <...>-2981 [000] 24008.872742: [ ok ] note_interrupt:spurious.c:229 ... After patch: # tracer: branch # # TASK-PID CPU# TIMESTAMP CORRECT FUNC:FILE:LINE # \| \| \| \| \| \| <...>-2985 [000] 26329.142970: [ ok ] slab_free:slub.c:1776 <...>-2985 [000] 26329.142972: [ ok ] trace_kmem_cache_free:kmem_event_types.h:191 ... Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <49E2F19A.3040006@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 02:30:36 +02:00
Lai Jiangshan	132380a06b	tracing, sched: mark get_parent_ip() notrace Impact: remove overly redundant tracing entries When tracer is "function" or "function_graph", way too much "get_parent_ip" entries are recorded in ring_buffer. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D458B1.5000703@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-14 02:11:05 +02:00
Andi Kleen	3d26dcf767	kernel/sys.c: clean up sys_shutdown exit path Impact: cleanup, fix Clean up sys_shutdown() exit path. Factor out common code. Return correct error code instead of always 0 on failure. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:32 -07:00
Oleg Nesterov	f1671f6d78	ptrace: fix exit_ptrace() vs ptrace_traceme() race Pointed out by Roland. The bug was recently introduced by me in "forget_original_parent: split out the un-ptrace part", commit `39c626ae47`. Since that patch we have a window after exit_ptrace() drops tasklist and before forget_original_parent() takes it again. In this window the child can do ptrace(PTRACE_TRACEME) and nobody can untrace this child after that. Change ptrace_traceme() to not attach to the exiting ->real_parent. We don't report the error in this case, we pretend we attach right before ->real_parent calls exit_ptrace() which should untrace us anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:31 -07:00
Peter Zijlstra	4be6f6bb66	mm: move the scan_unevictable_pages sysctl to the vm table vm knobs should go in the vm table. Probably too late for randomize_va_space though. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:28 -07:00
Zhaolei	a3d03ecaf9	tracing: Fix power tracer header Before patch: # tracer: power # # TASK-PID CPU# TIMESTAMP FUNCTION # \| \| \| \| \| [ 676.875865889] CSTATE: Going to C1 on cpu 0 for 0.005911463 [ 676.882938805] CSTATE: Going to C1 on cpu 0 for 0.104796532 ... After patch: # tracer: power # # TIMESTAMP STATE EVENT # \| \| \| [ 676.875865889] CSTATE: Going to C1 on cpu 0 for 0.005911463 [ 676.882938805] CSTATE: Going to C1 on cpu 0 for 0.104796532 ... v2: Use seq_puts instead of seq_printf Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <49E2E889.5000903@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-13 23:39:57 +02:00
Rafael J. Wysocki	c751085943	PM/Hibernate: Wait for SCSI devices scan to complete during resume There is a race between resume from hibernation and the asynchronous scanning of SCSI devices and to prevent it from happening we need to call scsi_complete_async_scans() during resume from hibernation. In addition, if the resume from hibernation is userland-driven, it's better to wait for all device probes in the kernel to complete before attempting to open the resume device. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 11:37:07 -07:00
Linus Torvalds	8255309b88	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/filters: return proper error code when writing filter file tracing/filters: allow user input integer to be oct or hex tracing/filters: fix NULL pointer dereference tracing/filters: NIL-terminate user input filter ftrace: Output REC->var instead of __entry->var for trace format Make __stringify support variable argument macros too tracing: fix document references tracing: fix splice return too large tracing: update file->f_pos when splice(2) it tracing: allocate page when needed tracing: disable seeking for trace_pipe_raw	2009-04-13 11:31:28 -07:00
Linus Torvalds	bf20753c0c	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: continue lock debugging despite some taints lockdep: warn about lockdep disabling after kernel taint	2009-04-13 11:30:26 -07:00
Linus Torvalds	d811f236d9	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: percpu: unbreak alpha percpu mutex: have non-spinning mutexes on s390 by default	2009-04-13 08:22:43 -07:00
Frederic Weisbecker	574bbe7820	lockdep: continue lock debugging despite some taints Impact: broaden lockdep checks Lockdep is disabled after any kernel taints. This might be convenient to ignore bad locking issues which sources come from outside the kernel tree. Nevertheless, it might be a frustrating experience for the staging developers or those who experience a warning but are focused on another things that require lockdep. The v2 of this patch simply don't disable anymore lockdep in case of TAINT_CRAP and TAINT_WARN events. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: LTP <ltp-list@lists.sourceforge.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg KH <gregkh@suse.de> LKML-Reference: <1239412638-6739-2-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 16:10:52 +02:00
Frederic Weisbecker	9eeba6138c	lockdep: warn about lockdep disabling after kernel taint Impact: provide useful missing info for developers Kernel taint can occur in several situations such as warnings, load of prorietary or staging modules, bad page, etc... But when such taint happens, a developer might still be working on the kernel, expecting that lockdep is still enabled. But a taint disables lockdep without ever warning about it. Such a kernel behaviour doesn't really help for kernel development. This patch adds this missing warning. Since the taint is done most of the time after the main message that explain the real source issue, it seems safe to warn about it inside add_taint() so that it appears at last, without hurting the main information. v2: Use a generic helper to disable lockdep instead of an open coded xchg(). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1239412638-6739-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 16:10:51 +02:00
Li Zefan	44e9c8b7ad	tracing/filters: return proper error code when writing filter file - propagate return value of filter_add_pred() to the user - return -ENOSPC but not -ENOMEM or -EINVAL when the filter array is full Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E04CF0.3010105@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 11:59:29 +02:00
Li Zefan	a3e0ab0507	tracing/filters: allow user input integer to be oct or hex Before patch: # echo 'parent_pid == 0x10' > events/sched/sched_process_fork/filter # cat sched/sched_process_fork/filter parent_pid == 0 After patch: # cat sched/sched_process_fork/filter parent_pid == 16 Also check the input more strictly. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E04C53.4010600@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 11:59:28 +02:00
Li Zefan	bcabd91c27	tracing/filters: fix NULL pointer dereference Try this, and you'll see NULL pointer dereference bug: # echo -n 'parent_comm ==' > sched/sched_process_fork/filter Because we passed NULL ptr to simple_strtoull(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E04C43.1050504@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 11:59:28 +02:00
Li Zefan	8433a40eb7	tracing/filters: NIL-terminate user input filter Make sure messages from user space are NIL-terminated strings, otherwise we could dump random memory while reading filter file. Try this: # echo 'parent_comm ==' > events/sched/sched_process_fork/filter # cat events/sched/sched_process_fork/filter parent_comm == � Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49E04C32.6060508@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-12 11:59:27 +02:00
Linus Torvalds	d6de2c80e9	async: Fix module loading async-work regression Several drivers use asynchronous work to do device discovery, and we synchronize with them in the compiled-in case before we actually try to mount root filesystems etc. However, when compiled as modules, that synchronization is missing - the module loading completes, but the driver hasn't actually finished probing for devices, and that means that any user mode that expects to use the devices after the 'insmod' is now potentially broken. We already saw one case of a similar issue in the ACPI battery code, where the kernel itself expected the module to be all done, and unmapped the init memory - but the async device discovery was still running. That got hacked around by just removing the "__init" (see commit `5d38258ec0` "ACPI battery: fix async boot oops"), but the real fix is to just make the module loading wait for all async work to be completed. It will slow down module loading, but since common devices should be built in anyway, and since the bug is really annoying and hard to handle from user space (and caused several S3 resume regressions), the simple fix to wait is the right one. This fixes at least http://bugzilla.kernel.org/show_bug.cgi?id=13063 but probably a few other bugzilla entries too (12936, for example), and is confirmed to fix Rafael's storage driver breakage after resume bug report (no bugzilla entry). We should also be able to now revert that ACPI battery fix. Reported-and-tested-by: Rafael J. Wysocki <rjw@suse.com> Tested-by: Heinz Diehl <htd@fancy-poultry.org> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-11 12:44:49 -07:00
Zhaolei	0462b5664b	ftrace: Output REC->var instead of __entry->var for trace format print fmt: "irq=%d return=%s", __entry->irq, __entry->ret ? \"handled\" : \"unhandled\" "__entry" should be convert to "REC" by __stringify() macro. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <49DC679D.2090901@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 15:48:53 +02:00
Li Zefan	4d1f4372db	tracing: fix document references When moving documents to Documentation/trace/, I forgot to grep Kconfig to find out those references. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Pekka Paalanen <pq@iki.fi> Cc: eduard.munteanu@linux360.ro LKML-Reference: <49DE97EF.7080208@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 13:08:50 +02:00
Lai Jiangshan	93cfb3c9fd	tracing: fix splice return too large I got these from strace: splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 16384 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192 I wanted to splice_read 4096 bytes, but it returns 8192 or larger. It is because the return value of tracing_buffers_splice_read() does not include "zero out any left over data" bytes. But tracing_buffers_read() includes these bytes, we make them consistent. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D46674.9030804@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:44:46 +02:00
Lai Jiangshan	c7625a555f	tracing: update file->f_pos when splice(2) it Impact: Cleanup These two lines: if (unlikely(*ppos)) return -ESPIPE; in tracing_buffers_splice_read() are not needed, VFS layer has disabled seek(2). We remove these two lines, and then we can update file->f_pos. And tracing_buffers_read() updates file->f_pos, this fix make tracing_buffers_splice_read() updates file->f_pos too. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D46670.4010503@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:44:44 +02:00
Lai Jiangshan	ddd538f3e6	tracing: allocate page when needed Impact: Cleanup Sometimes, we open trace_pipe_raw, but we don't read(2) it, we just splice(2) it, thus, the page is not used. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D4666B.4010608@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:44:43 +02:00
Lai Jiangshan	d1e7e02f30	tracing: disable seeking for trace_pipe_raw Impact: disable pread() We set tracing_buffers_fops.llseek to no_llseek, but we can still perform pread() to read this file. That is not expected. This fix uses nonseekable_open() to disable it. tracing_buffers_fops.llseek is still set to no_llseek, it mark this file is a "non-seekable device" and is used by sys_splice(). See also do_splice() or manual of splice(2): ERRORS EINVAL Target file system doesn't support splicing; neither of the descriptors refers to a pipe; or offset given for non-seekable device. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> LKML-Reference: <49D46668.8030806@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-10 12:44:42 +02:00
Linus Torvalds	c2ea122cd7	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: consolidate documents blktrace: pass the right pointer to kfree() tracing/syscalls: use a dedicated file header tracing: append a comma to INIT_FTRACE_GRAPH	2009-04-09 10:37:46 -07:00
Linus Torvalds	17b2e9bf27	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: do not count frozen tasks toward load sched: refresh MAINTAINERS entry sched: Print sched_group::__cpu_power in sched_domain_debug cpuacct: add per-cgroup utime/stime statistics posixtimers, sched: Fix posix clock monotonicity sched_rt: don't allocate cpumask in fastpath cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2	2009-04-09 10:37:28 -07:00
Linus Torvalds	422a253483	Merge branches 'core-fixes-for-linus', 'irq-fixes-for-linus' and 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: printk: fix wrong format string iter for printk futex: comment requeue key reference semantics * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq: fix cpumask memory leak on offstack cpumask kernels * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-timers: fix RLIMIT_CPU && setitimer(CPUCLOCK_PROF) posix-timers: fix RLIMIT_CPU && fork() timers: add missing kernel-doc	2009-04-09 10:35:30 -07:00
Heiko Carstens	36cd3c9f92	mutex: have non-spinning mutexes on s390 by default Impact: performance regression fix for s390 The adaptive spinning mutexes will not always do what one would expect on virtualized architectures like s390. Especially the cpu_relax() loop in mutex_spin_on_owner might hurt if the mutex holding cpu has been scheduled away by the hypervisor. We would end up in a cpu_relax() loop when there is no chance that the state of the mutex changes until the target cpu has been scheduled again by the hypervisor. For that reason we should change the default behaviour to no-spin on s390. We do have an instruction which allows to yield the current cpu in favour of a different target cpu. Also we have an instruction which allows us to figure out if the target cpu is physically backed. However we need to do some performance tests until we can come up with a solution that will do the right thing on s390. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> LKML-Reference: <20090409184834.7a0df7b2@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 19:28:24 +02:00
Peter Zijlstra	d3d21c412d	perf_counter: log full path names Impact: fix perf-report output for /home mounted binaries, etc. dentry_path() only provide path-names up to the mount root, which is unsuited for out purpose, use d_path() instead. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090409085524.601794134@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 11:50:54 +02:00
Peter Zijlstra	1ccd154978	perf_counter: sysctl for system wide perf counters Impact: add sysctl for paranoid/relaxed perfcounters policy Allow the use of system wide perf counters to everybody, but provide a sysctl to disable it for the paranoid security minded. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090409085524.514046352@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 11:50:52 +02:00
Peter Zijlstra	9ee318a782	perf_counter: optimize mmap/comm tracking Impact: performance optimization The mmap/comm tracking code does quite a lot of work before it discovers there's no interest in it, avoid that by keeping a counter. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090409085524.427173196@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 11:50:43 +02:00
Ingo Molnar	888fcee066	perf_counter: fix off task->comm by one strlen() does not include the \0. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 09:48:22 +02:00
Li Zefan	9eb85125ce	blktrace: pass the right pointer to kfree() Impact: fix kfree crash with non-standard act_mask string If passing a string with leading white spaces to strstrip(), the returned ptr != the original ptr. This bug was introduced by me. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49DD694C.8020902@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 05:52:40 +02:00
Frederic Weisbecker	47788c58e6	tracing/syscalls: use a dedicated file header Impact: fix build warnings and possibe compat misbehavior on IA64 Building a kernel on ia64 might trigger these ugly build warnings: CC arch/ia64/ia32/sys_ia32.o In file included from arch/ia64/ia32/sys_ia32.c:55: arch/ia64/ia32/ia32priv.h:290:1: warning: "elf_check_arch" redefined In file included from include/linux/elf.h:7, from include/linux/module.h:14, from include/linux/ftrace.h:8, from include/linux/syscalls.h:68, from arch/ia64/ia32/sys_ia32.c:18: arch/ia64/include/asm/elf.h:19:1: warning: this is the location of the previous definition [...] sys_ia32.c includes linux/syscalls.h which in turn includes linux/ftrace.h to import the syscalls tracing prototypes. But including ftrace.h can pull too much things for a low level file, especially on ia64 where the ia32 private headers conflict with higher level headers. Now we isolate the syscall tracing headers in their own lightweight file. Reported-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jason Baron <jbaron@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Michael Davidson <md@google.com> LKML-Reference: <20090408184058.GB6017@nowhere> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-09 05:43:32 +02:00
Andrew Morton	6b44003e5c	work_on_cpu(): rewrite it to create a kernel thread on demand Impact: circular locking bugfix The various implemetnations and proposed implemetnations of work_on_cpu() are vulnerable to various deadlocks because they all used queues of some form. Unrelated pieces of kernel code thus gained dependencies wherein if one work_on_cpu() caller holds a lock which some other work_on_cpu() callback also takes, the kernel could rarely deadlock. Fix this by creating a short-lived kernel thread for each work_on_cpu() invokation. This is not terribly fast, but the only current caller of work_on_cpu() is pci_call_probe(). It would be nice to find some other way of doing the node-local allocations in the PCI probe code so that we can zap work_on_cpu() altogether. The code there is rather nasty. I can't think of anything simple at this time... Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-04-09 09:50:37 +09:30
Oleg Nesterov	1c99315bb3	kthread: move sched-realeted initialization from kthreadd context kthreadd is the single thread which implements ths "create" request, move sched_setscheduler/etc from create_kthread() to kthread_create() to improve the scalability. We should be careful with sched_setscheduler(), use _nochek helper. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Vitaliy Gusev <vgusev@openvz.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-04-09 09:50:37 +09:30
Vitaliy Gusev	3217ab97f1	kthread: Don't looking for a task in create_kthread() #2 Remove the unnecessary find_task_by_pid_ns(). kthread() can just use "current" to get the same result. Signed-off-by: Vitaliy Gusev <vgusev@openvz.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-04-09 09:50:36 +09:30
Roland McGrath	3a70970353	ptrace: some checkpatch fixes This fixes all the checkpatch --file complaints about kernel/ptrace.c and also removes an unused #include. I've verified that there are no changes to the compiled code on x86_64. Signed-off-by: Roland McGrath <roland@redhat.com> [ Removed the parts that just split a line - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-08 10:21:44 -07:00
Peter Zijlstra	78f13e9525	perf_counter: allow for data addresses to be recorded Paul suggested we allow for data addresses to be recorded along with the traditional IPs as power can provide these. For now, only the software pagefault events provide data addresses, but in the future power might as well for some events. x86 doesn't seem capable of providing this atm. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130409.394816925@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 19:05:56 +02:00
Peter Zijlstra	4d855457d8	perf_counter: move PERF_RECORD_TIME Move PERF_RECORD_TIME so that all the fixed length items come before the variable length ones. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130409.307926436@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 19:05:55 +02:00
Peter Zijlstra	8d1b2d9361	perf_counter: track task-comm data Similar to the mmap data stream, add one that tracks the task COMM field, so that the userspace reporting knows what to call a task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130409.127422406@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 19:05:47 +02:00
Peter Zijlstra	6b6e5486b3	perf_counter: use misc field to widen type Push the PERF_EVENT_COUNTER_OVERFLOW bit into the misc field so that we can have the full 32bit for PERF_RECORD_ bits. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130408.891867663@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 18:53:28 +02:00
Peter Zijlstra	6fab01927e	perf_counter: provide misc bits in the event header Limit the size of each record to 64k (or should we count in multiples of u64 and have a 512K limit?), this gives 16 bits or spare room in the header, which we can use for misc bits, so as to not have to grow the record with u64 every time we have a few bits to report. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130408.769271806@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 18:53:27 +02:00
Peter Zijlstra	e30e08f65c	perf_counter: fix NMI race in task clock We should not be updating ctx->time from NMI context, work around that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090408130408.681326666@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 18:53:27 +02:00
Oleg Nesterov	8f2e586567	posix-timers: fix RLIMIT_CPU && setitimer(CPUCLOCK_PROF) update_rlimit_cpu() tries to optimize out set_process_cpu_timer() in case when we already have CPUCLOCK_PROF timer which should expire first. But it uses cputime_lt() instead of cputime_gt(). Test case: int main(void) { struct itimerval it = { .it_value = { .tv_sec = 1000 }, }; assert(!setitimer(ITIMER_PROF, &it, NULL)); struct rlimit rl = { .rlim_cur = 1, .rlim_max = 1, }; assert(!setrlimit(RLIMIT_CPU, &rl)); for (;;) ; return 0; } Without this patch, the task is not killed as RLIMIT_CPU demands. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Peter Lojkin <ia6432@inbox.ru> Cc: Roland McGrath <roland@redhat.com> Cc: stable@kernel.org LKML-Reference: <20090327000610.GA10108@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 17:51:39 +02:00
Oleg Nesterov	6279a751fe	posix-timers: fix RLIMIT_CPU && fork() See http://bugzilla.kernel.org/show_bug.cgi?id=12911 copy_signal() copies signal->rlim, but RLIMIT_CPU is "lost". Because posix_cpu_timers_init_group() sets cputime_expires.prof_exp = 0 and thus fastpath_timer_check() returns false unless we have other expired cpu timers. Change copy_signal() to set cputime_expires.prof_exp if we have RLIMIT_CPU. Also, set cputimer.running = 1 in that case. This is not strictly necessary, but imho makes sense. Reported-by: Peter Lojkin <ia6432@inbox.ru> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Peter Lojkin <ia6432@inbox.ru> Cc: Roland McGrath <roland@redhat.com> Cc: stable@kernel.org LKML-Reference: <20090327000607.GA10104@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 17:51:38 +02:00
Ingo Molnar	5af8c4e0fa	Merge commit 'v2.6.30-rc1' into sched/urgent Merge reason: update to latest upstream to queue up fix Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 17:26:00 +02:00
Ingo Molnar	ff96e612cb	Merge commit 'v2.6.30-rc1' into core/urgent Merge reason: need latest upstream to queue up dependent fix Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 17:02:57 +02:00
Ingo Molnar	5ea472a77f	Merge commit 'v2.6.30-rc1' into perfcounters/core Conflicts: arch/powerpc/include/asm/systbl.h arch/powerpc/include/asm/unistd.h include/linux/init_task.h Merge reason: the conflicts are non-trivial: PowerPC placement of sys_perf_counter_open has to be mixed with the new preadv/pwrite syscalls. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-08 10:35:30 +02:00
Linus Torvalds	1551260d1f	Merge branch 'core/softlockup' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core/softlockup' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: softlockup: make DETECT_HUNG_TASK default depend on DETECT_SOFTLOCKUP softlockup: move 'one' to the softlockup section in sysctl.c softlockup: ensure the task has been switched out once softlockup: remove timestamp checking from hung_task softlockup: convert read_lock in hung_task to rcu_read_lock softlockup: check all tasks in hung_task softlockup: remove unused definition for spawn_softlockup_task softlockup: fix potential race in hung_task when resetting timeout softlockup: fix to allow compiling with !DETECT_HUNG_TASK softlockup: decouple hung tasks check from softlockup detection	2009-04-07 14:11:07 -07:00
Linus Torvalds	c93f216b5b	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y branch tracer: Fix for enabling branch profiling makes sparse unusable ftrace: Correct a text align for event format output Update /debug/tracing/README tracing/ftrace: alloc the started cpumask for the trace file tracing, x86: remove duplicated #include ftrace: Add check of sched_stopped for probe_sched_wakeup function-graph: add proper initialization for init task tracing/ftrace: fix missing include string.h tracing: fix incorrect return type of ns2usecs() tracing: remove CALLER_ADDR2 from wakeup tracer blktrace: fix pdu_len when tracing packet command requests blktrace: small cleanup in blk_msg_write() blktrace: NUL-terminate user space messages tracing: move scripts/trace/power.pl to scripts/tracing/power.pl	2009-04-07 14:10:10 -07:00
Linus Torvalds	c61b79b6ef	Merge branch 'irq/threaded' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq/threaded' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: fix devres.o build for GENERIC_HARDIRQS=n genirq: provide old request_irq() for CONFIG_GENERIC_HARDIRQ=n genirq: threaded irq handlers review fixups genirq: add support for threaded interrupts to devres genirq: add threaded interrupt handler support	2009-04-07 14:07:52 -07:00
Masami Hiramatsu	de5bd88d5a	kprobes: support per-kprobe disabling Add disable_kprobe() and enable_kprobe() to disable/enable kprobes temporarily. disable_kprobe() asynchronously disables probe handlers of specified kprobe. So, after calling it, some handlers can be called at a while. enable_kprobe() enables specified kprobe. aggr_pre_handler and aggr_post_handler check disabled probes. On the other hand aggr_break_handler and aggr_fault_handler don't check it because these handlers will be called while executing pre or post handlers and usually those help error handling. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:08 -07:00
Masami Hiramatsu	e579abeb58	kprobes: rename kprobe_enabled to kprobes_all_disarmed Rename kprobe_enabled to kprobes_all_disarmed and invert logic due to avoiding naming confusion from per-probe disabling. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:08 -07:00
Masami Hiramatsu	99081ab553	kprobes: move EXPORT_SYMBOL_GPL just after function definitions Clean up positions of EXPORT_SYMBOL_GPL in kernel/kprobes.c according to checkpatch.pl. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:08 -07:00
Masami Hiramatsu	b918e5e60d	kprobes: cleanup aggr_kprobe related code Currently, kprobes can disable all probes at once, but can't disable it individually (not unregister, just disable an kprobe, because unregistering needs to wait for scheduler synchronization). These patches introduce APIs for on-the-fly per-probe disabling and re-enabling by dis-arming/re-arming its breakpoint instruction. This patch: Change old_p to ap in add_new_kprobe() for readability, copy flags member in add_aggr_kprobe(), and simplify the code flow of register_aggr_kprobe(). Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:07 -07:00
Peter W Morreale	fafd688e4c	mm: add /proc controls for pdflush threads Add /proc entries to give the admin the ability to control the minimum and maximum number of pdflush threads. This allows finer control of pdflush on both large and small machines. The rationale is simply one size does not fit all. Admins on large and/or small systems may want to tune the min/max pdflush thread count to best suit their needs. Right now the min/max is hardcoded to 2/8. While probably a fair estimate for smaller machines, large machines with large numbers of CPUs and large numbers of filesystems/block devices may benefit from larger numbers of threads working on different block devices. Even if the background flushing algorithm is radically changed, it is still likely that multiple threads will be involved and admins would still desire finer control on the min/max other than to have to recompile the kernel. The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions. The minimum value for nr_pdflush_threads_min is 1 and the maximum value is the current value of nr_pdflush_threads_max. This minimum is required since additional thread creation is performed in a pdflush thread itself. The minimum value for nr_pdflush_threads_max is the current value of nr_pdflush_threads_min and the maximum value can be 1000. Documentation/sysctl/vm.txt is also updated. [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly] Signed-off-by: Peter W Morreale <pmorreale@novell.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:03 -07:00
Zhaolei	1bbe2a83ab	ftrace: Correct a text align for event format output If we cat debugfs/tracing/events/ftrace/bprint/format, we'll see: name: bprint ID: 6 format: field:unsigned char common_type; offset:0; size:1; field:unsigned char common_flags; offset:1; size:1; field:unsigned char common_preempt_count; offset:2; size:1; field:int common_pid; offset:4; size:4; field:int common_tgid; offset:8; size:4; field:unsigned long ip; offset:12; size:4; field:char * fmt; offset:16; size:4; field: char buf; offset:20; size:0; print fmt: "%08lx (%d) fmt:%p %s" There is an inconsistent blank before char buf. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> LKML-Reference: <49D5E3EE.70201@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:02:42 +02:00
Nikanth Karthikesan	bc2b6871c1	Update /debug/tracing/README Some of the tracers have been renamed, which was not updated in the in-kernel run-time README file. Update it. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> LKML-Reference: <200903231158.32151.knikanth@suse.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:02:36 +02:00
Frederic Weisbecker	b0dfa978c7	tracing/ftrace: alloc the started cpumask for the trace file Impact: fix a crash while cat trace file Currently we are using a cpumask to remind each cpu where a trace occured. It lets us notice the user that a cpu just had its first trace. But on latest -tip we have the following crash once we cat the trace file: IP: [<c0270c4a>] print_trace_fmt+0x45/0xe7 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP last sysfs file: /sys/class/net/eth0/carrier Pid: 3897, comm: cat Not tainted (2.6.29-tip-02825-g0f22972-dirty #81) EIP: 0060:[<c0270c4a>] EFLAGS: 00010297 CPU: 0 EIP is at print_trace_fmt+0x45/0xe7 EAX: 00000000 EBX: 00000000 ECX: c12d9e98 EDX: ccdb7010 ESI: d31f4000 EDI: `00322401` EBP: d31f3f10 ESP: d31f3efc DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process cat (pid: 3897, ti=d31f2000 task=d3b3cf20 task.ti=d31f2000) Stack: d31f4080 ccdb7010 d31f4000 d691fe70 ccdb7010 d31f3f24 c0270e5c d31f4000 d691fe70 d31f4000 d31f3f34 c02718e8 c12d9e98 d691fe70 d31f3f70 c02bfc33 00001000 09130000 d3b46e00 d691fe98 00000000 00000079 00000001 00000000 Call Trace: [<c0270e5c>] ? print_trace_line+0x170/0x17c [<c02718e8>] ? s_show+0xa7/0xbd [<c02bfc33>] ? seq_read+0x24a/0x327 [<c02bf9e9>] ? seq_read+0x0/0x327 [<c02ab18b>] ? vfs_read+0x86/0xe1 [<c02ab289>] ? sys_read+0x40/0x65 [<c0202d8f>] ? sysenter_do_call+0x12/0x3c Code: 00 00 00 89 45 ec f7 c7 00 20 00 00 89 55 f0 74 4e f6 86 98 10 00 00 02 74 45 8b 86 8c 10 00 00 8b 9e a8 10 00 00 e8 52 f3 ff ff <0f> a3 03 19 c0 85 c0 75 2b 8b 86 8c 10 00 00 8b 9e a8 10 00 00 EIP: [<c0270c4a>] print_trace_fmt+0x45/0xe7 SS:ESP 0068:d31f3efc CR2: 0000000000000000 ---[ end trace aa9cf38e5ebed9dd ]--- This is because we alloc the iter->started cpumask on tracing_pipe_open but not on tracing_open. It hadn't been noticed until now because we need to have ring buffer overruns to activate the starting of cpu buffer detection. Also, we need a check to not print the messagge for the first trace on the file. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1238619188-6109-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:02:03 +02:00
Zhaolei	8bcae09b93	ftrace: Add check of sched_stopped for probe_sched_wakeup The wakeup tracing in sched_switch does not stop when a user disables tracing. This is because the probe_sched_wakeup() is missing the check to prevent the wakeup from being traced. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> LKML-Reference: <49D1C543.3010307@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:01:11 +02:00
Frederic Weisbecker	5f0c6c03c5	tracing/ftrace: fix missing include string.h Building a kernel with tracing can raise the following warning on tip/master: kernel/trace/trace.c:1249: error: implicit declaration of function 'vbin_printf' We are missing an include to string.h Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1238160130-7437-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 14:00:18 +02:00
Lai Jiangshan	cf8e347465	tracing: fix incorrect return type of ns2usecs() Impact: fix time output bug in 32bits system ns2usecs() returns 'long', it's incorrect. (In i386) ... <idle>-0 [000] 521.442100: _spin_lock <-tick_do_update_jiffies64 <idle>-0 [000] 521.442101: do_timer <-tick_do_update_jiffies64 <idle>-0 [000] 521.442102: update_wall_time <-do_timer <idle>-0 [000] 521.442102: update_xtime_cache <-update_wall_time .... (It always print the time less than 2200 seconds besides ...) Because 'long' is 32bits in i386. ( (1<<31) useconds is about 2200 seconds) ... <idle>-0 [001] 4154502640.134759: rcu_bh_qsctr_inc <-__do_softirq <idle>-0 [001] 4154502640.134760: _local_bh_enable <-__do_softirq <idle>-0 [001] 4154502640.134761: idle_cpu <-irq_exit ... (very large value) Because 'long' is a signed type and it is 32bits in i386. Changes in v2: return 'unsigned long long' instead of 'cycle_t' Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <49D05D10.4030009@cn.fujitsu.com> Reported-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:59:23 +02:00
Steven Rostedt	301fd748e2	tracing: remove CALLER_ADDR2 from wakeup tracer Maneesh Soni was getting a crash when running the wakeup tracer. We debugged it down to the recording of the function with the CALLER_ADDR2 macro. This is used to get the location of the caller to schedule. But the problem comes when schedule is called by assmebly. In the case that Maneesh had, retint_careful would call schedule. But retint_careful does not set up a proper frame pointer. CALLER_ADDR2 is defined as __builtin_return_address(2). This produces the following assembly in the wakeup tracer code. mov 0x0(%rbp),%rcx <--- get the frame pointer of the caller mov %r14d,%r8d mov 0xf2de8e(%rip),%rdi mov 0x8(%rcx),%rsi <-- this is __builtin_return_address(1) mov 0x28(%rdi,%rax,8),%rbx mov (%rcx),%rax <-- get the frame pointer of the caller's caller mov %r12,%rcx mov 0x8(%rax),%rdx <-- this is __builtin_return_address(2) At the reading of 0x8(%rax) Maneesh's machine would take a fault. The reason is that retint_careful did not set up the return address and the content of %rax here was zero. To verify this, I sent Maneesh a patch to create a frame pointer in retint_careful. He ran the test again but this time he would take the same type of fault from sysret_careful. The retint_careful was no longer an issue, but there are other callers that still have issues. Instead of adding frame pointers for all callers to schedule (in possibly all archs), it is much safer to simply not use CALLER_ADDR2. This loses out on knowing what called schedule, but the function tracer will help there if needed. Reported-by: Maneesh Soni <maneesh@in.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:58:54 +02:00
Ingo Molnar	a053958f07	Merge branch 'tracing/blktrace-fixes' into tracing/urgent Merge reason: this used to be a tracing/blktrace-v2 devel topic still cooking during the merge window - has propagated to fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 13:40:55 +02:00
Ingo Molnar	6c009ecef8	Merge branch 'linus' into perfcounters/core Merge reason: need the upstream facility added by: `7f1e2ca`: hrtimer: fix rq->lock inversion (again) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 12:05:25 +02:00
Ingo Molnar	5e34437840	Merge branch 'linus' into core/softlockup Conflicts: kernel/sysctl.c	2009-04-07 11:15:40 +02:00
Peter Zijlstra	bce379bf35	perf_counter: minimize context time updates Push the update_context_time() calls up the stack so that we get less invokations and thereby a less noisy output: before: # ./perfstat -e 1:0 -e 1:1 -e 1:1 -e 1:1 -l ls > /dev/null Performance counter stats for 'ls': 10.163691 cpu clock ticks (msecs) (scaled from 98.94%) 10.215360 task clock ticks (msecs) (scaled from 98.18%) 10.185549 task clock ticks (msecs) (scaled from 98.53%) 10.183581 task clock ticks (msecs) (scaled from 98.71%) Wall-clock time elapsed: 11.912858 msecs after: # ./perfstat -e 1:0 -e 1:1 -e 1:1 -e 1:1 -l ls > /dev/null Performance counter stats for 'ls': 9.316630 cpu clock ticks (msecs) 9.280789 task clock ticks (msecs) 9.280789 task clock ticks (msecs) 9.280789 task clock ticks (msecs) Wall-clock time elapsed: 9.574872 msecs Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.618876874@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:49:01 +02:00
Peter Zijlstra	849691a6cd	perf_counter: remove rq->lock usage Now that all the task runtime clock users are gone, remove the ugly rq->lock usage from perf counters, which solves the nasty deadlock seen when a software task clock counter was read from an NMI overflow context. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.531137582@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:49:01 +02:00
Peter Zijlstra	a39d6f2556	perf_counter: rework the task clock software counter Rework the task clock software counter to use the context time instead of the task runtime clock, this removes the last such user. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.445450972@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:49:00 +02:00
Peter Zijlstra	4af4998b8a	perf_counter: rework context time Since perf_counter_context is switched along with tasks, we can maintain the context time without using the task runtime clock. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.353552838@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:49:00 +02:00
Peter Zijlstra	4c9e25428f	perf_counter: change event definition Currently the definition of an event is slightly ambiguous. We have wakeup events, for poll() and SIGIO, which are either generated when a record crosses a page boundary (hw_events.wakeup_events == 0), or every wakeup_events new records. Now a record can be either a counter overflow record, or a number of different things, like the mmap PROT_EXEC region notifications. Then there is the PERF_COUNTER_IOC_REFRESH event limit, which only considers counter overflows. This patch changes then wakeup_events and SIGIO notification to only consider overflow events. Furthermore it changes the SIGIO notification to report SIGHUP when the event limit is reached and the counter will be disabled. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.266679874@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:59 +02:00
Peter Zijlstra	79f1464156	perf_counter: counter overflow limit Provide means to auto-disable the counter after 'n' overflow events. Create the counter with hw_event.disabled = 1, and then issue an ioctl(fd, PREF_COUNTER_IOC_REFRESH, n); to set the limit and enable the counter. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.083139737@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:58 +02:00
Peter Zijlstra	339f7c90b8	perf_counter: PERF_RECORD_TIME By popular request, provide means to log a timestamp along with the counter overflow event. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094518.024173282@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:57 +02:00
Peter Zijlstra	ebb3c4c4cb	perf_counter: fix the mlock accounting Reading through the code I saw I forgot the finish the mlock accounting. Do so now. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.899767331@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:57 +02:00
Peter Zijlstra	f6c7d5fe58	perf_counter: theres more to overflow than writing events Prepare for more generic overflow handling. The new perf_counter_overflow() method will handle the generic bits of the counter overflow, and can return a !0 return value, in which case the counter should be (soft) disabled, so that it won't count until it's properly disabled. XXX: do powerpc and swcounter Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.812109629@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:56 +02:00
Peter Zijlstra	671dec5daf	perf_counter: generalize pending infrastructure Prepare the pending infrastructure to do more than wakeups. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.634732847@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:55 +02:00
Peter Zijlstra	3c446b3d3b	perf_counter: SIGIO support Provide support for fcntl() I/O availability signals. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.579788800@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:55 +02:00
Peter Zijlstra	9c03d88e32	perf_counter: add more context information Change the callchain context entries to u16, so as to gain some space. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <20090406094517.457320003@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-07 10:48:54 +02:00
Rusty Russell	2e45e77787	Revert "module: remove the SHF_ALLOC flag on the __versions section." This reverts commit `9cb610d8e3`. This was an impressively stupid patch. Firstly, we reset the SHF_ALLOC flag lower down in the same function, so the patch was useless. Even better, find_sec() ignores sections with SHF_ALLOC not set, so it breaks CONFIG_MODVERSIONS=y with CONFIG_MODULE_FORCE_LOAD=n, which refuses to load the module since it can't find the __versions section. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-04-07 17:12:43 +09:30
Oleg Nesterov	432870dab8	exit_notify: kill the wrong capable(CAP_KILL) check The CAP_KILL check in exit_notify() looks just wrong, kill it. Whatever logic we have to reset ->exit_signal, the malicious user can bypass it if it execs the setuid application before exiting. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 14:57:23 -07:00
Linus Torvalds	cd5f9a4c31	kernel/sysctl.c: avoid annoying warnings Some of the limit constants are used only depending on some complex configuration dependencies, yet it's not worth making the simple variables depend on those configuration details. Just mark them as perhaps not being unused, and avoid the warning. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 13:38:46 -07:00
Linus Torvalds	609862be07	Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: add stack dumps to asserts hrtimer: fix rq->lock inversion (again)	2009-04-06 13:37:30 -07:00
Linus Torvalds	12fe32e4f9	Merge branch 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kmemtrace: trace kfree() calls with NULL or zero-length objects kmemtrace: small cleanups kmemtrace: restore original tracing data binary format, improve ABI kmemtrace: kmemtrace_alloc() must fill type_id kmemtrace: use tracepoints kmemtrace, rcu: don't include unnecessary headers, allow kmemtrace w/ tracepoints kmemtrace, rcu: fix rcupreempt.c data structure dependencies kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_unlzma.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_bunzip2.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_inflate.c kmemtrace, squashfs: fix slab.h dependency problem in squasfs kmemtrace, befs: fix slab.h dependency problem kmemtrace, security: fix linux/key.h header file dependencies kmemtrace, fs: fix linux/fdtable.h header file dependencies kmemtrace, fs: uninline simple_transaction_set() kmemtrace, fs, security: move alloc_secdata() and free_secdata() to linux/security.h	2009-04-06 13:30:00 -07:00
Peter Zijlstra	92f22a3865	perf_counter: update mmap() counter read Paul noted that we don't need SMP barriers for the mmap() counter read because its always on the same cpu (otherwise you can't access the hw counter anyway). So remove the SMP barriers and replace them with regular compiler barriers. Further, update the comment to include a race free method of reading said hardware counter. The primary change is putting the pmc_read inside the seq-loop, otherwise we can still race and read rubbish. Noticed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Orig-LKML-Reference: <20090402091319.577951445@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:47 +02:00
Peter Zijlstra	5872bdb88a	perf_counter: add more context information Put in counts to tell which ips belong to what context. ----- \| \| hv \| -- nr \| \| kernel \| -- \| \| user ----- Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Orig-LKML-Reference: <20090402091319.493101305@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:46 +02:00
Peter Zijlstra	c457810ab4	perf_counter: per event wakeups By request, provide a way to request a wakeup every 'n' events instead of every page of output. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Orig-LKML-Reference: <20090402091319.323309784@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:45 +02:00
Peter Zijlstra	8a057d8491	perf_counter: move the event overflow output bits to record_type Per suggestion from Paul, move the event overflow bits to record_type and sanitize the enums a bit. Breaks the ABI -- again ;-) Suggested-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Orig-LKML-Reference: <20090402091319.151921176@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:45 +02:00
Peter Zijlstra	394ee07623	perf_counter: provide generic callchain bits Provide the generic callchain support bits. If hw_event->callchain is set the arch specific perf_callchain() function is called upon to provide a perf_callchain_entry structure filled with the current callchain. If it does so, it is added to the overflow output event. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171024.254266860@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:43 +02:00
Peter Zijlstra	5ed00415e3	perf_counter: re-arrange the perf_event_type Breaks ABI yet again :-) Change the event type so that [0, 2^31-1] are regular event types, but [2^31, 2^32-1] forms a bitmask for overflow events. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171024.047961770@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:42 +02:00
Peter Zijlstra	78d613eb12	perf_counter: small cleanup of the output routines Move the nmi argument to the _begin() function, so that _end() only needs the handle. This allows the _begin() function to generate a wakeup on event loss. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171023.959404268@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:41 +02:00
Paul Mackerras	d5d2bc0dd0	perf_counter: make it possible for hw_perf_counter_init to return error codes Impact: better error reporting At present, if hw_perf_counter_init encounters an error, all it can do is return NULL, which causes sys_perf_counter_open to return an EINVAL error to userspace. This isn't very informative for userspace; it means that userspace can't tell the difference between "sorry, oprofile is already using the PMU" and "we don't support this CPU" and "this CPU doesn't support the requested generic hardware event". This commit uses the PTR_ERR/ERR_PTR/IS_ERR set of macros to let hw_perf_counter_init return an error code on error rather than just NULL if it wishes. If it does so, that error code will be returned from sys_perf_counter_open to userspace. If it returns NULL, an EINVAL error will be returned to userspace, as before. This also adapts the powerpc hw_perf_counter_init to make use of this to return ENXIO, EINVAL, EBUSY, or EOPNOTSUPP as appropriate. It would be good to add extra error numbers in future to allow userspace to distinguish the various errors that are currently reported as EINVAL, i.e. irq_period < 0, too many events in a group, conflict between exclude_* settings in a group, and PMU resource conflict in a group. [ v2: fix a bug pointed out by Corey Ashford where error returns from hw_perf_counter_init were not handled correctly in the case of raw hardware events.] Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Orig-LKML-Reference: <20090330171023.682428180@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:40 +02:00
Peter Zijlstra	0a4a93919b	perf_counter: executable mmap() information Currently the profiling information returns userspace IPs but no way to correlate them to userspace code. Userspace could look into /proc/$pid/maps but that might not be current or even present anymore at the time of analyzing the IPs. Therefore provide means to track the mmap information and provide it in the output stream. XXX: only covers mmap()/munmap(), mremap() and mprotect() are missing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Cc: Andrew Morton <akpm@linux-foundation.org> Orig-LKML-Reference: <20090330171023.417259499@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:38 +02:00
Peter Zijlstra	38ff667b32	perf_counter: fix update_userpage() It just occured to me it is possible to have multiple contending updates of the userpage (mmap information vs overflow vs counter). This would break the seqlock logic. It appear the arch code uses this from NMI context, so we cannot possibly serialize its use, therefore separate the data_head update from it and let it return to its original use. The arch code needs to make sure there are no contending callers by disabling the counter before using it -- powerpc appears to do this nicely. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171023.241410660@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:37 +02:00
Peter Zijlstra	925d519ab8	perf_counter: unify and fix delayed counter wakeup While going over the wakeup code I noticed delayed wakeups only work for hardware counters but basically all software counters rely on them. This patch unifies and generalizes the delayed wakeup to fix this issue. Since we're dealing with NMI context bits here, use a cmpxchg() based single link list implementation to track counters that have pending wakeups. [ This should really be generic code for delayed wakeups, but since we cannot use cmpxchg()/xchg() in generic code, I've let it live in the perf_counter code. -- Eric Dumazet could use it to aggregate the network wakeups. ] Furthermore, the x86 method of using TIF flags was flawed in that its quite possible to end up setting the bit on the idle task, loosing the wakeup. The powerpc method uses per-cpu storage and does appear to be sufficient. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Orig-LKML-Reference: <20090330171023.153932974@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:36 +02:00
Paul Mackerras	53cfbf5937	perf_counter: record time running and time enabled for each counter Impact: new functionality Currently, if there are more counters enabled than can fit on the CPU, the kernel will multiplex the counters on to the hardware using round-robin scheduling. That isn't too bad for sampling counters, but for counting counters it means that the value read from a counter represents some unknown fraction of the true count of events that occurred while the counter was enabled. This remedies the situation by keeping track of how long each counter is enabled for, and how long it is actually on the cpu and counting events. These times are recorded in nanoseconds using the task clock for per-task counters and the cpu clock for per-cpu counters. These values can be supplied to userspace on a read from the counter. Userspace requests that they be supplied after the counter value by setting the PERF_FORMAT_TOTAL_TIME_ENABLED and/or PERF_FORMAT_TOTAL_TIME_RUNNING bits in the hw_event.read_format field when creating the counter. (There is no way to change the read format after the counter is created, though it would be possible to add some way to do that.) Using this information it is possible for userspace to scale the count it reads from the counter to get an estimate of the true count: true_count_estimate = count * total_time_enabled / total_time_running This also lets userspace detect the situation where the counter never got to go on the cpu: total_time_running == 0. This functionality has been requested by the PAPI developers, and will be generally needed for interpreting the count values from counting counters correctly. In the implementation, this keeps 5 time values (in nanoseconds) for each counter: total_time_enabled and total_time_running are used when the counter is in state OFF or ERROR and for reporting back to userspace. When the counter is in state INACTIVE or ACTIVE, it is the tstamp_enabled, tstamp_running and tstamp_stopped values that are relevant, and total_time_enabled and total_time_running are determined from them. (tstamp_stopped is only used in INACTIVE state.) The reason for doing it like this is that it means that only counters being enabled or disabled at sched-in and sched-out time need to be updated. There are no new loops that iterate over all counters to update total_time_enabled or total_time_running. This also keeps separate child_total_time_running and child_total_time_enabled fields that get added in when reporting the totals to userspace. They are separate fields so that they can be atomic. We don't want to use atomics for total_time_running, total_time_enabled etc., because then we would have to use atomic sequences to update them, which are slower than regular arithmetic and memory accesses. It is possible to measure total_time_running by adding a task_clock counter to each group of counters, and total_time_enabled can be measured approximately with a top-level task_clock counter (though inaccuracies will creep in if you need to disable and enable groups since it is not possible in general to disable/enable the top-level task_clock counter simultaneously with another group). However, that adds extra overhead - I measured around 15% increase in the context switch latency reported by lat_ctx (from lmbench) when a task_clock counter was added to each of 2 groups, and around 25% increase when a task_clock counter was added to each of 4 groups. (In both cases a top-level task-clock counter was also added.) In contrast, the code added in this commit gives better information with no overhead that I could measure (in fact in some cases I measured lower times with this code, but the differences were all less than one standard deviation). [ v2: address review comments by Andrew Morton. ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Orig-LKML-Reference: <18890.6578.728637.139402@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:36 +02:00
Peter Zijlstra	7730d86558	perf_counter: allow and require one-page mmap on counting counters A brainfart stopped single page mmap()s working. The rest of the code should be perfectly fine with not having any data pages. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Orig-LKML-Reference: <1237981712.7972.812.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:35 +02:00
Peter Zijlstra	ea5d20cf99	perf_counter: optionally provide the pid/tid of the sampled task Allow cpu wide counters to profile userspace by providing what process the sample belongs to. This raises the first issue with the output type, lots of these options: group, tid, callchain, etc.. are non-exclusive and could be combined, suggesting a bitfield. However, things like the mmap() data stream doesn't fit in that. How to split the type field... Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Orig-LKML-Reference: <20090325113317.013775235@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:34 +02:00
Peter Zijlstra	63e35b25d6	perf_counter: sanity check on the output API Ensure we never write more than we said we would. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Orig-LKML-Reference: <20090325113316.921433024@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-06 09:30:33 +02:00

1 2 3 4 5 ...

7012 commits