Although we check if "missed" is not zero, we divide by hit + missed,
and the addition can possible overflow and become a divide by zero.
This patch checks for this case, and will report it when it happens
then modify "hit" to make the calculation be non zero.
[ Impact: prevent possible divide by zero in ring-buffer-benchmark ]
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The use of numeric constants is discouraged. It is cleaner and more
descriptive to use macros for constant time conversions.
This patch also removes an extra new line.
[ Impact: more descriptive time conversions ]
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix kprobes to lock text_mutex around some arch_arm/disarm_kprobe() which
are newly added by commit de5bd88d5a.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Other parts of the kernel may need to be able to enable or disable
specific events. Especially parts that create trace events.
[ Impact: allow enabling of trace events by those that create the event ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 8f31bfe538
tracing/events: clean up for ftrace_set_clr_event()
Moved out the code for ftrace_set_clr_event into a helper funciton but
did not initialize the return value. As a result, we do not warn about
a typo in the echoing of events in set_event.
This patch restores the old warning:
# echo foobar > set_event
-bash: echo: write error: Invalid argument
[ Impact: restore warning of invalid entries to set_event ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A smarter way to figure out the output of an enable file.
[ Impact: clean up ]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A0399A5.2080603@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a helper function __ftrace_set_clr_event(), and replace some
ftrace_set_clr_event() calls with this helper, thus we don't need any
kstrdup() or kmalloc().
As a side effect, this patch fixes an issue in self tests code, which is
similar to the one fixed in commit d6bf81ef0f
("tracing: append ":*" to internal setting of system events")
It's a small issue and won't cause any bug in fact, but we should do things
right anyway.
[ Impact: prevent spurious event-enabling in tracing self-tests ]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A03998E.3020503@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's a WARN_ON in the ring buffer code that makes sure preemption
is disabled. It checks "!preempt_count()". But when CONFIG_PREEMPT is not
enabled, preempt_count() is always zero, and this will trigger the warning.
[ Impact: prevent false warning on non preemptible kernels ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It is nice to see the overhead of the benchmark test when tracing is
disabled. That is, we turn off the ring buffer just to see what the
cost of running the loop that calls into the ring buffer is.
Currently, if no entries wer made, we get 0. This is not informative.
This patch changes it to check if we had any "missed" (non recorded)
events. If so, a total count is also reported.
[ Impact: evaluate the over head of the ring buffer benchmark test ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Calling cond_resched at every iteration of the loop adds a bit of
overhead to the benchmark.
This patch does two things.
1) only calls cond-resched when CONFIG_PREEMPT is not enabled
2) only calls cond-resched after so many traces has been performed.
[ Impact: less overhead to the ring-buffer-benchmark ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Tracing can be very helpful to debug the kernel. When DEBUG_KERNEL is
enabled it is nice to enable the trace menu as well.
This patch only make the tracing menu enabled by default, it does not
make any of the tracers enabled. And the menu is only enabled by
default if DEBUG_KERNEL is enabled.
[ Impact: show tracing options to those debugging the kernel ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The system enabling of events uses the same code as the set_event file.
It passes in the name of the system to the parser and that will enable
all the events that has that system as a name.
The problem is that it will also enable events with the same name as the
system.
If you have system name foo, and system name bar, but within the system
bar, there exists an event called foo. By setting the system name foo,
you will also be enabling the event foo in the system bar. This is not
an expected result.
The solution is to pass in "foo:*", which will only enable the system
foo and not events called foo.
[ Impact: prevent accidental enabling of events with same name as a system ]
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Ingo Molnar thought that the code to calculate the time in cond_resched
is a bit too ugly and is not needed. This patch removes it and replaces
it with a simple call to cond_resched. I kept the comment that explains
the reason for the cond_resched.
[ Impact: remove ugly code ]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Merge reason: this topic is ready for upstream now. It passed
Oleg's review and Andrew had no further mm/*
objections/observations either.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
on a handful of tracing fixes present in .30-rc5-almost.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In filter_add_subsystem_pred() we should release event_mutex before
calling filter_free_subsystem_preds(), since both functions hold
event_mutex.
[ Impact: fix deadlock when writing invalid pred into subsystem filter ]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: tzanussi@gmail.com
Cc: a.p.zijlstra@chello.nl
Cc: fweisbec@gmail.com
Cc: rostedt@goodmis.org
LKML-Reference: <4A028993.7020509@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we set a filter for an event, such as:
echo "name == my_lock_name" > \
/debug/tracing/events/lockdep/lock_acquired/filter
then the following order of token type is parsed:
- space
- operator
- parentheses
- operand
Because the operators and parentheses have a higher precedence
than the operand characters, which is normal, then we can't
use any string containing such special characters:
()=<>!&|
To get this support and also avoid ambiguous intepretation from
the parser or the human, we can do it using double quotes so that
we keep the usual languages habits.
Then after this patch you can still declare string condition like
before:
echo name == myname
But if you want to compare against a string containing an operator
character, you can use double quotes:
echo 'name == "&myname"'
Don't forget to include the whole expression into single quotes or
the double ones will be eaten by echo.
[ Impact: support strings with special characters for tracing filters ]
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Zhaolei <zhaolei@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Currently the filtering infrastructure supports well the
numeric types and fixed sized array types.
But the recently added __string() field uses a specific
indirect offset mechanism which requires a specific
predicate. Until now it wasn't supported.
This patch adds this support and implies very few changes,
only a new predicate is needed, the management of this specific
field can be done through the usual string helpers in the
filtering infrastructure.
[ Impact: support all kinds of strings in the tracing filters ]
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Zhaolei <zhaolei@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
With the current event directory, you can only enable individual events.
The file debugfs/tracing/set_event is used to be able to enable or
disable several events at once. But that can still be awkward.
This patch adds hierarchical enabling of events. That is, each directory
in debugfs/tracing/events has an "enable" file. This file can enable
or disable all events within the directory and below.
# echo 1 > /debugfs/tracing/events/enable
will enable all events.
# echo 1 > /debugfs/tracing/events/sched/enable
will enable all events in the sched subsystem.
# echo 1 > /debugfs/tracing/events/enable
# echo 0 > /debugfs/tracing/events/irq/enable
will enable all events, but then disable just the irq subsystem events.
When reading one of these enable files, there are four results:
0 - all events this file affects are disabled
1 - all events this file affects are enabled
X - there is a mixture of events enabled and disabled
? - this file does not affect any event
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Li Zefan found that there's a race using the event ids of events and
modules. When a module is loaded, an event id is incremented. We only
have 16 bits for event ids (65536) and there is a possible (but highly
unlikely) race that we could load and unload a module that registers
events so many times that the event id counter overflows.
When it overflows, it then restarts and goes looking for available
ids. An id is available if it was added by a module and released.
The race is if you have one module add an id, and then is removed.
Another module loaded can use that same event id. But if the old module
still had events in the ring buffer, the new module's call back would
get bogus data. At best (and most likely) the output would just be
garbage. But if the module for some reason used pointers (not recommended)
then this could potentially crash.
The safest thing to do is just reset the ring buffer if a module that
registered events is removed.
[ Impact: prevent unpredictable results of event id overflows ]
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <49FEAFD0.30106@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When building with gcc 3.2 I get thousands of warnings such as
include/linux/gfp.h: In function `allocflags_to_migratetype':
include/linux/gfp.h:105: warning: null format string
due to passing a NULL format string to warn_slowpath() in
#define __WARN() warn_slowpath(__FILE__, __LINE__, NULL)
Split this case out into a separate call. This also shrinks the kernel
slightly:
text data bss dec hex filename
4802274 707668 712704 6222646 5ef336 vmlinux
text data bss dec hex filename
4799027 703572 712704 6215303 5ed687 vmlinux
due to removeing one argument from the commonly-called __WARN().
[akpm@linux-foundation.org: reduce scope of `empty']
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is what we believe to be a false positive reported by lockdep.
inotify_inode_queue_event() => take inotify_mutex => kernel_event() =>
kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim =>
dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock
The plan is to fix this via lockdep annotation, but that is proving to be
quite involved.
The patch flips the allocation over to GFP_NFS to shut the warning up, for
the 2.6.30 release.
Hopefully we will fix this for real in 2.6.31. I'll queue a patch in -mm
to switch it back to GFP_KERNEL so we don't forget.
=================================
[ INFO: inconsistent lock state ]
2.6.30-rc2-next-20090417 #203
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&inode->inotify_mutex){+.+.?.}, at: [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
{RECLAIM_FS-ON-W} state was registered at:
[<ffffffff81079188>] mark_held_locks+0x68/0x90
[<ffffffff810792a5>] lockdep_trace_alloc+0xf5/0x100
[<ffffffff810f5261>] __kmalloc_node+0x31/0x1e0
[<ffffffff81130652>] kernel_event+0xe2/0x190
[<ffffffff81130826>] inotify_dev_queue_event+0x126/0x230
[<ffffffff8112f096>] inotify_inode_queue_event+0xc6/0x110
[<ffffffff8110444d>] vfs_create+0xcd/0x140
[<ffffffff8110825d>] do_filp_open+0x88d/0xa20
[<ffffffff810f6b68>] do_sys_open+0x98/0x140
[<ffffffff810f6c50>] sys_open+0x20/0x30
[<ffffffff8100c272>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 690455
hardirqs last enabled at (690455): [<ffffffff81564fe4>] _spin_unlock_irqrestore+0x44/0x80
hardirqs last disabled at (690454): [<ffffffff81565372>] _spin_lock_irqsave+0x32/0xa0
softirqs last enabled at (690178): [<ffffffff81052282>] __do_softirq+0x202/0x220
softirqs last disabled at (690157): [<ffffffff8100d50c>] call_softirq+0x1c/0x50
other info that might help us debug this:
2 locks held by kswapd0/380:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff810d0bd7>] shrink_slab+0x37/0x180
#1: (&type->s_umount_key#17){++++..}, at: [<ffffffff8110cfbf>] shrink_dcache_memory+0x11f/0x1e0
stack backtrace:
Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203
Call Trace:
[<ffffffff810789ef>] print_usage_bug+0x19f/0x200
[<ffffffff81018bff>] ? save_stack_trace+0x2f/0x50
[<ffffffff81078f0b>] mark_lock+0x4bb/0x6d0
[<ffffffff810799e0>] ? check_usage_forwards+0x0/0xc0
[<ffffffff8107b142>] __lock_acquire+0xc62/0x1ae0
[<ffffffff810f478c>] ? slob_free+0x10c/0x370
[<ffffffff8107c0a1>] lock_acquire+0xe1/0x120
[<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
[<ffffffff81562d43>] mutex_lock_nested+0x63/0x420
[<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
[<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
[<ffffffff81012fe9>] ? sched_clock+0x9/0x10
[<ffffffff81077165>] ? lock_release_holdtime+0x35/0x1c0
[<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
[<ffffffff8110c9dc>] dentry_iput+0xbc/0xe0
[<ffffffff8110cb23>] d_kill+0x33/0x60
[<ffffffff8110ce23>] __shrink_dcache_sb+0x2d3/0x350
[<ffffffff8110cffa>] shrink_dcache_memory+0x15a/0x1e0
[<ffffffff810d0cc5>] shrink_slab+0x125/0x180
[<ffffffff810d1540>] kswapd+0x560/0x7a0
[<ffffffff810ce160>] ? isolate_pages_global+0x0/0x2c0
[<ffffffff81065a30>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8107953d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff810d0fe0>] ? kswapd+0x0/0x7a0
[<ffffffff8106555b>] kthread+0x5b/0xa0
[<ffffffff8100d40a>] child_rip+0xa/0x20
[<ffffffff8100cdd0>] ? restore_args+0x0/0x30
[<ffffffff81065500>] ? kthread+0x0/0xa0
[<ffffffff8100d400>] ? child_rip+0x0/0x20
[eparis@redhat.com: fix audit too]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ring buffer benchmark/test runs a producer for 10 seconds.
This is done with preemption and interrupts enabled. But if the kernel
is not compiled with CONFIG_PREEMPT, it basically stops everything
but interrupts for 10 seconds.
Although this is just a test and is not for production, this attribute
can be quite annoying. It can also spawn badness elsewhere.
This patch solves the issues by calling "cond_resched" when the system
is not compiled with CONFIG_PREEMPT. It also keeps track of the time
spent to call cond_resched such that it does not go against the
time calculations. That is, if the task schedules away, the time scheduled
out is removed from the test data. Note, this only works for non PREEMPT
because we do not know when the task is scheduled out if we have PREEMPT
enabled.
[ Impact: prevent test from stopping the world for 10 seconds ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Ingo Molnar thought the code would be cleaner if we used a function call
instead of a goto for moving the tail page. After implementing this,
it seems that gcc still inlines the result and the output is pretty much
the same. Since this is considered a cleaner approach, might as well
implement it.
[ Impact: code clean up ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The result of the allocation of the ring buffer read page in the
ring buffer bench mark does not check the return to see if a page
was actually allocated. This patch fixes that.
[ Impact: avoid NULL dereference ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code in __rb_reserve_next checks on page overflow if it is the
original commiter and then resets the page back to the original
setting. Although this is fine, and the code is correct, it is
a bit fragil. Some experimental work I did breaks it easily.
The better and more robust solution is to have all commiters that
overflow the page, simply subtract what they added.
[ Impact: more robust ring buffer account management ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This compiler warning:
CC kernel/trace/trace_output.o
kernel/trace/trace_output.c: In function ‘register_ftrace_event’:
kernel/trace/trace_output.c:544: warning: ‘list’ may be used uninitialized in this function
Is wrong as 'list' is always initialized - but GCC (4.3.2) does not
recognize this relationship properly.
Work around the warning by initializing the variable to NULL.
[ Impact: fix false positive compiler warning ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This attempts to clarify names utilized during block I/O remap
operations (partition, volume manager). It correctly matches up the
/from/ information for both device & sector. This takes in the concept
from Kosaki Motohiro and extends it to include better naming for the
"device_from" field.
[ Impact: cleanup ]
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49FF4FAE.3000301@hp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The orig_cpu parameter in trace_sched_migrate_task() is not necessary,
it can be got by using task_cpu(p) in the probe.
[ Impact: micro-optimization ]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
[ modified from Mathieu's patch. The original patch is at:
http://marc.info/?l=linux-kernel&m=123791201716239&w=2 ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: fweisbec@gmail.com
Cc: rostedt@goodmis.org
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: zhaolei@cn.fujitsu.com
Cc: laijs@cn.fujitsu.com
LKML-Reference: <49FFFDB7.1050402@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A module will add/remove its trace events when it gets loaded/unloaded, so
the ftrace_events list is not "const", and concurrent access needs to be
protected.
This patch thus fixes races between loading/unloding modules and read
'available_events' or read/write 'set_event', etc.
Below shows how to reproduce the race:
# for ((; ;)) { cat /mnt/tracing/available_events; } > /dev/null &
# for ((; ;)) { insmod trace-events-sample.ko; rmmod sample; } &
After a while:
BUG: unable to handle kernel paging request at 0010011c
IP: [<c1080f27>] t_next+0x1b/0x2d
...
Call Trace:
[<c10c90e6>] ? seq_read+0x217/0x30d
[<c10c8ecf>] ? seq_read+0x0/0x30d
[<c10b4c19>] ? vfs_read+0x8f/0x136
[<c10b4fc3>] ? sys_read+0x40/0x65
[<c1002a68>] ? sysenter_do_call+0x12/0x36
[ Impact: fix races when concurrent accessing ftrace_events list ]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4A00F709.3080800@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When unloading a module, memory allocated by init_preds() and
trace_define_field() is not freed.
[ Impact: fix memory leak ]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <4A00F6E0.3040503@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds code that can benchmark the ring buffer as well as
test it. This code can be compiled into the kernel (not recommended)
or as a module.
A separate ring buffer is used to not interfer with other users, like
ftrace. It creates a producer and a consumer (option to disable creation
of the consumer) and will run for 10 seconds, then sleep for 10 seconds
and then repeat.
While running, the producer will write 10 byte loads into the ring
buffer with just putting in the current CPU number. The reader will
continually try to read the buffer. The reader will alternate from reading
the buffer via event by event, or by full pages.
The output is a pr_info, thus it will fill up the syslogs.
Starting ring buffer hammer
End ring buffer hammer
Time: 9000349 (usecs)
Overruns: 12578640
Read: 5358440 (by events)
Entries: 0
Total: 17937080
Missed: 0
Hit: 17937080
Entries per millisec: 1993
501 ns per entry
Sleeping for 10 secs
Starting ring buffer hammer
End ring buffer hammer
Time: 9936350 (usecs)
Overruns: 0
Read: 28146644 (by pages)
Entries: 74
Total: 28146718
Missed: 0
Hit: 28146718
Entries per millisec: 2832
353 ns per entry
Sleeping for 10 secs
Time: is the time the test ran
Overruns: the number of events that were overwritten and not read
Read: the number of events read (either by pages or events)
Entries: the number of entries left in the buffer
(the by pages will only read full pages)
Total: Entries + Read + Overruns
Missed: the number of entries that failed to write
Hit: the number of entries that were written
The above example shows that it takes ~353 nanosecs per entry when
there is a reader, reading by pages (and no overruns)
The event by event reader slowed the producer down to 501 nanosecs.
[ Impact: see how changes to the ring buffer affect stability and performance ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the hot path of the ring buffer "__rb_reserve_next" there's a big
if statement that does not even return back to the work flow.
code;
if (cross to next page) {
[ lots of code ]
return;
}
more code;
The condition is even the unlikely path, although we do not denote it
with an unlikely because gcc is fine with it. The condition is true when
the write crosses a page boundary, and we need to start at a new page.
Having this if statement makes it hard to read, but calling another
function to do the work is also not appropriate, because we are using a lot
of variables that were set before the if statement, and we do not want to
send them as parameters.
This patch changes it to a goto:
code;
if (cross to next page)
goto next_page;
more code;
return;
next_page:
[ lots of code]
This makes the code easier to understand, and a bit more obvious.
The output from gcc is practically identical. For some reason, gcc decided
to use different registers when I switched it to a goto. But other than that,
the logic is the same.
[ Impact: easier to read code ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When adding the EXPORT_SYMBOL to some of the tracing API, I accidently
used EXPORT_SYMBOL instead of EXPORT_SYMBOL_GPL. This patch fixes
that mistake.
[ Impact: export the tracing code only for GPL modules ]
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As a precaution, it is best to disable writing to the ring buffers
when reseting them.
[ Impact: prevent weird things if write happens during reset ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the swap page ring buffer code that is used by the ftrace splice code,
we scan the page to increment the counter of entries read.
With the number of entries already in the page we simply need to add it.
[ Impact: speed up reading page from ring buffer ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'irq/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "genirq: assert that irq handlers are indeed running in hardirq context"
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: x86, mmiotrace: fix range test
tracing: fix ref count in splice pages
Currently, when the ring buffer writer overflows the buffer and must
write over non consumed data, we increment the overrun counter by
reading the entries on the page we are about to overwrite. This reads
the entries one by one.
This is not very effecient. This patch adds another entry counter
into each buffer page descriptor that keeps track of the number of
entries on the page. Now on overwrite, the overrun counter simply
needs to add the number of entries that is on the page it is about
to overwrite.
[ Impact: speed up of ring buffer in overwrite mode ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The entries counter in cpu buffer is not atomic. It can be updated by
other interrupts or from another CPU (readers).
But making entries into "atomic_t" causes an atomic operation that can
hurt performance. Instead we convert it to a local_t that will increment
a counter with a local CPU atomic operation (if the arch supports it).
Instead of fighting with readers and overwrites that decrement the counter,
I added a "read" counter. Every time a reader reads an entry it is
incremented.
We already have a overrun counter and with that, the entries counter and
the read counter, we can calculate the total number of entries in the
buffer with:
(entries - overrun) - read
As long as the total number of entries in the ring buffer is less than
the word size, this will work. But since the entries counter was previously
a long, this is no different than what we had before.
Thanks to Andrew Morton for pointing out in the first version that
atomic_t does not replace unsigned long. I switched to atomic_long_t
even though it is signed. A negative count is most likely a bug.
[ Impact: keep accurate count of cpu buffer entries ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds stats to the ftrace ring buffers:
# cat /debugfs/tracing/per_cpu/cpu0/stats
entries: 42360
overrun: 30509326
commit overrun: 0
nmi dropped: 0
Where entries are the total number of data entries in the buffer.
overrun is the number of entries not consumed and were overwritten by
the writer.
commit overrun is the number of entries dropped due to nested writers
wrapping the buffer before the initial writer finished the commit.
nmi dropped is the number of entries dropped due to the ring buffer
lock being held when an nmi was going to write to the ring buffer.
Note, this field will be meaningless and will go away when the ring
buffer becomes lockless.
[ Impact: let userspace know what is happening in the ring buffers ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The WARN_ON in the ring buffer when a commit is preempted and the
buffer is filled by preceding writes can happen in normal operations.
The WARN_ON makes it look like a bug, not to mention, because
it does not stop tracing and calls printk which can also recurse, this
is prone to deadlock (the WARN_ON is not in a position to recurse).
This patch removes the WARN_ON and replaces it with a counter that
can be retrieved by a tracer. This counter is called commit_overrun.
While at it, I added a nmi_dropped counter to count any time an NMI entry
is dropped because the NMI could not take the spinlock.
[ Impact: prevent deadlock by printing normal case warning ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I'm adding a module to do a series of tests on the ring buffer as well
as benchmarks. This module needs to have more of the ring buffer API
exported. There's nothing wrong with reading the ring buffer from a
module.
[ Impact: allow modules to read pages from the ring buffer ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tick_handle_periodic() can lock up hard when a one shot clock event
device is used in combination with jiffies clocksource.
Avoid an endless loop issue by requiring that a highres valid
clocksource be installed before we call tick_periodic() in a loop when
using ONESHOT mode. The result is we will only increment jiffies once
per interrupt until a continuous hardware clocksource is available.
Without this, we can run into a endless loop, where each cycle through
the loop, jiffies is updated which increments time by tick_period or
more (due to clock steering), which can cause the event programming to
think the next event was before the newly incremented time and fail
causing tick_periodic() to be called again and the whole process loops
forever.
[ Impact: prevent hard lock up ]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
This reverts commit 044d408409.
The commit added a warning when handle_IRQ_event() is called outside
of hard interrupt context. This breaks the generic tasklet based
interrupt resend mechanism which is used when the hardware has no way
to retrigger the interrupt. So we get a warning for a use case which
is correct and worked for years. Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Sparse reports the following in kernel/posix-cpu-timers.c:
warning: symbol 'firing' shadows an earlier one
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Subrata Modak <subrata@linux.vnet.ibm.com>
LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE1909016C1AFE@mi8nycmail19.Mi8.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andrew Gallatin reported that IRQ and SOFTIRQ times were
sometime not reported correctly on recent kernels, and even
bisected to commit 457533a7d3
([PATCH] fix scaled & unscaled cputime accounting) as the first
bad commit.
Further analysis pointed that commit
79741dd357 ([PATCH] idle cputime
accounting) was the real cause of the problem.
account_process_tick() was not taking into account timer IRQ
interrupting the idle task servicing a hard or soft irq.
On mostly idle cpu, irqs were thus not accounted and top or
mpstat could tell user/admin that cpu was 100 % idle, 0.00 %
irq, 0.00 % softirq, while it was not.
[ Impact: fix occasionally incorrect CPU statistics in top/mpstat ]
Reported-by: Andrew Gallatin <gallatin@myri.com>
Re-reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: rick.jones2@hp.com
Cc: brice@myri.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <49F84BC1.7080602@cosmosbay.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
"tracing: create automated trace defines" causes this compile error on s390,
as reported by Sachin Sant against linux-next:
kernel/built-in.o: In function `__do_softirq':
(.text+0x1c680): undefined reference to `__tracepoint_softirq_entry'
This happens because the definitions of the softirq tracepoints were moved
from kernel/softirq.c to kernel/irq/handle.c. Since s390 doesn't support
generic hardirqs handle.c doesn't get compiled and the definitions are
missing.
So move the tracepoints to softirq.c again.
[ Impact: fix build failure on s390 ]
Reported-by: Sachin Sant <sachinp@in.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
LKML-Reference: <20090429135139.5fac79b8@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace the current event parser hack with a better one. Filters are
no longer specified predicate by predicate, but all at once and can
use parens and any of the following operators:
numeric fields:
==, !=, <, <=, >, >=
string fields:
==, !=
predicates can be combined with the logical operators:
&&, ||
examples:
"common_preempt_count > 4" > filter
"((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
If there was an error, the erroneous string along with an error
message can be seen by looking at the filter e.g.:
((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
^
parse_error: Field not found
Currently the caret for an error always appears at the beginning of
the filter; a real position should be used, but the error message
should be useful even without it.
To clear a filter, '0' can be written to the filter file.
Filters can also be set or cleared for a complete subsystem by writing
the same filter as would be written to an individual event to the
filter file at the root of the subsytem. Note however, that if any
event in the subsystem lacks a field specified in the filter being
set, the set will fail and all filters in the subsytem are
automatically cleared. This change from the previous version was made
because using only the fields that happen to exist for a given event
would most likely result in a meaningless filter.
Because the logical operators are now implemented as predicates, the
maximum number of predicates in a filter was increased from 8 to 16.
[ Impact: add new, extended trace-filter implementation ]
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1240905899.6416.121.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The new filter comparison ops need to be able to distinguish between
signed and unsigned field types, so add an is_signed flag/param to the
event field struct/trace_define_fields(). Also define a simple macro,
is_signed_type() to determine the signedness at compile time, used in the
trace macros. If the is_signed_type() macro won't work with a specific
type, a new slightly modified version of TRACE_FIELD() called
TRACE_FIELD_SIGN(), allows the signedness to be set explicitly.
[ Impact: extend trace-filter code for new feature ]
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1240905893.6416.120.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Create a new event_filter object, and move the pred-related members
out of the call and subsystem objects and into the filter object - the
details of the filter implementation don't need to be exposed in the
call and subsystem in any case, and it will also help make the new
parser implementation a little cleaner.
[ Impact: refactor trace-filter code to prepare for new features ]
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1240905887.6416.119.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The pages allocated for the splice binary buffer did not initialize
the ref count correctly. This caused pages not to be freed and causes
a drastic memory leak.
Thanks to logdev I was able to trace the tracer to find where the leak
was.
[ Impact: stop memory leak when using splice ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The warning output in trace_recursive_lock uses %d for a long when
it should be %ld.
[ Impact: fix compile warning ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Splice works with pages, it is much more effecient to use an entire
page than to copy bits over several pages.
Using logdev to trace the internals of the splice mechanism, I was
able to see that splice can be very aggressive. When tracing is
occurring, and the reader caught up to the writer, and the writer
is on the reader page, the reader will copy what is there into the
splice page. Splice may iterate over several pages and if the
writer is still writing to the page, the reader will keep copying
bits to new pages to pass to userspace.
This patch changes it to only pass data to userspace if the page
is full (the writer has left the page). This has a small side effect
that splice can not read a partial page, and must wait for the
page to fill. This should not be an issue. If tracing has stopped,
then a use of "read" will still read all of the page.
[ Impact: better performance for ring buffer splice code ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The splice code allocates a page even when the ring buffer is empty.
It detects the ring buffer being empty when it it fails to copy
anything from the ring buffer into the page.
This patch adds a check to see if there is anything in the ring buffer
before allocating a page.
Thanks to logdev for letting me trace the tracer to find this.
[ Impact: speed up due to removing unnecessary allocation ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The pages allocated for the splice binary buffer did not initialize
the ref count correctly. This caused pages not to be freed and causes
a drastic memory leak.
Thanks to logdev I was able to trace the tracer to find where the leak
was.
[ Impact: stop memory leak when using splice ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_dump is used for printing out the contents of the ftrace ring buffer
to the console on failure. Currently it uses a spinlock to synchronize
the output from multiple failures on different CPUs. This spin lock
currently is a normal spinlock and can cause issues with lockdep and
lock tracing.
This patch converts it to raw since it is for error handling only.
The lock is local to the ftrace_dump and is not used by any other
infrastructure.
[ Impact: prevent ftrace_dump from locking up by internal tracing ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/irq: mark NUMA_MIGRATE_IRQ_DESC broken
x86, irq: Remove IRQ_DISABLED check in process context IRQ move
For proper module reference counting, the file_operations that modules use
must have the "owner" field set to the module. Unfortunately, the trace events
use share file_operations. The same file_operations are used by all both
kernel core and all modules.
This patch makes the modules allocate their own file_operations and
copies the functions from the core kernel. This allows those file
operations to be owned by the module.
Care is taken to free this code on module unload.
Thanks to Greg KH for reminding me that file_operations must be owned
by the module to have reference counting take place.
[ Impact: fix modular tracepoints / potential crash ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
With modules being able to add trace events, and the max trace event
counter is 16 bits (65536) we can overflow the counter easily
with a simple while loop adding and removing modules that contain
trace events.
This patch links together the registered trace events and on overflow
searches for available trace event ids. It will still fail if
over 65536 events are registered, but considering that a typical
kernel only has 22000 functions, 65000 events should be sufficient.
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit c751085943 ("PM/Hibernate: Wait for
SCSI devices scan to complete during resume") added a call to
scsi_complete_async_scans() to software_resume(), so that it waited for
the SCSI scanning to complete, but the call was added at a wrong place.
Namely, it should have been added after wait_for_device_probe(), which
is called only if the image partition hasn't been specified yet. Also,
it's reasonable to check if the image partition is present and only wait
for the device probing and SCSI scanning to complete if it is not the
case.
Additionally, since noresume is checked right at the beginning of
software_resume() and the function returns immediately if it's set, it
doesn't make sense to check it once again later.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slow-work appears to delete its timer as soon as the first user
unregisters, even though other users could be active. At the same time, it
never seems to delete slow_work_oom_timer. Arrange for both to happen in
the shutdown path.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
arch/x86/kernel/ptrace.c
Merge reason: fix the conflict above, and also pick up the CONFIG_BROKEN
dependency change from upstream so that we can remove it
here.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
RB_MAX_SMALL_DATA = 28bytes is too small for most tracers, it wastes
an 'u32' to save the actually length for events which data size > 28.
This fix uses compressed event header and enlarges RB_MAX_SMALL_DATA.
[ Impact: saves about 0%-12.5%(depends on tracer) memory in ring_buffer ]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <49F13189.3090000@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The events exported by TRACE_EVENT are automated and are guaranteed
to be correct when used.
The internal ftrace structures on the other hand are more manually
exported. These require the ftrace maintainer to make sure they
are up to date.
This patch adds a size check to help flag when a type changes in
an internal ftrace data structure, and the update needs to be reflected
in the export.
If a export is incorrect, then the only harm is that the user space
tools will not know how to correctly read the internal structures of
ftrace.
[ Impact: help prevent inconsistent ftrace format print outs ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
With the new event tracing registration, we must increase the number
of events that can be registered. Currently the type field is only
one byte, which leaves us only 256 possible events.
Since we do not save the CPU number in the tracer anymore (it is determined
by the per cpu ring buffer that is used) we have an extra byte to use.
This patch increases the size of type from 1 byte (256 events) to
2 bytes (65,536 events).
It also adds a WARN_ON_ONCE if we exceed that limit.
[ Impact: allow more than 255 events ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The code had the following outside the lock:
if (next != wakeup_task)
return;
pc = preempt_count();
/* The task we are waiting for is waking up */
data = wakeup_trace->data[wakeup_cpu];
On initialization, wakeup_task is NULL and wakeup_cpu -1. This code
is not under a lock. If wakeup_task is set on another CPU as that
task is waking up, we can see the wakeup_task before wakeup_cpu is
set. If we read wakeup_cpu while it is still -1 then we will have
a bad data pointer.
This patch moves the reading of wakeup_cpu within the protection of
the spinlock used to protect the writing of wakeup_cpu and wakeup_task.
[ Impact: remove possible race causing invalid pointer dereference ]
Reported-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Andi Kleen reported this message triggering on non-lockdep kernels:
Disabling lockdep due to kernel taint
Clarify the message to say 'lock debugging' - debug_locks_off()
turns off all things lock debugging, not just lockdep.
[ Impact: change kernel warning message text ]
Reported-by: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
struct trace_entry->type is unsigned char, while trace event's id is
int type, thus for a event with id >= 256, it's entry->type is cast
to (id % 256), and then we can't see the trace output of this event.
# insmod trace-events-sample.ko
# echo foo_bar > /mnt/tracing/set_event
# cat /debug/tracing/events/trace-events-sample/foo_bar/id
256
# cat /mnt/tracing/trace_pipe
<...>-3548 [001] 215.091142: Unknown type 0
<...>-3548 [001] 216.089207: Unknown type 0
<...>-3548 [001] 217.087271: Unknown type 0
<...>-3548 [001] 218.085332: Unknown type 0
[ Impact: fix output for trace events with id >= 256 ]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <49EEDB0E.5070207@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add enable() and disable() callbacks for clocksources.
This allows us to put unused clocksources in power save mode. The
functions clocksource_enable() and clocksource_disable() wrap the
callbacks and are inserted in the timekeeping code to enable before use
and disable after switching to a new clocksource.
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass clocksource pointer to the read() callback for clocksources. This
allows us to share the callback between multiple instances.
[hugh@veritas.com: fix powerpc build of clocksource pass clocksource mods]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On boot up, to save memory, ftrace allocates the minimum buffer
which is two pages. Ftrace also goes through a series of tests
(when configured) on boot up. These tests can fill up a page within
a single interrupt.
The ring buffer also has a WARN_ON when it detects that the buffer was
completely filled within a single commit (other commits are allowed to
be nested).
Combine the small buffer on start up, with the tests that can fill more
than a single page within an interrupt, this can trigger the WARN_ON.
This patch makes the WARN_ON only happen when the ring buffer consists
of more than two pages.
[ Impact: prevent false WARN_ON in ftrace startup tests ]
Reported-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20090421094616.GA14561@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Suppose we would like to trace all tasks named '123', but this
will fail:
# echo 'parent_comm == 123' > events/sched/sched_process_fork/filter
bash: echo: write error: Invalid argument
Don't guess the type of the filter pred in filter_parse(), but instead
we check it in __filter_add_pred().
[ Impact: extend allowed filter field string values ]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49ED8DEB.6000700@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If writing subsys->filter returns EINVAL or ENOSPC, the original
filters in subsys/ and subsys/events/ will be removed. This is
definitely wrong.
[ Impact: fix filter setting semantics on error condition ]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49ED8DD2.2070700@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The startup tests for the event tracer also runs with the function
tracer enabled. The "wakeup" version of the trace commit was used
which can grab spinlocks. If a task was preempted by an NMI
that called a function being traced, it could deadlock due to the
function tracer trying to grab the same lock.
Thanks to Frederic Weisbecker for pointing out where the bug was.
Reported-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Althought using the irq level (hardirq_count, softirq_count and in_nmi)
was nice to detect bad recursion right away, but since the counters are
not atomically updated with respect to the interrupts, the function tracer
might trigger the test from an interrupt handler before the hardirq_count
is updated. This will trigger a false warning.
This patch converts the recursive detection to a simple counter.
If the depth is greater than 16 then the recursive detection will trigger.
16 is more than enough for any nested interrupts.
[ Impact: fix false positive trace recursion detection ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring_buffer_event_discard is not tied to ring_buffer_lock_reserve.
It can be called inside or outside the reserve/commit. Even if it
is called inside the reserve/commit the commit part must also be called.
Only ring_buffer_discard_commit can be used as a replacement for
ring_buffer_unlock_commit.
This patch removes the trace_recursive_unlock from ring_buffer_event_discard
since it would be the wrong place to do so.
[Impact: prevent breakage in trace recursive testing ]
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The recursive tests to detect same level recursion in the ring buffers
did not account for the hard/softirq_counts to be shifted. Thus the
numbers could be larger than then mask to be tested.
This patch includes the shift for the calculation of the irq depth.
[ Impact: stop false positives in trace recursion detection ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The late_initcall calls a helper function instead of the proper
init event selftest function.
This update may have been lost due to conflicting merges.
[ Impact: fix compiler warning and call extended event trace self tests ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently we have two configs: EVENT_TRACING and EVENT_TRACER.
All tracers enable EVENT_TRACING. The EVENT_TRACER is only a
convenience to enable the EVENT_TRACING when no other tracers
are enabled.
The names EVENT_TRACER and EVENT_TRACING are too similar and confusing.
This patch renames EVENT_TRACER to ENABLE_EVENT_TRACING to be more
appropriate to what it actually does, as well as add a comment in
the help menu to explain the option's purpose.
[ Impact: rename config option to reduce confusion ]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
During testing we often use randconfig to test various kernels.
The current configuration set up does not give an easy way to disable
all tracing with a single config. The case where randconfig would
test all tracing disabled is very unlikely.
This patch adds a config option to enable or disable all tracing.
It is hooked into the tracing menu just like other submenus are done.
[ Impact: allow randconfig to easily produce all traces disabled ]
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch makes the branch profiling into a choice selection:
None - no branch profiling
likely/unlikely - only profile likely/unlikely branches
all - profile all branches
The all profiler will also enable the likely/unlikely branches.
This does not change the way the profiler works or the dependencies
between the profilers.
What this patch does, is keep the branch profiling from being selected
by an allyesconfig make. The branch profiler is very intrusive and
it is known to break various architecture builds when selected as an
allyesconfig.
[ Impact: prevent branch profiler from being selected in allyesconfig ]
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In case of tracing recursion detection, we only get the stacktrace.
But the current context may be very useful to debug the issue.
This patch adds the softirq/hardirq/nmi context with the warning
using lockdep context display to have a familiar output.
v2: Use printk_once()
v3: drop {hardirq,softirq}_context which depend on lockdep,
only keep what is part of current->trace_recursion,
sufficient to debug the warning source.
[ Impact: print context necessary to debug recursion ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Commit 900af0d973 (PM: Change suspend
code ordering) changed the ordering of suspend code in such a way
that the platform .prepare() callback is now executed after the
device drivers' late suspend callbacks have run. Unfortunately, this
turns out to break ARM platforms that need to talk via I2C to power
control devices during the .prepare() callback.
For this reason introduce two new platform suspend callbacks,
.prepare_late() and .wake(), that will be called just prior to
disabling non-boot CPUs and right after bringing them back on line,
respectively, and use them instead of .prepare() and .finish() for
ACPI suspend. Make the PM core execute the .prepare() and .finish()
platform suspend callbacks where they were executed previously (that
is, right after calling the regular suspend methods provided by
device drivers and right before executing their regular resume
methods, respectively).
It is not necessary to make analogous changes to the hibernation
code and data structures at the moment, because they are only used
by ACPI platforms.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Len Brown <len.brown@intel.com>
This function is not actually used right now, since the original use
case for it was done with insert_resource_expand_to_fit() instead.
However, we now have another usage case that wants to basically do a
"reserve IO resource, splitting around existing resources", however that
one doesn't actually want the "recurse into the conflicting resource"
logic at all.
And since recursing into the conflicting resource was the most complex
part, and isn't wanted, just remove it. Maybe we'll some day want both
versions, but we can just resurrect the logic then.
Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
trace_printk can be called from any context, including NMIs.
If this happens, then we must test for for recursion before
grabbing any spinlocks.
This patch prevents trace_printk from being called recursively.
[ Impact: prevent hard lockup in lockdep event tracer ]
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing infrastructure allows for recursion. That is, an interrupt
may interrupt the act of tracing an event, and that interrupt may very well
perform its own trace. This is a recursive trace, and is fine to do.
The problem arises when there is a bug, and the utility doing the trace
calls something that recurses back into the tracer. This recursion is not
caused by an external event like an interrupt, but by code that is not
expected to recurse. The result could be a lockup.
This patch adds a bitmask to the task structure that keeps track
of the trace recursion. To find the interrupt depth, the following
algorithm is used:
level = hardirq_count() + softirq_count() + in_nmi;
Here, level will be the depth of interrutps and softirqs, and even handles
the nmi. Then the corresponding bit is set in the recursion bitmask.
If the bit was already set, we know we had a recursion at the same level
and we warn about it and fail the writing to the buffer.
After the data has been committed to the buffer, we clear the bit.
No atomics are needed. The only races are with interrupts and they reset
the bitmask before returning anywy.
[ Impact: detect same irq level trace recursion ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Not all the necessary symbols were exported to allow for tracing
by modules. This patch adds them in.
[ Impact: allow modules to commit data to the ring buffer ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>