linux

Commit Graph

Author	SHA1	Message	Date
Thomas Gleixner	9d591edd02	genirq: Allow shared oneshot interrupts Support ONESHOT on shared interrupts, if all drivers agree on it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110223234956.483640430@linutronix.de>	2011-02-25 20:24:21 +01:00
Thomas Gleixner	b5faba21a6	genirq: Prepare the handling of shared oneshot interrupts For level type interrupts we need to track how many threads are on flight to avoid useless interrupt storms when not all thread handlers have finished yet. Keep track of the woken threads and only unmask when there are no more threads in flight. Yes, I'm lazy and using a bitfield. But not only because I'm lazy, the main reason is that it's way simpler than using a refcount. A refcount based solution would need to keep track of various things like crashing the irq thread, spurious interrupts coming in, disables/enables, free_irq() and some more. The bitfield keeps the tracking simple and makes things just work. It's also nicely confined to the thread code pathes and does not require additional checks all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110223234956.388095876@linutronix.de>	2011-02-25 20:24:21 +01:00
Thomas Gleixner	1204e95689	genirq: Make warning in handle_percpu_event useful The WARN_ON_ONCE in handle_percpu_event() which emits a warning when an action handler returns with interrupts enabled is not really useful. It does not reveal the interrupt number and handler function which caused it. Make it WARN_ONCE() and add the information. Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-25 17:17:18 +01:00
Heiko Carstens	7e9498705e	sched: Add #ifdef around irq time accounting functions Get rid of this: kernel/sched.c:3731:13: warning: 'irqtime_account_idle_ticks' defined but not used kernel/sched.c:3732:13: warning: 'irqtime_account_process_tick' defined but not used Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110225133228.GD7469@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-25 14:39:48 +01:00
Peter Zijlstra	768a06e2ca	perf: Simplify task_clock_event_read() There is no point in us having different code paths for nmi and !nmi here, so remove the !nmi one. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:35:47 +01:00
Stephane Eranian	3f7cce3c18	perf_events: Fix rcu and locking issues with cgroup support This patches ensures that we do not end up calling perf_cgroup_from_task() when there is no cgroup event. This avoids potential RCU and locking issues. The change in perf_cgroup_set_timestamp() ensures we check against ctx->nr_cgroups. It also avoids calling perf_clock() tiwce in a row. It also ensures we do need to grab ctx->lock before calling the function. We drop update_cgrp_time() from task_clock_event_read() because it is not needed. This also avoids having to deal with perf_cgroup_from_task(). Thanks to Peter Zijlstra for his help on this. Signed-off-by: Stephane Eranian <eranian@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d5e76b8.815bdf0a.7ac3.774f@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:35:46 +01:00
Mike Galbraith	511f67a599	sched, autogroup: Stop claiming ownership of the root task group Disown it, and only display autogroup association if one exists. Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1298383320.8036.5.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:34:03 +01:00
Yong Zhang	800d4d30c8	sched, autogroup: Stop going ahead if autogroup is disabled when autogroup is disable from the beginning, sched_autogroup_create_attach() autogroup_move_group() <== 1 sched_move_task() <== 2 task_move_group_fair() set_task_rq() task_group() autogroup_task_group() We go the whole path without doing anything useful. Then stop going further if autogroup is disabled. But there will be a race window between 1 and 2, in which sysctl_sched_autogroup_enabled is enabled. This issue will be toke by following patch. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <1298185696-4403-4-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:59 +01:00
Yong Zhang	1747b21fec	sched, autogroup, sysctl: Use proc_dointvec_minmax() instead sched_autogroup_enabled has min/max value, proc_dointvec_minmax() is be used for this case. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1298185696-4403-2-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:58 +01:00
Peter Zijlstra	866ab43efd	sched: Fix the group_imb logic On a 262 machine something like: taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done' _should_ result in 9 busy CPUs, each running 1 task. However it didn't quite work reliably, most of the time one cpu of the second socket (6-11) would be idle and one cpu of the first socket (0-5) would have two tasks on it. The group_imb logic is supposed to deal with this and detect when a particular group is imbalanced (like in our case, 0-2 are idle but 3-5 will have 4 tasks on it). The detection phase needed a bit of a tweak as it was too weak and required more than 2 avg weight tasks difference between idle and busy cpus in the group which won't trigger for our test-case. So cure that to be one or more avg task weight difference between cpus. Once the detection phase worked, it was then defeated by the f_b_g() tests trying to avoid ping-pongs. In particular, this_load >= max_load triggered because the pulling cpu (the (first) idle cpu in on the second socket, say 6) would find this_load to be 5 and max_load to be 4 (there'd be 5 tasks running on our socket and only 4 on the other socket). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:57 +01:00
Peter Zijlstra	cc57aa8f4b	sched: Clean up some f_b_g() comments The existing comment tends to grow state (as it already has), split it up and place it near the actual tests. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:56 +01:00
Peter Zijlstra	c186fafe9a	sched: Clean up remnants of sd_idle With the wholesale removal of the sd_idle SMT logic we can clean up some more. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:55 +01:00
Ingo Molnar	d927dc9379	Merge commit 'v2.6.38-rc6' into sched/core Merge reason: Pick up the latest fixes before queueing up new changes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:31:38 +01:00
Jan Beulich	fd4afaf333	genirq: Streamline kernel/irq/Kconfig "def_bool n" without prompt is pointless, these should be just "bool". [ tglx: Adapted to latest changes ] Signed-off-by: Jan Beulich <jbeulich@novell.com> LKML-Reference: <4D5D3309020000780003264A@vpn.id2.novell.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-22 22:33:10 +01:00
Thomas Gleixner	dbebbfbb16	rtmutex: tester: Remove the remaining BKL leftovers We just leave the numbers assinged as commemoration and in case that someone was crazy enough to reimplement the test stuff out of tree. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-22 22:07:22 +01:00
Linus Torvalds	571020df6f	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Disable the SHIRQ_DEBUG call in request_threaded_irq for now genirq: Prevent access beyond allocated_irqs bitmap	2011-02-22 09:26:17 -08:00
Linus Torvalds	ee88347755	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix throttle logic perf, x86: P4 PMU: Fix spurious NMI messages	2011-02-22 09:25:55 -08:00
Thomas Gleixner	70433c0161	genirq: Use the correct variable for note_interrupt note_interrupt wants to be called with the combined result of all handlers called, not with the last one. If it's a shared interrupt then the last handler might return IRQ_NONE often enough to trigger the spurious dectector which turns off a perfectly fine working interrupt line. Bug was introduced in commit 1277a532(genirq: Simplify handle_irq_event()). Yes, I really messed up there. First the variable ret should not have been named differently to avoid similarity with retval. Second it should have been declared in the do {} loop. Rename it to res and move it into the do {} loop and vanish under a huge brown paperbag. Reported-bisected-tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-22 13:02:03 +01:00
John Stultz	7fdd7f8900	timers: Export CLOCK_BOOTTIME via the posix timers interface This patch exports CLOCK_BOOTTIME through the posix timers interface CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:09 -08:00
John Stultz	70a08cca12	timers: Add CLOCK_BOOTTIME hrtimer base CLOCK_MONOTONIC stops while the system is in suspend. This is because to applications system suspend is invisible. However, there is a growing set of applications that are wanting to be suspend-aware, but do not want to deal with the complications of CLOCK_REALTIME (which might jump around if settimeofday is called). For these applications, I propose a new clockid: CLOCK_BOOTTIME. CLOCK_BOOTTIME is idential to CLOCK_MONOTONIC, except it also includes any time spent in suspend. This patch add hrtimer base for CLOCK_BOOTTIME, using get_monotonic_boottime/ktime_get_boottime, to allow in kernel users to set timers against. CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:08 -08:00
John Stultz	314ac37150	time: Extend get_xtime_and_monotonic_offset() to also return sleep Extend get_xtime_and_monotonic_offset to get_xtime_and_monotonic_and_sleep_offset(). CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:07 -08:00
John Stultz	abb3a4ea2e	time: Introduce get_monotonic_boottime and ktime_get_boottime This adds new functions that return the monotonic time since boot (in other words, CLOCK_MONOTONIC + suspend time). CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:05 -08:00
John Stultz	e06383db9e	hrtimers: extend hrtimer base code to handle more then 2 clockids The hrtimer code is written mainly with CLOCK_REALTIME and CLOCK_MONOTONIC in mind. These are clockids 0 and 1 resepctively. However, if we are to introduce any new hrtimer bases, using new clockids, we have to skip the cputimers (clockids 2,3) as well as other clockids that may not impelement timers. This patch adds a little bit of indirection between the clockid and the base, so that we can extend the base by one when we add a new clockid at number 7 or so. CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:04 -08:00
Thomas Gleixner	8fff39e069	genirq: Add missing break in __irq_set_trigger() The switch case in __irq_set_trigger() lacks a break, which emits a pr_err unconditionally on success. Reported-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-21 21:20:00 +01:00
Yinghai Lu	ed4dea6e0e	genirq: Use IRQ_BITMAP_BITS as search size in irq_alloc_descs() The runtime expansion of nr_irqs does not take into account that bitmap_find_next_zero_area() returns "start" + size in case the search for an matching zero area fails. That results in a start value which can be completely off and is not covered by the following expand_nr_irqs() and possibly outside of the absolute limit. But we use it without further checking. Use IRQ_BITMAP_BITS as the limit for the bitmap search and expand nr_irqs when the start bit is beyond nr_irqs. So start is always pointing to the correct area in the bitmap. nr_irqs is just the limit for irq enumerations, not the real limit for the irq space. [ tglx: Let irq_expand_nr_irqs() take the new upper end so we do not expand nr_irqs more than necessary. Made changelog readable ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D6014F9.8040605@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-21 21:20:00 +01:00
Thomas Gleixner	a61d825808	genirq: Fix misplaced status update in irq_disable() We lazy disable interrupt lines, so only mark the line masked, when the chip provides an irq_disable callback. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-21 21:19:59 +01:00
Tejun Heo	24d51add74	workqueue: fix build failure introduced by s/freezeable/freezable/ wq:fixes-2.6.38 does s/WQ_FREEZEABLE/WQ_FREEZABLE and wq:for-2.6.39 adds new usage of the flag. The combination of the two creates a build failure after merge. Fix it by renaming all freezeables to freezables. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>	2011-02-21 10:07:23 +01:00
Tejun Heo	43d133c18b	Merge branch 'master' into for-2.6.39	2011-02-21 09:43:56 +01:00
David S. Miller	da935c66ba	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/e1000e/netdev.c net/xfrm/xfrm_policy.c	2011-02-19 19:17:35 -08:00
Thomas Gleixner	a439520f8b	genirq: Implement irq_data based move_*_irq() versions No need to lookup the irq descriptor when calling from a chip callback function which has irq_data already handy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:25 +01:00
Thomas Gleixner	77694b408a	genirq; Add fasteoi irq_chip quirk Some chips want irq_eoi() only called when an interrupt is actually handled. So they have checks for INPROGRESS and DISABLED in their irq_eoi callbacks. Add a chip flag, which allows to handle that in the generic code. No impact on the fastpath. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:24 +01:00
Thomas Gleixner	781295762d	genirq: Add preflow handler support sparc64 needs to call a preflow handler on certain interrupts befor calling the action chain. Integrate it into handle_fasteoi_irq. Must be enabled via CONFIG_IRQ_FASTEOI_PREFLOW. No impact when disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S. Miller <davem@davemloft.net>	2011-02-19 12:58:24 +01:00
Thomas Gleixner	3836ca08aa	genirq: Consolidate set_chip_handler functions No need to have separate functions if we have one plus inline wrappers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:23 +01:00
Thomas Gleixner	02725e7471	genirq: Use irq_get/put functions Convert the management functions to use the common irq_get/put function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:23 +01:00
Thomas Gleixner	d5eb4ad2df	genirq: Implement irq_get/put_desc_[bus]locked/unlock() Most of the managing functions get the irq descriptor and lock it - either with or without buslock. Instead of open coding this over and over provide a common function to do that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:23 +01:00
Thomas Gleixner	091738a266	genirq: Remove real old transition functions These transition helpers are stale for years now. Remove them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:23 +01:00
Thomas Gleixner	a6967caf00	genirq: Remove desc->status when GENERIC_HARDIRQS_NO_COMPAT=y If everything uses the right accessors, then enabling GENERIC_HARDIRQS_NO_COMPAT should just work. If not it will tell you. Don't be lazy and use the trick which I use in the core code! git grep status_use_accessors will unearth it in a split second. Offenders are tracked down and not slapped with stinking trouts. This time we use frozen shark for a better educational value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:22 +01:00
Thomas Gleixner	e1ef824146	genirq: Reflect IRQ_MOVE_PCNTXT in irq_data state Required by x86. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:22 +01:00
Thomas Gleixner	7f94226f03	genirq: Move wakeup state to irq_data Some irq_chips need to know the state of wakeup mode for setting the trigger type etc. Reflect it in irq_data state. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:22 +01:00
Thomas Gleixner	d4d5e08960	genirq: Add IRQCHIP_SET_TYPE_MASKED flag irq_chips, which require to mask the chip before changing the trigger type should set this flag. So the core takes care of it and the requirement for looking into desc->status in the chip goes away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Walleij <linus.walleij@stericsson.com> Cc: Lars-Peter Clausen <lars@metafoo.de>	2011-02-19 12:58:22 +01:00
Thomas Gleixner	5d4d8fc9ac	genirq: Cleanup irq.h Put the constants into an enum and document them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:21 +01:00
Thomas Gleixner	f9e4989eb8	genirq: Force wrapped access to desc->status in core code Force the usage of wrappers by another nasty CPP substitution. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:21 +01:00
Thomas Gleixner	1ccb4e612f	genirq: Wrap the remaning IRQ_* flags Use wrappers to keep them away from the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:21 +01:00
Thomas Gleixner	876dbd4cc1	genirq: Mirror irq trigger type bits in irq_data.state That's the data structure chip functions get provided. Also allow them to signal the core code that they updated the flags in irq_data.state by returning IRQ_SET_MASK_OK_NOCOPY. The default is unchanged. The type bits should be accessed via: val = irqd_get_trigger_type(irqdata); and irqd_set_trigger_type(irqdata, val); Coders who access them directly will be tracked down and slapped with stinking trouts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:20 +01:00
Thomas Gleixner	2bdd10558c	genirq: Move IRQ_AFFINITY_SET to core Keep status in sync until last abuser is gone. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:20 +01:00
Thomas Gleixner	bce43032ad	genirq: Reuse existing can set affinty check Add a !desc check while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:20 +01:00
Thomas Gleixner	a005677b3d	genirq: Mirror IRQ_PER_CPU and IRQ_NO_BALANCING in irq_data.state That's the right data structure to look at for arch code. Accessor functions are provided. irqd_is_per_cpu(irqdata); irqd_can_balance(irqdata); Coders who access them directly will be tracked down and slapped with stinking trouts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:20 +01:00
Thomas Gleixner	1ce6068dac	genirq: Move debug code to separate header It'll break when I'm going to undefine the constants. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:19 +01:00
Thomas Gleixner	fae581e588	genirq: Remove CHECK_IRQ_PER_CPU from core code Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:19 +01:00
Thomas Gleixner	6a58fb3bad	genirq: Remove CONFIG_IRQ_PER_CPU The saving of this switch is minimal versus the ifdef mess it creates. Simple enable PER_CPU unconditionally and remove the config switch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:19 +01:00
Thomas Gleixner	f230b6d5c4	genirq: Add IRQ_MOVE_PENDING to irq_data.state chip implementations need to know about it. Keep status in sync until all users are fixed. Accessor function: irqd_is_setaffinity_pending(irqdata) Coders who access them directly will be tracked down and slapped with stinking trouts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:18 +01:00
Thomas Gleixner	6d2cd17fde	genirq: Move IRQ_WAKEUP to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:18 +01:00
Thomas Gleixner	c531e8361f	genirq: Move IRQ_SUSPENDED to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:18 +01:00
Thomas Gleixner	6e40262ea4	genirq: Move IRQ_MASKED to core Keep status in sync until all users are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:17 +01:00
Thomas Gleixner	2a0d6fb335	genirq: Move IRQ_PENDING flag to core Keep status in sync until all users are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:17 +01:00
Thomas Gleixner	c1594b77e4	genirq: Move IRQ_DISABLED to core Keep status in sync until all abusers are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:17 +01:00
Thomas Gleixner	163ef30911	genirq: Move IRQ_REPLAY and IRQ_WAITING to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:16 +01:00
Thomas Gleixner	3d67baec7f	genirq: Move IRQ_ONESHOT to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:16 +01:00
Thomas Gleixner	009b4c3b8a	genirq: Add IRQ_INPROGRESS to core We need to maintain the flag for now in both fields status and istate. Add a CONFIG_GENERIC_HARDIRQS_NO_COMPAT switch to allow testing w/o the status one. Wrap the access to status IRQ_INPROGRESS in a inline which can be turned of with CONFIG_GENERIC_HARDIRQS_NO_COMPAT along with the define. There is no reason that anything outside of core looks at this. That needs some modifications, but we'll get there. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:16 +01:00
Thomas Gleixner	6954b75b48	genirq: Move IRQ_POLL_INPROGRESS to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:15 +01:00
Thomas Gleixner	6f91a52d9b	genirq: Use modify_status for set_irq_nested_thread No need for a separate function in the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:15 +01:00
Thomas Gleixner	7acdd53e5b	genirq: Move IRQ_SPURIOUS_DISABLED to core state No users outside. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:15 +01:00
Thomas Gleixner	bd062e7667	genirq: Move IRQ_AUTODETECT to internal state No users outside of core Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:15 +01:00
Thomas Gleixner	e6bea9c404	genirq: Protect tglx from tripping over his own feet The irq_desc.status field will either go away or renamed to settings. Anyway we need to maintain compatibility to avoid breaking the world and some more. While moving bits into the core, I need to avoid that I use any of the still existing IRQ_ bits in the core code by typos. So that file will hold the inline wrappers and some nasty CPP tricks to break the build when typoed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:14 +01:00
Thomas Gleixner	dbec07bac6	genirq: Add internal state field to irq_desc That field will contain internal state information which is not going to be exposed to anything outside the core code - except via accessor functions. I'm tired of everyone fiddling in irq_desc.status. core_internal_state__do_not_mess_with_it is clear enough, annoying to type and easy to grep for. Offenders will be tracked down and slapped with stinking trouts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:14 +01:00
Thomas Gleixner	35e857cbeb	genirq: Fixup core code namespace fallout Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:13 +01:00
Thomas Gleixner	c78b9b65fa	genirq: Implement generic irq_show_interrupts() All archs implement show_interrupts() in more or less the same way. That's tons of duplicated code with different bugs with no value. Implement a generic version and deprecate show_interrupts() Unfortunately we need some ifdeffery for !GENERIC_HARDIRQ archs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:13 +01:00
Thomas Gleixner	1277a5325a	genirq: Simplify handle_irq_event() Now that all core users are converted one layer can go. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:13 +01:00
Thomas Gleixner	0877d66257	genirq: Use handle_irq_event() in the spurious poll code Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:13 +01:00
Thomas Gleixner	849f061c25	genirq: Use handle_perpcu_event() in handle_percpu_irq() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:12 +01:00
Thomas Gleixner	a60a5dc2db	genirq: Use handle_irq_event() in handle_edge_irq() It's safe to drop the IRQ_INPROGRESS flag between action chain walks as we are protected by desc->lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:12 +01:00
Thomas Gleixner	a7ae4de5c8	genirq: Use handle_irq_event() in handle_fasteoi_irq() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:12 +01:00
Thomas Gleixner	1529866c63	genirq: Use handle_irq_event() in handle_level_irq() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:12 +01:00
Thomas Gleixner	107781e721	genirq: Use handle_irq_event() in handle_simple_irq() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:11 +01:00
Thomas Gleixner	4912609f22	genirq: Implement handle_irq_event() Core code replacement for the ugly camel case. It contains all the code which is shared in all handlers. clear status flags set INPROGRESS flag unlock call action chain note_interrupt lock clr INPROGRESS flag Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:11 +01:00
Thomas Gleixner	d78f8dd36b	genirq: Do not fiddle with IRQ_MASKED in handle_edge_irq() IRQ_MASKED is set in mask_ack_irq() anyway. Remove it from handle_edge_irq() to allow simpler ab^HHreuse of that function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110202212551.918484270@linutronix.de>	2011-02-19 12:58:11 +01:00
Thomas Gleixner	3aae994fb0	genirq: Consolidate IRQ_DISABLED Handle IRQ_DISABLED consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:11 +01:00
Thomas Gleixner	50f7c03275	genirq: Remove default magic Now that everything uses the wrappers, we can remove the default functions. None of those functions is performance critical. That makes the IRQ_MASKED flag tracking fully consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:10 +01:00
Thomas Gleixner	87923470c7	genirq: Consolidate disable/enable Create irq_disable/enable and use them to keep the flags consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:10 +01:00
Thomas Gleixner	4699923861	genirq: Consolidate startup/shutdown of interrupts Aside of duplicated code some of the startup/shutdown sites do not handle the MASKED/DISABLED flags and the depth field at all. Move that to a helper function and take care of it there. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110202212551.787481468@linutronix.de>	2011-02-19 12:58:10 +01:00
Thomas Gleixner	3b56f0585f	genirq: Remove bogus conditional The if (chip->irq_shutdown) check will always evaluate to true, as we fill in chip->irq_shutdown with default_shutdown in irq_chip_set_defaults() if the chip does not provide its own function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110202212551.667607458@linutronix.de>	2011-02-19 12:58:10 +01:00
Thomas Gleixner	1535dfacbf	genirq: Move irq thread flags to core Soleley used in core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:09 +01:00
Thomas Gleixner	fe200ae48e	genirq: Mark polled irqs and defer the real handler With the chip.end() function gone we might run into a situation where a poll call runs and the real interrupt comes in, sees IRQ_INPROGRESS and disables the line. That might be a perfect working one, which will then be masked forever. So mark them polled while the poll runs. When the real handler sees IRQ_INPROGRESS it checks the poll flag and waits for the polling to complete. Add the necessary amount of sanity checks to it to avoid deadlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:09 +01:00
Thomas Gleixner	d05c65fff0	genirq: spurious: Run only one poller at a time No point in running concurrent pollers which confuse each other by setting PENDING. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:09 +01:00
Thomas Gleixner	c7259cd7af	genirq: Do not poll disabled, percpu and timer interrupts There is no point in polling disabled lines. percpu does not make sense at all because we only poll on the cpu we're currently running on. Also polling per_cpu interrupts is racy as hell. The handler runs without locking so we might get a huge surprise. If the timer interrupt needs polling, then we wont get there anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:09 +01:00
Thomas Gleixner	fa27271bc8	genirq: Fixup poll handling try_one_irq() contains redundant code and lots of useless checks for shared interrupts. Check for shared before setting IRQ_INPROGRESS and then call handle_IRQ_event() while pending. Shorter version with the same functionality. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:08 +01:00
Thomas Gleixner	b738a50a20	genirq: Warn when handler enables interrupts We run all handlers with interrupts disabled and expect them not to enable them. Warn when we catch one who does. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:08 +01:00
Thomas Gleixner	1082687e8d	genirq: Plug race in report_bad_irq() We cannot walk the action chain unlocked. Even if IRQ_INPROGRESS is set an action can be removed and we follow a null pointer. It's safe to take the lock there, because the code which removes the action will call synchronize_irq() which waits unlocked for IRQ_INPROGRESS going away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:08 +01:00
Thomas Gleixner	2b879eaf09	genirq: Remove redundant thread affinity setting Thread affinity is already set by setup_affinity(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:08 +01:00
Thomas Gleixner	3b8249e759	genirq: Do not copy affinity before set While rumaging through arch code I found that there are a few workarounds which deal with the fact that the initial affinity setting from request_irq() copies the mask into irq_data->affinity before the chip code is called. In the normal path we unconditionally copy the mask when the chip code returns 0. Copy after the code is called and add a return code IRQ_SET_MASK_OK_NOCOPY for the chip functions, which prevents the copy. That way we see the real mask when the chip function decided to truncate it further as some arches do. IRQ_SET_MASK_OK is 0, which is the current behaviour. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:07 +01:00
Thomas Gleixner	569bda8df1	genirq: Always apply cpu online mask If the affinity had been set by the user, then a later request_irq() will honour that setting. But online cpus can have changed. So apply the online mask and for this case as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:07 +01:00
Thomas Gleixner	b008207cbd	genirq: Rremove redundant check IRQ_NO_BALANCING is already checked in irq_can_set_affinity() above, no need to check it again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:07 +01:00
Thomas Gleixner	1fa46f1f07	genirq: Simplify affinity related code There is lot of #ifdef CONFIG_GENERIC_PENDING_IRQ along with duplicated code in the irq core. Move the #ifdeffery into one place and cleanup the code so it's readable. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:07 +01:00
Thomas Gleixner	a0cd9ca2b9	genirq: Namespace cleanup The irq namespace has become quite convoluted. My bad. Clean it up and deprecate the old functions. All new functions follow the scheme: irq number based: irq_set/get/xxx/_xxx(unsigned int irq, ...) irq_data based: irq_data_set/get/xxx/_xxx(struct irq_data d, ....) irq_desc based: irq_desc_get_xxx(struct irq_desc desc) Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:06 +01:00
Thomas Gleixner	43abe43ce0	genirq: Add missing buslock to set_irq_type(), set_irq_wake() chips behind a slow bus cannot update the chip under desc->lock, but we miss the chip_buslock/chip_bus_sync_unlock() calls around the set type and set wake functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:06 +01:00
Thomas Gleixner	e7bcecb7b1	genirq: Make nr_irqs runtime expandable We face more and more the requirement to expand nr_irqs at runtime. The reason are irq expanders which can not be detected in the early boot stage. So we speculate nr_irqs to have enough room. Further Xen needs extra irq numbers and we really want to avoid adding more "detection" code into the early boot. There is no real good reason why we need to limit nr_irqs at early boot. Allow the allocation code to expand nr_irqs. We have already 8k extra number space in the allocation bitmap, so lets use it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:06 +01:00
Thomas Gleixner	218502bfe6	Merge branch 'irq/urgent' into irq/core Reason: Further patches are conflicting with mainline fixes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:56:43 +01:00
Thomas Gleixner	6d83f94db9	genirq: Disable the SHIRQ_DEBUG call in request_threaded_irq for now With CONFIG_SHIRQ_DEBUG=y we call a newly installed interrupt handler in request_threaded_irq(). The original implementation (commit `a304e1b8`) called the handler _BEFORE_ it was installed, but that caused problems with handlers calling disable_irq_nosync(). See commit `377bf1e4`. It's braindead in the first place to call disable_irq_nosync in shared handlers, but .... Moving this call after we installed the handler looks innocent, but it is very subtle broken on SMP. Interrupt handlers rely on the fact, that the irq core prevents reentrancy. Now this debug call violates that promise because we run the handler w/o the IRQ_INPROGRESS protection - which we cannot apply here because that would result in a possibly forever masked interrupt line. A concurrent real hardware interrupt on a different CPU results in handler reentrancy and can lead to complete wreckage, which was unfortunately observed in reality and took a fricking long time to debug. Leave the code here for now. We want this debug feature, but that's not easy to fix. We really should get rid of those disable_irq_nosync() abusers and remove that function completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Anton Vorontsov <avorontsov@ru.mvista.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Arjan van de Ven <arjan@infradead.org> Cc: stable@kernel.org # .28 -> .37	2011-02-19 12:11:13 +01:00
Thomas Gleixner	c1ee626428	genirq: Prevent access beyond allocated_irqs bitmap Lars-Peter Clausen pointed out: I stumbled upon this while looking through the existing archs using SPARSE_IRQ. Even with SPARSE_IRQ the NR_IRQS is still the upper limit for the number of IRQs. Both PXA and MMP set NR_IRQS to IRQ_BOARD_START, with IRQ_BOARD_START being the number of IRQs used by the core. In various machine files the nr_irqs field of the ARM machine defintion struct is then set to "IRQ_BOARD_START + NR_BOARD_IRQS". As a result "nr_irqs" will greater then NR_IRQS which then again causes the "allocated_irqs" bitmap in the core irq code to be accessed beyond its size overwriting unrelated data. The core code really misses a sanity check there. This went unnoticed so far as by chance the compiler/linker places data behind that bitmap which gets initialized later on those affected platforms. So the obvious fix would be to add a sanity check in early_irq_init() and break all affected platforms. Though that check wants to be backported to stable as well, which will require to fix all known problematic platforms and probably some more yet not known ones as well. Lots of churn. A way simpler solution is to allocate a slightly larger bitmap and avoid the whole churn w/o breaking anything. Add a few warnings when an arch returns utter crap. Reported-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org # .37 Cc: Haojian Zhuang <haojian.zhuang@marvell.com> Cc: Eric Miao <eric.y.miao@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2011-02-19 12:10:51 +01:00
Linus Torvalds	bc3adfc670	Merge branch 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: make sure MAYDAY_INITIAL_TIMEOUT is at least 2 jiffies long workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable' workqueue: wake up a worker when a rescuer is leaving a gcwq	2011-02-18 12:36:06 -08:00
Richard Cochran	db1c1cce4a	ntp: Remove redundant and incorrect parameter check The ADJ_SETOFFSET code redundantly checks the range of the nanoseconds field of the time value. This field is checked again in the subsequent call to timekeeping_inject_offset(). Also, as is, the check will not detect whether the number of microseconds is out of range. Let timekeeping_inject_offset() do the error checking. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Cc: johnstul@us.ibm.com LKML-Reference: <20110218090724.GA2924@riccoc20.at.omicron.at> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-18 17:01:12 +01:00
Ingo Molnar	e9345aab67	Revert "tracing: Add unstable sched clock note to the warning" This reverts commit `5e38ca8f3e`. Breaks the build of several !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK architectures. Cc: Jiri Olsa <jolsa@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Message-ID: <20110217171823.GB17058@elte.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-18 08:09:49 +01:00
Ingo Molnar	5beda5f6e4	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-02-17 14:11:15 +01:00
Stanislaw Gruszka	2e725a065b	PM / Hibernate: Return error code when alloc_image_page() fails Currently we return 0 in swsusp_alloc() when alloc_image_page() fails. Fix that. Also remove unneeded "error" variable since the only useful value of error is -ENOMEM. [rjw: Fixed up the changelog and changed subject.] Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Cc: stable@kernel.org Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-02-16 21:53:52 +01:00
Tejun Heo	3233cdbd9f	workqueue: make sure MAYDAY_INITIAL_TIMEOUT is at least 2 jiffies long MAYDAY_INITIAL_TIMEOUT is defined as HZ / 100 and depending on configuration may end up 0 or 1. Even when it's 1, depending on when the mayday timer is added in the current jiffy interval, it may expire way before a jiffy has passed. Make sure MAYDAY_INITIAL_TIMEOUT is at least two to guarantee that at least a full jiffy has passed before calling rescuers. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ray Jui <rjui@broadcom.com> Cc: stable@kernel.org	2011-02-16 18:10:19 +01:00
Tejun Heo	58a69cb47e	workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable' There are two spellings in use for 'freeze' + 'able' - 'freezable' and 'freezeable'. The former is the more prominent one. The latter is mostly used by workqueue and in a few other odd places. Unify the spelling to 'freezable'. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Dmitry Torokhov <dtor@mail.ru> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alex Dubov <oakad@yahoo.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Steven Whitehouse <swhiteho@redhat.com>	2011-02-16 17:48:59 +01:00
Steven Rostedt	48228f7b47	lockdep/timers: Explain in detail the locking problems del_timer_sync() may cause Twice I had to explain the output about why lockdep gives an error with locks in IRQ context and with del_timer_sync(). Might as well write it up and place it in the comments above the code in del_timer_sync(). Perhaps the next time this lockdep dump triggers people will understand the issues. It is a ticky issue and very subtle, explaining it in detail in the code may help others understand the issue when they stumble upon the bug again. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1297186794.23343.19.camel@gandalf.stny.rr.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:35:08 +01:00
Ingo Molnar	a3ec4a603f	Merge commit 'v2.6.38-rc5' into core/locking Merge reason: pick up upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:33:41 +01:00
Venkatesh Pallipadi	46e49b3836	sched: Wholesale removal of sd_idle logic sd_idle logic was introduced way back in 2005 (commit `5969fe06`), as an HT optimization. As per the discussion in the thread here: lkml - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1 https://patchwork.kernel.org/patch/532501/ The capacity based logic in the load balancer right now handles this in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle does not seem to bring any additional benefits. sd_idle logic also has some bugs that has performance impact. Here is the patch that removes the sd_idle logic altogether. Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle logic. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1297723130-693-1-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:33:20 +01:00
Ingo Molnar	48fa4b8ecf	Merge commit 'v2.6.38-rc5' into sched/core Merge reason: Pick up upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:31:55 +01:00
Peter Zijlstra	ba3dd36c67	perf: Optimize hrtimer events There is no need to re-initialize the hrtimer every time we start it, so don't do that (shaves a few cycles). Also, since we know hrtimers run at a fixed rate (nanoseconds) we can pre-compute the desired frequency at which they tick. This avoids us having to go through the whole adaptive frequency feedback logic (shaves another few cycles). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1297448589.5226.47.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:57 +01:00
Peter Zijlstra	163ec4354a	perf: Optimize throttling code By pre-computing the maximum number of samples per tick we can avoid a multiplication and a conditional since MAX_INTERRUPTS > max_samples_per_tick. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:55 +01:00
Stephane Eranian	e5d1367f17	perf: Add cgroup support This kernel patch adds the ability to filter monitoring based on container groups (cgroups). This is for use in per-cpu mode only. The cgroup to monitor is passed as a file descriptor in the pid argument to the syscall. The file descriptor must be opened to the cgroup name in the cgroup filesystem. For instance, if the cgroup name is foo and cgroupfs is mounted in /cgroup, then the file descriptor is opened to /cgroup/foo. Cgroup mode is activated by passing PERF_FLAG_PID_CGROUP in the flags argument to the syscall. For instance to measure in cgroup foo on CPU1 assuming cgroupfs is mounted under /cgroup: struct perf_event_attr attr; int cgroup_fd, fd; cgroup_fd = open("/cgroup/foo", O_RDONLY); fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP); close(cgroup_fd); Signed-off-by: Stephane Eranian <eranian@google.com> [ added perf_cgroup_{exit,attach} ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:48 +01:00
Peter Zijlstra	d41d5a0163	cgroup: Fix cgroup_subsys::exit callback Make the ::exit method act like ::attach, it is after all very nearly the same thing. The bug had no effect on correctness - fixing it is an optimization for the scheduler. Also, later perf-cgroups patches rely on it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Menage <menage@google.com> LKML-Reference: <1297160655.13327.92.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:47 +01:00
Ingo Molnar	b00560f2d4	Merge branch 'perf/urgent' into perf/core Merge reason: we need to queue up dependent patch Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:27:23 +01:00
Peter Zijlstra	4fe757dd48	perf: Fix throttle logic It was possible to call pmu::start() on an already running event. In particular this lead so some wreckage as the hrtimer events would re-initialize active timers. This was due to throttled events being activated again by scheduling. Scheduling in a context would add and force start events, resulting in running events with a possible throttle status. The next tick to hit that task will then try to unthrottle the event and call ->start() on an already running event. Reported-by: Jeff Moyer <jmoyer@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:25:29 +01:00
Linus Torvalds	a1213b091c	Merge branches 'core-fixes-for-linus' and 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: Revert "lockdep, timer: Fix del_timer_sync() annotation" * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timer debug: Hide kernel addresses via %pK in /proc/timer_list	2011-02-15 10:19:18 -08:00
Linus Torvalds	1cecd791f2	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Fix text_poke_smp_batch() deadlock perf tools: Fix thread_map event synthesizing in top and record watchdog, nmi: Lower the severity of error messages ARM: oprofile: Fix backtraces in timer mode oprofile: Fix usage of CONFIG_HW_PERF_EVENTS for oprofile_perf_init and friends	2011-02-15 10:18:48 -08:00
Jiri Kosina	0a9d59a246	Merge branch 'master' into for-next	2011-02-15 10:24:31 +01:00
Chris Smith	d4f7e51323	sh: Enable CONFIG_GCOV_PROFILE_ALL for sh This patch enables gcov kernel profiling over the whole kernel for sh. Profiling of specific files individually already worked. A handful of files have to be explicitly excluded from the profiling to avoid breaking things, notably pmb.c. Signed-off-by: Chris Smith <chris.smith@st.com> Signed-off-by: Stuart Menefy <stuart.menefy@st.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2011-02-15 16:47:17 +09:00
Ingo Molnar	bf1af3a809	Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-02-14 15:15:16 +01:00
Tejun Heo	7576958a9d	workqueue: wake up a worker when a rescuer is leaving a gcwq After executing the matching works, a rescuer leaves the gcwq whether there are more pending works or not. This may decrease the concurrency level to zero and stall execution until a new work item is queued on the gcwq. Make rescuer wake up a regular worker when it leaves a gcwq if there are more works to execute, so that execution isn't stalled. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ray Jui <rjui@broadcom.com> Cc: stable@kernel.org	2011-02-14 14:04:46 +01:00
Masami Hiramatsu	0de4b34d46	tracing/kprobe: Fix NULL pointer deref check Add NULL check for avoiding NULL pointer deref. This bug has been introduced by: 1ff511e35ed8: tracing/kprobes: Add bitfield type which causes a null pointer dereference bug when kprobe-tracer parses an argument without type. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110214054807.8919.69740.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>	2011-02-14 12:09:53 +01:00
Kees Cook	f590308536	timer debug: Hide kernel addresses via %pK in /proc/timer_list In the continuing effort to avoid kernel addresses leaking to unprivileged users, this patch switches to %pK for /proc/timer_list reporting. Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110212032125.GA23571@outflux.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-12 14:11:56 +01:00
Linus Torvalds	f7909fb835	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: pci: use security_capable() when checking capablities during config space read security: add cred argument to security_capable() tpm_tis: Use timeouts returned from TPM	2011-02-11 16:16:03 -08:00
Tejun Heo	01e05e9a90	ptrace: use safer wake up on ptrace_detach() The wake_up_process() call in ptrace_detach() is spurious and not interlocked with the tracee state. IOW, the tracee could be running or sleeping in any place in the kernel by the time wake_up_process() is called. This can lead to the tracee waking up unexpectedly which can be dangerous. The wake_up is spurious and should be removed but for now reduce its toxicity by only waking up if the tracee is in TRACED or STOPPED state. This bug can possibly be used as an attack vector. I don't think it will take too much effort to come up with an attack which triggers oops somewhere. Most sleeps are wrapped in condition test loops and should be safe but we have quite a number of places where sleep and wakeup conditions are expected to be interlocked. Although the window of opportunity is tiny, ptrace can be used by non-privileged users and with some loading the window can definitely be extended and exploited. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-11 16:12:19 -08:00
Steven Rostedt	868baf07b1	ftrace: Fix memory leak with function graph and cpu hotplug When the fuction graph tracer starts, it needs to make a special stack for each task to save the real return values of the tasks. All running tasks have this stack created, as well as any new tasks. On CPU hot plug, the new idle task will allocate a stack as well when init_idle() is called. The problem is that cpu hotplug does not create a new idle_task. Instead it uses the idle task that existed when the cpu went down. ftrace_graph_init_task() will add a new ret_stack to the task that is given to it. Because a clone will make the task have a stack of its parent it does not check if the task's ret_stack is already NULL or not. When the CPU hotplug code starts a CPU up again, it will allocate a new stack even though one already existed for it. The solution is to treat the idle_task specially. In fact, the function_graph code already does, just not at init_idle(). Instead of using the ftrace_graph_init_task() for the idle task, which that function expects the task to be a clone, have a separate ftrace_graph_init_idle_task(). Also, we will create a per_cpu ret_stack that is used by the idle task. When we call ftrace_graph_init_idle_task() it will check if the idle task's ret_stack is NULL, if it is, then it will assign it the per_cpu ret_stack. Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stable Tree <stable@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-11 16:23:33 -05:00
Arnaldo Carvalho de Melo	7c940c18c5	Merge remote branch 'acme/perf/urgent' into perf/core Fixups due to rename of event_t routines from event__ to perf_event__ done in perf/core. Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/util/event.c tools/perf/util/event.h Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-02-11 11:45:54 -02:00
Chris Wright	6037b715d6	security: add cred argument to security_capable() Expand security_capable() to include cred, so that it can be usable in a wider range of call sites. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>	2011-02-11 17:41:58 +11:00
Linus Torvalds	ee24aebffb	cap_syslog: accept CAP_SYS_ADMIN for now In commit `ce6ada35bd` ("security: Define CAP_SYSLOG") Serge Hallyn introduced CAP_SYSLOG, but broke backwards compatibility by no longer accepting CAP_SYS_ADMIN as an override (it would cause a warning and then reject the operation). Re-instate CAP_SYS_ADMIN - but keeping the warning - as an acceptable capability until any legacy applications have been updated. There are apparently applications out there that drop all capabilities except for CAP_SYS_ADMIN in order to access the syslog. (This is a re-implementation of a patch by Serge, cleaning the logic up and making the code more readable) Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-10 17:53:55 -08:00
Thomas Gleixner	51327ada71	Merge branch 'irq/for-mips' into irq/core Reason: irq/for-mips is provided for mips to make core independent progress. Merge it into irq/core to avoid conflicts Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-11 00:27:03 +01:00
David Daney	986c011ddb	genirq: Call bus_lock/unlock functions in setup_irq() irq_chips that supply .irq_bus_lock/.irq_bus_sync_unlock functions, expect that the other chip methods will be called inside of calls to the pair. If this expectation is not met, things tend to not work. Make setup_irq() call chip_bus_lock()/chip_bus_sync_unlock() too. For the vast majority of irq_chips, this will be a NOP as most don't have these bus lock functions. [ tglx: No we don't want to call that in __setup_irq(). Way too many error exit pathes. ] Signed-off-by: David Daney <ddaney@caviumnetworks.com> LKML-Reference: <1297296265-18680-1-git-send-email-ddaney@caviumnetworks.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-11 00:25:13 +01:00
Don Zickus	5651f7f47d	watchdog, nmi: Lower the severity of error messages During boot if the hardlockup detector fails to initialize, it complains very loudly. Some failures should be expected under certain situations, ie no lapics, or resource in-use. Tone those error messages down a bit. Keep the rest at a high level. Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1297278153-21111-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-10 13:21:59 +01:00
Linus Torvalds	aceb91cd35	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cdrom: support devices that have check_events but not media_changed cfq-iosched: Don't wait if queue already has requests. blkio-throttle: Avoid calling blkiocg_lookup_group() for root group cfq: rename a function to give it more appropriate name cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily. drivers/block/aoe/Makefile: replace the use of <module>-objs with <module>-y loop: queue_lock NULL pointer derefence in blk_throtl_exit drivers/block/Makefile: replace the use of <module>-objs with <module>-y blktrace: Don't output messages if NOTIFY isn't set.	2011-02-09 11:45:21 -08:00
Tejun Heo	4149efb22d	workqueue: add system_freezeable_wq Add system wide freezeable workqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl>	2011-02-09 09:37:49 +01:00
Steven Rostedt	6752ab4a9c	tracing: Deprecate tracing_enabled for tracing_on tracing_enabled should not be used, it is heavy weight and does not do much in helping lower the overhead. tracing_on should be used instead. Warn users to use tracing_on when tracing_enabled is used as it will soon be removed from the tracing directory. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-08 17:14:58 -05:00
Steven Rostedt	87d80de280	tracing: Remove obsolete sched_switch tracer The trace events sched_switch and sched_wakeup do the same thing as the stand alone sched_switch tracer does. It is no longer needed. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-08 17:14:56 -05:00
Thomas Gleixner	44951a60ff	genirq: Remove dead code CONFIG_KSTAT_IRQS_ONDEMAND does not exist. It's not worth to implement it. Use sparse irqs if you care about memory consumption of the interrupt layer. Found by undertaker: http://vamos.informatik.uni-erlangen.de/trac/undertaker Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-08 19:37:13 +01:00
Thomas Gleixner	c305d524e5	softirq: Avoid stack switch from ksoftirqd ksoftirqd() calls do_softirq() which switches stacks on several architectures. That makes no sense at all. ksoftirqd's stack is sufficient. Call __do_softirq() directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: David Miller <davem@davemloft.net> Cc: Paul Mundt <lethal@linux-sh.org> Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> LKML-Reference: <alpine.LFD.2.00.1102021704530.31804@localhost6.localdomain6>	2011-02-08 19:37:12 +01:00
Jiri Olsa	5e38ca8f3e	tracing: Add unstable sched clock note to the warning The warning "Delta way too big" warning might appear on a system with unstable shed clock right after the system is resumed and tracing was enabled during the suspend. Since it's not realy bug, and the unstable sched clock is working fast and reliable otherwise, Steven suggested to keep using the sched clock in any case and just to make note in the warning itself. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1296649698-6003-1-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-08 11:57:22 -05:00
Thomas Gleixner	c9a443cdf7	Merge branch 'irq/for-xen' into irq/core irq/for-xen contains new functionality to avoid Xen private irq hackery. That branch has a single irq commit and is pulled by Xen to base their new features on. Merge it into irq/core as other patches modify the same code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-08 16:38:21 +01:00
Thomas Gleixner	dc5f219e88	genirq: Add IRQF_FORCE_RESUME Xen needs to reenable interrupts which are marked IRQF_NO_SUSPEND in the resume path. Add a flag to force the reenabling in the resume code. Tested-and-acked-by: Ian Campbell <Ian.Campbell@eu.citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-08 16:36:47 +01:00
Peter Zijlstra	7ff207928e	Revert "lockdep, timer: Fix del_timer_sync() annotation" Both attempts at trying to allow softirq usage for del_timer_sync() failed (produced bogus warnings), so revert the commit for this release: f266a5110d45: lockdep, timer: Fix del_timer_sync() annotation and try again later. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <1297174680.13327.107.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-08 16:18:39 +01:00
Ian Munsie	ae07f551c4	tracing/syscalls: Early terminate search for sys_ni_syscall Many system calls are unimplemented and mapped to sys_ni_syscall, but at boot ftrace would still search through every syscall metadata entry for a match which wouldn't be there. This patch adds causes the search to terminate early if the system call is not mapped. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-7-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:30:14 -05:00
Ian Munsie	b2d5549681	tracing/syscalls: Allow arch specific syscall symbol matching Some architectures have unusual symbol names and the generic code to match the symbol name with the function name for the syscall metadata will fail. For example, symbols on PPC64 start with a period and the generic code will fail to match them. This patch moves the match logic out into a separate function which an arch can override by defining ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and implementing arch_syscall_match_sym_name. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-5-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:29:03 -05:00
Ian Munsie	c763ba06bd	tracing/syscalls: Make arch_syscall_addr weak Some architectures use non-trivial system call tables and will not work with the generic arch_syscall_addr code. For example, PowerPC64 uses a table of twin long longs. This patch makes the generic arch_syscall_addr weak to allow architectures with non-trivial system call tables to override it. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-4-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:26:22 -05:00
Ian Munsie	3773b389b6	tracing/syscalls: Convert redundant syscall_nr checks into WARN_ON With the ftrace events now checking if the syscall_nr is valid upon initialisation it should no longer be possible to register or unregister a syscall event without a valid syscall_nr since they should not be created. This adds a WARN_ON_ONCE in the register and unregister functions to locate potential regressions in the future. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-3-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:25:52 -05:00
Ian Munsie	ba976970c7	tracing/syscalls: Don't add events for unmapped syscalls FTRACE_SYSCALLS would create events for each and every system call, even if it had failed to map the system call's name with it's number. This resulted in a number of events being created that would not behave as expected. This could happen, for example, on architectures who's symbol names are unusual and will not match the system call name. It could also happen with system calls which were mapped to sys_ni_syscall. This patch changes the default system call number in the metadata to -1. If the system call name from the metadata is not successfully mapped to a system call number during boot, than the event initialisation routine will now return an error, preventing the event from being created. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-2-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:24:44 -05:00
Steven Rostedt	4defe682d8	tracing/filter: Remove synchronize_sched() from __alloc_preds() Because the filters are processed first and then activated (added to the call), we no longer need to worry about the preds of the filter in __alloc_preds() being used. As the filter that is allocating preds is not activated yet. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:20 -05:00
Steven Rostedt	75b8e98263	tracing/filter: Swap entire filter of events When creating a new filter, instead of allocating the filter to the event call first and then processing the filter, it is easier to process a temporary filter and then just swap it with the call filter. By doing this, it simplifies the code. A filter is allocated and processed, when it is done, it is swapped with the call filter, synchronize_sched() is called to make sure all callers are done with the old filter (filters are called with premption disabled), and then the old filter is freed. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:20 -05:00
Steven Rostedt	bf93f9ed3a	tracing/filter: Increase the max preds to 2^14 Now that the filter logic does not require to save the pred results on the stack, we can increase the max number of preds we allow. As the preds are index by a short value, and we use the MSBs as flags we can increase the max preds to 2^14 (16384) which should be way more than enough. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:20 -05:00
Steven Rostedt	4a3d27e98a	tracing/filter: Move MAX_FILTER_PRED to local tracing directory The MAX_FILTER_PRED is only needed by the kernel/trace/*.c files. Move it to kernel/trace/trace.h. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:20 -05:00
Steven Rostedt	43cd414552	tracing/filter: Optimize filter by folding the tree There are many cases that a filter will contain multiple ORs or ANDs together near the leafs. Walking up and down the tree to get to the next compare can be a waste. If there are several ORs or ANDs together, fold them into a single pred and allocate an array of the conditions that they check. This will speed up the filter by linearly walking an array and can still break out if a short circuit condition is met. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:19 -05:00
Steven Rostedt	ec126cac23	tracing/filter: Check the created pred tree Since the filter walks a tree to determine if a match is made or not, if the tree was incorrectly created, it could cause an infinite loop. Add a check to walk the entire tree before assigning it as a filter to make sure the tree is correct. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:19 -05:00
Steven Rostedt	5571927418	tracing/filter: Optimize short ciruit check The test if we should break out early for OR and AND operations can be optimized by comparing the current result with (pred->op == OP_OR) That is if the result is true and the op is an OP_OR, or if the result is false and the op is not an OP_OR (thus an OP_AND) we can break out early in either case. Otherwise we continue processing. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:19 -05:00
Steven Rostedt	61e9dea20e	tracing/filter: Use a tree instead of stack for filter_match_preds() Currently the filter_match_preds() requires a stack to push and pop the preds to determine if the filter matches the record or not. This has two drawbacks: 1) It requires a stack to store state information. As this is done in fast paths we can't allocate the storage for this stack, and we can't use a global as it must be re-entrant. The stack is stored on the kernel stack and this greatly limits how many preds we may allow. 2) All conditions are calculated even when a short circuit exists. a \|\| b will always calculate a and b even though a was determined to be true. Using a tree we can walk a constant structure that will save the state as we go. The algorithm is simply: pred = root; do { switch (move) { case MOVE_DOWN: if (OR or AND) { pred = left; continue; } if (pred == root) break; match = pred->fn(); pred = pred->parent; move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT; continue; case MOVE_UP_FROM_LEFT: /* Only OR or AND can be a parent / if (match && OR \|\| !match && AND) { / short circuit */ if (pred == root) break; pred = pred->parent; move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT; continue; } pred = pred->right; move = MOVE_DOWN; continue; case MOVE_UP_FROM_RIGHT: if (pred == root) break; pred = pred->parent; move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT; continue; } done = 1; } while (!done); This way there's no strict limit to how many preds we allow and it also will short circuit the logical operations when possible. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:19 -05:00
Steven Rostedt	f76690afd0	tracing/filter: Free pred array on disabling of filter When a filter is disabled, free the preds. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	74e9e58c35	tracing/filter: Allocate the preds in an array Currently we allocate an array of pointers to filter_preds, and then allocate a separate filter_pred for each item in the array. This adds slight overhead in the filters as it needs to derefernce twice to get to the op condition. Allocating the preds themselves in a single array removes a dereference as well as helps on the cache footprint. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	0fc3ca9a10	tracing/filter: Call synchronize_sched() just once for system filters By separating out the reseting of the filter->n_preds to zero from the reallocation of preds for the filter, we can reset groups of filters first, call synchronize_sched() just once, and then reallocate each of the filters in the system group. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	c9c53ca03d	tracing/filter: Dynamically allocate preds For every filter that is made, we create predicates to hold every operation within the filter. We have a max of 32 predicates that we can hold. Currently, we allocate all 32 even if we only need to use one. Part of the reason we do this is that the filter can be used at any moment by any event. Fortunately, the filter is only used with preemption disabled. By reseting the count of preds used "n_preds" to zero, then performing a synchronize_sched(), we can safely free and reallocate a new array of preds. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	58d9a597c4	tracing/filter: Move OR and AND logic out of fn() method The ops OR and AND act different from the other ops, as they are the only ones to take other ops as their arguements. These ops als change the logic of the filter_match_preds. By removing the OR and AND fn's we can also remove the val1 and val2 that is passed to all other fn's and are unused. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	6d54057d76	tracing/filter: Have no filter return a match The n_preds field of a file can change at anytime, and even can become zero, just as the filter is about to be processed by an event. In the case that is zero on entering the filter, return 1, telling the caller the event matchs and should be trace. Also use a variable and assign it with ACCESS_ONCE() such that the count stays consistent within the function. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:17 -05:00
Tetsuo Handa	fb2b2a1d37	CRED: Fix memory and refcount leaks upon security_prepare_creds() failure In prepare_kernel_cred() since 2.6.29, put_cred(new) is called without assigning new->usage when security_prepare_creds() returned an error. As a result, memory for new and refcount for new->{user,group_info,tgcred} are leaked because put_cred(new) won't call __put_cred() unless old->usage == 1. Fix these leaks by assigning new->usage (and new->subscribers which was added in 2.6.32) before calling security_prepare_creds(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-07 14:04:00 -08:00
Tetsuo Handa	2edeaa34a6	CRED: Fix BUG() upon security_cred_alloc_blank() failure In cred_alloc_blank() since 2.6.32, abort_creds(new) is called with new->security == NULL and new->magic == 0 when security_cred_alloc_blank() returns an error. As a result, BUG() will be triggered if SELinux is enabled or CONFIG_DEBUG_CREDENTIALS=y. If CONFIG_DEBUG_CREDENTIALS=y, BUG() is called from __invalid_creds() because cred->magic == 0. Failing that, BUG() is called from selinux_cred_free() because selinux_cred_free() is not expecting cred->security == NULL. This does not affect smack_cred_free(), tomoyo_cred_free() or apparmor_cred_free(). Fix these bugs by (1) Set new->magic before calling security_cred_alloc_blank(). (2) Handle null cred->security in creds_are_invalid() and selinux_cred_free(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-07 14:04:00 -08:00
Masami Hiramatsu	1ff511e35e	tracing/kprobes: Add bitfield type Add bitfield type for tracing arguments on kprobe-tracer. The syntax of a bitfield type is: b<bit-size>@<bit-offset>/<container-size> e.g. Accessing 2 bits-width field with 4 bits-offset in 32 bits-width data at 4 bytes offseted from the address pointed by AX register: +4(%ax):b2@4/32 Since the width of container data depends on the arch, so I just added the container-size at the end. Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20110204125205.9507.11363.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-02-07 09:18:11 -02:00
Masami Hiramatsu	e374536998	tracing/kprobes: Support longer (>128 bytes) command Expand command line buffer of kprobe-tracer to 4096 bytes. Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20110204125159.9507.20895.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-02-07 09:15:59 -02:00
Masami Hiramatsu	76022db323	tracing/kprobes: Cleanup strict_strtol() using code Since strict_strtol() accepts minus digits started with '-', it doesn't need to invert after converting. Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20110204125153.9507.49335.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-02-07 09:13:36 -02:00
Ingo Molnar	c7f9a6f377	Merge branch 'linus' into perf/core Merge reason: Pick up perf fixes that are now upstream Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-07 08:44:26 +01:00
Linus Torvalds	f0adc82064	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep, timer: Fix del_timer_sync() annotation RTC: Prevents a division by zero in kernel code.	2011-02-06 12:05:15 -08:00
Thomas Gleixner	285c1a2c3a	Merge branch 'irq/urgent' into irq/core Reason: Get mainline fixes integrated. Further patches conflict with them Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-05 21:50:05 +01:00
Ingo Molnar	862b6f62bf	Merge branch 'tip/perf/urgent-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent	2011-02-04 19:02:53 +01:00
Peter Zijlstra	f266a5110d	lockdep, timer: Fix del_timer_sync() annotation Calling local_bh_enable() will want to actually start processing softirqs, which isn't a good idea since this can get called with IRQs disabled. Cure this by using _local_bh_enable() which doesn't start processing softirqs, and use raw_local_irq_save() to avoid any softirqs from happening without letting lockdep think IRQs are in fact disabled. Reported-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> LKML-Reference: <20110203141548.039540914@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-04 10:31:22 +01:00
Linus Torvalds	aba99437f5	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Prevent irq storm on migration	2011-02-03 09:17:41 -08:00
Linus Torvalds	49abda9892	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix update_curr_rt() sched, docs: Update schedstats documentation to version 15	2011-02-03 08:55:07 -08:00
Linus Torvalds	eb487ab4d5	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix reading in perf_event_read() watchdog: Don't change watchdog state on read of sysctl watchdog: Fix sysctl consistency watchdog: Fix broken nowatchdog logic perf: Fix Pentium4 raw event validation perf: Fix alloc_callchain_buffers()	2011-02-03 08:52:05 -08:00
Steven Rostedt	3d56e331b6	tracing: Replace syscall_meta_data struct array with pointer array Currently the syscall_meta structures for the syscall tracepoints are placed in the __syscall_metadata section, and at link time, the linker makes one large array of all these syscall metadata structures. On boot up, this array is read (much like the initcall sections) and the syscall data is processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the __syscall_metadata section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The __syscall_metadata section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-03 09:29:06 -05:00
Mathieu Desnoyers	6549864629	tracepoints: Fix section alignment using pointer array Make the tracepoints more robust, making them solid enough to handle compiler changes by not relying on anything based on compiler-specific behavior with respect to structure alignment. Implement an approach proposed by David Miller: use an array of const pointers to refer to the individual structures, and export this pointer array through the linker script rather than the structures per se. It will consume 32 extra bytes per tracepoint (24 for structure padding and 8 for the pointers), but are less likely to break due to compiler changes. History: commit `7e066fb8` tracepoints: add DECLARE_TRACE() and DEFINE_TRACE() added the aligned(32) type and variable attribute to the tracepoint structures to deal with gcc happily aligning statically defined structures on 32-byte multiples. One attempt was to use a 8-byte alignment for tracepoint structures by applying both the variable and type attribute to tracepoint structures definitions and declarations. It worked fine with gcc 4.5.1, but broke with gcc 4.4.4 and 4.4.5. The reason is that the "aligned" attribute only specify the _minimum_ alignment for a structure, leaving both the compiler and the linker free to align on larger multiples. Because tracepoint.c expects the structures to be placed as an array within each section, up-alignment cause NULL-pointer exceptions due to the extra unexpected padding. (this patch applies on top of -tip) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: David S. Miller <davem@davemloft.net> LKML-Reference: <20110126222622.GA10794@Krystal> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: Andrew Morton <akpm@linux-foundation.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-03 09:28:46 -05:00
Mike Galbraith	d95f412200	sched: Add yield_to(task, preempt) functionality Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group. Implemented via a scheduler hint, using cfs_rq->next to encourage the target being selected. We can rely on pick_next_entity to keep things fair, so noone can accelerate a thread that has already used its fair share of CPU time. This also means callers should only call yield_to when they really mean it. Calling it too often can result in the scheduler just ignoring the hint. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 14:20:33 +01:00
Rik van Riel	ac53db596c	sched: Use a buddy to implement yield_task_fair() Use the buddy mechanism to implement yield_task_fair. This allows us to skip onto the next highest priority se at every level in the CFS tree, unless doing so would introduce gross unfairness in CPU time distribution. We order the buddy selection in pick_next_entity to check yield first, then last, then next. We need next to be able to override yield, because it is possible for the "next" and "yield" task to be different processen in the same sub-tree of the CFS tree. When they are, we need to go into that sub-tree regardless of the "yield" hint, and pick the correct entity once we get to the right level. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 14:20:33 +01:00
Rik van Riel	2c13c919d9	sched: Limit the scope of clear_buddies The clear_buddies function does not seem to play well with the concept of hierarchical runqueues. In the following tree, task groups are represented by 'G', tasks by 'T', next by 'n' and last by 'l'. (nl) / \ G(nl) G / \ \ T(l) T(n) T This situation can arise when a task is woken up T(n), and the previously running task T(l) is marked last. When clear_buddies is called from either T(l) or T(n), the next and last buddies of the group G(nl) will be cleared. This is not the desired result, since we would like to be able to find the other type of buddy in many cases. This especially a worry when implementing yield_task_fair through the buddy system. The fix is simple: only clear the buddy type that the task itself is indicated to be. As an added bonus, we stop walking up the tree when the buddy has already been cleared or pointed elsewhere. Signed-off-by: Rik van Riel <riel@redhat.coM> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201094837.6b0962a9@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 14:20:32 +01:00
Rik van Riel	725e7580aa	sched: Check the right ->nr_running in yield_task_fair() With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq. Yielding to a task from another cfs_rq may be worthwhile, since a process calling yield typically cannot use the CPU right now. Therefor, we want to check the per-cpu nr_running, not the cgroup local one. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201094715.798c4f86@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 14:20:32 +01:00
Peter Zijlstra	06c3bc6556	sched: Fix update_curr_rt() cpu_stopper_thread() migration_cpu_stop() __migrate_task() deactivate_task() dequeue_task() dequeue_task_rq() update_curr_rt() Will call update_curr_rt() on rq->curr, which at that time is rq->stop. The problem is that rq->stop.prio matches an RT prio and thus falsely assumes its a rt_sched_class task. Reported-Debuged-Tested-Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Cc: stable@kernel.org # .37 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 12:21:33 +01:00
Peter Zijlstra	542e72fc90	perf: Fix reading in perf_event_read() It is quite possible for the event to have been disabled between perf_event_read() sending the IPI and the CPU servicing the IPI and calling __perf_event_read(), hence revalidate the state. Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 12:15:46 +01:00
Peter Zijlstra	fe4b04fa31	perf: Cure task_oncpu_function_call() races Oleg reported that on architectures with __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from task_oncpu_function_call() can land before perf_event_task_sched_in() and cause interesting situations for eg. perf_install_in_context(). This patch reworks the task_oncpu_function_call() interface to give a more usable primitive as well as rework all its users to hopefully be more obvious as well as remove the races. While looking at the code I also found a number of races against perf_event_task_sched_out() which can flip contexts between tasks so plug those too. Reported-and-reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 12:14:43 +01:00
Steven Rostedt	e4a9ea5ee7	tracing: Replace trace_event struct array with pointer array Currently the trace_event structures are placed in the _ftrace_events section, and at link time, the linker makes one large array of all the trace_event structures. On boot up, this array is read (much like the initcall sections) and the events are processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the _ftrace_event section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The _ftrace_event section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-02 21:37:13 -05:00
Thomas Gleixner	f1a06390d0	genirq: Prevent irq storm on migration move_native_irq() masks and unmasks the interrupt line unconditionally, but the interrupt line might be masked due to a threaded oneshot handler in progress. Unmasking the line in that case can lead to interrupt storms. Observed on PREEMPT_RT. Originally-from: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2011-02-02 22:15:08 +01:00
Richard Cochran	0606f422b4	posix clocks: Introduce dynamic clocks This patch adds support for adding and removing posix clocks. The clock lifetime cycle is patterned after usb devices. Each clock is represented by a standard character device. In addition, the driver may optionally implement custom character device operations. The posix clock and timer system calls listed below now work with dynamic posix clocks, as well as the traditional static clocks. The following system calls are affected: - clock_adjtime (brand new syscall) - clock_gettime - clock_getres - clock_settime - timer_create - timer_delete - timer_gettime - timer_settime [ tglx: Adapted to the posix-timer cleanup. Moved clock_posix_dynamic to posix-clock.c and made all referenced functions static ] Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134420.164172635@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:20 +01:00
Thomas Gleixner	527087374f	posix-timers: Cleanup namespace Rename register_posix_clock() to posix_timers_register_clock(). That's what the function really does. As a side effect this cleans up the posix_clock namespace for the upcoming dynamic posix_clock infrastructure. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Richard Cochran <richard.cochran@omicron.at> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <alpine.LFD.2.00.1102021222240.31804@localhost6.localdomain6>	2011-02-02 15:28:19 +01:00
Richard Cochran	81e294cba2	posix-timers: Add support for fd based clocks Extend the negative clockids which are currently used by posix cpu timers to encode the PID with a file descriptor based type which encodes the fd in the upper bits. Originally-from: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134420.062860200@linutronix.de>	2011-02-02 15:28:19 +01:00
Richard Cochran	f1f1d5ebd1	posix-timers: Introduce a syscall for clock tuning. A new syscall is introduced that allows tuning of a POSIX clock. The new call, clock_adjtime, takes two parameters, the clock ID and a pointer to a struct timex. Any ADJTIMEX(2) operation may be requested via this system call, but various POSIX clocks may or may not support tuning. [ tglx: Adapted to the posix-timer cleanup series. Avoid copy_to_user in the error case ] Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134419.869804645@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:19 +01:00
Richard Cochran	65f5d80bdf	time: Splitout compat timex accessors Split out the compat timex accessors into separate functions. Preparatory patch for a new syscall. [ tglx: Split that patch from Richards "posix-timers: Introduce a syscall for clock tuning.". Keeps the changes strictly separate ] Originally-from: Richard Cochran <richardcochran@gmail.com> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20110201134419.772343089@linutronix.de>	2011-02-02 15:28:18 +01:00
Richard Cochran	094aa1881f	ntp: Add ADJ_SETOFFSET mode bit This patch adds a new mode bit into the timex structure. When set, the bit instructs the kernel to add the given time value to the current time. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134320.688829863@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:18 +01:00
John Stultz	c528f7c6c2	time: Introduce timekeeping_inject_offset This adds a kernel-internal timekeeping interface to add or subtract a fixed amount from CLOCK_REALTIME. This makes it so kernel users or interfaces trying to do so do not have to read the time, then add an offset and then call settimeofday(), which adds some extra error in comparision to just simply adding the offset in the kernel timekeeping core. Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.584311693@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:18 +01:00
Richard Cochran	0061748dd2	posix-timer: Update comment Pick the cleanup to the comment in posix-timers.c from Richards all in one conversion patch. Originally-from: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134419.487708516@linutronix.de>	2011-02-02 15:28:18 +01:00
Thomas Gleixner	bc2c8ea483	posix-timers: Make posix-cpu-timers functions static All functions are accessed via clock_posix_cpu now. So make them static. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.389755466@linutronix.de>	2011-02-02 15:28:17 +01:00
Thomas Gleixner	0aa3975f02	posix-timers: Remove CLOCK_DISPATCH leftovers All users gone. Remove the cruft. Huge thanks to Richard Cochran who tackled that maze first. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.294620613@linutronix.de>	2011-02-02 15:28:17 +01:00
Thomas Gleixner	6761c6702e	posix-timers: Convert timer_delete() to clockid_to_kclock() Set the common function for CLOCK_MONOTONIC and CLOCK_REALTIME kclocks and use the new decoding function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.198999420@linutronix.de>	2011-02-02 15:28:17 +01:00
Thomas Gleixner	a7319fa253	posix-timers: Convert timer_gettime() to clockid_to_kclock() Set the common function for CLOCK_MONOTONIC and CLOCK_REALTIME kclocks and use the new decoding function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.101243181@linutronix.de>	2011-02-02 15:28:16 +01:00
Thomas Gleixner	27722df16e	posix-timers: Convert timer_settime() to clockid_to_kclock() Set the common function for CLOCK_MONOTONIC and CLOCK_REALTIME kclocks and use the new decoding function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.001863714@linutronix.de>	2011-02-02 15:28:16 +01:00
Thomas Gleixner	838394fbf9	posix-timers: Convert timer_create() to clockid_to_kclock() Setup timer_create for CLOCK_MONOTONIC and CLOCK_REALTIME kclocks and remove the no_timer_create() implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.903604289@linutronix.de>	2011-02-02 15:28:15 +01:00
Thomas Gleixner	ebaac757ac	posix-timers: Remove useless res field from k_clock The res member of kclock is only used by mmtimer.c, but even there it contains redundant information. Remove the field and fixup mmtimer. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.808714587@linutronix.de>	2011-02-02 15:28:15 +01:00
Thomas Gleixner	e5e542eea9	posix-timers: Convert clock_getres() to clockid_to_kclock() Use the new kclock decoding. Fixup the fallout in mmtimer.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.709802797@linutronix.de>	2011-02-02 15:28:15 +01:00
Thomas Gleixner	4359ac0ace	posix-timers: Make clock_getres and clock_get mandatory Richard said: "I would think that we can require k_clocks to provide the read function. This could be checked and enforced in register_posix_clock()." Add checks for clock_getres and clock_get in the register function. Suggested-by: Richard Cochran <richardcochran@gmail.com> Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:15 +01:00
Thomas Gleixner	4228577763	posix-timers: Convert clock_gettime() to clockid_to_kclock() Use the new kclock decoding mechanism and rename the misnomed common_clock_get() to posix_clock_realtime_get(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.611097203@linutronix.de>	2011-02-02 15:28:14 +01:00
Thomas Gleixner	26f9a4796a	posix-timers: Convert clock_settime to clockid_to_kclock() Use the new kclock decoding function in clock_settime and cleanup all kclocks which use the default functions. Rename the misnomed common_clock_set() to posix_clock_realtime_set(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.518851246@linutronix.de>	2011-02-02 15:28:14 +01:00
Thomas Gleixner	79c9da0d05	posix-cpu-timers: Remove the stub nanosleep functions CLOCK_THREAD_CPUTIME_ID implements stub functions for nanosleep and nanosleep_restart, which return -EINVAL. That return value is wrong. The correct return value is -ENOTSUP. Remove the stubs and let the new dispatch code return the correct error code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.422446502@linutronix.de>	2011-02-02 15:28:14 +01:00
Thomas Gleixner	3751f9f29b	posix-timers: Cleanup restart_block usage posix timers still use the legacy arg0-arg3 members of restart_block. Use restart_block.nanosleep instead Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.232288779@linutronix.de>	2011-02-02 15:28:13 +01:00
Thomas Gleixner	59bd5bc24a	posix-timers: Convert clock_nanosleep_restart to clockid_to_kclock() Use the new kclock decoding function in clock_nanosleep_restart. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.131263211@linutronix.de>	2011-02-02 15:28:13 +01:00
Thomas Gleixner	a5cd288010	posix-timers: Convert clock_nanosleep to clockid_to_kclock() Use the new kclock decoding function in clock_nanosleep and cleanup all kclocks which use the default functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.034175556@linutronix.de>	2011-02-02 15:28:13 +01:00
Thomas Gleixner	cc785ac22b	posix-timers: Introduce clockid_to_kclock() New function to find the kclock for a given clockid. Returns a pointer to clock_posix_cpu if clockid < 0. If clockid >= MAXCLOCK or if the clock_getres pointer is not set it returns NULL. For valid clocks it returns a pointer to the matching posix_clock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <johnstul@us.ibm.com> Acked-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134417.938447839@linutronix.de>	2011-02-02 15:28:12 +01:00
Thomas Gleixner	1976945eea	posix-timers: Introduce clock_posix_cpu The CLOCK_DISPATCH() macro is a horrible magic. We call common functions if a function pointer is not set. That's just backwards. To support dynamic file decriptor based clocks we need to cleanup that dispatch logic. Create a k_clock struct clock_posix_cpu which has all the posix-cpu-timer functions filled in. After the cleanup the functions can be made static. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134417.841974553@linutronix.de>	2011-02-02 15:28:12 +01:00
Thomas Gleixner	2fd1f04089	posix-timers: Cleanup struct initializers Cosmetic. No functional change Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134417.745627057@linutronix.de>	2011-02-02 15:28:12 +01:00
Thomas Gleixner	65da528d7c	posix-timers: Define nanosleep not supported error separate Define the conditional nanosleep not supported error value outside of do_posix_clock_nonanosleep(). Preparatory patch for further cleanups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134417.643486574@linutronix.de>	2011-02-02 15:28:11 +01:00
Richard Cochran	1e6d767924	time: Correct the settime parameters Both settimeofday() and clock_settime() promise with a 'const' attribute not to alter the arguments passed in. This patch adds the missing 'const' attribute into the various kernel functions implementing these calls. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134417.545698637@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:11 +01:00
Lucian Adrian Grijincu	4916ca401e	security: remove unused security_sysctl hook The only user for this hook was selinux. sysctl routes every call through /proc/sys/. Selinux and other security modules use the file system checks for sysctl too, so no need for this hook any more. Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>	2011-02-01 11:54:02 -05:00
Thomas Gleixner	7cf37e87dd	time: Fix legacy arch fallout The xtime/dotimer cleanup broke architectures which do not implement clockevents. Time to send out another __do_IRQ threat. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Ingo Molnar <mingo@elte.hu> Cc: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145905.23248.30458.stgit@localhost> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-01 09:46:47 +01:00
Torben Hohn	e2830b5c1b	time: Make do_timer() and xtime_lock local to kernel/time/ All callers of do_timer() are converted to xtime_update(). The only users of xtime_lock are in kernel/time/. Make both local to kernel/time/ and remove them from the global header files. [ tglx: Reuse tick-internal.h instead of creating another local header file. Massaged changelog ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 19:26:50 +01:00
Thomas Gleixner	51563cd53c	Merge branch 'tip/rtmutex' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into core/locking *git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace tip/rtmutex: rtmutex: Simplify PI algorithm and make highest prio task get lock	2011-01-31 15:09:14 +01:00
Torben Hohn	f0af911a9d	time: Provide xtime_update() xtime_update() takes xtime_lock write locked and calls do_timer(). Provided to replace the do_timer() calls in the architecture code. Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145910.23248.21379.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:43 +01:00
Thomas Gleixner	79ecaf0d15	time: Remove unused __get_wall_to_monotonic() No users left. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:43 +01:00
Torben Hohn	48cf76f710	time: Provide get_xtime_and_monotonic_offset() The hrtimer code accesses timekeeping variables under xtime_lock. Provide a sensible accessor function and use it. [ tglx: Removed the conditionals, unused variable, fixed codingstyle and massaged changelog ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145905.23248.30458.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:42 +01:00
Torben Hohn	fbad1ea941	time: Move get_jiffies_64 to kernel/time/jiffies.c Move the jiffies access functions to the jiffies clocksource code. [ tglx: Add missing include ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145900.23248.73352.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:42 +01:00
Torben Hohn	871cf1e5f2	time: Move do_timer() to kernel/time/timekeeping.c do_timer() is primary timekeeping related. calc_global_load() is called from do_timer() as well, but that's more for historical reasons. [ tglx: Fixed up the calc_global_load() reject andmassaged changelog ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145855.23248.56933.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:41 +01:00
Marcin Slusarz	9ffdc6c37d	watchdog: Don't change watchdog state on read of sysctl Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> [ add {}'s to fix a warning ] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-3-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-31 13:22:43 +01:00
Marcin Slusarz	397357666d	watchdog: Fix sysctl consistency If it was not possible to enable watchdog for any cpu, switch watchdog_enabled back to 0, because it's visible via kernel.watchdog sysctl. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-2-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-31 13:22:43 +01:00
Marcin Slusarz	4135038a58	watchdog: Fix broken nowatchdog logic Passing nowatchdog to kernel disables 2 things: creation of watchdog threads AND initialization of percpu watchdog_hrtimer. As hrtimers are initialized only at boot it's not possible to enable watchdog later - for me all watchdog threads started to eat 100% of CPU time, but they could just crash. Additionally, even if these threads would start properly, watchdog_disable_all_cpus was guarded by no_watchdog check, so you couldn't disable watchdog. To fix this, remove no_watchdog variable and use already existing watchdog_enabled variable. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> [ removed another no_watchdog instance ] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-31 13:22:42 +01:00
Thomas Gleixner	1fb0ef31f4	genirq: Fix affinity notifier fallout The new code of commit cd7eab44e(genirq: Add IRQ affinity notifiers) references irq_desc.affinity which fails to compile with CONFIG_GENERIC_HARDIRQS_NO_DEPRECATED=y. Use irq_desc.irq_data.affinity instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ben Hutchings <bhutchings@solarflare.com>	2011-01-31 08:57:41 +01:00
Kacper Kornet	aa5bd67dcf	Fix prlimit64 for suid/sgid processes Since check_prlimit_permission always fails in the case of SUID/GUID processes, such processes are not able to read or set their own limits. This commit changes this by assuming that process can always read/change its own limits. Signed-off-by: Kacper Kornet <kornet@camk.edu.pl> Acked-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-31 13:01:27 +10:00
Lai Jiangshan	8161239a8b	rtmutex: Simplify PI algorithm and make highest prio task get lock In current rtmutex, the pending owner may be boosted by the tasks in the rtmutex's waitlist when the pending owner is deboosted or a task in the waitlist is boosted. This boosting is unrelated, because the pending owner does not really take the rtmutex. It is not reasonable. Example. time1: A(high prio) onwers the rtmutex. B(mid prio) and C (low prio) in the waitlist. time2 A release the lock, B becomes the pending owner A(or other high prio task) continues to run. B's prio is lower than A, so B is just queued at the runqueue. time3 A or other high prio task sleeps, but we have passed some time The B and C's prio are changed in the period (time2 ~ time3) due to boosting or deboosting. Now C has the priority higher than B. **Is it reasonable that C has to boost B and help B to get the rtmutex? NO!! I think, it is unrelated/unneed boosting before B really owns the rtmutex. We should give C a chance to beat B and win the rtmutex. This is the motivation of this patch. This patch ensures* only the top waiter or higher priority task can take the lock. How? 1) we don't dequeue the top waiter when unlock, if the top waiter is changed, the old top waiter will fail and go to sleep again. 2) when requiring lock, it will get the lock when the lock is not taken and: there is no waiter OR higher priority than waiters OR it is top waiter. 3) In any time, the top waiter is changed, the top waiter will be woken up. The algorithm is much simpler than before, no pending owner, no boosting for pending owner. Other advantage of this patch: 1) The states of a rtmutex are reduced a half, easier to read the code. 2) the codes become shorter. 3) top waiter is not dequeued until it really take the lock: they will retain FIFO when it is stolen. Not advantage nor disadvantage 1) Even we may wakeup multiple waiters(any time when top waiter changed), we hardly cause "thundering herd", the number of wokenup task is likely 1 or very little. 2) two APIs are changed. rt_mutex_owner() will not return pending owner, it will return NULL when the top waiter is going to take the lock. rt_mutex_next_owner() always return the top waiter. will not return NULL if we have waiters because the top waiter is not dequeued. I have fixed the code that use these APIs. need updated after this patch is accepted 1) Document/* 2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4D3012D5.4060709@cn.fujitsu.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-01-27 21:13:51 -05:00
Linus Torvalds	bffb276fff	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Use rq->clock_task instead of rq->clock for correctly maintaining load averages sched: Fix/remove redundant cfs_rq checks sched: Fix sign under-flows in wake_affine	2011-01-28 06:45:04 +10:00
Eric Dumazet	88d4f0db7f	perf: Fix alloc_callchain_buffers() Commit `927c7a9e92` ("perf: Fix race in callchains") introduced a mismatch in the sizing of struct callchain_cpus_entries. nr_cpu_ids must be used instead of num_possible_cpus(), or we might get out of bound memory accesses on some machines. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: Stephane Eranian <eranian@google.com> CC: stable@kernel.org LKML-Reference: <1295980851.3588.351.camel@edumazet-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-27 19:21:50 +01:00
Peter Zijlstra	6ea72f1206	sched: Avoid expensive initial update_cfs_load(), on UP too Fix the build on UP. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> LKML-Reference: <20110122044852.102126037@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-27 12:48:14 +01:00
Thomas Gleixner	10389a15e2	cred: Replace deprecated spinlock initialization SPIN_LOCK_UNLOCK is deprecated. Use the lockdep capable variant instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-27 12:30:37 +01:00
Thomas Gleixner	f97b12cce6	Merge commit 'v2.6.38-rc2' into core/locking Reason: Update to mainline before adding the locking cleanup Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-27 12:29:37 +01:00
Arnd Bergmann	ccaa8d6571	rtmutex-tester: Remove BKL tests The BKL is going away, no need to test it any more. I left the definitions of the test case numbers in, so that the other tests do not get renumbered. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1295993854-4971-19-git-send-email-arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-26 15:58:12 +01:00
Peter Zijlstra	da7a735e51	sched: Fix switch_from_fair() When a task is taken out of the fair class we must ensure the vruntime is properly normalized because when we put it back in it will assume to be normalized. The case that goes wrong is when changing away from the fair class while sleeping. Sleeping tasks have non-normalized vruntime in order to make sleeper-fairness work. So treat the switch away from fair as a wakeup and preserve the relative vruntime. Also update sysrq-n to call the ->switch_{to,from} methods. Reported-by: Onkalo Samu <samu.p.onkalo@nokia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:22 +01:00
Peter Zijlstra	a8941d7ec8	sched: Simplify the idle scheduling class Since commit `48c5ccae88` (sched: Simplify cpu-hot-unplug task migration) this should no longer happen, so remove the code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:22 +01:00
Venkatesh Pallipadi	414bee9ba6	softirqs: Account ksoftirqd time as cpustat softirq softirq time in ksoftirqd context is not accounted in ns granularity per cpu softirq stats, as we want that to be a part of ksoftirqd exec_runtime. Accounting them as softirq on /proc/stat separately. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-6-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:22 +01:00
Venkatesh Pallipadi	abb74cefa9	sched: Export ns irqtimes through /proc/stat CONFIG_IRQ_TIME_ACCOUNTING adds ns granularity irq time on each CPU. This info is already used in scheduler to do proper task chargeback (earlier patches). This patch retro-fits this ns granularity hardirq and softirq information to /proc/stat irq and softirq fields. The update is still done on timer tick, where we look at accumulated ns hardirq/softirq time and account the tick to user/system/irq/hardirq/guest accordingly. No new interface added. Earlier versions looked at adding this as new fields in some /proc files. This one seems to be the best in terms of impact to existing apps, even though it has somewhat more kernel code than earlier versions. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:21 +01:00
Venkatesh Pallipadi	70a89a6620	sched: Refactor account_system_time separating id-update Refactor account_system_time, to separate out the logic of identifying the update needed and code that does actual update. This is used by following patch for IRQ_TIME_ACCOUNTING, which has different identification logic and same update logic. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:21 +01:00
Venkatesh Pallipadi	a1dabb6bff	time: Add nsecs_to_cputime64 interface for asm-generic Add nsecs_to_cputime64 interface. This is used in following patches that updates cpu irq stat based on ns granularity info in IRQ_TIME_ACCOUNTING. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:20 +01:00
Venkatesh Pallipadi	4dd53d891c	softirqs: Free up pf flag PF_KSOFTIRQD Cleanup patch, freeing up PF_KSOFTIRQD and use per_cpu ksoftirqd pointer instead, as suggested by Eric Dumazet. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:20 +01:00
Paul Turner	f07333bf6e	sched: Avoid expensive initial update_cfs_load() Since cfs->{load_stamp,load_last} are zero-initalized the initial load update will consider the delta to be 'since the beginning of time'. This results in a lot of pointless divisions to bring this large period to be within the sysctl_sched_shares_window. Fix this by initializing load_stamp to be 1 at cfs_rq initialization, this allows for an initial load_stamp > load_last which then lets standard idle truncation proceed. We avoid spinning (and slightly improve consistency) by fixing delta to be [period - 1] in this path resulting in a slightly more predictable shares ramp. (Previously the amount of idle time preserved by the overflow would range between [period/2,period-1].) Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044852.102126037@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:19 +01:00
Paul Turner	6d5ab2932a	sched: Simplify update_cfs_shares parameters Re-visiting this: Since update_cfs_shares will now only ever re-weight an entity that is a relative parent of the current entity in enqueue_entity; we can safely issue the account_entity_enqueue relative to that cfs_rq and avoid the requirement for special handling of the enqueue case in update_cfs_shares. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.915214637@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:19 +01:00
Paul Turner	05ca62c6ca	sched: Use rq->clock_task instead of rq->clock for correctly maintaining load averages The delta in clock_task is a more fair attribution of how much time a tg has been contributing load to the current cpu. While not really important it also means we're more in sync (by magnitude) with respect to periodic updates (since __update_curr deltas are clock_task based). Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044852.007092349@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:31:03 +01:00
Paul Turner	b815f1963e	sched: Fix/remove redundant cfs_rq checks Since updates are against an entity's queuing cfs_rq it's not possible to enter update_cfs_{shares,load} with a NULL cfs_rq. (Indeed, update_cfs_load would crash prior to the check if we did anyway since we load is examined during the initializers). Also, in the update_cfs_load case there's no point in maintaining averages for rq->cfs_rq since we don't perform shares distribution at that level -- NULL check is replaced accordingly. Thanks to Dan Carpenter for pointing out the deference before NULL check. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.825284940@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:31:02 +01:00
Paul Turner	e37b6a7b27	sched: Fix sign under-flows in wake_affine While care is taken around the zero-point in effective_load to not exceed the instantaneous rq->weight, it's still possible (e.g. using wake_idx != 0) for (load + effective_load) to underflow. In this case the comparing the unsigned values can result in incorrect balanced decisions. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.734245014@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:31:01 +01:00
Linus Torvalds	6fb1b30425	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: wacom - pass touch resolution to clients through input_absinfo Input: wacom - add 2 Bamboo Pen and touch models Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistent Input: sparse-keymap - fix KEY_VSW handling in sparse_keymap_setup Input: tegra-kbc - add tegra keyboard driver Input: gpio_keys - switch to using request_any_context_irq Input: serio - allow registered drivers to get status flag Input: ct82710c - return proper error code for ct82c710_open Input: bu21013_ts - added regulator support Input: bu21013_ts - remove duplicate resolution parameters Input: tnetv107x-ts - don't treat NULL clk as an error Input: tnetv107x-keypad - don't treat NULL clk as an error Fix up trivial conflicts in drivers/input/keyboard/Makefile due to additions of tc3589x/Tegra drivers	2011-01-26 16:31:44 +10:00
Torben Hohn	ac751efa6a	console: rename acquire/release_console_sem() to console_lock/unlock() The -rt patches change the console_semaphore to console_mutex. As a result, a quite large chunk of the patches changes all acquire/release_console_sem() to acquire/release_console_mutex() This commit makes things use more neutral function names which dont make implications about the underlying lock. The only real change is the return value of console_trylock which is inverted from try_acquire_console_sem() This patch also paves the way to switching console_sem from a semaphore to a mutex. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: make console_trylock return 1 on success, per Geert] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Thomas Gleixner <tglx@tglx.de> Cc: Greg KH <gregkh@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:06 +10:00
David S. Miller	c2df88cbb4	Merge branch 'irq/numa' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2011-01-24 14:30:13 -08:00
David S. Miller	5bdc22a565	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/sched/sch_hfsc.c net/sched/sch_htb.c net/sched/sch_tbf.c	2011-01-24 14:09:35 -08:00
Linus Torvalds	500d85ce39	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Fix time function double declaration with glibc perf tools: Fix build by checking if extra warnings are supported perf tools: Fix build when using gcc 3.4.6 perf tools: Add missing header, fixes build perf tools: Fix 64 bit integer format strings perf test: Fix build on older glibcs perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ perf test: Use cpu_map->[cpu] when setting affinity perf symbols: Fix annotation of thumb code perf: Annotate cpuctx->ctx.mutex to avoid a lockdep splat powerpc, perf: Fix frequency calculation for overflowing counters (FSL version) perf: Fix perf_event_init_task()/perf_event_free_task() interaction perf: Fix find_get_context() vs perf_event_exit_task() race	2011-01-25 05:26:47 +10:00
Linus Torvalds	ce84d539ce	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: RTC: Remove Kconfig symbol for UIE emulation RTC: Properly handle rtc_read_alarm error propagation and fix bug RTC: Propagate error handling via rtc_timer_enqueue properly acpi_pm: Clear pmtmr_ioport if acpi_pm initialization fails rtc: Cleanup removed UIE emulation declaration hrtimers: Notify hrtimer users of switches to NOHZ mode	2011-01-25 05:25:55 +10:00
Linus Torvalds	bc094757f4	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix poor interactivity on UP systems due to group scheduler nice tune bug	2011-01-25 05:25:13 +10:00
Andy Whitcroft	8c6a98b22b	Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistent Currently sysrq_enabled and __sysrq_enabled are initialised separately and inconsistently, leading to sysrq being actually enabled by reported as not enabled in sysfs. The first change to the sysfs configurable synchronises these two: static int __read_mostly sysrq_enabled = 1; static int __sysrq_enabled; Add a common define to carry the default for these preventing them becoming out of sync again. Default this to 1 to mirror previous behaviour. Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: stable@kernel.org Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2011-01-24 09:33:36 -08:00
Yong Zhang	3ff6dcac73	sched: Fix poor interactivity on UP systems due to group scheduler nice tune bug Michael Witten and Christian Kujau reported that the autogroup scheduling feature hurts interactivity on their UP systems. It turns out that this is an older bug in the group scheduling code, and the wider appeal provided by the autogroup feature exposed it more prominently. When on UP with FAIR_GROUP_SCHED enabled, tune shares only affect tg->shares, but is not reflected in tg->se->load. The reason is that update_cfs_shares() does nothing on UP. So introduce update_cfs_shares() for UP && FAIR_GROUP_SCHED. This issue was found when enable autogroup scheduling was enabled, but it is an older bug that also exists on cgroup.cpu on UP. Reported-and-Tested-by: Michael Witten <mfwitten@gmail.com> Reported-and-Tested-by: Christian Kujau <christian@nerdbynature.de> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110124073352.GA24186@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-24 11:47:50 +01:00
Dmitry Torokhov	e94965ed5b	module: show version information for built-in modules in sysfs Currently only drivers that are built as modules have their versions shown in /sys/module/<module_name>/version, but this information might also be useful for built-in drivers as well. This especially important for drivers that do not define any parameters - such drivers, if built-in, are completely invisible from userspace. This patch changes MODULE_VERSION() macro so that in case when we are compiling built-in module, version information is stored in a separate section. Kernel then uses this data to create 'version' sysfs attribute in the same fashion it creates attributes for module parameters. Signed-off-by: Dmitry Torokhov <dtor@vmware.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-01-24 14:32:51 +10:30
Ben Hutchings	cd7eab44e9	genirq: Add IRQ affinity notifiers When initiating I/O on a multiqueue and multi-IRQ device, we may want to select a queue for which the response will be handled on the same or a nearby CPU. This requires a reverse-map of IRQ affinity. Add a notification mechanism to support this. This is based closely on work by Thomas Gleixner <tglx@linutronix.de>. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Cc: linux-net-drivers@solarflare.com Cc: Tom Herbert <therbert@google.com> Cc: David Miller <davem@davemloft.net> LKML-Reference: <1295470904.11126.84.camel@bwh-desktop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-22 17:34:25 +01:00
Linus Torvalds	5bf7a6503f	Merge branch 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: note the nested NOT_RUNNING test in worker_clr_flags() isn't a noop workqueue: relax lockdep annotation on flush_work()	2011-01-21 13:38:57 -08:00
Oleg Nesterov	806839b22c	perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ In theory, almost every user of task->child->perf_event_ctxp[] is wrong. find_get_context() can install the new context at any moment, we need read_barrier_depends(). `dbe08d82ce` "perf: Fix find_get_context() vs perf_event_exit_task() race" added rcu_dereference() into perf_event_exit_task_context() to make the precedent, but this makes __rcu_dereference_check() unhappy. Use rcu_dereference_raw() to shut up the warning. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: acme@redhat.com Cc: paulus@samba.org Cc: stern@rowland.harvard.edu Cc: a.p.zijlstra@chello.nl Cc: fweisbec@gmail.com Cc: roland@redhat.com Cc: prasad@linux.vnet.ibm.com Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> LKML-Reference: <20110121174547.GA8796@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-21 22:08:16 +01:00
Peter Zijlstra	547e9fd7d3	perf: Annotate cpuctx->ctx.mutex to avoid a lockdep splat Lockdep spotted: loop_1b_instruc/1899 is trying to acquire lock: (event_mutex){+.+.+.}, at: [<ffffffff810e1908>] perf_trace_init+0x3b/0x2f7 but task is already holding lock: (&ctx->mutex){+.+.+.}, at: [<ffffffff810eb45b>] perf_event_init_context+0xc0/0x218 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&ctx->mutex){+.+.+.}: -> #2 (cpu_hotplug.lock){+.+.+.}: -> #1 (module_mutex){+.+...}: -> #0 (event_mutex){+.+.+.}: But because the deadlock would be cpuhotplug (cpu-event) vs fork (task-event) it cannot, in fact, happen. We can annotate this by giving the perf_event_context used for the cpuctx a different lock class from those used by tasks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-21 16:32:42 +01:00
Thomas Gleixner	1c77ff22f5	genirq: Remove __do_IRQ All architectures are finally converted. Remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Mike Frysinger <vapier@gentoo.org> Cc: David Howells <dhowells@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Michal Simek <monstr@monstr.eu> Acked-by: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jeff Dike <jdike@addtoit.com>	2011-01-21 11:55:31 +01:00
Linus Torvalds	2b1caf6ed7	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c lockdep: Move early boot local IRQ enable/disable status to init/main.c	2011-01-20 18:30:37 -08:00
Linus Torvalds	8d99641f6c	Merge branch 'akpm' * akpm: kernel/smp.c: consolidate writes in smp_call_function_interrupt() kernel/smp.c: fix smp_call_function_many() SMP race memcg: correctly order reading PCG_USED and pc->mem_cgroup backlight: fix 88pm860x_bl macro collision drivers/leds/ledtrig-gpio.c: make output match input, tighten input checking MAINTAINERS: update Atmel AT91 entry mm: fix truncate_setsize() comment memcg: fix rmdir, force_empty with THP memcg: fix LRU accounting with THP memcg: fix USED bit handling at uncharge in THP memcg: modify accounting function for supporting THP better fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bio mm: compaction: prevent division-by-zero during user-requested compaction mm/vmscan.c: remove duplicate include of compaction.h memblock: fix memblock_is_region_memory() thp: keep highpte mapped until it is no longer needed kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT	2011-01-20 17:02:14 -08:00
Milton Miller	225c8e010f	kernel/smp.c: consolidate writes in smp_call_function_interrupt() We have to test the cpu mask in the interrupt handler before checking the refs, otherwise we can start to follow an entry before its deleted and find it partially initailzed for the next trip. Presently we also clear the cpumask bit before executing the called function, which implies getting write access to the line. After the function is called we then decrement refs, and if they go to zero we then unlock the structure. However, this implies getting write access to the call function data before and after another the function is called. If we can assert that no smp_call_function execution function is allowed to enable interrupts, then we can move both writes to after the function is called, hopfully allowing both writes with one cache line bounce. On a 256 thread system with a kernel compiled for 1024 threads, the time to execute testcase in the "smp_call_function_many race" changelog was reduced by about 30-40ms out of about 545 ms. I decided to keep this as WARN because its now a buggy function, even though the stack trace is of no value -- a simple printk would give us the information needed. Raw data: Without patch: ipi_test startup took 1219366ns complete 539819014ns total 541038380ns ipi_test startup took 1695754ns complete 543439872ns total 545135626ns ipi_test startup took 7513568ns complete 539606362ns total 547119930ns ipi_test startup took 13304064ns complete 533898562ns total 547202626ns ipi_test startup took 8668192ns complete 544264074ns total 552932266ns ipi_test startup took 4977626ns complete 548862684ns total 553840310ns ipi_test startup took 2144486ns complete 541292318ns total 543436804ns ipi_test startup took 21245824ns complete 530280180ns total 551526004ns With patch: ipi_test startup took 5961748ns complete 500859628ns total 506821376ns ipi_test startup took 8975996ns complete 495098924ns total 504074920ns ipi_test startup took 19797750ns complete 492204740ns total 512002490ns ipi_test startup took 14824796ns complete 487495878ns total 502320674ns ipi_test startup took 11514882ns complete 494439372ns total 505954254ns ipi_test startup took 8288084ns complete 502570774ns total 510858858ns ipi_test startup took 6789954ns complete 493388112ns total 500178066ns #include <linux/module.h> #include <linux/init.h> #include <linux/sched.h> /* sched clock / #define ITERATIONS 100 static void do_nothing_ipi(void dummy) { } static void do_ipis(struct work_struct *dummy) { int i; for (i = 0; i < ITERATIONS; i++) smp_call_function(do_nothing_ipi, NULL, 1); printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id()); } static struct work_struct work[NR_CPUS]; static int __init testcase_init(void) { int cpu; u64 start, started, done; start = local_clock(); for_each_online_cpu(cpu) { INIT_WORK(&work[cpu], do_ipis); schedule_work_on(cpu, &work[cpu]); } started = local_clock(); for_each_online_cpu(cpu) flush_work(&work[cpu]); done = local_clock(); pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n", started-start, done-started, done-start); return 0; } static void __exit testcase_exit(void) { } module_init(testcase_init) module_exit(testcase_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Anton Blanchard"); Signed-off-by: Milton Miller <miltonm@bga.com> Cc: Anton Blanchard <anton@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:06 -08:00
Anton Blanchard	6dc1989995	kernel/smp.c: fix smp_call_function_many() SMP race I noticed a failure where we hit the following WARN_ON in generic_smp_call_function_interrupt: if (!cpumask_test_and_clear_cpu(cpu, data->cpumask)) continue; data->csd.func(data->csd.info); refs = atomic_dec_return(&data->refs); WARN_ON(refs < 0); <------------------------- We atomically tested and cleared our bit in the cpumask, and yet the number of cpus left (ie refs) was 0. How can this be? It turns out commit `54fdade1c3` ("generic-ipi: make struct call_function_data lockless") is at fault. It removes locking from smp_call_function_many and in doing so creates a rather complicated race. The problem comes about because: - The smp_call_function_many interrupt handler walks call_function.queue without any locking. - We reuse a percpu data structure in smp_call_function_many. - We do not wait for any RCU grace period before starting the next smp_call_function_many. Imagine a scenario where CPU A does two smp_call_functions back to back, and CPU B does an smp_call_function in between. We concentrate on how CPU C handles the calls: CPU A CPU B CPU C CPU D smp_call_function smp_call_function_interrupt walks call_function.queue sees data from CPU A on list smp_call_function smp_call_function_interrupt walks call_function.queue sees (stale) CPU A on list smp_call_function int clears last ref on A list_del_rcu, unlock smp_call_function reuses percpu data A data->cpumask sees and clears bit in cpumask might be using old or new fn! decrements refs below 0 set data->refs (too late!) The important thing to note is since the interrupt handler walks a potentially stale call_function.queue without any locking, then another cpu can view the percpu data structure at any time, even when the owner is in the process of initialising it. The following test case hits the WARN_ON 100% of the time on my PowerPC box (having 128 threads does help :) #include <linux/module.h> #include <linux/init.h> #define ITERATIONS 100 static void do_nothing_ipi(void dummy) { } static void do_ipis(struct work_struct dummy) { int i; for (i = 0; i < ITERATIONS; i++) smp_call_function(do_nothing_ipi, NULL, 1); printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id()); } static struct work_struct work[NR_CPUS]; static int __init testcase_init(void) { int cpu; for_each_online_cpu(cpu) { INIT_WORK(&work[cpu], do_ipis); schedule_work_on(cpu, &work[cpu]); } return 0; } static void __exit testcase_exit(void) { } module_init(testcase_init) module_exit(testcase_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Anton Blanchard"); I tried to fix it by ordering the read and the write of ->cpumask and ->refs. In doing so I missed a critical case but Paul McKenney was able to spot my bug thankfully :) To ensure we arent viewing previous iterations the interrupt handler needs to read ->refs then ->cpumask then ->refs _again_. Thanks to Milton Miller and Paul McKenney for helping to debug this issue. [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ] [miltonm@bga.com: remove excess tests] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Milton Miller <miltonm@bga.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: <stable@kernel.org> [2.6.32+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:06 -08:00
Linus Torvalds	466c19063b	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, cgroup: Use exit hook to avoid use-after-free crash sched: Fix signed unsigned comparison in check_preempt_tick() sched: Replace rq->bkl_count with rq->rq_sched_info.bkl_count sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure sched: Display autogroup names in /proc/sched_debug sched: Reinstate group names in /proc/sched_debug sched: Update effective_load() to use global share weights	2011-01-20 16:37:55 -08:00
Tejun Heo	bd924e8cbd	smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c percpu may end up calling vfree() during early boot which in turn may call on_each_cpu() for TLB flushes. The function of on_each_cpu() can be done safely while IRQ is disabled during early boot but it assumed that the function is always called with local IRQ enabled which ended up enabling local IRQ prematurely during boot and triggering a couple of warnings. This patch updates on_each_cpu() and smp_call_function_many() such on_each_cpu() can be used safely while early_boot_irqs_disabled is set. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110120110713.GC6036@htj.dyndns.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reported-by: Ingo Molnar <mingo@elte.hu>	2011-01-20 13:32:34 +01:00
Tejun Heo	2ce802f62b	lockdep: Move early boot local IRQ enable/disable status to init/main.c During early boot, local IRQ is disabled until IRQ subsystem is properly initialized. During this time, no one should enable local IRQ and some operations which usually are not allowed with IRQ disabled, e.g. operations which might sleep or require communications with other processors, are allowed. lockdep tracked this with early_boot_irqs_off/on() callbacks. As other subsystems need this information too, move it to init/main.c and make it generally available. While at it, toggle the boolean to early_boot_irqs_disabled instead of enabled so that it can be initialized with %false and %true indicates the exceptional condition. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110120110635.GB6036@htj.dyndns.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-20 13:32:33 +01:00
Patrick McHardy	14f0290ba4	Merge branch 'master' of /repos/git/net-next-2.6	2011-01-19 23:51:37 +01:00
Stephen Boyd	2d0640b47d	hrtimers: Notify hrtimer users of switches to NOHZ mode When NOHZ=y and high res timers are disabled (via cmdline or Kconfig) tick_nohz_switch_to_nohz() will notify the user about switching into NOHZ mode. Nothing is printed for the case where HIGH_RES_TIMERS=y. Fix this for the HIGH_RES_TIMERS=y case by duplicating the printk from the low res NOHZ path in the high res NOHZ path. This confused me since I was thinking 'dmesg \| grep -i NOHZ' would tell me if NOHZ was enabled, but if I have hrtimers there is nothing. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1295419594-13085-1-git-send-email-sboyd@codeaurora.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-19 20:08:15 +01:00
Oleg Nesterov	8550d7cb6e	perf: Fix perf_event_init_task()/perf_event_free_task() interaction perf_event_init_task() should clear child->perf_event_ctxp[] before anything else. Otherwise, if perf_event_init_context(perf_hw_context) fails, perf_event_free_task() can free perf_event_ctxp[perf_sw_context] copied from parent->perf_event_ctxp[] by dup_task_struct(). Also move the initialization of perf_event_mutex and perf_event_list from perf_event_init_context() to perf_event_init_context(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> LKML-Reference: <20110119182228.GC12183@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-19 20:04:28 +01:00
Oleg Nesterov	dbe08d82ce	perf: Fix find_get_context() vs perf_event_exit_task() race find_get_context() must not install the new perf_event_context if the task has already passed perf_event_exit_task(). If nothing else, this means the memory leak. Initially ctx->refcount == 2, it is supposed that perf_event_exit_task_context() should participate and do the necessary put_ctx(). find_lively_task_by_vpid() checks PF_EXITING but this buys nothing, by the time we call find_get_context() this task can be already dead. To the point, cmpxchg() can succeed when the task has already done the last schedule(). Change find_get_context() to populate task->perf_event_ctxp[] under task->perf_event_mutex, this way we can trust PF_EXITING because perf_event_exit_task() takes the same mutex. Also, change perf_event_exit_task_context() to use rcu_dereference(). Probably this is not strictly needed, but with or without this change find_get_context() can race with setup_new_exec()->perf_event_exit_task(), rcu_dereference() looks better. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> LKML-Reference: <20110119182207.GB12183@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-19 20:04:27 +01:00
Tao Ma	490da40d82	blktrace: Don't output messages if NOTIFY isn't set. Now if we enable blktrace, cfq has too many messages output to the trace buffer. It is fine if we don't specify any action mask. But if I do like this: blktrace /dev/sdb -a issue -a complete -o - \| blkparse -i - I only want to see 'D' and 'C', while with the following command dd if=/mnt/ocfs2/test of=/dev/null bs=4k count=1 iflag=direct I will get(with a 2.6.37 vanilla kernel): 8,16 0 0 0.000000000 0 m N cfq3805 alloced 8,16 0 0 0.000004126 0 m N cfq3805 insert_request 8,16 0 0 0.000004884 0 m N cfq3805 add_to_rr 8,16 0 0 0.000008417 0 m N cfq workload slice:300 8,16 0 0 0.000009557 0 m N cfq3805 set_active wl_prio:0 wl_type:2 8,16 0 0 0.000010640 0 m N cfq3805 fifo= (null) 8,16 0 0 0.000011193 0 m N cfq3805 dispatch_insert 8,16 0 0 0.000012221 0 m N cfq3805 dispatched a request 8,16 0 0 0.000012802 0 m N cfq3805 activate rq, drv=1 8,16 0 1 0.000013181 3805 D R 114759 + 8 [dd] 8,16 0 2 0.000164244 0 C R 114759 + 8 [0] 8,16 0 0 0.000167997 0 m N cfq3805 complete rqnoidle 0 8,16 0 0 0.000168782 0 m N cfq3805 set_slice=100 8,16 0 0 0.000169874 0 m N cfq3805 arm_idle: 8 group_idle: 0 8,16 0 0 0.000170189 0 m N cfq schedule dispatch 8,16 0 0 0.000397938 0 m N cfq3805 slice expired t=0 8,16 0 0 0.000399763 0 m N cfq3805 sl_used=1 disp=1 charge=1 iops=0 sect=8 8,16 0 0 0.000400227 0 m N cfq3805 del_from_rr 8,16 0 0 0.000400882 0 m N cfq3805 put_queue See, there are 19 lines while I only need 2. I don't think it is appropriate for a user. So this patch will disable any messages if the BLK_TC_NOTIFY isn't set. Now the output for the same command will look like: 8,16 0 1 0.000000000 4908 D R 114759 + 8 [dd] 8,16 0 2 0.000146827 0 C R 114759 + 8 [0] Yes, it is what I want to see. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-19 08:25:02 -07:00
Jesper Juhl	42b16b3fbb	Kill off warning: ‘inline’ is not at beginning of declaration Fix a bunch of warning: ‘inline’ is not at beginning of declaration messages when building a 'make allyesconfig' kernel with -Wextra. These warnings are trivial to kill, yet rather annoying when building with -Wextra. The more we can cut down on pointless crap like this the better (IMHO). A previous patch to do this for a 'allnoconfig' build has already been merged. This just takes the cleanup a little further. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-01-19 15:43:08 +01:00
Peter Zijlstra	068c5cc5ac	sched, cgroup: Use exit hook to avoid use-after-free crash By not notifying the controller of the on-exit move back to init_css_set, we fail to move the task out of the previous cgroup's cfs_rq. This leads to an opportunity for a cgroup-destroy to come in and free the cgroup (there are no active tasks left in it after all) to which the not-quite dead task is still enqueued. Reported-by: Miklos Vajna <vmiklos@frugalware.org> Fixed-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1293206353.29444.205.camel@laptop>	2011-01-19 12:51:32 +01:00
Linus Torvalds	335bc70b6b	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Validate cpu early in perf_event_alloc() perf: Find_get_context: fix the per-cpu-counter check perf: Fix contexted inheritance	2011-01-18 14:29:37 -08:00
Oleg Nesterov	66832eb4ba	perf: Validate cpu early in perf_event_alloc() Starting from perf_event_alloc()->perf_init_event(), the kernel assumes that event->cpu is either -1 or the valid CPU number. Change perf_event_alloc() to validate this argument early. This also means we can remove the similar check in find_get_context(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: gregkh@suse.de Cc: stable@kernel.org LKML-Reference: <20110118161032.GC693@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 19:34:23 +01:00
Oleg Nesterov	22a4ec7290	perf: Find_get_context: fix the per-cpu-counter check If task == NULL, find_get_context() should always check that cpu is correct. Afaics, the bug was introduced by `38a81da2` "perf events: Clean up pid passing", but even before that commit "&& cpu != -1" was not exactly right, -ESRCH from find_task_by_vpid() is not accurate. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: gregkh@suse.de Cc: stable@kernel.org LKML-Reference: <20110118161008.GB693@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 19:34:23 +01:00
Peter Zijlstra	c5ed514559	perf: Fix contexted inheritance Linus reported that the RCU lockdep annotation bits triggered for this rcu_dereference() because we're not holding rcu_read_lock(). Going over the code I cannot convince myself its correct: - holding a ref on the parent_ctx, doesn't avoid it being uncloned concurrently (as the comment says), so we can race with a free. - holding parent_ctx->mutex doesn't avoid the above free from taking place either, it would at best avoid parent_ctx from being freed. I.e. the warning is correct. To fix the bug, serialize against the unclone_ctx() call by extending the reach of the parent_ctx->lock. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 15:10:35 +01:00
Mike Galbraith	d7d8294415	sched: Fix signed unsigned comparison in check_preempt_tick() Signed unsigned comparison may lead to superfluous resched if leftmost is right of the current task, wasting a few cycles, and inadvertently _lengthening_ the current task's slice. Reported-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1294202477.9384.5.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 15:09:44 +01:00
Yong Zhang	fce2097983	sched: Replace rq->bkl_count with rq->rq_sched_info.bkl_count Now rq->rq_sched_info.bkl_count is not used for rq, scroll rq->bkl_count into it. Thus we can save some space for rq. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1294991859-13246-1-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 15:09:43 +01:00
Mike Galbraith	f44937718c	sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure If CONFIG_RT_GROUP_SCHED is set, __sched_setscheduler() fails due to autogroup not allocating rt_runtime. Free unused/unusable rt_se and rt_rq, redirect RT tasks to the root task group, and tell __sched_setscheduler() that it's ok. Reported-and-tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1294890890.8089.39.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 15:09:42 +01:00
Bharata B Rao	8ecedd7a06	sched: Display autogroup names in /proc/sched_debug Add autogroup name to cfs_rq and tasks information to /proc/sched_debug. Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110111101257.GF4772@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 15:09:40 +01:00
Bharata B Rao	efe25c2c7b	sched: Reinstate group names in /proc/sched_debug Displaying of group names in /proc/sched_debug was dropped in autogroup patches. Add group names while displaying cfs_rq and tasks information. Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110111101153.GE4772@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 15:09:39 +01:00
Paul Turner	977dda7c9b	sched: Update effective_load() to use global share weights Previously effective_load would approximate the global load weight present on a group taking advantage of: entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided by tg_shares_up. This worked (approximately) for an 'empty' (at tg level) cpu since we would place boost load representative of what a newly woken task would receive. However, now that load is instantaneously updated this assumption is no longer true and the load calculation is rather incorrect in this case. Fix this (and improve the general case) by re-writing effective_load to take advantage of the new shares distribution code. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110115015817.069769529@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-18 15:09:38 +01:00
Jan Engelhardt	ae9d67aff6	audit: export symbol for use with xt_AUDIT When xt_AUDIT is built as a module, modpost reports a problem. MODPOST 322 modules ERROR: "audit_enabled" [net/netfilter/x_tables.ko] undefined! WARNING: modpost: Found 1 section mismatch(es). Cc: Thomas Graf <tgraf@redhat.com> Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2011-01-18 06:48:29 +01:00
Linus Torvalds	f9ee7f60d6	Merge branches 'core-fixes-for-linus', 'x86-fixes-for-linus', 'timers-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: avoid pointless blocked-task warnings rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status rtmutex: Fix comment about why new_owner can be NULL in wake_futex_pi() * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, olpc: Add missing Kconfig dependencies x86, mrst: Set correct APB timer IRQ affinity for secondary cpu x86: tsc: Fix calibration refinement conditionals to avoid divide by zero x86, ia64, acpi: Clean up x86-ism in drivers/acpi/numa.c * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timekeeping: Make local variables static time: Rename misnamed minsec argument of clocks_calc_mult_shift() * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Remove syscall_exit_fields tracing: Only process module tracepoints once perf record: Add "nodelay" mode, disabled by default perf sched: Fix list of events, dropping unsupported ':r' modifier Revert "perf tools: Emit clearer message for sys_perf_event_open ENOENT return" perf top: Fix annotate segv perf evsel: Fix order of event list deletion	2011-01-15 12:45:00 -08:00
Lai Jiangshan	7f85803a26	tracing: Remove syscall_exit_fields There is no need for syscall_exit_fields as the syscall exit event class can already host the fields in its structure, like most other trace events do by default. Use that default behavior instead. Following this scheme, we don't need anymore to override the get_fields() callback of the syscall exit event class either. Hence both syscall_exit_fields and syscall_get_exit_fields() can be removed. Also changed some indentation to keep the following under 80 characters: ".fields = LIST_HEAD_INIT(event_class_syscall_exit.fields)," Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4D301C0E.8090408@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-01-14 18:06:35 -05:00
Linus Torvalds	acda4721ae	Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: kernel: fix hlist_bl again cgroups: Fix a lockdep warning at cgroup removal fs: namei fix ->put_link on wrong inode in do_filp_open	2011-01-14 09:08:29 -08:00
Al Viro	c72a04e347	cgroup_fs: fix cgroup use of simple_lookup() cgroup can't use simple_lookup(), since that'd override its desired ->d_op. Tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-14 08:07:48 -08:00
Ingo Molnar	1161ec9449	Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent	2011-01-14 15:30:16 +01:00
Paul E. McKenney	b24efdfdf6	rcu: avoid pointless blocked-task warnings If the RCU callback-processing kthread has nothing to do, it parks in a wait_event(). If RCU remains idle for more than two minutes, the kernel complains about this. This commit changes from wait_event() to wait_event_interruptible() to prevent the kernel from complaining just because RCU is idle. Reported-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Thomas Weber <weber@corscience.de> Tested-by: Russell King <rmk+kernel@arm.linux.org.uk>	2011-01-14 04:58:08 -08:00
Paul E. McKenney	c072a388d5	rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status Because the adaptive synchronize_srcu_expedited() approach has worked very well in testing, remove the kernel parameter and replace it by a C-preprocessor macro. If someone finds problems with this approach, a more complex and aggressively adaptive approach might be required. Longer term, SRCU will be merged with the other RCU implementations, at which point synchronize_srcu_expedited() will be event driven, just as synchronize_sched_expedited() currently is. At that point, there will be no need for this adaptive approach. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-01-14 04:56:49 -08:00
Li Zefan	3ec762ad8b	cgroups: Fix a lockdep warning at cgroup removal Commit `2fd6b7f5` ("fs: dcache scale subdirs") forgot to annotate a dentry lock, which caused a lockdep warning. Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2011-01-14 08:46:29 +00:00
Linus Torvalds	52cfd503ad	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits) ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework ACPI: fix resource check message ACPI / Battery: Update information on info notification and resume ACPI: Drop device flag wake_capable ACPI: Always check if _PRW is present before trying to evaluate it ACPI / PM: Check status of power resources under mutexes ACPI / PM: Rename acpi_power_off_device() ACPI / PM: Drop acpi_power_nocheck ACPI / PM: Drop acpi_bus_get_power() Platform / x86: Make fujitsu_laptop use acpi_bus_update_power() ACPI / Fan: Rework the handling of power resources ACPI / PM: Register power resource devices as soon as they are needed ACPI / PM: Register acpi_power_driver early ACPI / PM: Add function for updating device power state consistently ACPI / PM: Add function for device power state initialization ACPI / PM: Introduce __acpi_bus_get_power() ACPI / PM: Introduce function for refcounting device power resources ACPI / PM: Add functions for manipulating lists of power resources ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes ACPICA: Update version to 20101209 ...	2011-01-13 20:15:35 -08:00
Andrea Arcangeli	ba76149f47	thp: khugepaged Add khugepaged to relocate fragmented pages into hugepages if new hugepages become available. (this is indipendent of the defrag logic that will have to make new hugepages available) The fundamental reason why khugepaged is unavoidable, is that some memory can be fragmented and not everything can be relocated. So when a virtual machine quits and releases gigabytes of hugepages, we want to use those freely available hugepages to create huge-pmd in the other virtual machines that may be running on fragmented memory, to maximize the CPU efficiency at all times. The scan is slow, it takes nearly zero cpu time, except when it copies data (in which case it means we definitely want to pay for that cpu time) so it seems a good tradeoff. In addition to the hugepages being released by other process releasing memory, we have the strong suspicion that the performance impact of potentially defragmenting hugepages during or before each page fault could lead to more performance inconsistency than allocating small pages at first and having them collapsed into large pages later... if they prove themselfs to be long lived mappings (khugepaged scan is slow so short lived mappings have low probability to run into khugepaged if compared to long lived mappings). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Andrea Arcangeli	e7a00c45f2	thp: add pmd_huge_pte to mm_struct This increase the size of the mm struct a bit but it is needed to preallocate one pte for each hugepage so that split_huge_page will not require a fail path. Guarantee of success is a fundamental property of split_huge_page to avoid decrasing swapping reliability and to avoid adding -ENOMEM fail paths that would otherwise force the hugepage-unaware VM code to learn rolling back in the middle of its pte mangling operations (if something we need it to learn handling pmd_trans_huge natively rather being capable of rollback). When split_huge_page runs a pte is needed to succeed the split, to map the newly splitted regular pages with a regular pte. This way all existing VM code remains backwards compatible by just adding a split_huge_page* one liner. The memory waste of those preallocated ptes is negligible and so it is worth it. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:41 -08:00
Andrea Arcangeli	a5b338f2b0	thp: update futex compound knowledge Futex code is smarter than most other gup_fast O_DIRECT code and knows about the compound internals. However now doing a put_page(head_page) will not release the pin on the tail page taken by gup-fast, leading to all sort of refcounting bugchecks. Getting a stable head_page is a little tricky. page_head = page is there because if this is not a tail page it's also the page_head. Only in case this is a tail page, compound_head is called, otherwise it's guaranteed unnecessary. And if it's a tail page compound_head has to run atomically inside irq disabled section __get_user_pages_fast before returning. Otherwise ->first_page won't be a stable pointer. Disableing irq before __get_user_page_fast and releasing irq after running compound_head is needed because if __get_user_page_fast returns == 1, it means the huge pmd is established and cannot go away from under us. pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait for local_irq_enable before the IPI delivery can return. This means __split_huge_page_refcount can't be running from under us, and in turn when we run compound_head(page) we're not reading a dangling pointer from tailpage->first_page. Then after we get to stable head page, we are always safe to call compound_lock and after taking the compound lock on head page we can finally re-check if the page returned by gup-fast is still a tail page. in which case we're set and we didn't need to split the hugepage in order to take a futex on it. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:39 -08:00
Mandeep Singh Baines	dabb16f639	oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down We'd like to be able to oom_score_adj a process up/down as it enters/leaves the foreground. Currently, it is not possible to oom_adj down without CAP_SYS_RESOURCE. This patch allows a task to decrease its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to or its inherited value at fork. Assuming the thread that has forked it has oom_score_adj of 0, each process could decrease it back from 0 upon activation unless a CAP_SYS_RESOURCE thread elevated it to something higher. Alternative considered: * a setuid binary * a daemon with CAP_SYS_RESOURCE Since you don't wan't all processes to be able to reduce their oom_adj, a setuid or daemon implementation would be complex. The alternatives also have much higher overhead. This patch updated from original patch based on feedback from David Rientjes. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:35 -08:00

... 4 5 6 7 8 ...

11457 Commits (a26ac2455ffcf3be5c6ef92bc6df7182700f2114)