linux

Commit Graph

Author	SHA1	Message	Date
Darren Hart	7ada876a87	futex: Fix errors in nested key ref-counting futex_wait() is leaking key references due to futex_wait_setup() acquiring an additional reference via the queue_lock() routine. The nested key ref-counting has been masking bugs and complicating code analysis. queue_lock() is only called with a previously ref-counted key, so remove the additional ref-counting from the queue_(un)lock() functions. Also futex_wait_requeue_pi() drops one key reference too many in unqueue_me_pi(). Remove the key reference handling from unqueue_me_pi(). This was paired with a queue_lock() in futex_lock_pi(), so the count remains unchanged. Document remaining nested key ref-counting sites. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Reported-and-tested-by: Matthieu Fertré<matthieu.fertre@kerlabs.com> Reported-by: Louis Rilling<louis.rilling@kerlabs.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <4CBB17A8.70401@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2010-10-19 11:41:54 +02:00
Arnd Bergmann	7ff52efdca	dabusb: remove the BKL The dabusb device driver is sufficiently serialized using its own mutex, no need for the big kernel lock here in addition. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Mauro Carvalho Chehab <mchehab@infradead.org>	2010-10-19 11:30:00 +02:00
Arnd Bergmann	a6f8dbc654	sunrpc: remove the big kernel lock The sunrpc cache_ioctl function does not need the big kernel lock because it uses its own queue_lock already. rpc_pipe_ioctl apparently should be using i_lock like the other operations on the pipe file descriptor do. Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2010-10-19 11:29:59 +02:00
Namhyung Kim	1fa4f3b57c	init/main.c: remove BKL notations According to commit `5e3d20a68f` (init: Remove the BKL from startup code) these sparse notations should be removed also. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2010-10-19 11:29:58 +02:00
Arnd Bergmann	01b284f9b6	blktrace: remove the big kernel lock According to Jens, this code does not need the BKL at all, it is sufficiently serialized by bd_mutex. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-10-19 11:29:57 +02:00
Arnd Bergmann	0fc86c7bd9	rtmutex-tester: make it build without BKL The big kernel lock is going away, so make sure that if it is disabled by Kconfig, we do not try to validate it, which would result in compile errors. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-10-19 11:29:56 +02:00
Arnd Bergmann	72024f1ec5	dvb-core: kill the big kernel lock The dvb core only uses the big kernel lock in the open and ioctl functions, which means it can be replaced with a dvb specific mutex. Fortunately, all the ioctl functions go through dvb_usercopy, so we can move the serialization in there. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: linux-media@vger.kernel.org	2010-10-19 11:29:56 +02:00
Arnd Bergmann	adfedd216d	dvb/bt8xx: kill the big kernel lock The bt8xx driver only uses the big kernel lock in its dst_ca_ioctl function and never to serialize against other code, so we can trivially replace it with a private mutex. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: linux-media@vger.kernel.org Cc: Mauro Carvalho Chehab <mchehab@infradead.org>	2010-10-19 11:29:55 +02:00
Arnd Bergmann	efbec1cd04	tlclk: remove big kernel lock This driver already has a global mutex, so let's just use that in the open function instead of the BKL. It may not even be needed there, but this patch should have the smallest impact. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Mark Gross <mark.gross@intel.com>	2010-10-19 11:29:54 +02:00
Al Viro	c4a0472725	fix rawctl compat ioctls breakage on amd64 and itanic RAW_SETBIND and RAW_GETBIND 32bit versions are fscked in interesting ways. 1) fs/compat_ioctl.c has COMPATIBLE_IOCTL(RAW_SETBIND) followed by HANDLE_IOCTL(RAW_SETBIND, raw_ioctl). The latter is ignored. 2) on amd64 (and itanic) the damn thing is broken - we have int + u64 + u64 and layouts on i386 and amd64 are _not_ the same. raw_ioctl() would work there, but it's never called due to (1). As it is, i386 /sbin/raw definitely doesn't work on amd64 boxen. 3) switching to raw_ioctl() as is would not work on e.g. sparc64 and ppc64, which would be rather sad, seeing that normal userland there is 32bit. The thing is, slapping __packed on the struct in question does not DTRT - it eliminates all padding. The real solution is to use compat_u64. 4) of course, all that stuff has no business being outside of raw.c in the first place - there should be ->compat_ioctl() for /dev/rawctl instead of messing with compat_ioctl.c. [akpm@linux-foundation.org: coding-style fixes] [arnd@arndb.de: port to 2.6.36] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2010-10-19 11:29:54 +02:00
Arnd Bergmann	9a181c5861	uml: kill big kernel lock Three uml device drivers still use the big kernel lock, but all of them can be safely converted to using a per-driver mutex instead. Most likely this is not even necessary, so after further review these can and should be removed as well. The exec system call no longer requires the BKL either, so remove it from there, too. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Jeff Dike <jdike@addtoit.com> Cc: user-mode-linux-devel@lists.sourceforge.net	2010-10-19 11:29:42 +02:00
Yinghai Lu	9717967c4b	x86: ioapic: Call free_irte only if interrupt remapping enabled On a system that support intr-rempping when booting with "intremap=off" [ 177.895501] BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8 [ 177.913316] IP: [<ffffffff8145fc18>] free_irte+0x47/0xc0 ... [ 178.173326] Call Trace: [ 178.173574] [<ffffffff810515b4>] destroy_irq+0x3a/0x75 [ 178.192934] [<ffffffff81051834>] arch_teardown_msi_irq+0xe/0x10 [ 178.193418] [<ffffffff81458dc3>] arch_teardown_msi_irqs+0x56/0x7f [ 178.213021] [<ffffffff81458e79>] free_msi_irqs+0x8d/0xeb Call free_irte only when interrupt remapping is enabled. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4CBCB274.7010108@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-10-19 09:25:33 +02:00
Paul Mackerras	57fa721433	perf, powerpc: Fix power_pmu_event_init to not use event->ctx Commit `c3f00c70` ("perf: Separate find_get_context() from event initialization") changed the generic perf_event code to call perf_event_alloc, which calls the arch-specific event_init code, before looking up the context for the new event. Unfortunately, power_pmu_event_init uses event->ctx->task to see whether the new event is a per-task event or a system-wide event, and thus crashes since event->ctx is NULL at the point where power_pmu_event_init gets called. (The reason it needs to know whether it is a per-task event is because there are some hardware events on Power systems which only count when the processor is not idle, and there are some fixed-function counters which count such events. For example, the "run cycles" event counts cycles when the processor is not idle. If the user asks to count cycles, we can use "run cycles" if this is a per-task event, since the processor is running when the task is running, by definition. We can't use "run cycles" if the user asks for "cycles" on a system-wide counter.) Fortunately the information we need is in the event->attach_state field, so we just use that instead. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20101019055535.GA10398@drongo> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reported-by: Alexey Kardashevskiy <aik@au1.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-19 09:18:34 +02:00
Ingo Molnar	1fa41266e9	Merge branch 'tip/perf/recordmcount-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-10-19 08:21:10 +02:00
Kukjin Kim	fe0cdec8ba	ARM: S5PV310: Fix build error on GPIO map This patch fixes build error about GPIO address due to conflict of commit `4d914705` and `19a2c065`. - commit 4d914705: Fix on GPIO base addresses - commit 19a2c065: Moves initial map for merging S5P64X0 Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>	2010-10-19 08:02:57 +09:00
Russell King	a0a55682b8	Merge branch 'hotplug' into devel Conflicts: arch/arm/kernel/head-common.S	2010-10-18 22:34:47 +01:00
Russell King	23beab76b4	Merge branches 'at91', 'dcache', 'ftrace', 'hwbpt', 'misc', 'mmci', 's3c', 'st-ux' and 'unwind' into devel	2010-10-18 22:34:25 +01:00
Steven Rostedt	d7b4d6de57	ftrace: Remove recursion between recordmcount and scripts/mod/empty When DYNAMIC_FTRACE is enabled and we use the C version of recordmcount, all objects are run through the recordmcount program to create a separate section that stores all the callers of mcount. The build process has a special file: scripts/mod/empty.o. This is built from empty.c which is literally an empty file (except for a single comment). This file is used to find information about the target elf format, like endianness and word size. The problem comes up when we need to build recordmcount. The build process requires that empty.o is built first. The build rules for empty.o will try to execute recordmcount on the empty.o file. We get an error that recordmcount does not exist. To avoid this recursion, the build file will skip running recordmcount if the file that it is building is script/mod/empty.o. [ extra comment Suggested-by: Sam Ravnborg <sam@ravnborg.org> ] Reported-by: Ingo Molnar <mingo@elte.hu> Tested-by: Ingo Molnar <mingo@elte.hu> Cc: Michal Marek <mmarek@suse.cz> Cc: linux-kbuild@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-18 17:08:10 -04:00
Srinidhi Kasagar	f3af03de0b	ARM: 6441/1: ux500: The platform is not just based on early drop silicon version. Update Kconfig text accordingly. Signed-off-by: srinidhi kasagar <srinidhi.kasagar@stericsson.com> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2010-10-18 22:07:26 +01:00
Linus Torvalds	547af560dd	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus * 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: MIPS: Enable ISA_DMA_API config to fix build failure MIPS: 32-bit: Fix build failure in asm/fcntl.h MIPS: Remove all generated vmlinuz* files on "make clean" MIPS: do_sigaltstack() expects userland pointers MIPS: Fix error values in case of bad_stack MIPS: Sanitize restart logics MIPS: secure_computing, syscall audit: syscall number should in r2, not r0. MIPS: Don't block signals if we'd failed to setup a sigframe	2010-10-18 13:10:36 -07:00
Linus Torvalds	b0579fc089	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: evdev - fix EVIOCSABS regression Input: evdev - fix Ooops in EVIOCGABS/EVIOCSABS	2010-10-18 13:10:08 -07:00
Linus Torvalds	7f81c56cf2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6: firewire: ohci: fix TI TSB82AA2 regression since 2.6.35	2010-10-18 13:09:26 -07:00
Sascha Hauer	63f1474c69	mxc_nand: do not depend on disabling the irq in the interrupt handler This patch reverts the driver to enabling/disabling the NFC interrupt mask rather than enabling/disabling the system interrupt. This cleans up the driver so that it doesn't rely on interrupts being disabled within the interrupt handler. For i.MX21 we keep the current behaviour, that is calling enable_irq/disable_irq_nosync to enable/disable interrupts. This patch is based on earlier work by John Ogness. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Acked-by: John Ogness <john.ogness@linutronix.de> Tested-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-18 13:09:05 -07:00
Linus Torvalds	f68c834b04	Merge branch 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux * 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux: i2c-imx: do not allow interruptions when waiting for I2C to complete i2c-davinci: Fix TX setup for more SoCs	2010-10-18 13:05:10 -07:00
Linus Torvalds	822a2e4524	Merge branch 'fixes' * fixes: v4l1: fix 32-bit compat microcode loading translation De-pessimize rds_page_copy_user	2010-10-18 13:04:33 -07:00
Ingo Molnar	b7dadc3879	sched: Export account_system_vtime() KVM uses it for example: ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined! Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:30 +02:00
Venkatesh Pallipadi	d267f87fb8	sched: Call tick_check_idle before __irq_enter When CPU is idle and on first interrupt, irq_enter calls tick_check_idle() to notify interruption from idle. But, there is a problem if this call is done after __irq_enter, as all routines in __irq_enter may find stale time due to yet to be done tick_check_idle. Specifically, trace calls in __irq_enter when they use global clock and also account_system_vtime change in this patch as it wants to use sched_clock_cpu() to do proper irq timing. But, tick_check_idle was moved after __irq_enter intentionally to prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a: irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups Moving tick_check_idle() before __irq_enter and wrapping it with local_bh_enable/disable would solve both the problems. Fixed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:29 +02:00
Venkatesh Pallipadi	aa48380851	sched: Remove irq time from available CPU power The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:27 +02:00
Venkatesh Pallipadi	305e6835e0	sched: Do not account irq time to current task Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:26 +02:00
Venkatesh Pallipadi	e82b8e4ea4	x86: Add IRQ_TIME_ACCOUNTING This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it when TSC is enabled. This change just enables fine grained irq time accounting, isn't used yet. Following patches use it for different purposes. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:25 +02:00
Venkatesh Pallipadi	b52bfee445	sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:24 +02:00
Venkatesh Pallipadi	6cdd5199da	sched: Add a PF flag for ksoftirqd identification To account softirq time cleanly in scheduler, we need to identify whether softirq is invoked in ksoftirqd context or softirq at hardirq tail context. Add PF_KSOFTIRQD for that purpose. As all PF flag bits are currently taken, create space by moving one of the infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along with some other state fields. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:22 +02:00
Venkatesh Pallipadi	e1e10a265d	sched: Consolidate account_system_vtime extern declaration Just a minor cleanup patch that makes things easier to the following patches. No functionality change in this patch. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:21 +02:00
Venkatesh Pallipadi	75e1056f5c	sched: Fix softirq time accounting Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:20 +02:00
Nikhil Rao	75dd321d79	sched: Drop group_capacity to 1 only if local group has extra capacity When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << nr_cpus), we still drop group_capacity down to 1 if we prefer siblings. - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most likely be > sgs.group_capacity. - When balancing large weight tasks, if the local group does not have extra capacity, we do not pick the group with the niced task as the busiest group. This prevents failed balances, active migration and the under-utilization described above. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:19 +02:00
Nikhil Rao	fab476228b	sched: Force balancing on newidle balance if local group has capacity This patch forces a load balance on a newly idle cpu when the local group has extra capacity and the busiest group does not have any. It improves system utilization when balancing tasks with a large weight differential. Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. With this patch, if the local group has extra capacity, we shortcut the checks in f_b_g() and try to pull a task over. A sched group has extra capacity if the group capacity is greater than the number of running tasks in that group. Thanks to Mike Galbraith for discussions leading to this patch and for the insight to reuse SD_NEWIDLE_BALANCE. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:18 +02:00
Nikhil Rao	2582f0eba5	sched: Set group_imb only a task can be pulled from the busiest cpu When cycling through sched groups to determine the busiest group, set group_imb only if the busiest cpu has more than 1 runnable task. This patch fixes the case where two cpus in a group have one runnable task each, but there is a large weight differential between these two tasks. The load balancer is unable to migrate any task from this group, and hence do not consider this group to be imbalanced. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com> [ small code readability edits ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:17 +02:00
Nikhil Rao	ef8002f684	sched: Do not consider SCHED_IDLE tasks to be cache hot This patch adds a check in task_hot to return if the task has SCHED_IDLE policy. SCHED_IDLE tasks have very low weight, and when run with regular workloads, are typically scheduled many milliseconds apart. There is no need to consider these tasks hot for load balancing. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:15 +02:00
Peter Zijlstra	ebf31f5024	jump_label: Add COND_STMT(), reducer wrappery The use of the JUMP_LABEL() construct ends up creating endless silly wrappers, create a higher level construct to reduce this clutter. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:59:01 +02:00
Peter Zijlstra	7e54a5a0b6	perf: Optimize sw events Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:59 +02:00
Peter Zijlstra	82cd6def98	perf: Use jump_labels to optimize the scheduler hooks Trades a call + conditional + ret for an unconditional jmp. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101014203625.501657727@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:58 +02:00
Peter Zijlstra	8b92538d84	jump_label: Add atomic_t interface Add an interface to allow usage of jump_labels with atomic counters. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20101014203625.501657727@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:57 +02:00
Peter Zijlstra	3b6e901f83	jump_label: Use more consistent naming Now that there's still only a few users around, rename things to make them more consistent. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101014203625.448565169@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:56 +02:00
Peter Zijlstra	d580ff8699	perf, hw_breakpoint: Fix crash in hw_breakpoint creation hw_breakpoint creation needs to account stuff per-task to ensure there is always sufficient hardware resources to back these things due to ptrace. With the perf per pmu context changes the event initialization no longer has access to the event context, for the simple reason that we need to first find the pmu (result of initialization) before we can find the context. This makes hw_breakpoints unhappy, because it can no longer do per task accounting, cure this by frobbing a task pointer in the event::hw bits for now... Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20101014203625.391543667@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:55 +02:00
Peter Zijlstra	c6be5a5cb6	perf: Find task before event alloc So that we can pass the task pointer to the event allocation, so that we can use task associated data during event initialization. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101014203625.340789919@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:54 +02:00
Peter Zijlstra	e7d0bc0475	perf: Fix task refcount bugs Currently it looks like find_lively_task_by_vpid() takes a task ref and relies on find_get_context() to drop it. The problem is that perf_event_create_kernel_counter() shouldn't be dropping task refs. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Matt Helsley <matthltc@us.ibm.com> LKML-Reference: <20101014203625.278436085@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:52 +02:00
Peter Zijlstra	74c3337c2f	perf: Fix group moving Matt found we trigger the WARN_ON_ONCE() in perf_group_attach() when we take the move_group path in perf_event_open(). Since we cannot de-construct the group (we rely on it to move the events), we have to simply ignore the double attach. The group state is context invariant and doesn't need changing. Reported-by: Matt Fleming <matt@console-pimps.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287135757.29097.1368.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:51 +02:00
Peter Zijlstra	e360adbe29	irq_work: Add generic hardirq context callbacks Provide a mechanism that allows running code in IRQ context. It is most useful for NMI code that needs to interact with the rest of the system -- like wakeup a task to drain buffers. Perf currently has such a mechanism, so extract that and provide it as a generic feature, independent of perf so that others may also benefit. The IRQ context callback is generated through self-IPIs where possible, or on architectures like powerpc the decrementer (the built-in timer facility) is set to generate an interrupt immediately. Architectures that don't have anything like this get to do with a callback from the timer tick. These architectures can call irq_work_run() at the tail of any IRQ handlers that might enqueue such work (like the perf IRQ handler) to avoid undue latencies in processing the work. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [ various fixes ] Signed-off-by: Huang Ying <ying.huang@intel.com> LKML-Reference: <1287036094.7768.291.camel@yhuang-dev> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:50 +02:00
Stephane Eranian	8e5fc1a732	perf_events: Fix transaction recovery in group_sched_in() The group_sched_in() function uses a transactional approach to schedule a group of events. In a group, either all events can be scheduled or none are. To schedule each event in, the function calls event_sched_in(). In case of error, event_sched_out() is called on each event in the group. The problem is that event_sched_out() does not completely cancel the effects of event_sched_in(). Furthermore event_sched_out() changes the state of the event as if it had run which is not true is this particular case. Those inconsistencies impact time tracking fields and may lead to events in a group not all reporting the same time_enabled and time_running values. This is demonstrated with the example below: $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5 `1946101` unhalted_core_cycles (32.85% scaling, ena=829181, run=556827) 11423 baclears (32.85% scaling, ena=829181, run=556827) 7671 baclears (0.00% scaling, ena=556827, run=556827) 2250443 unhalted_core_cycles (57.83% scaling, ena=962822, run=405995) 11705 baclears (57.83% scaling, ena=962822, run=405995) 11705 baclears (57.83% scaling, ena=962822, run=405995) Notice that in the first group, the last baclears event does not report the same timings as its siblings. This issue comes from the fact that tstamp_stopped is updated by event_sched_out() as if the event had actually run. To solve the issue, we must ensure that, in case of error, there is no change in the event state whatsoever. That means timings must remain as they were when entering group_sched_in(). To do this we defer updating tstamp_running until we know the transaction succeeded. Therefore, we have split event_sched_in() in two parts separating the update to tstamp_running. Similarly, in case of error, we do not want to update tstamp_stopped. Therefore, we have split event_sched_out() in two parts separating the update to tstamp_stopped. With this patch, we now get the following output: $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5 2492050 unhalted_core_cycles (71.75% scaling, ena=1093330, run=308841) 11243 baclears (71.75% scaling, ena=1093330, run=308841) 11243 baclears (71.75% scaling, ena=1093330, run=308841) 1852746 unhalted_core_cycles (0.00% scaling, ena=784489, run=784489) 9253 baclears (0.00% scaling, ena=784489, run=784489) 9253 baclears (0.00% scaling, ena=784489, run=784489) Note that the uneven timing between groups is a side effect of the process spending most of its time sleeping, i.e., not enough event rotations (but that's a separate issue). Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4cb86b4c.41e9d80a.44e9.3e19@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:49 +02:00
Stephane Eranian	ba0cef3d14	perf_events: Fix bogus AMD64 generic TLB events PERF_COUNT_HW_CACHE_DTLB:READ:MISS had a bogus umask value of 0 which counts nothing. Needed to be 0x7 (to count all possibilities). PERF_COUNT_HW_CACHE_ITLB:READ:MISS had a bogus umask value of 0 which counts nothing. Needed to be 0x3 (to count all possibilities). Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: <stable@kernel.org> # as far back as it applies LKML-Reference: <4cb85478.41e9d80a.44e2.3f00@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 19:58:48 +02:00

... 4 5 6 7 8 ...

213017 Commits (adbb39b5485b72dca963a2bc9b1b22bfc19d4967) All Branches Search

213017 Commits (adbb39b5485b72dca963a2bc9b1b22bfc19d4967)

All Branches