Commit Graph

1135 Commits (a34db9b28a1c63317e1d6f1080a12d711579e7d0)

Author SHA1 Message Date
Ingo Molnar a34db9b28a [PATCH] genirq: update copyrights
Update/add copyrights in the generic IRQ code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner 94d39e1f6e [PATCH] genirq: add IRQ_NOAUTOEN support
Enable platforms to disable the automatic enabling of freshly set up irqs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner 6550c775cb [PATCH] genirq: add IRQ_NOREQUEST support
Enable platforms to disable request_irq() for certain interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner 3418d72404 [PATCH] genirq: add IRQ_NOPROBE support
Introduce IRQ_NOPROBE: enables platforms to control chip-probing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner a4633adcdb [PATCH] genirq: add genirq sw IRQ-retrigger
Enable platforms that do not have a hardware-assisted hardirq-resend mechanism
to resend them via a softirq-driven IRQ emulation mechanism.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar 77a5afecdb [PATCH] genirq: cleanup: no_irq_type cleanups
Clean up no_irq_type: share the NOP functions where possible, and properly
name the ack_bad() function.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar 8d28bc751b [PATCH] genirq: doc: handle_IRQ_event() and __do_IRQ() comments
Document handle_IRQ_event() and __do_IRQ().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar c0ad90a32f [PATCH] genirq: add ->retrigger() irq op to consolidate hw_irq_resend()
Add ->retrigger() irq op to consolidate hw_irq_resend() implementations.
(Most architectures had it defined to NOP anyway.)

NOTE: ia64 needs testing. i386 and x86_64 tested.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Thomas Gleixner 096c8131c5 [PATCH] genirq: debug: better debug printout in enable_irq()
Make enable_irq() debug printouts user-readable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar 0d7012a968 [PATCH] genirq: cleanup: turn ARCH_HAS_IRQ_PER_CPU into CONFIG_IRQ_PER_CPU
Cleanup: change ARCH_HAS_IRQ_PER_CPU into a Kconfig method.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar cd916d31cc [PATCH] genirq: cleanup: merge pending_irq_cpumask[] into irq_desc[]
Consolidation: remove the pending_irq_cpumask[NR_IRQS] array and move it into
the irq_desc[NR_IRQS].pending_mask field.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 4a733ee126 [PATCH] genirq: cleanup: merge irq_dir[], smp_affinity_entry[] into irq_desc[]
Consolidation: remove the irq_dir[NR_IRQS] and the smp_affinity_entry[NR_IRQS]
arrays and move them into the irq_desc[] array.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 34ffdb7233 [PATCH] genirq: cleanup: reduce irq_desc_t use, mark it obsolete
Cleanup: remove irq_desc_t use from the generic IRQ code, and mark it
obsolete.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 06fcb0c6fb [PATCH] genirq: cleanup: misc code cleanups
Assorted code cleanups to the generic IRQ code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 2e60bbb6d5 [PATCH] genirq: cleanup: remove fastcall
Now that i386 defaults to regparm, explicit uses of fastcall are not needed
anymore.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar a8553acd6c [PATCH] genirq: cleanup: remove irq_descp()
Cleanup: remove irq_descp() - explicit use of irq_desc[] is shorter and more
readable.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar a53da52fd7 [PATCH] genirq: cleanup: merge irq_affinity[] into irq_desc[]
Consolidation: remove the irq_affinity[NR_IRQS] array and move it into the
irq_desc[NR_IRQS].affinity field.

[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar 74ffd553a3 [PATCH] genirq: sem2mutex probe_sem -> probing_active
Convert the irq auto-probing semaphore to a mutex.  (This allows us to find
probing API usage bugs sooner, via the mutex debugging code.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:21 -07:00
Ingo Molnar d1bef4ed5f [PATCH] genirq: rename desc->handler to desc->chip
This patch-queue improves the generic IRQ layer to be truly generic, by adding
various abstractions and features to it, without impacting existing
functionality.

While the queue can be best described as "fix and improve everything in the
generic IRQ layer that we could think of", and thus it consists of many
smaller features and lots of cleanups, the one feature that stands out most is
the new 'irq chip' abstraction.

The irq-chip abstraction is about describing and coding and IRQ controller
driver by mapping its raw hardware capabilities [and quirks, if needed] in a
straightforward way, without having to think about "IRQ flow"
(level/edge/etc.) type of details.

This stands in contrast with the current 'irq-type' model of genirq
architectures, which 'mixes' raw hardware capabilities with 'flow' details.
The patchset supports both types of irq controller designs at once, and
converts i386 and x86_64 to the new irq-chip design.

As a bonus side-effect of the irq-chip approach, chained interrupt controllers
(master/slave PIC constructs, etc.) are now supported by design as well.

The end result of this patchset intends to be simpler architecture-level code
and more consolidation between architectures.

We reused many bits of code and many concepts from Russell King's ARM IRQ
layer, the merging of which was one of the motivations for this patchset.

This patch:

rename desc->handler to desc->chip.

Originally i did not want to do this, because it's a big patch.  But having
both "desc->handler", "desc->handle_irq" and "action->handler" caused a
large degree of confusion and made the code appear alot less clean than it
truly is.

I have also attempted a dual approach as well by introducing a
desc->chip alias - but that just wasnt robust enough and broke
frequently.

So lets get over with this quickly.  The conversion was done automatically
via scripts and converts all the code in the kernel.

This renaming patch is the first one amongst the patches, so that the
remaining patches can stay flexible and can be merged and split up
without having some big monolithic patch act as a merge barrier.

[akpm@osdl.org: build fix]
[akpm@osdl.org: another build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:21 -07:00
Andrew Morton 84860f9979 [PATCH] load_module() cleanup
Undo bizarre declaration in load_module().

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
Arjan van de Ven f71d20e961 [PATCH] Add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL
Temporarily add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL.  These
will be used as a transition measure for symbols that aren't used in the
kernel and are on the way out.  When a module uses such a symbol, a warning
is printk'd at modprobe time.

The main reason for removing unused exports is size: eacho export takes
roughly between 100 and 150 bytes of kernel space in the binary.  This
patch gives users the option to immediately get this size gain via a config
option.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
Jesper Juhl 9a66a53f55 [PATCH] Remove redundant NULL checks before [kv]free - in kernel/
Remove redundant kfree NULL checks from kernel/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Sebastien Dugue 59e0e0ace7 [PATCH] futex_requeue() optimization
In futex_requeue(), when the 2 futexes keys hash to the same bucket, there
is no need to move the futex_q to the end of the bucket list.

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Thomas Gleixner 95e02ca9bb [PATCH] rtmutex: Propagate priority settings into PI lock chains
When the priority of a task, which is blocked on a lock, changes we must
propagate this change into the PI lock chain.  Therefor the chain walk code
is changed to get rid of the references to current to avoid false positives
in the deadlock detector, as setscheduler might be called by a task which
holds the lock on which the task whose priority is changed is blocked.

Also add some comments about the get/put_task_struct usage to avoid
confusion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Thomas Gleixner 0bafd214e4 [PATCH] rtmutex: Modify rtmutex-tester to test the setscheduler propagation
Make test suite setscheduler calls asynchronously.  Remove the waits in the
test cases and add a new testcase to verify the correctness of the
setscheduler priority propagation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Thomas Gleixner e74c69f46d [PATCH] Drop tasklist lock in do_sched_setscheduler
There is no need to hold tasklist_lock across the setscheduler call, when
we pin the task structure with get_task_struct().  Interrupts are disabled
in setscheduler anyway and the permission checks do not need interrupts
disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar c87e2837be [PATCH] pi-futex: futex_lock_pi/futex_unlock_pi support
This adds the actual pi-futex implementation, based on rt-mutexes.

[dino@in.ibm.com: fix an oops-causing race]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar 0cdbee9920 [PATCH] pi-futex: rt mutex futex api
Add proxy-locking rt-mutex functionality needed by pi-futexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Thomas Gleixner 61a8712286 [PATCH] pi-futex: rt mutex tester
RT-mutex tester: scriptable tester for rt mutexes, which allows userspace
scripting of mutex unit-tests (and dynamic tests as well), using the actual
rt-mutex implementation of the kernel.

[akpm@osdl.org: fixlet]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar e7eebaf6a8 [PATCH] pi-futex: rt mutex debug
Runtime debugging functionality for rt-mutexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar 23f78d4a03 [PATCH] pi-futex: rt mutex core
Core functions for the rt-mutex subsystem.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar b29739f902 [PATCH] pi-futex: scheduler support for pi
Add framework to boost/unboost the priority of RT tasks.

This consists of:

 - caching the 'normal' priority in ->normal_prio
 - providing a functions to set/get the priority of the task
 - make sched_setscheduler() aware of boosting

The effective_prio() cleanups also fix a priority-calculation bug pointed out
by Andrey Gelman, in set_user_nice().

has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrey Gelman <agelman@012.net.il>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Ingo Molnar e2970f2fb6 [PATCH] pi-futex: futex code cleanups
We are pleased to announce "lightweight userspace priority inheritance" (PI)
support for futexes.  The following patchset and glibc patch implements it,
ontop of the robust-futexes patchset which is included in 2.6.16-mm1.

We are calling it lightweight for 3 reasons:

 - in the user-space fastpath a PI-enabled futex involves no kernel work
   (or any other PI complexity) at all.  No registration, no extra kernel
   calls - just pure fast atomic ops in userspace.

 - in the slowpath (in the lock-contention case), the system call and
   scheduling pattern is in fact better than that of normal futexes, due to
   the 'integrated' nature of FUTEX_LOCK_PI.  [more about that further down]

 - the in-kernel PI implementation is streamlined around the mutex
   abstraction, with strict rules that keep the implementation relatively
   simple: only a single owner may own a lock (i.e.  no read-write lock
   support), only the owner may unlock a lock, no recursive locking, etc.

  Priority Inheritance - why, oh why???
  -------------------------------------

Many of you heard the horror stories about the evil PI code circling Linux for
years, which makes no real sense at all and is only used by buggy applications
and which has horrible overhead.  Some of you have dreaded this very moment,
when someone actually submits working PI code ;-)

So why would we like to see PI support for futexes?

We'd like to see it done purely for technological reasons.  We dont think it's
a buggy concept, we think it's useful functionality to offer to applications,
which functionality cannot be achieved in other ways.  We also think it's the
right thing to do, and we think we've got the right arguments and the right
numbers to prove that.  We also believe that we can address all the
counter-arguments as well.  For these reasons (and the reasons outlined below)
we are submitting this patch-set for upstream kernel inclusion.

What are the benefits of PI?

  The short reply:
  ----------------

User-space PI helps achieving/improving determinism for user-space
applications.  In the best-case, it can help achieve determinism and
well-bound latencies.  Even in the worst-case, PI will improve the statistical
distribution of locking related application delays.

  The longer reply:
  -----------------

Firstly, sharing locks between multiple tasks is a common programming
technique that often cannot be replaced with lockless algorithms.  As we can
see it in the kernel [which is a quite complex program in itself], lockless
structures are rather the exception than the norm - the current ratio of
lockless vs.  locky code for shared data structures is somewhere between 1:10
and 1:100.  Lockless is hard, and the complexity of lockless algorithms often
endangers to ability to do robust reviews of said code.  I.e.  critical RT
apps often choose lock structures to protect critical data structures, instead
of lockless algorithms.  Furthermore, there are cases (like shared hardware,
or other resource limits) where lockless access is mathematically impossible.

Media players (such as Jack) are an example of reasonable application design
with multiple tasks (with multiple priority levels) sharing short-held locks:
for example, a highprio audio playback thread is combined with medium-prio
construct-audio-data threads and low-prio display-colory-stuff threads.  Add
video and decoding to the mix and we've got even more priority levels.

So once we accept that synchronization objects (locks) are an unavoidable fact
of life, and once we accept that multi-task userspace apps have a very fair
expectation of being able to use locks, we've got to think about how to offer
the option of a deterministic locking implementation to user-space.

Most of the technical counter-arguments against doing priority inheritance
only apply to kernel-space locks.  But user-space locks are different, there
we cannot disable interrupts or make the task non-preemptible in a critical
section, so the 'use spinlocks' argument does not apply (user-space spinlocks
have the same priority inversion problems as other user-space locking
constructs).  Fact is, pretty much the only technique that currently enables
good determinism for userspace locks (such as futex-based pthread mutexes) is
priority inheritance:

Currently (without PI), if a high-prio and a low-prio task shares a lock [this
is a quite common scenario for most non-trivial RT applications], even if all
critical sections are coded carefully to be deterministic (i.e.  all critical
sections are short in duration and only execute a limited number of
instructions), the kernel cannot guarantee any deterministic execution of the
high-prio task: any medium-priority task could preempt the low-prio task while
it holds the shared lock and executes the critical section, and could delay it
indefinitely.

  Implementation:
  ---------------

As mentioned before, the userspace fastpath of PI-enabled pthread mutexes
involves no kernel work at all - they behave quite similarly to normal
futex-based locks: a 0 value means unlocked, and a value==TID means locked.
(This is the same method as used by list-based robust futexes.) Userspace uses
atomic ops to lock/unlock these mutexes without entering the kernel.

To handle the slowpath, we have added two new futex ops:

  FUTEX_LOCK_PI
  FUTEX_UNLOCK_PI

If the lock-acquire fastpath fails, [i.e.  an atomic transition from 0 to TID
fails], then FUTEX_LOCK_PI is called.  The kernel does all the remaining work:
if there is no futex-queue attached to the futex address yet then the code
looks up the task that owns the futex [it has put its own TID into the futex
value], and attaches a 'PI state' structure to the futex-queue.  The pi_state
includes an rt-mutex, which is a PI-aware, kernel-based synchronization
object.  The 'other' task is made the owner of the rt-mutex, and the
FUTEX_WAITERS bit is atomically set in the futex value.  Then this task tries
to lock the rt-mutex, on which it blocks.  Once it returns, it has the mutex
acquired, and it sets the futex value to its own TID and returns.  Userspace
has no other work to perform - it now owns the lock, and futex value contains
FUTEX_WAITERS|TID.

If the unlock side fastpath succeeds, [i.e.  userspace manages to do a TID ->
0 atomic transition of the futex value], then no kernel work is triggered.

If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then
FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of
userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes
up any potential waiters.

Note that under this approach, contrary to other PI-futex approaches, there is
no prior 'registration' of a PI-futex.  [which is not quite possible anyway,
due to existing ABI properties of pthread mutexes.]

Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties
of futexes, and all four combinations are possible: futex, robust-futex,
PI-futex, robust+PI-futex.

  glibc support:
  --------------

Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes
(and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX
mutexes.  (PTHREAD_PRIO_PROTECT support will be added later on too, no
additional kernel changes are needed for that).  [NOTE: The glibc patch is
obviously inofficial and unsupported without matching upstream kernel
functionality.]

the patch-queue and the glibc patch can also be downloaded from:

  http://redhat.com/~mingo/PI-futex-patches/

Many thanks go to the people who helped us create this kernel feature: Steven
Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan
van de Ven, Oleg Nesterov and others.  Credits for related prior projects goes
to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others.

Clean up the futex code, before adding more features to it:

 - use u32 as the futex field type - that's the ABI
 - use __user and pointers to u32 instead of unsigned long
 - code style / comment style cleanups
 - rename hash-bucket name from 'bh' to 'hb'.

I checked the pre and post futex.o object files to make sure this
patch has no code effects.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Steven Rostedt 66e5393a78 [PATCH] BUG() if setscheduler is called from interrupt context
Thomas Gleixner is adding the call to a rtmutex function in setscheduler.
This call grabs a spin_lock that is not always protected by interrupts
disabled.  So this means that setscheduler cant be called from interrupt
context.

To prevent this from happening in the future, this patch adds a
BUG_ON(in_interrupt()) in that function.  (Thanks to akpm <aka.  Andrew
Morton> for this suggestion).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Oleg Nesterov 9fea80e4d9 [PATCH] sched: uninline task_rq_lock()
Saves 543 bytes from sched.o (gcc 3.3.3).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Siddha, Suresh B 5c45bf279d [PATCH] sched: mc/smt power savings sched policy
sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
/sys/devices/system/cpu/ control the MC/SMT power savings policy for the
scheduler.

Based on the values (1-enable, 0-disable) for these controls, sched groups
cpu power will be determined for different domains.  When power savings
policy is enabled and under light load conditions, scheduler will minimize
the physical packages/cpu cores carrying the load and thus conserving
power(with a perf impact based on the workload characteristics...  see OLS
2005 CMP kernel scheduler paper for more details..)

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri 369381694d [PATCH] sched_domai: Allocate sched_group structures dynamically
As explained here:
	http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2

there is a problem with sharing sched_group structures between two
separate sched_group structures for different sched_domains.

The patch has been tested and found to avoid the kernel lockup problem
described in above URL.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri 15f0b676a4 [PATCH] sched_domai: Use kmalloc_node
The sched group structures used to represent various nodes need to be
allocated from respective nodes (as suggested here also:

	http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html)

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri d3a5aa9858 [PATCH] sched_domai: Don't use GFP_ATOMIC
Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based
allocation.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri 51888ca25a [PATCH] sched_domain: handle kmalloc failure
Try to handle mem allocation failures in build_sched_domains by bailing out
and cleaning up thus-far allocated memory.  The patch has a direct consequence
that we disable load balancing completely (even at sibling level) upon *any*
memory allocation failure.

[Lee.Schermerhorn@hp.com: bugfix]
Signed-off-by: Srivatsa Vaddagir <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Peter Williams 615052dc3b [PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks()
Problem:

To help distribute high priority tasks evenly across the available CPUs
move_tasks() does not, under some circumstances, skip tasks whose load
weight is bigger than the designated amount.  Because the highest priority
task on the busiest queue may be on the expired array it may be moved as a
result of this mechanism.  Apart from not being the most desirable way to
redistribute the high priority tasks (we'd rather move the second highest
priority task), there is a risk that this could set up a loop with this
task bouncing backwards and forwards between the two queues.  (This latter
possibility can be demonstrated by running a nice==-20 CPU bound task on an
otherwise quiet 2 CPU system.)

Solution:

Modify the mechanism so that it does not override skip for the highest
priority task on the CPU.  Of course, if there are more than one tasks at
the highest priority then it will allow the override for one of them as
this is a desirable redistribution of high priority tasks.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Peter Williams 50ddd96917 [PATCH] sched: modify move_tasks() to improve load balancing outcomes
Problem:

The move_tasks() function is designed to move UP TO the amount of load it
is asked to move and in doing this it skips over tasks looking for ones
whose load weights are less than or equal to the remaining load to be
moved.  This is (in general) a good thing but it has the unfortunate result
of breaking one of the original load balancer's good points: namely, that
(within the limits imposed by the active/expired array model and the fact
the expired is processed first) it moves high priority tasks before low
priority ones and this means there's a good chance (see active/expired
problem for why it's only a chance) that the highest priority task on the
queue but not actually on the CPU will be moved to the other CPU where (as
a high priority task) it may preempt the current task.

Solution:

Modify move_tasks() so that high priority tasks are not skipped when moving
them will make them the highest priority task on their new run queue.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Peter Williams 2dd73a4f09 [PATCH] sched: implement smpnice
Problem:

The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.

For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running.  The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each.  However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU.  Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met.  The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.

Solution:

The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance.  Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example.  Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2.  If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop.  If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.

One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.

Notes:

struct task_struct:

has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.

struct runqueue:

has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue.  This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens.  This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.

int try_to_wake_up():

in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on.  While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction.  To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).

int move_tasks():

The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists.  This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function.  The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load.  This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.

struct sched_group *find_busiest_group():

Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created.  A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required.  As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.

A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.

void set_user_nice():

has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.

From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>

With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain.  This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.

a) on a simple 4-way MP system, if we have one high priority and 4 normal
   priority tasks, with smpnice we would like to see the high priority task
   scheduled on one cpu, two other cpus getting one normal task each and the
   fourth cpu getting the remaining two normal tasks.  but with current
   smpnice extra normal priority task keeps jumping from one cpu to another
   cpu having the normal priority task.  This is because of the
   busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
   cpu with high priority task in max_load calculations but including that in
   total and avg_load calcuations..  leading to max_load < avg_load and load
   balance between cpus running normal priority tasks(2 Vs 1) will always show
   imbalanace as one normal priority and the extra normal priority task will
   keep moving from one cpu to another cpu having normal priority task..

b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
   highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
   priority task each..  P2 and P3 are idle.  With this patch, one of the
   normal priority tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always make
   sense to look at the highest weighted runqueue in the busy group..
   Consider a load balance scenario on a DP with HT system, with Package-0
   containing one high priority and one low priority, Package-1 containing one
   low priority(with other thread being idle)..  Package-1 thinks that it need
   to take the low priority thread from Package-0.  And find_busiest_queue()
   returns the cpu thread with highest priority task..  And ultimately(with
   help of active load balance) we move high priority task to Package-1.  And
   same continues with Package-0 now, moving high priority task from package-1
   to package-0..  Even without the presence of active load balance, load
   balance will fail to balance the above scenario..  Fix find_busiest_queue
   to use "imbalance" when it is lightly loaded.

[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Kirill Korotaev efc30814a8 [PATCH] sched: CPU hotplug race vs. set_cpus_allowed()
There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
__migrate_task() doesn't report any err code, so task can be left on its
runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
possible target.  Also, chaning cpus_allowed mask requires rq->lock being
held.

Signed-off-by: Kirill Korotaev <dev@openvz.org>
Acked-By: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Steven Rostedt cc94abfcbc [PATCH] unnecessary long index i in sched
Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
as a long long here.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Con Kolivas 72d2854d4e [PATCH] sched: fix interactive ceiling code
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough.  The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what
exactly it does, just the reason it does it.  Fix it.

There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass.  Fix it.

Prevent the on-runqueue bonus logic from defeating the idle sleep logic.

Opportunity to micro-optimise.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Steven Rostedt d444886e14 [PATCH] sched: simplify bitmap definition
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Chen, Kenneth W c96d145e71 [PATCH] sched: fix smt nice lock contention and optimization
Initial report and lock contention fix from Chris Mason:

Recent benchmarks showed some performance regressions between 2.6.16 and
2.6.5.  We tracked down one of the regressions to lock contention in
schedule heavy workloads (~70,000 context switches per second)

kernel/sched.c:dependent_sleeper() was responsible for most of the lock
contention, hammering on the run queue locks.  The patch below is more of a
discussion point than a suggested fix (although it does reduce lock
contention significantly).  The dependent_sleeper code looks very expensive
to me, especially for using a spinlock to bounce control between two
different siblings in the same cpu.

It is further optimized:

* perform dependent_sleeper check after next task is determined
* convert wake_sleeping_dependent to use trylock
* skip smt runqueue check if trylock fails
* optimize double_rq_lock now that smt nice is converted to trylock
* early exit in searching first SD_SHARE_CPUPOWER domain
* speedup fast path of dependent_sleeper

[akpm@osdl.org: cleanup]
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Chris Mason <mason@suse.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Chandra Seetharaman 26c2143b63 [PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit only
Make notifier_calls associated with cpu_notifier as __cpuinit.

__cpuinit makes sure that the function is init time only unless
CONFIG_HOTPLUG_CPU is defined.

[akpm@osdl.org: section fix]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman 65edc68c34 [PATCH] cpu hotplug: make [un]register_cpu_notifier init time only
CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined).
So, cpu_notifier functionality need to be available only at init time.

This patch makes register_cpu_notifier() available only at init time, unless
CONFIG_HOTPLUG_CPU is defined.

This patch exports register_cpu_notifier() and unregister_cpu_notifier() only
if CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00