Commit Graph

3920 Commits (e958b3600484533ff801920290468adc8135f89d)

Author SHA1 Message Date
Thomas Gleixner e958b36004 sched: move weighted_cpuload into #ifdef CONFIG_SMP section
weighted_cpuload is only used on SMP. move it into the CONFIG_SMP
section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:25:02 +02:00
Max Krasnyansky 68f4f1ec08 sched: Move cpu masks from kernel/sched.c into kernel/cpu.c
kernel/cpu.c seems a more logical place for those maps since they do not really
have much to do with the scheduler these days.

kernel/cpu.c is now built for the UP kernel too, but it does not affect the size
the kernel sections.

$ size vmlinux

before
   text       data        bss        dec        hex    filename
3313797     307060     310352    3931209     3bfc49    vmlinux

after
   text       data        bss        dec        hex    filename
3313797     307060     310352    3931209     3bfc49    vmlinux

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Cc: mingo@elte.hu
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:25:01 +02:00
Max Krasnyansky 5c8e1ed1d2 sched: CPU hotplug events must not destroy scheduler domains created by the cpusets
First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.

Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Cc: mingo@elte.hu
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:25:00 +02:00
Ingo Molnar 1100ac91b6 sched: fix cpuprio build bug
this patch was not built on !SMP:

 kernel/sched_rt.c: In function 'inc_rt_tasks':
 kernel/sched_rt.c:404: error: 'struct rq' has no member named 'online'

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-06 15:19:46 +02:00
Thomas Gleixner e539d8fcd1 sched: fix the cpuprio count really
Peter pointed out that the last version of the "fix" was still one off
under certain circumstances. Use BITS_TO_LONG instead to get an
accurate result.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:44 +02:00
Gregory Haskins 709d4b0c60 sched: fix cpupri priocount
A rounding error was pointed out by Peter Zijlstra which would result
in the structure holding priorities to be off by one.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:43 +02:00
Gregory Haskins 1f11eb6a8b sched: fix cpupri hotplug support
The RT folks over at RedHat found an issue w.r.t. hotplug support which
was traced to problems with the cpupri infrastructure in the scheduler:

https://bugzilla.redhat.com/show_bug.cgi?id=449676

This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel.  This patch
applies to 25.4-rt4, though it should trivially apply to most cpupri enabled
kernels mentioned above.

It turned out that the issue was that offline cpus could get inadvertently
registered with cpupri so that they were erroneously selected during
migration decisions.  The end result would be an OOPS as the offline cpu
had tasks routed to it.

This patch generalizes the old join/leave domain interface into an
online/offline interface, and adjusts the root-domain/hotplug code to
utilize it.

I was able to easily reproduce the issue prior to this patch, and am no
longer able to reproduce it after this patch.  I can offline cpus
indefinately and everything seems to be in working order.

Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point
me in the right direction.  Also thank you to Peter for reviewing the
early iterations of this patch.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:42 +02:00
Gautham R Shenoy 099f98c8a1 sched: print the sd->level in sched_domain_debug code
While printing out the visual representation of the sched-domains, print
the level (MC, SMT, CPU, NODE, ... ) of each of the sched_domains.

Credit: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-06 15:19:41 +02:00
Dhaval Giani 6d6bc0ad86 sched: add comments for ifdefs in sched.c
make sched.c easier to read.

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:38 +02:00
Arjan van de Ven e21f5b153b sched: print module list in the "scheduling while atomic" warning
For the normal WARN_ON() etc we added a print-the-modules-list already,
which is very useful to figure out candidates for certain types of bugs.

This patch adds the same print to the "scheduling while atomic" BUG warning,
for the same reason: when we get here it's very useful to see which modules
are loaded, to narrow down the candidate code list.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:37 +02:00
Rabin Vincent 81d41d7ece sched: fix defined-but-unused warning
Fix this warning, which appears with !CONFIG_SMP:
kernel/sched.c:1216: warning: `init_hrtick' defined but not used

Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:34 +02:00
Thomas Gleixner f7dcd80bbc namespacecheck: fixes in kernel/sched.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:32 +02:00
Dmitry Adamushko d07355f5de sched: check for SD_SERIALIZE atomically in rebalance_domains()
Nothing really serious here, mainly just a matter of nit-picking :-/

From: Dmitry Adamushko <dmitry.adamushko@gmail.com>
For CONFIG_SCHED_DEBUG && CONFIG_SYSCT configs, sd->flags can be altered
while being manipulated in rebalance_domains(). Let's do an atomic check.
We rely here on the atomicity of read/write accesses for aligned words.

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:30 +02:00
Gregory Haskins 6d299f1b53 sched: fix SCHED_OTHER balance iterator to include all tasks
The currently logic inadvertently skips the last task on the run-queue,
resulting in missed balance opportunities.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: David Bahi <dbahi@novell.com>
CC: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:29 +02:00
Gregory Haskins 6e0534f278 sched: use a 2-d bitmap for searching lowest-pri CPU
The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:28 +02:00
Mike Galbraith f333fdc909 sched: make !hrtick faster
it is safe to ignore timers and flags when the feature is disabled.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:27 +02:00
Gregory Haskins 45c01e8249 sched: prioritize non-migratable tasks over migratable ones
Dmitry Adamushko pointed out a known flaw in the rt-balancing algorithm
that could allow suboptimal balancing if a non-migratable task gets
queued behind a running migratable one.  It is discussed in this thread:

http://lkml.org/lkml/2008/4/22/296

This issue has been further exacerbated by a recent checkin to
sched-devel (git-id 5eee63a5ebc19a870ac40055c0be49457f3a89a3).

>From a pure priority standpoint, the run-queue is doing the "right"
thing. Using Dmitry's nomenclature, if T0 is on cpu1 first, and T1
wakes up at equal or lower priority (affined only to cpu1) later, it
*should* wait for T0 to finish.  However, in reality that is likely
suboptimal from a system perspective if there are other cores that
could allow T0 and T1 to run concurrently.  Since T1 can not migrate,
the only choice for higher concurrency is to try to move T0.  This is
not something we addessed in the recent rt-balancing re-work.

This patch tries to enhance the balancing algorithm by accomodating this
scenario.  It accomplishes this by incorporating the migratability of a
task into its priority calculation.  Within a numerical tsk->prio, a
non-migratable task is logically higher than a migratable one.  We
maintain this by introducing a new per-priority queue (xqueue, or
exclusive-queue) for holding non-migratable tasks.  The scheduler will
draw from the xqueue over the standard shared-queue (squeue) when
available.

There are several details for utilizing this properly.

1) During task-wake-up, we not only need to check if the priority
   preempts the current task, but we also need to check for this
   non-migratable condition.  Therefore, if a non-migratable task wakes
   up and sees an equal priority migratable task already running, it
   will attempt to preempt it *if* there is a likelyhood that the
   current task will find an immediate home.

2) Tasks only get this non-migratable "priority boost" on wake-up.  Any
   requeuing will result in the non-migratable task being queued to the
   end of the shared queue.  This is an attempt to prevent the system
   from being completely unfair to migratable tasks during things like
   SCHED_RR timeslicing.

I am sure this patch introduces potentially "odd" behavior if you
concoct a scenario where a bunch of non-migratable threads could starve
migratable ones given the right pattern.  I am not yet convinced that
this is a problem since we are talking about tasks of equal RT priority
anyway, and there never is much in the way of guarantees against
starvation under that scenario anyway. (e.g. you could come up with a
similar scenario with a specific timing environment verses an affinity
environment).  I can be convinced otherwise, but for now I think this is
"ok".

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-06 15:19:25 +02:00
Linus Torvalds 3b5b60b821 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
  kgdbts: Use HW breakpoints with CONFIG_DEBUG_RODATA
  kgdb: use common ascii helpers and put_unaligned_be32 helper
2008-06-04 08:08:27 -07:00
Linus Torvalds a7f75d3bed Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: re-tune NUMA topologies
  sched: stop wake_affine from causing serious imbalance
  sched: fix sched_clock_cpu()
  revert ("sched: fair-group: SMP-nice for group scheduling")
  sched: cleanup
  show_schedstat(): fix memleak
  sched: unite unlikely pairs in rt_policy() and schedule_debug()
  revert ("sched: fair: weight calculations")
2008-05-29 09:26:17 -07:00
Ingo Molnar 6715930654 Merge commit 'linus/master' into sched-fixes-for-linus 2008-05-29 16:05:05 +02:00
Mike Galbraith b3137bc8e7 sched: stop wake_affine from causing serious imbalance
Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:29:20 +02:00
Peter Zijlstra a381759d6a sched: fix sched_clock_cpu()
Make sched_clock_cpu() return 0 before it has been initialized and avoid
corrupting its state due to doing so.

This fixes the weird printk timestamp jump reported.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-05-29 11:29:19 +02:00
Ingo Molnar 6363ca57c7 revert ("sched: fair-group: SMP-nice for group scheduling")
Yanmin Zhang reported:

Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.

With bisect, I located the following patch:

| 18d95a2832 is first bad commit
| commit 18d95a2832
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date:   Sat Apr 19 19:45:00 2008 +0200
|
|     sched: fair-group: SMP-nice for group scheduling

Revert it so that we get v2.6.25 behavior.

Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:28:57 +02:00
Ingo Molnar 4285f594f8 sched: cleanup
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:25:15 +02:00
Adrian Bunk c6fba5451a show_schedstat(): fix memleak
The Coverity checker spotted a memleak introduced by commit
39106dcf85 (cpumask: use new cpus_scnprintf
function).

It seems the kfree() got lost between v2 and v3 of this patch...

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:25:15 +02:00
Roel Kluin 3f33a7ce95 sched: unite unlikely pairs in rt_policy() and schedule_debug()
Removes obfuscation and may improve assembly.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:25:14 +02:00
Ingo Molnar f9305d4a09 revert ("sched: fair: weight calculations")
Yanmin Zhang reported:

Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has many
regressions with 2.6.26-rc1:

 1) 8-core stoakley: 28%;
 2) 16-core tigerton: 20%;
 3) Itanium Montvale: 50%.

Bisect located this patch:

| 8f1bc385cf is first bad commit
| commit 8f1bc385cf
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date:   Sat Apr 19 19:45:00 2008 +0200
|
|     sched: fair: weight calculations

Revert it to the 2.6.25 state.

Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-29 11:24:01 +02:00
Harvey Harrison 827e609b45 kgdb: use common ascii helpers and put_unaligned_be32 helper
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2008-05-28 12:49:56 -05:00
Tom Zanussi a82c53a0e3 splice: fix sendfile() issue with relay
Splice isn't always incrementing the ppos correctly, which broke
relay splice.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-05-28 14:49:27 +02:00
Oleg Nesterov cbaffba12c posix timers: discard SI_TIMER signals on exec
Based on Roland's patch. This approach was suggested by Austin Clements
from the very beginning, and then by Linus.

As Austin pointed out, the execing task can be killed by SI_TIMER signal
because exec flushes the signal handlers, but doesn't discard the pending
signals generated by posix timers. Perhaps not a bug, but people find this
surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26 10:37:07 -07:00
Oleg Nesterov c8e85b4f4b posix timers: sigqueue_free: don't free sigqueue if it is queued
Currently sigqueue_free() removes sigqueue from list, but doesn't cancel the
pending signal. This is not consistent, the task should either receive the
"full" signal along with siginfo_t, or it shouldn't receive the signal at all.

Change sigqueue_free() to clear SIGQUEUE_PREALLOC but leave sigqueue on list
if it is queued.

This is a user-visible change. If the signal is blocked, it stays queued
after sys_timer_delete() until unblocked with the "stale" si_code/si_value,
and of course it is still counted wrt RLIMIT_SIGPENDING which also limits
the number of posix timers.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-26 10:37:06 -07:00
Cedric Le Goater 5c02b57578 cgroups: remove node_ prefix_from ns subsystem
This is a slight change in the namespace cgroup subsystem api.

The change is that previously when cgroup_clone() was called (currently
only from the unshare path in ns_proxy cgroup, you'd get a new group named
"node_$pid" whereas now you'll get a group named after just your pid.)

The only users who would notice it are those who are using the ns_proxy
cgroup subsystem to auto-create cgroups when namespaces are unshared -
something of an experimental feature, which I think really needs more
complete container/namespace support in order to be useful.  I suspect the
only users are Cedric and Serge, or maybe a few others on
containers@lists.linux-foundation.org.  And in fact it would only be
noticed by the users who make the assumption about how the name is
generated, rather than getting it from the /proc/<pid>/cgroups file for
the process in question.

Whether the change is actually needed or not I'm fairly agnostic on, but I
guess it is more elegant to just use the pid as the new group name rather
than adding a fairly arbitrary "node_" prefix on the front.

[menage@google.com: provided changelog]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Paul Menage" <menage@google.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:14 -07:00
Shi Weihua 7b26655f62 sys_prctl(): fix return of uninitialized value
If none of the switch cases match, the PR_SET_PDEATHSIG and
PR_SET_DUMPABLE cases of the switch statement will never write to local
variable `error'.

Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Cc: Andrew G. Morgan <morgan@kernel.org>
Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:13 -07:00
Oleg Nesterov da7978b034 signals: fix sigqueue_free() vs __exit_signal() race
__exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock.
This can race with another thread doing sigqueue_free(), we can free the
same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list.

Note that even sys_exit_group() can trigger this race, not only
sys_timer_delete().

Move the callsite of flush_sigqueue(tsk->pending) under ->siglock.

This patch doesn't touch flush_sigqueue(->shared_pending) below, it is
called when there are no other threads which can play with signals, and
sigqueue_free() can't be used outside of our thread group.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:10 -07:00
Christian Borntraeger 3401a61e16 stop_machine: make stop_machine_run more virtualization friendly
On kvm I have seen some rare hangs in stop_machine when I used more guest
cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the
hang quite often. I could also reproduce the problem on a 4 way z/VM host with
a 64 way guest.

It turned out that the guest was consuming all available cpus mostly for
spinning on scheduler locks like rq->lock. This is expected as the threads are
calling yield all the time.
The problem is now, that the host scheduling decisings together with the guest
scheduling decisions and spinlocks not being fair managed to create an
interesting scenario similar to a live lock. (Sometimes the hang resolved
itself after some minutes)

Changing stop_machine to yield the cpu to the hypervisor when yielding inside
the guest fixed the problem for me. While I am not completely happy with this
patch, I think it causes no harm and it really improves the situation for me.

I used cpu_relax for yielding to the hypervisor, does that work on all
architectures?

p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use
stop_machine_run and both triggered the problem after some retries.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:34 +10:00
Denis V. Lunev 34e4e2fef4 modules: proper cleanup of kobject without CONFIG_SYSFS
kobject: '<NULL>' (ffffffffa0104050): is not initialized, yet kobject_put() is being called.
------------[ cut here ]------------
WARNING: at /home/den/src/linux-netns26/lib/kobject.c:583 kobject_put+0x53/0x55()
Modules linked in: ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ide_cd_mod cdrom button [last unloaded: pktgen]
comm: rmmod Tainted: G        W 2.6.26-rc3 #585
Call Trace:
  [<ffffffff802359ab>] warn_on_slowpath+0x58/0x7a
  [<ffffffff80236aca>] ? printk+0x67/0x69
  [<ffffffff80236aca>] ? printk+0x67/0x69
  [<ffffffff80324289>] kobject_put+0x53/0x55
  [<ffffffff8025e2ee>] free_module+0x87/0xfa
  [<ffffffff8025fee5>] sys_delete_module+0x178/0x1e1
  [<ffffffff804b1e70>] ? lockdep_sys_exit_thunk+0x35/0x67
  [<ffffffff804b1dff>] ? trace_hardirqs_on_thunk+0x35/0x3a
  [<ffffffff8020c0bb>] system_call_after_swapgs+0x7b/0x80
---[ end trace 8f5aafa7f6406cf8 ]---

mod->mkobj.kobj is not initialized without CONFIG_SYSFS. Do not call
kobject_put in this case.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:33 +10:00
Cyrill Gorcunov c4ea6fcf5a module loading ELF handling: use SELFMAG instead of numeric constant
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-05-23 13:09:32 +10:00
Linus Torvalds 16ae527bfa Merge branch 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] list_for_each_rcu must die: audit
  [patch 1/1] audit_send_reply(): fix error-path memory leak
  [PATCH] open sessionid permissions
2008-05-19 16:38:10 -07:00
Paul E. McKenney 6793a051fb [PATCH] list_for_each_rcu must die: audit
All uses of list_for_each_rcu() can be profitably replaced by the
easier-to-use list_for_each_entry_rcu().  This patch makes this change
for the Audit system, in preparation for removing the list_for_each_rcu()
API entirely.  This time with well-formed SOB.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17 03:30:23 -04:00
Andrew Morton fcaf1eb868 [patch 1/1] audit_send_reply(): fix error-path memory leak
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663

Reporter: Daniel Marjamki <danielm77@spray.se>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-17 03:30:22 -04:00
Al Viro eceea0b3df [PATCH] avoid multiplication overflows and signedness issues for max_fds
Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and
we don't want size calculation for ->fd[] to overflow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:52 -04:00
Al Viro 02afc6267f [PATCH] dup_fd() fixes, part 1
Move the sucker to fs/file.c in preparation to the rest

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:26 -04:00
Harvey Harrison 3fc957721d lib: create common ascii hex array
Add a common hex array in hexdump.c so everyone can use it.

Add a common hi/lo helper to avoid the shifting masking that is
done to get the upper and lower nibbles of a byte value.

Pull the pack_hex_byte helper from kgdb as it is opencoded many
places in the tree that will be consolidated.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00
Mirco Tischler 0c70814c31 cgroups: fix compile warning
Return type of cpu_rt_runtime_write() should be int instead of ssize_t.

Signed-off-by: Mirco Tischler <mt-ml@gmx.de>
Acked-by: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-14 19:11:14 -07:00
Linus Torvalds c3921ab715 Add new 'cond_resched_bkl()' helper function
It acts exactly like a regular 'cond_resched()', but will not get
optimized away when CONFIG_PREEMPT is set.

Normal kernel code is already preemptable in the presense of
CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
02b67cc3ba "sched: do not do
cond_resched() when CONFIG_PREEMPT").

But when wanting to conditionally reschedule while holding a lock, you
need to use "cond_sched_lock(lock)", and the new function is the BKL
equivalent of that.

Also make fs/locks.c use it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-11 16:04:48 -07:00
Linus Torvalds 8e3e076c5a BKL: revert back to the old spinlock implementation
The generic semaphore rewrite had a huge performance regression on AIM7
(and potentially other BKL-heavy benchmarks) because the generic
semaphores had been rewritten to be simple to understand and fair.  The
latter, in particular, turns a semaphore-based BKL implementation into a
mess of scheduling.

The attempt to fix the performance regression failed miserably (see the
previous commit 00b41ec261 'Revert
"semaphore: fix"'), and so for now the simple and sane approach is to
instead just go back to the old spinlock-based BKL implementation that
never had any issues like this.

This patch also has the advantage of being reported to fix the
regression completely according to Yanmin Zhang, unlike the semaphore
hack which still left a couple percentage point regression.

As a spinlock, the BKL obviously has the potential to be a latency
issue, but it's not really any different from any other spinlock in that
respect.  We do want to get rid of the BKL asap, but that has been the
plan for several years.

These days, the biggest users are in the tty layer (open/release in
particular) and Alan holds out some hope:

  "tty release is probably a few months away from getting cured - I'm
   afraid it will almost certainly be the very last user of the BKL in
   tty to get fixed as it depends on everything else being sanely locked."

so while we're not there yet, we do have a plan of action.

Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10 20:58:02 -07:00
Linus Torvalds 00b41ec261 Revert "semaphore: fix"
This reverts commit bf726eab37, as it has
been reported to cause a regression with processes stuck in __down(),
apparently because some missing wakeup.

Quoth Sven Wegener:
 "I'm currently investigating a regression that has showed up with my
  last git pull yesterday.  Bisecting the commits showed bf726e
  "semaphore: fix" to be the culprit, reverting it fixed the issue.

  Symptoms: During heavy filesystem usage (e.g.  a kernel compile) I get
  several compiler processes in uninterruptible sleep, blocking all i/o
  on the filesystem.  System is an Intel Core 2 Quad running a 64bit
  kernel and userspace.  Filesystem is xfs on top of lvm.  See below for
  the output of sysrq-w."

See

	http://lkml.org/lkml/2008/5/10/45

for full report.

In the meantime, we can just fix the BKL performance regression by
reverting back to the good old BKL spinlock implementation instead,
since any sleeping lock will generally perform badly, especially if it
tries to be fair.

Reported-by: Sven Wegener <sven.wegener@stealer.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-10 20:43:22 -07:00
Rusty Russell 91e37a793b module: don't ignore vermagic string if module doesn't have modversions
Linus found a logic bug: we ignore the version number in a module's
vermagic string if we have CONFIG_MODVERSIONS set, but modversions
also lets through a module with no __versions section for modprobe
--force (with tainting, but still).

We should only ignore the start of the vermagic string if the module
actually *has* crcs to check.  Rather than (say) having an
entertaining hissy fit and creating a config option to work around the
buggy code.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-09 07:45:18 -07:00
Rusty Russell a5dd697074 module: be more picky about allowing missing module versions
We allow missing __versions sections, because modprobe --force strips
it.  It makes less sense to allow sections where there's no version
for a specific symbol the module uses, so disallow that.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-09 07:45:18 -07:00
Linus Torvalds c4f51b4662 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes:
  sched: fix weight calculations
  semaphore: fix
2008-05-08 11:31:07 -07:00