Commit Graph

13982 Commits (9a0c6fef423528ba5b62aa31b29aabf689eb8f70)

Author SHA1 Message Date
Paul E. McKenney 79bce67243 rcu: Prevent initialization-time quiescent-state race
The next step in reducing RCU's grace-period initialization latency on
large systems will make this initialization preemptible.  Unfortunately,
making the grace-period initialization subject to interrupts (let alone
preemption) exposes the following race on systems whose rcu_node tree
contains more than one node:

1.	CPU 31 starts initializing the grace period, including the
    	first leaf rcu_node structures, and is then preempted.

2.	CPU 0 refers to the first leaf rcu_node structure, and notes
    	that a new grace period has started.  It passes through a
    	quiescent state shortly thereafter, and informs the RCU core
    	of this rite of passage.

3.	CPU 0 enters an RCU read-side critical section, acquiring
    	a pointer to an RCU-protected data item.

4.	CPU 31 takes an interrupt whose handler removes the data item
	referenced by CPU 0 from the data structure, and registers an
	RCU callback in order to free it.

5.	CPU 31 resumes initializing the grace period, including its
    	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
    	which advances all callbacks, including the one registered
    	in #4 above, to be handled by the current grace period.

6.	The remaining CPUs pass through quiescent states and inform
    	the RCU core, but CPU 0 remains in its RCU read-side critical
    	section, still referencing the now-removed data item.

7.	The grace period completes and all the callbacks are invoked,
    	including the one that frees the data item that CPU 0 is still
    	referencing.  Oops!!!

One way to avoid this race is to remove grace-period acceleration from
rcu_start_gp_per_cpu().  Now, the only reason for this acceleration was
to allow CPUs bringing RCU out of idle state to have their callbacks
invoked after only one grace period, rather than the two grace periods
that would otherwise be required.  But this acceleration does not
work when RCU grace-period initialization is moved to a kthread because
the CPU posting the callback is no longer necessarily the CPU that is
initializing the resulting grace period.

This commit therefore removes this now-pointless (and soon to be dangerous)
grace-period acceleration, thus avoiding the above race.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:41:52 -07:00
Paul E. McKenney b3dbec76e5 rcu: Move RCU grace-period initialization into a kthread
As the first step towards allowing grace-period initialization to be
preemptible, this commit moves the RCU grace-period initialization
into its own kthread.  This is needed to keep large-system scheduling
latency at reasonable levels.

Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
by Peter Zijlstra in review comments.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:52 -07:00
Paul E. McKenney a10d206ef1 rcu: Fix day-one dyntick-idle stall-warning bug
Each grace period is supposed to have at least one callback waiting
for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally.  In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period.  Given that nothing is waiting on this grace
period, this is also not a problem.

That is, unless RCU CPU stall warnings are also enabled, as they are
in recent kernels.  In this case, if a CPU wakes up after at least one
minute of inactivity, an RCU CPU stall warning will result.  The reason
that no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that consistently get into this situation.

All this begs the question of exactly how a callback-free grace period
gets started in the first place.  This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait for
another grace period, never mind the fact that the grace period that they
were waiting for just completed.  This CPU can therefore erroneously
decide to start a new grace period.  Note that this can happen in
TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system:  Deadlock
considerations mean that the CPU that detected the end of the grace
period is not necessarily officially informed of this fact for some time.

Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks.  It then won't have any callbacks left.  If no
other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a
new grace period.  This new check relies on an array of tail pointers
into each CPU's list of callbacks.  If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment.  The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
as soon as the CPU is officially notified that the old grace period
has ended.

This change is to cpu_needs_another_gp(), which is called in a number
of places.  The only one that really matters is in rcu_start_gp(), where
the root rcu_node structure's ->lock is held, which prevents any
other CPU from starting or completing a grace period, so that the
comparison that determines whether the CPU is missing the completion
of a grace period is stable.

Reported-by: Becky Bruce <bgillbruce@gmail.com>
Reported-by: Subodh Nijsure <snijsure@grid-net.com>
Reported-by: Paul Walmsley <paul@pwsan.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430
Cc: stable@vger.kernel.org
2012-09-23 07:31:52 -07:00
Linus Torvalds 519b3b742d Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "One more timekeeping fix for v3.6"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Fix timeekeping_get_ns overflow on 32bit systems
2012-09-21 14:25:46 -07:00
Linus Torvalds c5c473e29c Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue / powernow-k8 fix from Tejun Heo:
 "This is the fix for the bug where cpufreq/powernow-k8 was tripping
  BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
  different CPU.

    https://bugzilla.kernel.org/show_bug.cgi?id=47301

  As discussed, the fix is now two parts - one to reimplement
  work_on_cpu() so that it doesn't create a new kthread each time and
  the actual fix which makes powernow-k8 use work_on_cpu() instead of
  performing manual migration.

  While pretty late in the merge cycle, both changes are on the safer
  side.  Jiri and I verified two existing users of work_on_cpu() and
  Duncan confirmed that the powernow-k8 fix survived about 18 hours of
  testing."

* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  cpufreq/powernow-k8: workqueue user shouldn't migrate the kworker to another CPU
  workqueue: reimplement work_on_cpu() using system_wq
2012-09-19 11:00:07 -07:00
Tejun Heo ed48ece27c workqueue: reimplement work_on_cpu() using system_wq
The existing work_on_cpu() implementation is hugely inefficient.  It
creates a new kthread, execute that single function and then let the
kthread die on each invocation.

Now that system_wq can handle concurrent executions, there's no
advantage of doing this.  Reimplement work_on_cpu() using system_wq
which makes it simpler and way more efficient.

stable: While this isn't a fix in itself, it's needed to fix a
        workqueue related bug in cpufreq/powernow-k8.  AFAICS, this
        shouldn't break other existing users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
2012-09-19 10:13:12 -07:00
Linus Torvalds 4651afbbae Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull another workqueue fix from Tejun Heo:
 "Unfortunately, yet another late fix.  This too is discovered and fixed
  by Lai.  This bug was introduced during this merge window by commit
  25511a4776 ("workqueue: reimplement CPU online rebinding to handle
  idle workers") which started using WORKER_REBIND flag for idle rebind
  too.

  The bug is relatively easy to trigger if the CPU rapidly goes through
  off, on and then off (and stay off).  The fix is on the safer side.
  This hasn't been on linux-next yet but I'm pushing early so that it
  can get more exposure before v3.6 release."

* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
2012-09-17 16:05:23 -07:00
Lai Jiangshan 960bd11bf2 workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed
(CPU is down again).  This used to be okay because the flag wasn't
used for anything else.

However, after 25511a477 "workqueue: reimplement CPU online rebinding
to handle idle workers", WORKER_REBIND is also used to command idle
workers to rebind.  If not cleared, the worker may confuse the next
CPU_UP cycle by having REBIND spuriously set or oops / get stuck by
prematurely calling idle_worker_rebind().

  WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5
 00()
  Hardware name: Bochs
  Modules linked in: test_wq(O-)
  Pid: 33, comm: kworker/1:1 Tainted: G           O 3.6.0-rc1-work+ #3
  Call Trace:
   [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0
   [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20
   [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500
   [<ffffffff810bc16e>] kthread+0xbe/0xd0
   [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
  ---[ end trace e977cf20f4661968 ]---
  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500
  PGD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in: test_wq(O-)
  CPU 0
  Pid: 33, comm: kworker/1:1 Tainted: G        W  O 3.6.0-rc1-work+ #3 Bochs Bochs
  RIP: 0010:[<ffffffff810b3db0>]  [<ffffffff810b3db0>] worker_thread+0x360/0x500
  RSP: 0018:ffff88001e1c9de0  EFLAGS: 00010086
  RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
  RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001
  R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580
  R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900
  FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900)
  Stack:
   ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010
   ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900
   ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340
  Call Trace:
   [<ffffffff810bc16e>] kthread+0xbe/0xd0
   [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
  Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b
  RIP  [<ffffffff810b3db0>] worker_thread+0x360/0x500
   RSP <ffff88001e1c9de0>
  CR2: 0000000000000000

There was no reason to keep WORKER_REBIND on failure in the first
place - WORKER_UNBOUND is guaranteed to be set in such cases
preventing incorrectly activating concurrency management.  Always
clear WORKER_REBIND.

tj: Updated comment and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-09-17 15:42:31 -07:00
Andrew Vagin 579035dc5d pid-namespace: limit value of ns_last_pid to (0, max_pid)
The kernel doesn't check the pid for negative values, so if you try to
write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic.

The crash happens because the next pid is -1, and alloc_pidmap() will
try to access to a nonexistent pidmap.

  map = &pid_ns->pidmap[pid/BITS_PER_PAGE];

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:38 -07:00
Linus Torvalds 37407ea7f9 Revert "sched: Improve scalability via 'CPU buddies', which withstand random perturbations"
This reverts commit 970e178985.

Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
performance drop on PostgreSQL 9.2 on his machine (running "pgbench").

Borislav Petkov was able to reproduce this, and bisected it to this
commit 970e178985 ("sched: Improve scalability via 'CPU buddies' ...")
apparently because the new single-idle-buddy model simply doesn't find
idle CPU's to reschedule on aggressively enough.

Mike Galbraith suspects that it is likely due to the user-mode spinlocks
in PostgreSQL not reacting well to preemption, but we don't really know
the details - I'll just revert the commit for now.

There are hopefully other approaches to improve scheduler scalability
without it causing these kinds of downsides.

Reported-by: Nikolay Ulyanitsky <lystor@gmail.com>
Bisected-by: Borislav Petkov <bp@alien8.de>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-16 12:29:43 -07:00
Linus Torvalds 889cb3b9a4 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Smaller fixlets"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix kernel-doc warnings in kernel/sched/fair.c
  sched: Unthrottle rt runqueues in __disable_runtime()
  sched: Add missing call to calc_load_exit_idle()
  sched: Fix load avg vs cpu-hotplug
2012-09-14 17:44:52 -07:00
Linus Torvalds 7ef6e97380 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This tree includes various fixes"

Ingo really needs to improve on the whole "explain git pull" part.
"Various fixes" indeed.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled
  perf/x86: Enable Intel Cedarview Atom suppport
  perf_event: Switch to internal refcount, fix race with close()
  oprofile, s390: Fix uninitialized memory access when writing to oprofilefs
  perf/x86: Fix microcode revision check for SNB-PEBS
2012-09-14 17:43:45 -07:00
John Stultz ec145babe7 time: Fix timeekeping_get_ns overflow on 32bit systems
Daniel Lezcano reported seeing multi-second stalls from
keyboard input on his T61 laptop when NOHZ and CPU_IDLE
were enabled on a 32bit kernel.

He bisected the problem down to commit
1e75fa8be9 ("time: Condense timekeeper.xtime into xtime_sec").

After reproducing this issue, I narrowed the problem down
to the fact that timekeeping_get_ns() returns a 64bit
nsec value that hasn't been accumulated. In some cases
this value was being then stored in timespec.tv_nsec
(which is a long).

On 32bit systems, with idle times larger then 4 seconds
(or less, depending on the value of xtime_nsec), the
returned nsec value would overflow 32bits. This limited
kept time from increasing, causing timers to not expire.

The fix is to make sure we don't directly store the
result of timekeeping_get_ns() into a tv_nsec field,
instead using a 64bit nsec value which can then be
added into the timespec via timespec_add_ns().

Reported-and-bisected-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Link: http://lkml.kernel.org/r/1347405963-35715-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-13 17:39:14 +02:00
Linus Torvalds 0bd1189e23 Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
 "It's later than I'd like but well the timing just didn't work out this
  time.

  There are three bug fixes.  One from before 3.6-rc1 and two from the
  new CPU hotplug code.  Kudos to Lai for discovering all of them and
  providing fixes.

   * Atomicity bug when clearing a flag and setting another.  The two
     operation should have been atomic but wasn't.  This bug has existed
     for a long time but is unlikely to have actually happened.  Fix is
     safe.  Marked for -stable.

   * If CPU hotplug cycles happen back-to-back before workers finish the
     previous cycle, the states could get out of sync and it could get
     stuck.  Fixed by waiting for workers to complete before finishing
     hotplug cycle.

   * While CPU hotplug is in progress, idle workers could be depleted
     which can then lead to deadlock.  I think both happening together
     is highly unlikely but still better to fix it and the fix isn't too
     scary.

  There's another workqueue related regression which reported a few days
  ago:

    https://bugzilla.kernel.org/show_bug.cgi?id=47301

  It's a bit of head scratcher but there is a semi-reliable reproduce
  case, so I'm hoping to resolve it soonish."

* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix possible idle worker depletion across CPU hotplug
  workqueue: restore POOL_MANAGING_WORKERS
  workqueue: fix possible deadlock in idle worker rebinding
  workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function
  workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic
2012-09-12 07:16:54 +08:00
Lai Jiangshan ee378aa49b workqueue: fix possible idle worker depletion across CPU hotplug
To simplify both normal and CPU hotplug paths, worker management is
prevented while CPU hoplug is in progress.  This is achieved by CPU
hotplug holding the same exclusion mechanism used by workers to ensure
there's only one manager per pool.

If someone else seems to be performing the manager role, workers
proceed to execute work items.  CPU hotplug using the same mechanism
can lead to idle worker depletion because all workers could proceed to
execute work items while CPU hotplug is in progress and CPU hotplug
itself wouldn't actually perform the worker management duty - it
doesn't guarantee that there's an idle worker left when it releases
management.

This idle worker depletion, under extreme circumstances, can break
forward-progress guarantee and thus lead to deadlock.

This patch fixes the bug by using separate mechanisms for manager
exclusion among workers and hotplug exclusion.  For manager exclusion,
POOL_MANAGING_WORKERS which was restored by the previous patch is
used.  pool->manager_mutex is now only used for exclusion between the
elected manager and CPU hotplug.  The elected manager won't proceed
without holding pool->manager_mutex.

This ensures that the worker which won the manager position can't skip
managing while CPU hotplug is in progress.  It will block on
manager_mutex and perform management after CPU hotplug is complete.

Note that hotplug may happen while waiting for manager_mutex.  A
manager isn't either on idle or busy list and thus the hoplug code
can't unbind/rebind it.  Make the manager handle its own un/rebinding.

tj: Updated comment and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-09-10 10:05:54 -07:00
Lai Jiangshan 552a37e936 workqueue: restore POOL_MANAGING_WORKERS
This patch restores POOL_MANAGING_WORKERS which was replaced by
pool->manager_mutex by 6037315269 "workqueue: use mutex for global_cwq
manager exclusion".

There's a subtle idle worker depletion bug across CPU hotplug events
and we need to distinguish an actual manager and CPU hotplug
preventing management.  POOL_MANAGING_WORKERS will be used for the
former and manager_mutex the later.

This patch just lays POOL_MANAGING_WORKERS on top of the existing
manager_mutex and doesn't introduce any synchronization changes.  The
next patch will update it.

Note that this patch fixes a non-critical anomaly where
too_many_workers() may return %true spuriously while CPU hotplug is in
progress.  While the issue could schedule idle timer spuriously, it
didn't trigger any actual misbehavior.

tj: Rewrote patch description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-09-10 10:04:54 -07:00
Tejun Heo ec58815ab0 workqueue: fix possible deadlock in idle worker rebinding
Currently, rebind_workers() and idle_worker_rebind() are two-way
interlocked.  rebind_workers() waits for idle workers to finish
rebinding and rebound idle workers wait for rebind_workers() to finish
rebinding busy workers before proceeding.

Unfortunately, this isn't enough.  The second wait from idle workers
is implemented as follows.

	wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));

rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
then returns.  If CPU hotplug cycle happens again before one of the
idle workers finishes the above wait_event(), rebind_workers() will
repeat the first part of the handshake - set WORKER_REBIND again and
wait for the idle worker to finish rebinding - and this leads to
deadlock because the idle worker would be waiting for WORKER_REBIND to
clear.

This is fixed by adding another interlocking step at the end -
rebind_workers() now waits for all the idle workers to finish the
above WORKER_REBIND wait before returning.  This ensures that all
rebinding steps are complete on all idle workers before the next
hotplug cycle can happen.

This problem was diagnosed by Lai Jiangshan who also posted a patch to
fix the issue, upon which this patch is based.

This is the minimal fix and further patches are scheduled for the next
merge window to simplify the CPU hotplug path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
2012-09-05 16:10:15 -07:00
Tejun Heo 90beca5de5 workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function
This doesn't make any functional difference and is purely to help the
next patch to be simpler.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2012-09-05 16:10:14 -07:00
Lai Jiangshan 96e65306b8 workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic
The compiler may compile the following code into TWO write/modify
instructions.

	worker->flags &= ~WORKER_UNBOUND;
	worker->flags |= WORKER_REBIND;

so the other CPU may temporarily see worker->flags which doesn't have
either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
prematurely.

Fix it by using single explicit assignment via ACCESS_ONCE().

Because idle workers have another WORKER_NOT_RUNNING flag, this bug
doesn't exist for them; however, update it to use the same pattern for
consistency.

tj: Applied the change to idle workers too and updated comments and
    patch description a bit.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2012-09-04 17:04:45 -07:00
K.Prasad 500ad2d8b0 perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled
While debugging a warning message on PowerPC while using hardware
breakpoints, it was discovered that when perf_event_disable is invoked
through hw_breakpoint_handler function with interrupts disabled, a
subsequent IPI in the code path would trigger a WARN_ON_ONCE message in
smp_call_function_single function.

This patch calls __perf_event_disable() when interrupts are already
disabled, instead of perf_event_disable().

Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Signed-off-by: K.Prasad <Prasad.Krishnan@gmail.com>
[naveen.n.rao@linux.vnet.ibm.com: v3: Check to make sure we target current task]
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120802081635.5811.17737.stgit@localhost.localdomain
[ Fixed build error on MIPS. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04 17:29:53 +02:00
Al Viro a6fa941d94 perf_event: Switch to internal refcount, fix race with close()
Don't mess with file refcounts (or keep a reference to file, for
that matter) in perf_event.  Use explicit refcount of its own
instead.  Deal with the race between the final reference to event
going away and new children getting created for it by use of
atomic_long_inc_not_zero() in inherit_event(); just have the
latter free what it had allocated and return NULL, that works
out just fine (children of siblings of something doomed are
created as singletons, same as if the child of leader had been
created and immediately killed).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120820135925.GG23464@ZenIV.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04 17:29:22 +02:00
Randy Dunlap 9450d57eab sched: Fix kernel-doc warnings in kernel/sched/fair.c
Fix two kernel-doc warnings in kernel/sched/fair.c:

  Warning(kernel/sched/fair.c:3660): Excess function parameter 'cpus' description in 'update_sg_lb_stats'
  Warning(kernel/sched/fair.c:3806): Excess function parameter 'cpus' description in 'update_sd_lb_stats'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/50303714.3090204@xenotime.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04 14:30:49 +02:00
Peter Boonstoppel a4c96ae319 sched: Unthrottle rt runqueues in __disable_runtime()
migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel <pboonstoppel@nvidia.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04 14:30:30 +02:00
Charles Wang 749c8814f0 sched: Add missing call to calc_load_exit_idle()
Azat Khuzhin reported high loadavg in Linux v3.6

After checking the upstream scheduler code, I found Peter's commit:

  5167e8d541 sched/nohz: Rewrite and fix load-avg computation -- again

not fully applied, missing the call to calc_load_exit_idle().

After that idle exit in sampling window will always be calculated
to non-idle, and the load will be higher than normal.

This patch adds the missing call to calc_load_exit_idle().

Signed-off-by: Charles Wang <muming.wq@taobao.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345449754-27130-1-git-send-email-muming.wq@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04 14:30:29 +02:00
Peter Zijlstra f319da0c68 sched: Fix load avg vs cpu-hotplug
Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-04 14:30:18 +02:00
John Stultz cee58483cf time: Move ktime_t overflow checking into timespec_valid_strict
Andreas Bombe reported that the added ktime_t overflow checking added to
timespec_valid in commit 4e8b14526c ("time: Improve sanity checking of
timekeeping inputs") was causing problems with X.org because it caused
timeouts larger then KTIME_T to be invalid.

Previously, these large timeouts would be clamped to KTIME_MAX and would
never expire, which is valid.

This patch splits the ktime_t overflow checking into a new
timespec_valid_strict function, and converts the timekeeping codes
internal checking to use this more strict function.

Reported-and-tested-by: Andreas Bombe <aeb@debian.org>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-01 10:24:48 -07:00
Linus Torvalds 7ca63ee1b0 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This tree contains misc fixlets: a perf script python binding fix, a
  uprobes fix and a syscall tracing fix."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Add missing files to build the python binding
  uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails
  tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1
2012-08-23 21:48:41 -07:00
Linus Torvalds b5bc0c7054 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Mostly small fixes for the fallout of the timekeeping overhaul in 3.6
  along with stable fixes to address an accumulation problem and missing
  sanity checks for RTC readouts and user space provided values."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Avoid making adjustments if we haven't accumulated anything
  time: Avoid potential shift overflow with large shift values
  time: Fix casting issue in timekeeping_forward_now
  time: Ensure we normalize the timekeeper in tk_xtime_add
  time: Improve sanity checking of timekeeping inputs
2012-08-23 21:46:57 -07:00
John Stultz bf2ac31219 time: Avoid making adjustments if we haven't accumulated anything
If update_wall_time() is called and the current offset isn't large
enough to accumulate, avoid re-calling timekeeping_adjust which may
change the clock freq and can cause 1ns inconsistencies with
CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22 10:42:13 +02:00
John Stultz 6ea565a9be time: Avoid potential shift overflow with large shift values
Andreas Schwab noticed that the 1 << tk->shift could overflow if the
shift value was greater than 30, since 1 would be a 32bit long on
32bit architectures. This issue was introduced by 1e75fa8be (time:
Condense timekeeper.xtime into xtime_sec)

Use 1ULL instead to ensure we don't overflow on the shift.

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22 10:42:13 +02:00
Andreas Schwab 85dc8f05c9 time: Fix casting issue in timekeeping_forward_now
arch_gettimeoffset returns a u32 value which when shifted by tk->shift
can overflow. This issue was introduced with 1e75fa8be (time: Condense
timekeeper.xtime into xtime_sec)

Cast it to u64 first.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22 10:42:13 +02:00
John Stultz 784ffcbb96 time: Ensure we normalize the timekeeper in tk_xtime_add
Andreas noticed problems with resume on specific hardware after commit
1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined
with commit b44d50dca (time: Fix casting issue in tk_set_xtime and
tk_xtime_add)

After some digging I realized we aren't normalizing the timekeeper
after the add. Add the missing normalize call.

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22 10:42:12 +02:00
Linus Torvalds 1456c75a80 Merge branch 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull audit-tree fixes from Miklos Szeredi:
 "The audit subsystem maintainers (Al and Eric) are not responding to
  repeated resends.  Eric did ack them a while ago, but no response
  since then.  So I'm sending these directly to you."

* 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  audit: clean up refcounting in audit-tree
  audit: fix refcounting in audit-tree
  audit: don't free_chunk() after fsnotify_add_mark()
2012-08-21 12:25:24 -07:00
Eric Dumazet f341861fb0 task_work: add a scheduling point in task_work_run()
It seems commit 4a9d4b024a ("switch fput to task_work_add") re-
introduced the problem addressed in 944be0b224 ("close_files(): add
scheduling point")

If a server process with a lot of files (say 2 million tcp sockets) is
killed, we can spend a lot of time in task_work_run() and trigger a soft
lockup.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21 09:11:44 -07:00
Ingo Molnar 5c65ca7520 Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull syscall tracing fix from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21 11:49:37 +02:00
Oleg Nesterov c7a3a88c93 uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails
This patch fixes:

  https://bugzilla.redhat.com/show_bug.cgi?id=843640

If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.

Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().

For example, uprobe_mmap()->install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.

dup_mmap()->uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.

And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().

Reported-and-tested-by: William Cohen <wcohen@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v3.5
Cc: Anton Arapov <anton@redhat.com>
Cc: William Cohen <wcohen@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-21 11:48:12 +02:00
Linus Torvalds 53795ced6e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix migration thread runtime bogosity
  sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled
  sched,cgroup: Fix up task_groups list
  sched: fix divide by zero at {thread_group,task}_times
  sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
2012-08-20 10:35:05 -07:00
Linus Torvalds 90785be317 Merge branch 'alpha' (alpha architecture patches)
Merge alpha architecture update from Michael Cree:
 "The Alpha Maintainer, Matt Turner, is currently unavailable, so I have
  collected up patches that have been posted to the linux-alpha mailing
  list over the last couple of months, and are forwarding them to you in
  the hope that you are prepared to accept them via me.

  The patches by Al Viro and myself I have been running against kernels
  for two months now so have had quite a bit of testing.  All except one
  patch were intended for the 3.5 kernel but because of Matt's
  unavailability never got forwarded to you."

* emailed patches from Michael Cree <mcree@orcon.net.nz>: (9 commits)
  alpha: Fix fall-out from disintegrating asm/system.h
  Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts
  alpha: fix fpu.h usage in userspace
  alpha/mm/fault.c: Port OOM changes to do_page_fault
  alpha: take kernel_execve() out of entry.S
  alpha: take a bunch of syscalls into osf_sys.c
  alpha: Use new generic strncpy_from_user() and strnlen_user()
  alpha: Wire up cross memory attach syscalls
  alpha: Don't export SOCK_NONBLOCK to user space.
2012-08-19 08:41:29 -07:00
Al Viro be53db6e4e alpha: take a bunch of syscalls into osf_sys.c
New helper: current_thread_info().  Allows to do a bunch of odd syscalls
in C. While we are at it, there had never been a reason to do
osf_getpriority() in assembler.  We also get "namespace"-aware (read:
consistent with getuid(2), etc.) behaviour from getx?id() syscalls now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Acked-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-19 08:41:19 -07:00
Will Deacon 60916a9382 tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1
syscall_get_nr can return -1 in the case that the task is not executing
a system call.

This patch fixes perf_syscall_{enter,exit} to check that the syscall
number is valid before using it as an index into a bitmap.

Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com

Cc: Jason Baron <jbaron@redhat.com>
Cc: Wade Farnsworth <wade_farnsworth@mentor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-08-17 15:19:46 -04:00
John Stultz 4e8b14526c time: Improve sanity checking of timekeeping inputs
Unexpected behavior could occur if the time is set to a value large
enough to overflow a 64bit ktime_t (which is something larger then the
year 2262).

Also unexpected behavior could occur if large negative offsets are
injected via adjtimex.

So this patch improves the sanity check timekeeping inputs by
improving the timespec_valid() check, and then makes better use of
timespec_valid() to make sure we don't set the time to an invalid
negative value or one that overflows ktime_t.

Note: This does not protect from setting the time close to overflowing
ktime_t and then letting natural accumulation cause the overflow.

Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-15 15:54:01 +02:00
Miklos Szeredi b3e8692b4d audit: clean up refcounting in audit-tree
Drop the initial reference by fsnotify_init_mark early instead of
audit_tree_freeing_mark() at destroy time.

In the cases we destroy the mark before we drop the initial reference we need to
get rid of the get_mark that balances the put_mark in audit_tree_freeing_mark().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-08-15 12:55:22 +02:00
Miklos Szeredi a2140fc0cb audit: fix refcounting in audit-tree
Refcounting of fsnotify_mark in audit tree is broken.  E.g:

                              refcount
create_chunk
  alloc_chunk                 1
  fsnotify_add_mark           2

untag_chunk
  fsnotify_get_mark           3
  fsnotify_destroy_mark
    audit_tree_freeing_mark   2
  fsnotify_put_mark           1
  fsnotify_put_mark           0
  via destroy_list
    fsnotify_mark_destroy    -1

This was reported by various people as triggering Oops when stopping auditd.

We could just remove the put_mark from audit_tree_freeing_mark() but that would
break freeing via inode destruction.  So this patch simply omits a put_mark
after calling destroy_mark or adds a get_mark before.

The additional get_mark is necessary where there's no other put_mark after
fsnotify_destroy_mark() since it assumes that the caller is holding a reference
(or the inode is keeping the mark pinned, not the case here AFAICS).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Valentin Avram <aval13@gmail.com>
Reported-by: Peter Moody <pmoody@google.com>
Acked-by: Eric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org
2012-08-15 12:55:22 +02:00
Miklos Szeredi 0fe33aae0e audit: don't free_chunk() after fsnotify_add_mark()
Don't do free_chunk() after fsnotify_add_mark().  That one does a delayed unref
via the destroy list and this results in use-after-free.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Eric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org
2012-08-15 12:55:22 +02:00
Mike Galbraith 8f6189684e sched: Fix migration thread runtime bogosity
Make stop scheduler class do the same accounting as other classes,

Migration threads can be caught in the act while doing exec balancing,
leading to the below due to use of unmaintained ->se.exec_start.  The
load that triggered this particular instance was an apparently out of
control heavily threaded application that does system monitoring in
what equated to an exec bomb, with one of the VERY frequently migrated
tasks being ps.

%CPU   PID USER     CMD
99.3    45 root     [migration/10]
97.7    53 root     [migration/12]
97.0    57 root     [migration/13]
90.1    49 root     [migration/11]
89.6    65 root     [migration/15]
88.7    17 root     [migration/3]
80.4    37 root     [migration/8]
78.1    41 root     [migration/9]
44.2    13 root     [migration/2]

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 18:41:55 +02:00
Mike Galbraith e221d028bb sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled
Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 18:41:55 +02:00
Mike Galbraith 35cf4e50b1 sched,cgroup: Fix up task_groups list
With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance.  This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: stable@kernel.org # v3.3+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 18:41:54 +02:00
Stanislaw Gruszka bea6832cc8 sched: fix divide by zero at {thread_group,task}_times
On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Cc: stable@vger.kernel.org
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 18:41:54 +02:00
Peter Zijlstra a35b6466aa sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.

His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.

It was found that:

  schedule()
    idle_balance()
      load_balance()
        local_irq_save()
        double_rq_lock()
        update_h_load()
          walk_tg_tree(tg_load_down)
            tg_load_down()

Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.

This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.

I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).

Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Peter Portante <pportant@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 18:41:54 +02:00
Paul E. McKenney 62ab707247 rcu: Use smp_hotplug_thread facility for RCUs per-CPU kthread
Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU
kthread code to use the new smp_hotplug_thread facility.

[ tglx: Adapted it to use callbacks and to the simplified rcu yield ]
    
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.673354828@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:08 +02:00
Thomas Gleixner bcd951cf10 watchdog: Use hotplug thread infrastructure
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.563736676@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:07 +02:00
Thomas Gleixner 3e339b5dae softirq: Use hotplug thread infrastructure
[ paulmck: Call rcu_note_context_switch() with interrupts enabled. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.456416747@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:07 +02:00
Paul E. McKenney 3180d89b47 hotplug: Fix UP bug in smpboot hotplug code
Because kernel subsystems need their per-CPU kthreads on UP systems as
well as on SMP systems, the smpboot hotplug kthread functions must be
provided in UP builds as well as in SMP builds.  This commit therefore
adds smpboot.c to UP builds and excludes irrelevant code via #ifdef.
    
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:07 +02:00
Thomas Gleixner f97f8f06a4 smpboot: Provide infrastructure for percpu hotplug threads
Provide a generic interface for setting up and tearing down percpu
threads.

On registration the threads for already online cpus are created and
started. On deregistration (modules) the threads are stoppped.

During hotplug operations the threads are created, started, parked and
unparked. The datastructure for registration provides a pointer to
percpu storage space and optional setup, cleanup, park, unpark
functions. These functions are called when the thread state changes.

Each implementation has to provide a function which is queried and
returns whether the thread should run and the thread function itself.

The core code handles all state transitions and avoids duplicated code
in the call sites.

[ paulmck: Preemption leak fix ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.352501068@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:07 +02:00
Thomas Gleixner 2a1d446019 kthread: Implement park/unpark facility
To avoid the full teardown/setup of per cpu kthreads in the case of
cpu hot(un)plug, provide a facility which allows to put the kthread
into a park position and unpark it when the cpu comes online again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120716103948.236618824@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:06 +02:00
Thomas Gleixner 5d01bbd111 rcu: Yield simpler
The rcu_yield() code is amazing. It's there to avoid starvation of the
system when lots of (boosting) work is to be done.

Now looking at the code it's functionality is:

 Make the thread SCHED_OTHER and very nice, i.e. get it out of the way
 Arm a timer with 2 ticks
 schedule()

Now if the system goes idle the rcu task returns, regains SCHED_FIFO
and plugs on. If the systems stays busy the timer fires and wakes a
per node kthread which in turn makes the per cpu thread SCHED_FIFO and
brings it back on the cpu. For the boosting thread the "make it FIFO"
bit is missing and it just runs some magic boost checks. Now this is a
lot of code with extra threads and complexity.

It's way simpler to let the tasks when they detect overload schedule
away for 2 ticks and defer the normal wakeup as long as they are in
yielded state and the cpu is not idle.

That solves the same problem and the only difference is that when the
cpu goes idle it's not guaranteed that the thread returns right away,
but it won't be longer out than two ticks, so no harm is done. If
that's an issue than it is way simpler just to wake the task from
idle as RCU has callbacks there anyway.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120716103948.131256723@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:06 +02:00
Linus Torvalds e4e139bebd Power management fixes for 3.6-rc2
* Fix for two recent regressions in the generic PM domains framework.
 * Revert of a commit that introduced a resume regression and is conceptually
   incorrect in my opinion.
 * Fix for a return value in pcc-cpufreq.c from Julia Lawall.
 * RTC wakeup signaling fix from Neil Brown.
 * Suppression of compiler warnings for CONFIG_PM_SLEEP unset in ACPI,
   platform/x86 and TPM drivers.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJQJWxZAAoJEKhOf7ml8uNs1EcP/ApgCk1SfMo779Lcq8OQVVqq
 2jbtoqnsuPMs/rl4VrW1adJspEkWb39KgE5XIlfg6tIKm5nuIauFtJEGskMq00w7
 8PT7bQOSJdLKIOjsBEUugUtp+HZO0iUuGahciQf4V11eOAZKODqxtomL8Ry2mY3P
 gDohYBa3J+xnkvRqKUY0k0OkSNDDlI3+y+WPr+tamjDzT5uqjWLR9LJ1+1eGtmou
 6DrgjD3eOus/r53OXKlNldXc9HbzVdnmoZwMNtswlNTaCL7HkdpRnPClSWt+NvVi
 cOviJ6F4d6FRmYRFvatFEaXmSAfpB9v/dt1C9VYtoLyZsZWs1sRGd/bxgCofYWnE
 GZckKl8pI80u14345P9R+QF3CculV2itfbKBiXxWunmOeokBYIz5sWdTh4mNg/vy
 VZdeO9jJy2542aF8P9Up9EE3IjkrEz7gEL0Sv4hfmEoHI1jKJDdAn/9/lmfrujPh
 e3vpBeqlBmSTU0rKj97x/G8zwWhPscqJDPkDUEEe+wfS3oPvhymYesV1bF7OCNwr
 WMMcFoDuSRzZ1lvEY7w4IWAKRDCqjaJ1kkBZvzoOEIC4gi4i3pAehpYEZMNFtFrf
 RB2z5Jx1Z1w0LOgcz69TTMY274kZ8N/v7/SVUBk5+tSs1VNHo/p+WYGqW/8ExSvH
 D4H8kQvz8uBK23g7ekVR
 =lo6A
 -----END PGP SIGNATURE-----

Merge tag 'pm-for-3.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael J. Wysocki:

 - Fix for two recent regressions in the generic PM domains framework.

 - Revert of a commit that introduced a resume regression and is
   conceptually incorrect in my opinion.

 - Fix for a return value in pcc-cpufreq.c from Julia Lawall.

 - RTC wakeup signaling fix from Neil Brown.

 - Suppression of compiler warnings for CONFIG_PM_SLEEP unset in ACPI,
   platform/x86 and TPM drivers.

* tag 'pm-for-3.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  tpm_tis / PM: Fix unused function warning for CONFIG_PM_SLEEP
  platform / x86 / PM: Fix unused function warnings for CONFIG_PM_SLEEP
  ACPI / PM: Fix unused function warnings for CONFIG_PM_SLEEP
  Revert "NMI watchdog: fix for lockup detector breakage on resume"
  PM: Make dev_pm_get_subsys_data() always return 0 on success
  drivers/cpufreq/pcc-cpufreq.c: fix error return code
  RTC: Avoid races between RTC alarm wakeup and suspend.
2012-08-12 21:34:09 +03:00
Jeff Mahoney e3756477ae printk: Fix calculation of length used to discard records
While tracking down a weird buffer overflow issue in a program that
looked to be sane, I started double checking the length returned by
syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing
the buffer.

Sure enough, it was.  I saw this in strace:

  11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279

It turns out that the loops that calculate how much space the entries
will take when they're copied don't include the newlines and prefixes
that will be included in the final output since prev flags is passed as
zero.

This patch properly accounts for it and fixes the overflow.

CC: stable@kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-12 21:25:50 +03:00
Rafael J. Wysocki 300d3739e8 Revert "NMI watchdog: fix for lockup detector breakage on resume"
Revert commit 45226e9 (NMI watchdog: fix for lockup detector breakage
on resume) which breaks resume from system suspend on my SH7372
Mackerel board (by causing a NULL pointer dereference to happen) and
is generally wrong, because it abuses the CPU hotplug functionality
in a shamelessly blatant way.

The original issue should be addressed through appropriate syscore
resume callback instead.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-08-08 20:49:45 +02:00
Ingo Molnar 1d17d17484 time: Fix adjustment cleanup bug in timekeeping_adjust()
Tetsuo Handa reported that sporadically the system clock starts
counting up too quickly which is enough to confuse the hangcheck
timer to print a bogus stall warning.

Commit 2a8c0883 "time: Move xtime_nsec adjustment underflow handling
timekeeping_adjust" overlooked this exit path:

        } else
                return;

which should really be a proper exit sequence, fixing the bug as a
side effect.

Also make the flow more readable by properly balancing curly
braces.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: john.stultz@linaro.org
Cc: a.p.zijlstra@chello.nl
Cc: richardcochran@gmail.com
Cc: prarit@redhat.com
Link: http://lkml.kernel.org/r/20120804192114.GA28347@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-05 12:37:14 +02:00
Linus Torvalds c4e62d6785 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fixes from Ingo Molnar:
 "A couple of futex fixes from Darren Hart: two bugs reported by Dave
  Jones (found with his trinity test) and Dan Carpenter through static
  analysis.  The third found while debugging the first two."

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()
  futex: Fix bug in WARN_ON for NULL q.pi_state
  futex: Test for pi_mutex on fault in futex_wait_requeue_pi()
2012-08-03 11:00:26 -07:00
Linus Torvalds ddc5057c1c Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "One regression fix, and a couple of cleanups that clean up the code
  flow in areas that had high-profile bugs recently."

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Remove all direct references to timekeeper
  time: Clean up offs_real/wall_to_mono and offs_boot/total_sleep_time updates
  time: Clean up stray newlines
  time/jiffies: Rename ACTHZ to SHIFTED_HZ
  time/jiffies: Allow CLOCK_TICK_RATE to be undefined
  time: Fix casting issue in tk_set_xtime and tk_xtime_add
2012-08-03 10:58:57 -07:00
Linus Torvalds fcc1d2a9ce Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Fixes and two late cleanups"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cleanups: Add load balance cpumask pointer to 'struct lb_env'
  sched: Fix comment about PREEMPT_ACTIVE bit location
  sched: Fix minor code style issues
  sched: Use task_rq_unlock() in __sched_setscheduler()
  sched/numa: Add SD_PERFER_SIBLING to CPU domain
2012-08-03 10:58:13 -07:00
Linus Torvalds bd463a0606 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Fix merge window fallout and fix sleep profiling (this was always
  broken, so it's not a fix for the merge window - we can skip this one
  from the head of the tree)."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/trace: Add ability to set a target task for events
  perf/x86: Fix USER/KERNEL tagging of samples properly
  perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bit
2012-08-03 10:57:20 -07:00
Linus Torvalds 148311d2ad Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar.

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Allow irq chips to mark themself oneshot safe
2012-08-03 10:56:44 -07:00
Linus Torvalds d97e1dcde5 KGDB/KDB/usb-dbgp fixes and cleanups
usb-dbgp - increase the controller wait time to come out of halt.
    kdb - Remove unused KDB_FLAG_ONLY_DO_DUMP code and cpu in more prompt
    debug core - pass NMI type on archs that provide NMI types
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQG8wyAAoJEIciOldedpOjN2oP/ipaQSLnvoKUhutFl/qL2239
 mMsxh9ga9rfKuCujpSkHZUwjo3VX7put7cnhVwETd4y2gN2YWPYg4OIKt+Y0AhNe
 4NzHwB+lm6iGE33Q1x4uEHBH5aWLzWcOM/9n4avwY2DjtDfpecki5ChP/CVHK8qU
 VVF2PfY8nbxcEonCbP1b/0KaD3xrPqwgZ70HdFi5eUuXiBajAyp9c9zqVUWJ6j+H
 r+2PVkzn9NxRCkyq3tzK5gYk5SzoJPClkpB67CWugG35MiFLkz2csJNztFxtaInZ
 t8HLkMTVLdCgLZqnw/ZEVWfqQA+q6N5NS6zs9j0siLg5HbEb+UtCebPwpChdQrmh
 Sol+0vmT9Hi4Jm6onhDnQYchaDI7gMhynUC9sWAPhtSHS9e7D9c5IBLHQd3YbOHK
 c8ELzxduszw8+jaiDJStkWM+tbQzJXD9bT1KpLVJd8t9BKAmuBX7ETSD+eKtjynJ
 SgywSfVOdxXzEMvRWqeK3qgkLJYCWCfFsc+75hzJl18dRoM3NDyuOxKyOLWF9tFV
 QUjaCvndIFz7CgM7FTToJZbACqxFHRGh4UUXHiXPUBPXvE5Zt34n4VsxAXYpJc95
 1by/TBcYKWPd3hsOJOh1qMeKXD6TEwN/eEmsjBfnHTp8EBcgRtvue8ZHPMcfYnn/
 Iauy66b/nHJLRsi1gWrn
 =LAWq
 -----END PGP SIGNATURE-----

Merge tag 'for_linux-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb

Pull KGDB/KDB/usb-dbgp fixes and cleanups from Jason Wessel:
 "There are no new features, those will be delayed to the 3.7 window.
  There are only fixes/cleanup against the usual kernel churn and we are
  removing more lines than we add:

   - usb-dbgp - increase the controller wait time to come out of halt.
   - kdb - Remove unused KDB_FLAG_ONLY_DO_DUMP code and cpu in more prompt
   - debug core - pass NMI type on archs that provide NMI types"

* tag 'for_linux-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
  USB: echi-dbgp: increase the controller wait time to come out of halt.
  kernel/debug: Make use of KGDB_REASON_NMI
  kdb: Remove cpu from the more prompt
  kdb: Remove unused KDB_FLAG_ONLY_DO_DUMP
2012-08-03 10:53:47 -07:00
Linus Torvalds a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
Linus Torvalds 2d53492620 irqdomain changes for Linux v3.6
Round of refactoring and enhancements to irq_domain infrastructure. This
 series starts the process of simplifying irqdomain. The ultimate goal is
 to merge LEGACY, LINEAR and TREE mappings into a single system, but had
 to back off from that after some last minute bugs. Instead it mainly
 reorganizes the code and ensures that the reverse map gets populated
 when the irq is mapped instead of the first time it is looked up.
 
 Merging of the irq_domain types is deferred to v3.7
 
 In other news, this series adds helpers for creating static mappings on
 a linear or tree mapping.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQGKKkAAoJEEFnBt12D9kB59gQAJnTjrihej1tr0OEkffIthGK
 RyVI/DMo0jMgLs4K/rIo3Y+PdTSsNYd8x4R7ln8O7rNRQn8W6jE6NQgoMh51EvNc
 FAltmTsBldq6hUNuz2FEnbmojBP4QklTzL8bAiXtX5EufWQsgMsP4guOuHXLCjEV
 CkWYVk/slXEWJ8yYJc6GKVRvL+CNeiXVCTcOsYA0CI3ofN7O0rd+YAL314CRllIc
 e5uARbWM+s9FJ/eXwCZP4+3jCmdI/CHJb284WldMc/mBD8Rbiqpb4kH6AZI+TH2O
 CyiNEPWs6FG5eJPTID7HrOarXGzwYq/pvv8iG7Mh8NiKSae1C1HdkHelCjbLQ+pU
 POya0fWF1Gvzlmw0gHik86dqaKjwb29btjj7SFg8KnQExWn2ifhsY70mM9wCTo3s
 cwcQlssDIsARE83nttTFCoV/iAWh9AvTxafrXu/+9OKTjpsYlC8kgzdVjq5aAxON
 JaAUK1OduTWRsd1TabKlh6naRXr9nRcLKikwKri2oYVKkj97wahBuib4ffzAcNqz
 VklRBxTH6M+dz/t5NpcVyLXJpqzTN++QNdTAmeQG6LOnHJL4tpFTsx5sMa7ghmzX
 LNpmp/AkVfP0MT7Drf0FUUx6iFA7sjANYzcepUVDrPGKHx0E3LyqbG5JKcC5LgM6
 +UIoKAktF3vY7pdZJL9z
 =ZUF/
 -----END PGP SIGNATURE-----

Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6

Pull irqdomain changes from Grant Likely:
 "Round of refactoring and enhancements to irq_domain infrastructure.
  This series starts the process of simplifying irqdomain.  The ultimate
  goal is to merge LEGACY, LINEAR and TREE mappings into a single
  system, but had to back off from that after some last minute bugs.
  Instead it mainly reorganizes the code and ensures that the reverse
  map gets populated when the irq is mapped instead of the first time it
  is looked up.

  Merging of the irq_domain types is deferred to v3.7

  In other news, this series adds helpers for creating static mappings
  on a linear or tree mapping."

* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
  irqdomain: Improve diagnostics when a domain mapping fails
  irqdomain: eliminate slow-path revmap lookups
  irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
  irqdomain: Eliminate dedicated radix lookup functions
  irqdomain: Support for static IRQ mapping and association.
  irqdomain: Always update revmap when setting up a virq
  irqdomain: Split disassociating code into separate function
  irq_domain: correct a minor wrong comment for linear revmap
  irq_domain: Standardise legacy/linear domain selection
  irqdomain: Make ops->map hook optional
  irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
  irqdomain: Simple NUMA awareness.
  devicetree: add helper inline for retrieving a node's full name
2012-07-31 20:44:03 -07:00
Linus Torvalds ac694dbdbc Merge branch 'akpm' (Andrew's patch-bomb)
Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...
2012-07-31 19:25:39 -07:00
Linus Torvalds 3e9a97082f This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.  The goal is to
 addresses weaknesses discussed in the paper "Mining your Ps and Qs:
 Detection of Widespread Weak Keys in Network Devices", by Nadia
 Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
 be published in the Proceedings of the 21st Usenix Security Symposium,
 August 2012.  (See https://factorable.net for more information and an
 extended version of the paper.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
 lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
 LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
 6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
 03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
 3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
 AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
 QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
 iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
 NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
 TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
 bi5FC1OolqhvzVIdsqgt
 =u7KM
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random subsystem patches from Ted Ts'o:
 "This patch series contains a major revamp of how we collect entropy
  from interrupts for /dev/random and /dev/urandom.

  The goal is to addresses weaknesses discussed in the paper "Mining
  your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
  by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J.  Alex Halderman,
  which will be published in the Proceedings of the 21st Usenix Security
  Symposium, August 2012.  (See https://factorable.net for more
  information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
  random: mix in architectural randomness in extract_buf()
  dmi: Feed DMI table to /dev/random driver
  random: Add comment to random_initialize()
  random: final removal of IRQF_SAMPLE_RANDOM
  um: remove IRQF_SAMPLE_RANDOM which is now a no-op
  sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
  board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
  isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
  uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
  drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
  xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
  n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
  i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
  input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
  mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
  ...
2012-07-31 19:07:42 -07:00
Mel Gorman 907aed48f6 mm: allow PF_MEMALLOC from softirq context
This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle with
- thus the gfp to alloc flag mapping ignores the task flags when in
interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery.  This patch borrows the task flags from whatever process happens
to be preempted by the softirq.  It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
cannot leak back into the preempted process.  This should be safe due to
the following reasons

Softirqs can run on multiple CPUs sure but the same task should not be
	executing the same softirq code. Neither should the softirq
	handler be preempted by any other softirq handler so the flags
	should not leak to an unrelated softirq.

Softirqs re-enable hardware interrupts in __do_softirq() so can be
	preempted by hardware interrupts so PF_MEMALLOC is inherited
	by the hard IRQ. However, this is similar to a process in
	reclaim being preempted by a hardirq. While PF_MEMALLOC is
	set, gfp_to_alloc_flags() distinguishes between hard and
	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
	flag.

If the softirq is deferred to ksoftirq then its flags may be used
        instead of a normal tasks but as the softirq cannot be preempted,
        the PF_MEMALLOC flag does not leak to other code by accident.

[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:45 -07:00
Jiang Liu 9adb62a5df mm/hotplug: correctly setup fallback zonelists when creating new pgdat
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node.  There's code to try
to achieve that in hotadd_new_pgdat() as below:

	/*
	 * The node we allocated has no zone fallback lists. For avoiding
	 * to access not-initialized zonelist, build here.
	 */
	mutex_lock(&zonelists_mutex);
	build_all_zonelists(pgdat, NULL);
	mutex_unlock(&zonelists_mutex);

But it doesn't work as expected.  When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet.  And build_all_zonelists() only builds zonelists for
online nodes as:

        for_each_online_node(nid) {
                pg_data_t *pgdat = NODE_DATA(nid);

                build_zonelists(pgdat);
                build_zonelist_cache(pgdat);
        }

Though we hope to create zonelist for the new pgdat, but it doesn't.  So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.

Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:44 -07:00
Andrew Morton c255a45805 memcg: rename config variables
Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:43 -07:00
Wanpeng Li 3965c9ae47 mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads
Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more.  But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.

Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:40 -07:00
Huang Shijie 44de9d0cad mm: account the total_vm in the vm_stat_account()
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.

Even for mprotect_fixup(), we can get the right result in the end.

Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:39 -07:00
Linus Torvalds bca1a5c0ea Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
  updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
  fixes) now that Arnaldo is back from vacation."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  uprobes: __replace_page() needs munlock_vma_page()
  uprobes: Rename vma_address() and make it return "unsigned long"
  uprobes: Fix register_for_each_vma()->vma_address() check
  uprobes: Introduce vaddr_to_offset(vma, vaddr)
  uprobes: Teach build_probe_list() to consider the range
  uprobes: Remove insert_vm_struct()->uprobe_mmap()
  uprobes: Remove copy_vma()->uprobe_mmap()
  uprobes: Fix overflow in vma_address()/find_active_uprobe()
  uprobes: Suppress uprobe_munmap() from mmput()
  uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
  uprobes: Clean up and document write_opcode()->lock_page(old_page)
  uprobes: Kill write_opcode()->lock_page(new_page)
  uprobes: __replace_page() should not use page_address_in_vma()
  uprobes: Don't recheck vma/f_mapping in write_opcode()
  perf/x86: Fix missing struct before structure name
  perf/x86: Fix format definition of SNB-EP uncore QPI box
  perf/x86: Make bitfield unsigned
  perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
  perf/x86: Add Intel Nehalem-EX uncore support
  perf/x86: Fix typo in format definition of uncore PCU filter
  ...
2012-07-31 15:34:13 -07:00
John Stultz 4e250fdde9 time: Remove all direct references to timekeeper
Ingo noted that the numerous timekeeper.value references made
the timekeeping code ugly and caused many long lines that
had to be broken up. He recommended replacing timekeeper.value
references with tk->value.

This patch provides a local tk value for all top level time
functions and sets it to &timekeeper. Then all timekeeper
access is done via a tk pointer.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31 17:09:14 +02:00
John Stultz 6d0ef903e2 time: Clean up offs_real/wall_to_mono and offs_boot/total_sleep_time updates
For performance reasons, we maintain ktime_t based duplicates of
wall_to_monotonic (offs_real) and total_sleep_time (offs_boot).

Since large problems could occur (such as the resume regression
on 3.5-rc7, or the leapsecond hrtimer issue) if these value
pairs were to be inconsistently updated, this patch this cleans
up how we modify these value pairs to ensure we are always
consistent.

As a side-effect this is also more efficient as we only
caulculate the duplicate values when they are changed,
rather then every update_wall_time call.

This also provides WARN_ONs to detect if future changes break
the invariants.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-5-git-send-email-john.stultz@linaro.org
[ Cleaned up minor style issues. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31 17:09:14 +02:00
John Stultz d4e3ab384b time: Clean up stray newlines
Ingo noted inconsistent newline usage between functions.
This patch cleans those up.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31 17:09:13 +02:00
John Stultz 02ab20ae38 time/jiffies: Rename ACTHZ to SHIFTED_HZ
Ingo noted that ACTHZ is a confusing name, and requested it
be renamed, so this patch renames ACTHZ to SHIFTED_HZ to
better describe it.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31 17:09:12 +02:00
Ingo Molnar 1f815faec4 Merge branch 'linus' into timers/urgent
Merge in Linus's branch which already has timers/core merged.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31 17:05:27 +02:00
Andrew Vagin e6dab5ffab perf/trace: Add ability to set a target task for events
A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.

Now a target task can be set by using __perf_task().

The original idea and a draft patch belongs to Peter Zijlstra.

I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.

Inspired-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31 17:02:05 +02:00
Michael Wang b9403130a5 sched/cleanups: Add load balance cpumask pointer to 'struct lb_env'
With this patch struct ld_env will have a pointer of the load balancing
cpumask and we don't need to pass a cpumask around anymore.

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FFE8665.3080705@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31 17:00:16 +02:00
Anton Vorontsov b10d22d6e8 kernel/debug: Make use of KGDB_REASON_NMI
Currently kernel never set KGDB_REASON_NMI. We do now, when we enter
KGDB/KDB from an NMI.

This is not to be confused with kgdb_nmicallback(), NMI callback is
an entry for the slave CPUs during CPUs roundup, but REASON_NMI is the
entry for the master CPU.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-07-31 08:16:43 -05:00
Jason Wessel 07cd27bbd4 kdb: Remove cpu from the more prompt
Having the CPU in the more prompt is completely redundent vs the
standard kdb prompt, and it also wastes 32 bytes on the stack.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-07-31 08:16:43 -05:00
Jason Wessel 0f26d0e0a7 kdb: Remove unused KDB_FLAG_ONLY_DO_DUMP
This code cleanup was missed in the original kdb merge, and this code
is simply not used at all.  The code that was previously used to set
the KDB_FLAG_ONLY_DO_DUMP was removed prior to the initial kdb merge.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-07-31 08:16:42 -05:00
Octavian Purdila 65fed8f6f2 resource: make sure requested range is included in the root range
When the requested range is outside of the root range the logic in
__reserve_region_with_split will cause an infinite recursion which will
overflow the stack as seen in the warning bellow.

This particular stack overflow was caused by requesting the
(100000000-107ffffff) range while the root range was (0-ffffffff).  In
this case __request_resource would return the whole root range as
conflict range (i.e.  0-ffffffff).  Then, the logic in
__reserve_region_with_split would continue the recursion requesting the
new range as (conflict->end+1, end) which incidentally in this case
equals the originally requested range.

This patch aborts looking for an usable range when the request does not
intersect with the root range.  When the request partially overlaps with
the root range, it ajust the request to fall in the root range and then
continues with the new request.

When the request is modified or aborted errors and a stack trace are
logged to allow catching the errors in the upper layers.

[    5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
[    5.975150] Modules linked in:
[    5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
[    5.985324] Call Trace:
[    5.987759]  [<c1039dfc>] ? console_unlock+0x17b/0x18d
[    5.992891]  [<c1039620>] warn_slowpath_common+0x48/0x5d
[    5.998194]  [<c1031758>] ? sub_preempt_count+0x63/0x89
[    6.003412]  [<c1039644>] warn_slowpath_null+0xf/0x13
[    6.008453]  [<c1031758>] sub_preempt_count+0x63/0x89
[    6.013499]  [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
[    6.018453]  [<c10c6349>] add_partial+0x36/0x3b
[    6.022973]  [<c10c7c0a>] deactivate_slab+0x96/0xb4
[    6.027842]  [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
[    6.034456]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.039842]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.045232]  [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
[    6.050710]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.056100]  [<c103f78f>] kzalloc.constprop.5+0x29/0x38
[    6.061320]  [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
[    6.067230]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
...
[    7.179057]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
[    7.184970]  [<c17b4779>] reserve_region_with_split+0x30/0x42
[    7.190709]  [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
[    7.196623]  [<c17c9526>] pcibios_resource_survey+0x23/0x2a
[    7.202184]  [<c17cad8a>] pcibios_init+0x23/0x35
[    7.206789]  [<c17ca574>] pci_subsys_init+0x3f/0x44
[    7.211659]  [<c1002088>] do_one_initcall+0x72/0x122
[    7.216615]  [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
[    7.221659]  [<c17a27ff>] kernel_init+0xa6/0x118
[    7.226265]  [<c17a2759>] ? start_kernel+0x334/0x334
[    7.231223]  [<c14d7482>] kernel_thread_helper+0x6/0x10

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Alan Cox 25353b3377 taskstats: check nla_reserve() return
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=44621

Reported-by: <rucsoftsec@gmail.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Steven Rostedt fd4b616b0f sysctl: suppress kmemleak messages
register_sysctl_table() is a strange function, as it makes internal
allocations (a header) to register a sysctl_table.  This header is a
handle to the table that is created, and can be used to unregister the
table.  But if the table is permanent and never unregistered, the header
acts the same as a static variable.

Unfortunately, this allocation of memory that is never expected to be
freed fools kmemleak in thinking that we have leaked memory.  For those
sysctl tables that are never unregistered, and have no pointer referencing
them, kmemleak will think that these are memory leaks:

unreferenced object 0xffff880079fb9d40 (size 192):
  comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
    [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [<ffffffff8110b852>] __kmalloc+0x107/0x153
    [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
    [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
    [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
    [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
    [<ffffffff81afb0a1>] sysctl_init+0x10/0x14
    [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
    [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
    [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
    [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
    [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
    [<ffffffffffffffff>] 0xffffffffffffffff

The sysctl_base_table used by sysctl itself is one such instance that
registers the table to never be unregistered.

Use kmemleak_not_leak() to suppress the kmemleak false positive.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Vivek Goyal 63dca8d5b5 kdump: append newline to the last lien of vmcoreinfo note
The last line of vmcoreinfo note does not end with \n.  Parsing all the
lines in note becomes easier if all lines end with \n instead of trying to
special case the last line.

I know at least one tool, vmcore-dmesg in kexec-tools tree which made the
assumption that all lines end with \n.  I think it is a good idea to fix
it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Akinobu Mita f19b9f74b7 fork: fix error handling in dup_task()
The function dup_task() may fail at the following function calls in the
following order.

0) alloc_task_struct_node()
1) alloc_thread_info_node()
2) arch_dup_task_struct()

Error by 0) is not a matter, it can just return.  But error by 1) requires
releasing task_struct allocated by 0) before it returns.  Likewise, error
by 2) requires releasing task_struct and thread_info allocated by 0) and
1).

The existing error handling calls free_task_struct() and
free_thread_info() which do not only release task_struct and thread_info,
but also call architecture specific arch_release_task_struct() and
arch_release_thread_info().

The problem is that task_struct and thread_info are not fully initialized
yet at this point, but arch_release_task_struct() and
arch_release_thread_info() are called with them.

For example, x86 defines its own arch_release_task_struct() that releases
a task_xstate.  If alloc_thread_info_node() fails in dup_task(),
arch_release_task_struct() is called with task_struct which is just
allocated and filled with garbage in this error handling.

This actually happened with tools/testing/fault-injection/failcmd.sh

	# env FAILCMD_TYPE=fail_page_alloc \
		./tools/testing/fault-injection/failcmd.sh --times=100 \
		--min-order=0 --ignore-gfp-wait=0 \
		-- make -C tools/testing/selftests/ run_tests

In order to fix this issue, make free_{task_struct,thread_info}() not to
call arch_release_{task_struct,thread_info}() and call
arch_release_{task_struct,thread_info}() implicitly where needed.

Default arch_release_task_struct() and arch_release_thread_info() are
defined as empty by default.  So this change only affects the
architectures which implement their own arch_release_task_struct() or
arch_release_thread_info() as listed below.

arch_release_task_struct(): x86, sh
arch_release_thread_info(): mn10300, tile

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Andrew Morton 87bec58a52 revert "sched: Fix fork() error path to not crash"
To make way for "fork: fix error handling in dup_task()", which fixes the
errors more completely.

Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Huang Shijie b2412b7fa7 fork: use vma_pages() to simplify the code
The current code can be replaced by vma_pages().  So use it to simplify
the code.

[akpm@linux-foundation.org: initialise `len' at its definition site]
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Tetsuo Handa 0f20784d4b kmod: avoid deadlock from recursive kmod call
The system deadlocks (at least since 2.6.10) when
call_usermodehelper(UMH_WAIT_EXEC) request triggers
call_usermodehelper(UMH_WAIT_PROC) request.

This is because "khelper thread is waiting for the worker thread at
wait_for_completion() in do_fork() since the worker thread was created
with CLONE_VFORK flag" and "the worker thread cannot call complete()
because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
thread cannot start processing UMH_WAIT_PROC request because the khelper
thread is waiting for the worker thread at wait_for_completion() in
do_fork()".

The easiest example to observe this deadlock is to use a corrupted
/sbin/hotplug binary (like shown below).

  # : > /tmp/dummy
  # chmod 755 /tmp/dummy
  # echo /tmp/dummy > /proc/sys/kernel/hotplug
  # modprobe whatever

call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
module.  do_execve("/tmp/dummy") triggers a call to
request_module("binfmt-0000") from search_binary_handler() which in turn
calls call_usermodehelper(UMH_WAIT_PROC).

In order to avoid deadlock, as a for-now and easy-to-backport solution, do
not try to call wait_for_completion() in call_usermodehelper_exec() if the
worker thread was created by khelper thread with CLONE_VFORK flag.  Future
and fundamental solution might be replacing singleton khelper thread with
some workqueue so that recursive calls up to max_active dependency loop
can be handled without deadlock.

[akpm@linux-foundation.org: add comment to kmod_thread_locker]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Andrew Morton 79c743dd1e kernel/kmod.c: document call_usermodehelper_fns() a bit
This function's interface is, uh, subtle.  Attempt to apologise for it.

Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Joe Perches 088a52aac8 printk: only look for prefix levels in kernel messages
vprintk_emit() prefix parsing should only be done for internal kernel
messages.  This allows existing behavior to be kept in all cases.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:14 -07:00
Joe Perches acc8fa41ad printk: add generic functions to find KERN_<LEVEL> headers
The current form of a KERN_<LEVEL> is "<.>".

Add printk_get_level and printk_skip_level functions to handle these
formats.

These functions centralize tests of KERN_<LEVEL> so a future modification
can change the KERN_<LEVEL> style and shorten the number of bytes consumed
by these headers.

[akpm@linux-foundation.org: fix build error and warning]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Wu Fengguang <wfg@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00
Kay Sievers cdf5344136 kmsg: /dev/kmsg - properly return possible copy_from_user() failure
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00
Andrew Morton b57b44ae69 kernel/sys.c: avoid argv_free(NULL)
If argv_split() failed, the code will end up calling argv_free(NULL).  Fix
it up and clean things up a bit.

Addresses Coverity report 703573.

Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00
Sameer Nanda 45226e944c NMI watchdog: fix for lockup detector breakage on resume
On the suspend/resume path the boot CPU does not go though an
offline->online transition.  This breaks the NMI detector post-resume
since it depends on PMU state that is lost when the system gets
suspended.

Fix this by forcing a CPU offline->online transition for the lockup
detector on the boot CPU during resume.

To provide more context, we enable NMI watchdog on Chrome OS.  We have
seen several reports of systems freezing up completely which indicated
that the NMI watchdog was not firing for some reason.

Debugging further, we found a simple way of repro'ing system freezes --
issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
after the system has been suspended/resumed one or more times.

With this patch in place, the system freeze result in panics, as
expected.

These panics provide a nice stack trace for us to debug the actual issue
causing the freeze.

[akpm@linux-foundation.org: fiddle with code comment]
[akpm@linux-foundation.org: make lockup_detector_bootcpu_resume() conditional on CONFIG_SUSPEND]
[akpm@linux-foundation.org: fix section errors]
Signed-off-by: Sameer Nanda <snanda@chromium.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:13 -07:00