Commit Graph

254 Commits (661a85dc0d2ec0404e3b80909e413a9d5e42a239)

Author SHA1 Message Date
Paul E. McKenney 661a85dc0d rcu: Add random PROVE_RCU_DELAY to grace-period initialization
Preemption greatly raised the probability of certain types of race
conditions, so this commit adds an anti-heisenbug to greatly increase
the collision cross section, also known as the probability of occurrence.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:55 -07:00
Paul E. McKenney 5d4b865949 rcu: Fix day-zero grace-period initialization/cleanup race
The current approach to grace-period initialization is vulnerable to
extremely low-probability races.  These races stem from the fact that
the old grace period is marked completed on the same traversal through
the rcu_node structure that is marking the start of the new grace period.
This means that some rcu_node structures will believe that the old grace
period is still in effect at the same time that other rcu_node structures
believe that the new grace period has already started.

These sorts of disagreements can result in too-short grace periods,
as shown in the following scenario:

1.	CPU 0 completes a grace period, but needs an additional
	grace period, so starts initializing one, initializing all
	the non-leaf rcu_node structures and the first leaf rcu_node
	structure.  Because CPU 0 is both completing the old grace
	period and starting a new one, it marks the completion of
	the old grace period and the start of the new grace period
	in a single traversal of the rcu_node structures.

	Therefore, CPUs corresponding to the first rcu_node structure
	can become aware that the prior grace period has completed, but
	CPUs corresponding to the other rcu_node structures will see
	this same prior grace period as still being in progress.

2.	CPU 1 passes through a quiescent state, and therefore informs
	the RCU core.  Because its leaf rcu_node structure has already
	been initialized, this CPU's quiescent state is applied to the
	new (and only partially initialized) grace period.

3.	CPU 1 enters an RCU read-side critical section and acquires
	a reference to data item A.  Note that this CPU believes that
	its critical section started after the beginning of the new
	grace period, and therefore will not block this new grace period.

4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
	mode, other CPUs informed the RCU core of its extended quiescent
	state for the past several grace periods.  This means that CPU 16
	is not yet aware that these past grace periods have ended.  Assume
	that CPU 16 corresponds to the second leaf rcu_node structure --
	which has not yet been made aware of the new grace period.

5.	CPU 16 removes data item A from its enclosing data structure
	and passes it to call_rcu(), which queues a callback in the
	RCU_NEXT_TAIL segment of the callback queue.

6.	CPU 16 enters the RCU core, possibly because it has taken a
	scheduling-clock interrupt, or alternatively because it has
	more than 10,000 callbacks queued.  It notes that the second
	most recent grace period has completed (recall that because it
	corresponds to the second as-yet-uninitialized rcu_node structure,
	it cannot yet become aware that the most recent grace period has
	completed), and therefore advances its callbacks.  The callback
	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
	of the callback queue.

7.	CPU 0 completes initialization of the remaining leaf rcu_node
	structures for the new grace period, including the structure
	corresponding to CPU 16.

8.	CPU 16 again enters the RCU core, again, possibly because it has
	taken a scheduling-clock interrupt, or alternatively because
	it now has more than 10,000 callbacks queued.	It notes that
	the most recent grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.

9.	All CPUs other than CPU 1 pass through quiescent states.  Because
	CPU 1 already passed through its quiescent state, the new grace
	period completes.  Note that CPU 1 is still in its RCU read-side
	critical section, still referencing data item A.

10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
	state for the new grace period, and suppose further that CPU 2
	did not have any callbacks queued, therefore not needing an
	additional grace period.  CPU 2 therefore traverses all of the
	rcu_node structures, marking the new grace period as completed,
	but does not initialize a new grace period.

11.	CPU 16 yet again enters the RCU core, yet again possibly because
	it has taken a scheduling-clock interrupt, or alternatively
	because it now has more than 10,000 callbacks queued.	It notes
	that the new grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.  This means
	that this callback is now considered ready to be invoked.

12.	CPU 16 invokes the callback, freeing data item A while CPU 1
	is still referencing it.

This scenario represents a day-zero bug for TREE_RCU.  This commit
therefore ensures that the old grace period is marked completed in
all leaf rcu_node structures before a new grace period is marked
started in any of them.

That said, it would have been insanely difficult to force this race to
happen before the grace-period initialization process was preemptible.
Therefore, this commit is not a candidate for -stable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

Conflicts:

	kernel/rcutree.c
2012-09-23 07:41:55 -07:00
Paul E. McKenney 7e5c2dfb4d rcu: Make rcutree module parameters visible in sysfs
The module parameters blimit, qhimark, and qlomark (and more
recently, rcu_fanout_leaf) have permission masks of zero, so
that their values are not visible from sysfs.  This is unnecessary
and inconvenient to administrators who might like an easy way to
see what these values are on a running system.  This commit therefore
sets their permission masks to 0444, allowing them to be read but
not written.

Reported-by: Rusty Russell <rusty@ozlabs.org>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:55 -07:00
Paul E. McKenney d40011f601 rcu: Control grace-period duration from sysfs
Although almost everyone is well-served by the defaults, some uses of RCU
benefit from shorter grace periods, while others benefit more from the
greater efficiency provided by longer grace periods.  Situations requiring
a large number of grace periods to elapse (and wireshark startup has
been called out as an example of this) are helped by lower-latency
grace periods.  Furthermore, in some embedded applications, people are
willing to accept a small degradation in update efficiency (due to there
being more of the shorter grace-period operations) in order to gain the
lower latency.

In contrast, those few systems with thousands of CPUs need longer grace
periods because the CPU overhead of a grace period rises roughly
linearly with the number of CPUs.  Such systems normally do not make
much use of facilities that require large numbers of grace periods to
elapse, so this is a good tradeoff.

Therefore, this commit allows the durations to be controlled from sysfs.
There are two sysfs parameters, one named "jiffies_till_first_fqs" that
specifies the delay in jiffies from the end of grace-period initialization
until the first attempt to force quiescent states, and the other named
"jiffies_till_next_fqs" that specifies the delay (again in jiffies)
between subsequent attempts to force quiescent states.  They both default
to three jiffies, which is compatible with the old hard-coded behavior.

At some future time, it may be possible to automatically increase the
grace-period length with the number of CPUs, but we do not yet have
sufficient data to do a good job.  Preliminary data indicates that we
should add an addiitonal jiffy to each of the delays for every 200 CPUs
in the system, but more experimentation is needed.  For now, the number
of systems with more than 1,000 CPUs is small enough that this can be
relegated to boot-time hand tuning.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:54 -07:00
Paul E. McKenney 394f2769aa rcu: Prevent force_quiescent_state() memory contention
Large systems running RCU_FAST_NO_HZ kernels see extreme memory
contention on the rcu_state structure's ->fqslock field.  This
can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
or at boot time (via the nohz kernel boot parameter), but large
systems will no doubt become sensitive to energy consumption.
This commit therefore uses a combining-tree approach to spread the
memory contention across new cache lines in the leaf rcu_node structures.
This can be thought of as a tournament lock that has only a try-lock
acquisition primitive.

The effect on small systems is minimal, because such systems have
an rcu_node "tree" consisting of a single node.  In addition, this
functionality is not used on fastpaths.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:54 -07:00
Paul E. McKenney b4be093fee rcu: Allow RCU quiescent-state forcing to be preempted
RCU quiescent-state forcing is currently carried out without preemption
points, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs).  This patch therefore inserts
a voluntary preemption point into force_qs_rnp(), which should greatly
reduce the magnitude of these spikes.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:54 -07:00
Paul E. McKenney 4cdfc175c2 rcu: Move quiescent-state forcing into kthread
As the first step towards allowing quiescent-state forcing to be
preemptible, this commit moves RCU quiescent-state forcing into the
same kthread that is now used to initialize and clean up after grace
periods.  This is yet another step towards keeping scheduling
latency down to a dull roar.

Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
and to remove the now-unused rcu_state structure fields as suggested by
Peter Zijlstra.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:41:54 -07:00
Paul E. McKenney bfa00b4c40 rcu: Prevent offline CPUs from executing RCU core code
Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
in order to note a quiescent state for the outgoing CPU.  Because the
CPU is marked "offline" during the execution of the CPU_DYING notifiers,
the RCU core had to tolerate being invoked from an offline CPU.  However,
commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
code in the CPU_DYING notifier, so the RCU core need no longer execute
on offline CPUs.  This commit therefore enforces this restriction.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:53 -07:00
Paul E. McKenney 7fdefc10e1 rcu: Break up rcu_gp_kthread() into subfunctions
Then rcu_gp_kthread() function is too large and furthermore needs to
have the force_quiescent_state() code pulled in.  This commit therefore
breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:53 -07:00
Paul E. McKenney c856bafae7 rcu: Allow RCU grace-period cleanup to be preempted
RCU grace-period cleanup is currently carried out with interrupts
disabled, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs).  This patch therefore makes the
RCU grace-period cleanup be preemptible, including voluntary preemption
points, which should eliminate those latency spikes.  Similar spikes from
forcing of quiescent states will be dealt with similarly by later patches.

Updated to replace uses of spin_lock_irqsave() with spin_lock_irq(), as
suggested by Peter Zijlstra.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:53 -07:00
Paul E. McKenney cabc49c1ff rcu: Move RCU grace-period cleanup into kthread
As a first step towards allowing grace-period cleanup to be preemptible,
this commit moves the RCU grace-period cleanup into the same kthread
that is now used to initialize grace periods.  This is needed to keep
scheduling latency down to a dull roar.

[ paulmck: Get rid of stray spin_lock_irqsave() calls. ]

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:52 -07:00
Paul E. McKenney 755609a908 rcu: Allow RCU grace-period initialization to be preempted
RCU grace-period initialization is currently carried out with interrupts
disabled, which can result in 200-microsecond latency spikes on systems
on which RCU has been configured for 4096 CPUs.  This patch therefore
makes the RCU grace-period initialization be preemptible, which should
eliminate those latency spikes.  Similar spikes from grace-period cleanup
and the forcing of quiescent states will be dealt with similarly by later
patches.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:52 -07:00
Paul E. McKenney 79bce67243 rcu: Prevent initialization-time quiescent-state race
The next step in reducing RCU's grace-period initialization latency on
large systems will make this initialization preemptible.  Unfortunately,
making the grace-period initialization subject to interrupts (let alone
preemption) exposes the following race on systems whose rcu_node tree
contains more than one node:

1.	CPU 31 starts initializing the grace period, including the
    	first leaf rcu_node structures, and is then preempted.

2.	CPU 0 refers to the first leaf rcu_node structure, and notes
    	that a new grace period has started.  It passes through a
    	quiescent state shortly thereafter, and informs the RCU core
    	of this rite of passage.

3.	CPU 0 enters an RCU read-side critical section, acquiring
    	a pointer to an RCU-protected data item.

4.	CPU 31 takes an interrupt whose handler removes the data item
	referenced by CPU 0 from the data structure, and registers an
	RCU callback in order to free it.

5.	CPU 31 resumes initializing the grace period, including its
    	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
    	which advances all callbacks, including the one registered
    	in #4 above, to be handled by the current grace period.

6.	The remaining CPUs pass through quiescent states and inform
    	the RCU core, but CPU 0 remains in its RCU read-side critical
    	section, still referencing the now-removed data item.

7.	The grace period completes and all the callbacks are invoked,
    	including the one that frees the data item that CPU 0 is still
    	referencing.  Oops!!!

One way to avoid this race is to remove grace-period acceleration from
rcu_start_gp_per_cpu().  Now, the only reason for this acceleration was
to allow CPUs bringing RCU out of idle state to have their callbacks
invoked after only one grace period, rather than the two grace periods
that would otherwise be required.  But this acceleration does not
work when RCU grace-period initialization is moved to a kthread because
the CPU posting the callback is no longer necessarily the CPU that is
initializing the resulting grace period.

This commit therefore removes this now-pointless (and soon to be dangerous)
grace-period acceleration, thus avoiding the above race.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-09-23 07:41:52 -07:00
Paul E. McKenney b3dbec76e5 rcu: Move RCU grace-period initialization into a kthread
As the first step towards allowing grace-period initialization to be
preemptible, this commit moves the RCU grace-period initialization
into its own kthread.  This is needed to keep large-system scheduling
latency at reasonable levels.

Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
by Peter Zijlstra in review comments.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-09-23 07:41:52 -07:00
Paul E. McKenney a10d206ef1 rcu: Fix day-one dyntick-idle stall-warning bug
Each grace period is supposed to have at least one callback waiting
for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally.  In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period.  Given that nothing is waiting on this grace
period, this is also not a problem.

That is, unless RCU CPU stall warnings are also enabled, as they are
in recent kernels.  In this case, if a CPU wakes up after at least one
minute of inactivity, an RCU CPU stall warning will result.  The reason
that no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that consistently get into this situation.

All this begs the question of exactly how a callback-free grace period
gets started in the first place.  This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait for
another grace period, never mind the fact that the grace period that they
were waiting for just completed.  This CPU can therefore erroneously
decide to start a new grace period.  Note that this can happen in
TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system:  Deadlock
considerations mean that the CPU that detected the end of the grace
period is not necessarily officially informed of this fact for some time.

Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks.  It then won't have any callbacks left.  If no
other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a
new grace period.  This new check relies on an array of tail pointers
into each CPU's list of callbacks.  If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment.  The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
as soon as the CPU is officially notified that the old grace period
has ended.

This change is to cpu_needs_another_gp(), which is called in a number
of places.  The only one that really matters is in rcu_start_gp(), where
the root rcu_node structure's ->lock is held, which prevents any
other CPU from starting or completing a grace period, so that the
comparison that determines whether the CPU is missing the completion
of a grace period is stable.

Reported-by: Becky Bruce <bgillbruce@gmail.com>
Reported-by: Subodh Nijsure <snijsure@grid-net.com>
Reported-by: Paul Walmsley <paul@pwsan.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430
Cc: stable@vger.kernel.org
2012-09-23 07:31:52 -07:00
Paul E. McKenney c701d5d9b3 rcu: Fix code-style issues involving "else"
The Linux kernel coding style says that single-statement blocks should
omit curly braces unless the other leg of the "if" statement has
multiple statements, in which case the curly braces should be included.
This commit fixes RCU's violations of this rule.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-06 06:01:48 -07:00
Paul E. McKenney 02a0677b0b Merge branches 'bigrtm.2012.07.04a', 'doctorture.2012.07.02a', 'fixes.2012.07.06a' and 'fnh.2012.07.02a' into HEAD
bigrtm: First steps towards getting RCU out of the way of
	tens-of-microseconds real-time response on systems compiled
	with NR_CPUS=4096.  Also cleanups for and increased concurrency
	of rcu_barrier() family of primitives.
doctorture: rcutorture and documentation improvements.
fixes:  Miscellaneous fixes.
fnh: RCU_FAST_NO_HZ fixes and improvements.
2012-07-06 05:59:30 -07:00
Paul E. McKenney cfca927972 rcu: Introduce check for callback list/count mismatch
The recent bug that introduced the RCU callback list/count mismatch
showed the need for a diagnostic to check for this, which this commit
adds.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-06 05:55:16 -07:00
Paul E. McKenney bf1304e9cd rcu: Dump only the current CPU's buffers for idle-entry/exit warnings
Problems in RCU idle entry and exit are almost always confined to the
offending CPU.  This commit therefore switches ftrace_dump() from
DUMP_ALL to DUMP_ORIG.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-07-02 12:34:42 -07:00
Paul E. McKenney cf01537ecf rcu: Add check for CPUs going offline with callbacks queued
If a CPU goes offline with callbacks queued, those callbacks might be
indefinitely postponed, which can result in a system hang.  This commit
therefore inserts warnings for this condition.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:25 -07:00
Paul E. McKenney 95f0c1de3e rcu: Disable preemption in rcu_blocking_is_gp()
It is time to optimize CONFIG_TREE_PREEMPT_RCU's synchronize_rcu()
for uniprocessor optimization, which means that rcu_blocking_is_gp()
can no longer rely on RCU read-side critical sections having disabled
preemption.  This commit therefore disables preemption across
rcu_blocking_is_gp()'s scan of the cpu_online_mask.

(Updated from previous version to fix embarrassing bug spotted by
Wu Fengguang.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:34:25 -07:00
Paul E. McKenney 29154c57e3 rcu: Split RCU core processing out of __call_rcu()
The __call_rcu() function is a bit overweight, so this commit splits
it into actual enqueuing of and accounting for the callback (__call_rcu())
and associated RCU-core processing (__call_rcu_core()).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:24 -07:00
Paul E. McKenney a16b7a6934 rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
The __call_rcu() function will invoke the RCU core, for example, if
it detects that the current CPU has too many callbacks.  However, this
can happen on an offline CPU that is on its way to the idle loop, in
which case it is an error to invoke the RCU core, and the excess callbacks
will be adopted in any case.  This commit therefore adds checks to
__call_rcu() for running on an offline CPU, refraining from invoking
the RCU core in this case.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:24 -07:00
Paul E. McKenney 62fde6edf1 rcu: Make __call_rcu() handle invocation from idle
Although __call_rcu() is handled correctly when called from a momentary
non-idle period, if it is called on a CPU that RCU believes to be idle
on RCU_FAST_NO_HZ kernels, the callback might be indefinitely postponed.
This commit therefore ensures that RCU is aware of the new callback and
has a chance to force the CPU out of dyntick-idle mode when a new callback
is posted.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:24 -07:00
Paul E. McKenney 1d1fb395f6 rcu: Add ACCESS_ONCE() to ->qlen accesses
The _rcu_barrier() function accesses other CPUs' rcu_data structure's
->qlen field without benefit of locking.  This commit therefore adds
the required ACCESS_ONCE() wrappers around accesses and updates that
need it.

ACCESS_ONCE() is not needed when a CPU accesses its own ->qlen, or
in code that cannot run while _rcu_barrier() is sampling ->qlen fields.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:22 -07:00
Paul E. McKenney 3f5d3ea64f rcu: Consolidate duplicate callback-list initialization
There are a couple of open-coded initializations of the rcu_data
structure's RCU callback list.  This commit therefore consolidates
them into a new init_callback_list() function.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:21 -07:00
Paul E. McKenney 285fe29481 rcu: Fix detection of abruptly-ending stall
The code that attempts to identify stalls that end just as we detect
them is broken by both flavors of initialization failure.  This commit
therefore properly initializes and computes the count of the number
of reasons why the RCU grace period is stalled.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:34:21 -07:00
Paul E. McKenney ff015030c9 rcu: RCU_SAVE_DYNTICK code no longer ever dead
Before RCU had unified idle, the RCU_SAVE_DYNTICK leg of the switch
statement in force_quiescent_state() was dead code for CONFIG_NO_HZ=n
kernel builds.  With unified idle, the code is never dead.  This commit
therefore removes the "if" statement designed to make gcc aware of when
the code was and was not dead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:24 -07:00
Paul E. McKenney 6ce75a2326 rcu: Introduce for_each_rcu_flavor() and use it
The arrival of TREE_PREEMPT_RCU some years back included some ugly
code involving either #ifdef or #ifdef'ed wrapper functions to iterate
over all non-SRCU flavors of RCU.  This commit therefore introduces
a for_each_rcu_flavor() iterator over the rcu_state structures for each
flavor of RCU to clean up a bit of the ugliness.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:24 -07:00
Paul E. McKenney 1bca8cf1a2 rcu: Remove unneeded __rcu_process_callbacks() argument
With the advent of __this_cpu_ptr(), it is no longer necessary to pass
both the rcu_state and rcu_data structures into __rcu_process_callbacks().
This commit therefore computes the rcu_data pointer from the rcu_state
pointer within __rcu_process_callbacks() so that callers can pass in
only the pointer to the rcu_state structure.  This paves the way for
linking the rcu_state structures together and iterating over them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:23 -07:00
Paul E. McKenney a83eff0a82 rcu: Add tracing for _rcu_barrier()
This commit adds event tracing for _rcu_barrier() execution.  This
is defined only if RCU_TRACE=y.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:23 -07:00
Paul E. McKenney cf3a9c4842 rcu: Increase rcu_barrier() concurrency
The traditional rcu_barrier() implementation has serialized all requests,
regardless of RCU flavor, and also does not coalesce concurrent requests.
In the past, this has been good and sufficient.

However, systems are getting larger and use of rcu_barrier() has been
increasing.  This commit therefore introduces a counter-based scheme
that allows _rcu_barrier() calls for the same flavor of RCU to take
advantage of each others' work.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:23 -07:00
Paul E. McKenney cfed0a85da rcu: Remove needless initialization
For global variables, C defaults all fields to zero.  The initialization
of the rcu_state structure's ->n_force_qs and ->n_force_qs_ngp fields
is therefore redundant, so this commit removes these initializations.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:22 -07:00
Paul E. McKenney 7be7f0be90 rcu: Move rcu_barrier_mutex to rcu_state structure
In order to allow each RCU flavor to concurrently execute its
rcu_barrier() function, it is necessary to move the relevant
state to the rcu_state structure.  This commit therefore moves the
rcu_barrier_mutex global variable to a new ->barrier_mutex field
in the rcu_state structure.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:22 -07:00
Paul E. McKenney 7db74df88b rcu: Move rcu_barrier_completion to rcu_state structure
In order to allow each RCU flavor to concurrently execute its
rcu_barrier() function, it is necessary to move the relevant
state to the rcu_state structure.  This commit therefore moves the
rcu_barrier_completion global variable to a new ->barrier_completion
field in the rcu_state structure.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:22 -07:00
Paul E. McKenney 24ebbca8ec rcu: Move rcu_barrier_cpu_count to rcu_state structure
In order to allow each RCU flavor to concurrently execute its rcu_barrier()
function, it is necessary to move the relevant state to the rcu_state
structure.  This commit therefore moves the rcu_barrier_cpu_count global
variable to a new ->barrier_cpu_count field in the rcu_state structure.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:22 -07:00
Paul E. McKenney 06668efa91 rcu: Move _rcu_barrier()'s rcu_head structures to rcu_data structures
In order for multiple flavors of RCU to each concurrently run one
rcu_barrier(), each flavor needs its own per-CPU set of rcu_head
structures.  This commit therefore moves _rcu_barrier()'s set of
per-CPU rcu_head structures from per-CPU variables to the existing
per-CPU and per-RCU-flavor rcu_data structures.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:21 -07:00
Paul E. McKenney 037b64ed0b rcu: Place pointer to call_rcu() in rcu_data structure
This is a preparatory commit for increasing rcu_barrier()'s concurrency.
It adds a pointer in the rcu_data structure to the corresponding call_rcu()
function.  This allows a pointer to the rcu_data structure to imply the
function pointer, which allows _rcu_barrier() state to be placed in the
rcu_state structure.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:21 -07:00
Paul E. McKenney 6c90cc7bf0 rcu: Prevent excessive line length in RCU_STATE_INITIALIZER()
Upcoming rcu_barrier() concurrency commits will result in line lengths
greater than 80 characters in the RCU_STATE_INITIALIZER(), so this commit
shortens the name of the macro's argument to prevent this.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-07-02 12:33:21 -07:00
Paul E. McKenney cca6f39319 rcu: Size rcu_node tree from nr_cpu_ids rather than NR_CPUS
The rcu_node tree array is sized based on compile-time constants,
including NR_CPUS.  Although this approach has worked well in the past,
the recent trend by many distros to define NR_CPUS=4096 results in
excessive grace-period-initialization latencies.

This commit therefore substitutes the run-time computed nr_cpu_ids for
the compile-time NR_CPUS when building the tree.  This can result in
much of the compile-time-allocated rcu_node array being unused.  If
this is a major problem, you are in a specialized situation anyway,
so you can manually adjust the NR_CPUS, RCU_FANOUT, and RCU_FANOUT_LEAF
kernel config parameters.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:21 -07:00
Paul E. McKenney f885b7f2b2 rcu: Control RCU_FANOUT_LEAF from boot-time parameter
Although making RCU_FANOUT_LEAF a kernel configuration parameter rather
than a fixed constant makes it easier for people to decrease cache-miss
overhead for large systems, it is of little help for people who must
run a single pre-built kernel binary.

This commit therefore allows the value of RCU_FANOUT_LEAF to be
increased (but not decreased!) via a boot-time parameter named
rcutree.rcu_fanout_leaf.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-07-02 12:33:20 -07:00
Paul E. McKenney cba6d0d64e Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"
This reverts commit 616c310e83.
(Move PREEMPT_RCU preemption to switch_to() invocation).
Testing by Sasha Levin <levinsasha928@gmail.com> showed that this
can result in deadlock due to invoking the scheduler when one of
the runqueue locks is held.  Because this commit was simply a
performance optimization, revert it.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
2012-07-02 11:39:19 -07:00
Paul E. McKenney b41772abeb rcu: Stop rcu_do_batch() from multiplexing the "count" variable
Commit b1420f1c (Make rcu_barrier() less disruptive) rearranged the
code in rcu_do_batch(), moving the ->qlen manipulation to follow
the requeueing of the callbacks.  Unfortunately, this rearrangement
clobbered the value of the "count" local variable before the value
of rdp->qlen was adjusted, resulting in the value of rdp->qlen being
inaccurate.  This commit therefore introduces an index variable "i",
avoiding the inadvertent multiplexing.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-06-25 12:35:25 -07:00
Paul E. McKenney 8f5af6f1f2 rcu: RCU_FAST_NO_HZ detection of callback adoption
In the present implementations of CPU hotplug, the outgoing CPU is
guaranteed to run its stop-machine process on the way out, which
will guarantee that RCU_FAST_NO_HZ forces the CPU out of dyntick-idle
mode.

However, new versions of CPU hotplug might not work this way.  This
commit therefore removes this design constraint by explicitly notifying
CPUs when they adopt non-lazy RCU callbacks.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-06-06 20:43:27 -07:00
Ingo Molnar 2d84e023cb Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v3.5 RCU tree from Paul E. McKenney:

 1)	A set of improvements and fixes to the RCU_FAST_NO_HZ feature
	(with more on the way for 3.6).  Posted to LKML:
	https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
	https://lkml.org/lkml/2012/4/16/611 (commit 4),
	https://lkml.org/lkml/2012/4/30/390 (commit 6), and
	https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
	the other commits for the convenience of the tester).

 2)	Changes to make rcu_barrier() avoid disrupting execution of CPUs
	that have no RCU callbacks.  Posted to LKML:
	https://lkml.org/lkml/2012/4/23/322.

 3)	A couple of commits that improve the efficiency of the interaction
	between preemptible RCU and the scheduler, these two being all
	that survived an abortive attempt to allow preemptible RCU's
	__rcu_read_lock() to be inlined.  The full set was posted to
	LKML at https://lkml.org/lkml/2012/4/14/143, and the first and
	third patches of that set remain.

 4)	Lai Jiangshan's algorithmic implementation of SRCU, which includes
	call_srcu() and srcu_barrier().  A major feature of this new
	implementation is that synchronize_srcu() no longer disturbs
	the execution of other CPUs.  This work is based on earlier
	implementations by Peter Zijlstra and Paul E. McKenney.  Posted to
	LKML: https://lkml.org/lkml/2012/2/22/82.

 5)	A number of miscellaneous bug fixes and improvements which were
	posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
	subsequent updates posted to LKML.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 08:41:46 +02:00
Paul E. McKenney dc36be4419 Merge branches 'barrier.2012.05.09a', 'fixes.2012.04.26a', 'inline.2012.05.02b' and 'srcu.2012.05.07b' into HEAD
barrier:  Reduce the amount of disturbance by rcu_barrier() to the rest of
    	the system.  This branch also includes improvements to
    	RCU_FAST_NO_HZ, which are included here due to conflicts.
fixes:  Miscellaneous fixes.
inline:  Remaining changes from an abortive attempt to inline
    	preemptible RCU's __rcu_read_lock().  These are (1) making
    	exit_rcu() avoid unnecessary work and (2) avoiding having
    	preemptible RCU record a blocked thread when the scheduler
    	declines to do a context switch.
srcu:	Lai Jiangshan's algorithmic implementation of SRCU, including
    	call_srcu().
2012-05-11 10:14:21 -07:00
Paul E. McKenney b1420f1c8b rcu: Make rcu_barrier() less disruptive
The rcu_barrier() primitive interrupts each and every CPU, registering
a callback on every CPU.  Once all of these callbacks have been invoked,
rcu_barrier() knows that every callback that was registered before
the call to rcu_barrier() has also been invoked.

However, there is no point in registering a callback on a CPU that
currently has no callbacks, most especially if that CPU is in a
deep idle state.  This commit therefore makes rcu_barrier() avoid
interrupting CPUs that have no callbacks.  Doing this requires reworking
the handling of orphaned callbacks, otherwise callbacks could slip through
rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had
not yet interrupted to a CPU that rcu_barrier() had already interrupted.
This reworking was needed anyway to take a first step towards weaning
RCU from the CPU_DYING notifier's use of stop_cpu().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-09 14:27:54 -07:00
Paul E. McKenney 616c310e83 rcu: Move PREEMPT_RCU preemption to switch_to() invocation
Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler.
This is inefficient because enqueuing is required only if there is a
context switch, and entry to the scheduler does not guarantee a context
switch.

The commit therefore moves the enqueuing to immediately precede the
call to switch_to() from the scheduler.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-02 14:43:23 -07:00
Paul E. McKenney c57afe80db rcu: Make RCU_FAST_NO_HZ account for pauses out of idle
Both Steven Rostedt's new idle-capable trace macros and the RCU_NONIDLE()
macro can cause RCU to momentarily pause out of idle without the rest
of the system being involved.  This can cause rcu_prepare_for_idle()
to run through its state machine too quickly, which can in turn result
in needless scheduling-clock interrupts.

This commit therefore adds code to enable rcu_prepare_for_idle() to
distinguish between an initial entry to idle on the one hand (which needs
to advance the rcu_prepare_for_idle() state machine) and an idle reentry
due to idle-capable trace macros and RCU_NONIDLE() on the other hand
(which should avoid advancing the rcu_prepare_for_idle() state machine).
Additional state is maintained to allow the timer to be correctly reposted
when returning after a momentary pause out of idle, and even more state
is maintained to detect when new non-lazy callbacks have been enqueued
(which may require re-evaluation of the approach to idleness).

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-24 20:55:20 -07:00
Paul E. McKenney 6d8133919b rcu: Document why rcu_blocking_is_gp() is safe
The rcu_blocking_is_gp() function tests to see if there is only one
online CPU, and if so, synchronize_sched() and friends become no-ops.
However, for larger systems, num_online_cpus() scans a large vector,
and might be preempted while doing so.  While preempted, any number
of CPUs might come online and go offline, potentially resulting in
num_online_cpus() returning 1 when there never had only been one
CPU online.  This could result in a too-short RCU grace period, which
could in turn result in total failure, except that the only way that
the grace period is too short is if there is an RCU read-side critical
section spanning it.  For RCU-sched and RCU-bh (which are the only
cases using rcu_blocking_is_gp()), RCU read-side critical sections
have either preemption or bh disabled, which prevents CPUs from going
offline.  This in turn prevents actual failures from occurring.

This commit therefore adds a large block comment to rcu_blocking_is_gp()
documenting why it is safe.  This commit also moves rcu_blocking_is_gp()
into kernel/rcutree.c, which should help prevent unwary developers from
mistaking it for a generally useful function.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-24 20:54:53 -07:00